Skip to main content

Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-document Summarization

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Abstract

Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence-clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervised systems and better than/comparable with novel supervised systems of this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 14

    Google Scholar 

  2. Kumar, N., Srinathan, K.: Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In: Proceeding of the Eighth ACM Symposium on Document Engineering, DocEng 2008, Sao Paulo, Brazil, pp. 199–208 (2008)

    Google Scholar 

  3. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford digital library technologies project (1998)

    Google Scholar 

  4. Saramaki, J., Onnela, J.-P., Kertesz, J., Kaski, K.: Characterizing Motifs in Weighted Complex Networks

    Google Scholar 

  5. Mcdonald, D.M., Chen, H.: Summary in context: searching versus browsing. ACM Transactions on Information Systems 24(1), 111–141 (2006)

    Article  Google Scholar 

  6. Wang, D., Zhu, S., Li, T., Gong, Y.: Multi-Document Summarization using Sentence-based Topic Models. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACL and AFNLP, Suntec, Singapore, pp. 297–300 (August 4, 2009)

    Google Scholar 

  7. Ding, C., He, X.: K-means clustering and principal component analysis. In: Prodeedings of ICML 2004 (2004)

    Google Scholar 

  8. Ding, C., He, X., Simon, H.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of Siam Data Mining 2005 (2005)

    Google Scholar 

  9. Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of SIGKDD 2006 (2006)

    Google Scholar 

  10. Erkan, G., Radev, D.: Lexpagerank: Prestige in multi-document text summarization. In: Proceedings of EMNLP 2004 (2004)

    Google Scholar 

  11. Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of SIGIR (2001)

    Google Scholar 

  12. Lee, D.D., Sebastian Seung, H.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, vol. 13

    Google Scholar 

  13. Lin, C.-Y., Hovy, E.: Automatic evaluation of summaries using n-gram cooccurrence statistics. In: Proceedings of NLT-NAACL 2003 (2003)

    Google Scholar 

  14. Lin, C.-Y., Hovy, E.: From single to multi-document summarization: A prototype system and its evaluation. In: Proceedings of ACL 2002 (2002)

    Google Scholar 

  15. Mani, I.: Automatic summarization. John Benjamins Publishing Company (2001)

    Google Scholar 

  16. Radev, D., Jing, H., Stys, M., Tam, D.: Centroid-based summarization of multiple documents. Information Processing and Management, 919–938 (2004)

    Google Scholar 

  17. Ricardo, B., Berthier, R.: Modern information retrieval. ACM Press (1999)

    Google Scholar 

  18. Shen, D., Sun, J.-T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proceedings of IJCAI 2007 (2007)

    Google Scholar 

  19. Wang, D., Zhu, S., Li, T., Chi, Y., Gong, Y.: Integrating clustering and multi-document summarization to improve document understanding. In: Proceedings of CIKM 2008 (2008)

    Google Scholar 

  20. Yih, W.-T., Goodman, J., Vanderwende, L., Suzuki, H.: Multidocument summarization by maximizing informative content-words. In: Proceedings of IJCAI 2007 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kumar, N., Srinathan, K., Varma, V. (2012). Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-document Summarization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28601-8_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28600-1

  • Online ISBN: 978-3-642-28601-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics