Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-document Summarization

Kumar, Niraj; Srinathan, Kannan; Varma, Vasudeva

doi:10.1007/978-3-642-28601-8_33

Niraj Kumar¹⁷,
Kannan Srinathan¹⁷ &
Vasudeva Varma¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7182))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

Abstract

Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence-clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervised systems and better than/comparable with novel supervised systems of this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 14
Google Scholar
Kumar, N., Srinathan, K.: Automatic keyphrase extraction from scientific documents using N-gram filtration technique. In: Proceeding of the Eighth ACM Symposium on Document Engineering, DocEng 2008, Sao Paulo, Brazil, pp. 199–208 (2008)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford digital library technologies project (1998)
Google Scholar
Saramaki, J., Onnela, J.-P., Kertesz, J., Kaski, K.: Characterizing Motifs in Weighted Complex Networks
Google Scholar
Mcdonald, D.M., Chen, H.: Summary in context: searching versus browsing. ACM Transactions on Information Systems 24(1), 111–141 (2006)
Article Google Scholar
Wang, D., Zhu, S., Li, T., Gong, Y.: Multi-Document Summarization using Sentence-based Topic Models. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACL and AFNLP, Suntec, Singapore, pp. 297–300 (August 4, 2009)
Google Scholar
Ding, C., He, X.: K-means clustering and principal component analysis. In: Prodeedings of ICML 2004 (2004)
Google Scholar
Ding, C., He, X., Simon, H.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of Siam Data Mining 2005 (2005)
Google Scholar
Ding, C., Li, T., Peng, W., Park, H.: Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of SIGKDD 2006 (2006)
Google Scholar
Erkan, G., Radev, D.: Lexpagerank: Prestige in multi-document text summarization. In: Proceedings of EMNLP 2004 (2004)
Google Scholar
Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of SIGIR (2001)
Google Scholar
Lee, D.D., Sebastian Seung, H.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, vol. 13
Google Scholar
Lin, C.-Y., Hovy, E.: Automatic evaluation of summaries using n-gram cooccurrence statistics. In: Proceedings of NLT-NAACL 2003 (2003)
Google Scholar
Lin, C.-Y., Hovy, E.: From single to multi-document summarization: A prototype system and its evaluation. In: Proceedings of ACL 2002 (2002)
Google Scholar
Mani, I.: Automatic summarization. John Benjamins Publishing Company (2001)
Google Scholar
Radev, D., Jing, H., Stys, M., Tam, D.: Centroid-based summarization of multiple documents. Information Processing and Management, 919–938 (2004)
Google Scholar
Ricardo, B., Berthier, R.: Modern information retrieval. ACM Press (1999)
Google Scholar
Shen, D., Sun, J.-T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proceedings of IJCAI 2007 (2007)
Google Scholar
Wang, D., Zhu, S., Li, T., Chi, Y., Gong, Y.: Integrating clustering and multi-document summarization to improve document understanding. In: Proceedings of CIKM 2008 (2008)
Google Scholar
Yih, W.-T., Goodman, J., Vanderwende, L., Suzuki, H.: Multidocument summarization by maximizing informative content-words. In: Proceedings of IJCAI 2007 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

IIIT-Hyderabad, Hyderabad, 500032, India
Niraj Kumar, Kannan Srinathan & Vasudeva Varma

Authors

Niraj Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Kannan Srinathan
View author publications
You can also search for this author in PubMed Google Scholar
Vasudeva Varma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, N., Srinathan, K., Varma, V. (2012). Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-document Summarization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-28601-8_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28600-1
Online ISBN: 978-3-642-28601-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics