EM Clustering Algorithm for Automatic Text Summarization

Ledeneva, Yulia; Hernández, René García; Soto, Romyna Montiel; Reyes, Rafael Cruz; Gelbukh, Alexander

doi:10.1007/978-3-642-25324-9_26

Yulia Ledeneva²¹,
René García Hernández²¹,
Romyna Montiel Soto²²,
Rafael Cruz Reyes²² &
…
Alexander Gelbukh²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7094))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1406 Accesses
7 Citations

Abstract

Automatic text summarization has emerged as a technique for accessing only to useful information. In order to known the quality of the automatic summaries produced by a system, in DUC 2002 (Document Understanding Conference) has developed a standard human summaries called gold collection of 567 documents of single news. In this conference only five systems could outperforms the baseline heuristic in single extractive summarization task. So far, some approaches have got good results combining different strategies with language-dependent knowledge. In this paper, we present a competitive method based on an EM clustering algorithm for improving the quality of the automatic summaries using practically non language-dependent knowledge. Also, a comparison of this method with three text models is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Montiel, R., et al.: Comparación de Tres Modelos de Texto para la Generación Automática de Resúmenes. Natural Language Processing Journal of Spain Society 43, 303–311 (2009)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier (2005) ISBN: 0-12-088407-0
Google Scholar
NetCraft, June 2011 Web Server Survey, England (2011), http://news.netcraft.com/archives/-web_server_survey.html
Lee, J.-H., Park, S., Ahn, C.-M., Kim, D.: Automatic Generic Document Summarization Based on Non-negative Matrix Factorization. In: Information Processing and Management, vol. 45, pp. 20–34. Elsevier (2009) ISSN 0306-4573
Google Scholar
García, R., Ledeneva, Y., Gelbukh, A.: Keeping Maximal Frequent Sequences Facilitates Extractive Summarization. Research in Computing Science 28 (2008) ISSN 1870-4069
Google Scholar
García-Hernández, R.A., Montiel, R., Ledeneva, Y., Rendón, E., Gelbukh, A., Cruz, R.: Text Summarization by Sentence Extraction Using Unsupervised Learning. In: Gelbukh, A., Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 133–143. Springer, Heidelberg (2008)
Chapter Google Scholar
Porter, M.: The Porter Stemming Algorithm. Official home page for distribution of the Porter Stemming Algorithm (2006), http://tartarus.org/~martin/PorterStemmer/index.html
Luhn, H.P.: The automatic creation of Literature abstracts. IBM Journal of Research and Development (1958)
Google Scholar
Edmondson, H.P.: New Methods in Automatic Extraction. Journal of the Association for Computing Machinery (1969)
Google Scholar
Brandow, R., Mitze, K., Rau, L.: Automatic condensation of Electronic publication by sentence selection. In: Information Proc. and Management (1995)
Google Scholar
Kupiec, J., Pedersen, J.O., Chen, F.: A trainable document summarizer. In: SIGIR 1995, New York (1995)
Google Scholar
Goldstein, J., Carbolell, J., Kantrowitz, M., Mittal, V.: Summarizating text documents: sentence and evaluation metrics. In: 22nd Int. ACM SIGIR Research and Development in Information Retrieval, Berkley (1999)
Google Scholar
Marcus, D.: The rethorical parsing, summarization, and generation of natural language text, PhD. Thesis, Dep. of Computer Science, University of Toronto (1998)
Google Scholar
Marcus, D.: The Theory and practice of Discourse Parsing summarization. Institute of technology, Massachusetts (2000)
Google Scholar
Yeh, J.Y., Ke, H.R., Yang, W.P., Meng, J.H.: Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management (2005)
Google Scholar
Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proc. IJCAI 2007 (2007)
Google Scholar
da Cunha, I., Fernández, S., Velázquez Morales, P., Vivaldi, J., SanJuan, E., Torres-Moreno, J.-M.: A New Hybrid Summarizer Based on Vector Space Model, Statistical Physics and Linguistics. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 872–882. Springer, Heidelberg (2007)
Chapter Google Scholar
Radev, R., Jing, H., Stys, M., Tam, D.: Centroid-based summarization for multiple documents. 1st Int. Journal Information Processing and Management (2004)
Google Scholar
García, R., Ledeneva, Y., Gelbukh, A., Gutierrez, C.: An assessment of Word Sequence Models for Extractive Text Summarization. Research in Computing Science (38), 253–262 (2008)
Google Scholar
García, R., Ledeneva, Y., Mendoza, G., Hernandez, A., Chavez, J., Gelbukh, A., Tapia, L.: Comparing commercial tools and state-of-the-art methods for generating text summaries. In: Eighth Mexican International Conference on Artificial Intelligence, México, pp. 92–96 (2009)
Google Scholar
Ledeneva, Y., Sidorov, G.: Recent advances in Computational Linguistics. Informatica, International Journal of Computing and Informatics 3871(34), 3–18 (2010) ISSN: 1854-3871
Google Scholar
Ledeneva, Y., García, R., Gelbukh, A.: Multi-document summarization using Maximal Frequent Sequences. Research in Computer Science 47 (2010) ISSN 1870-4069
Google Scholar
Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Association for Computational Linguistics, Workshop, on Text Summarization, pp. 74–81 (2004)
Google Scholar
Lin, C., Hovy, E.: Manual and Automatic evaluation of summaries, In: Proceedings of the Workshop on Automatic Summarization (including DUC 2002), vol. I, pp. 71–78. Association for Computational Linguistics on Human Language Technology (2002)
Google Scholar
DUC Document Understanding Conference 2002 (2002), http://www-nlpir.nist.gov/proyect/duc
Garcia, R., Martinez, F., Carrasco, A.: Finding maximal sequential patterns in text document collections and single documents. Informatica, International Journal of Computing and Informatics (34), 93–101 (2010) ISSN: 1854-3871
Google Scholar
Lin, C.Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. In: Proceedings of HLT-NAACL, Canada (2003)
Google Scholar
Mihalcea, R.: Graph-based Ranking Algorithms for Sentence Extraction. In: Applied to Text Summarization. University of North Texas, Texas (2004)
Google Scholar
Ledeneva, Y.: Automatic Language-Independent Detection of Multiword Descriptions for Text Summarization, National Polytechnic Institute, PhD. Thesis, Mexico (2009)
Google Scholar
Sidorov, G.: Lemmatization in automatized system for compilation of personal style dictionaries of literary writers. “Word of Dostoyevsky”, Russian Academy of Sciences, 266–300 (1996)
Google Scholar
Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Chapter Google Scholar
Sidorov, G., Barrón-Cedeño, A., Rosso, P.: English-Spanish Large Statistical Dictionary of Inflectional Forms. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 277–281. European Language Resources Association (ELRA) (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Unidad Académica Profesional Tianguistenco, Universidad Autónoma del Estado de México, Paraje el Tejocote San Pedro Tlantizapan, 52600, México
Yulia Ledeneva & René García Hernández
Laboratorio de Reconocimiento de Patrones, Instituto Tecnológico de Toluca, Av. Tecnológico s/n, Ex Rancho La Virgen, Metepec, 52140, México
Romyna Montiel Soto & Rafael Cruz Reyes
Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz s/n, D.F., 07738, México
Alexander Gelbukh

Authors

Yulia Ledeneva
View author publications
You can also search for this author in PubMed Google Scholar
René García Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Romyna Montiel Soto
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Cruz Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Mexican Petroleum Institute (IMP), Eje Central Lazaro Cardenas Norte, 152, Col. San Bartolo Atepehuacan, CP 07730,, Mexico DF,, Mexico
Ildar Batyrshin
National Polytechnic Institute (IPN), Center for Computing Research (CIC), Av. Juan Dios Bátiz, s/n, Col. Nueva Industrial Vallejo, CP 07738, Mexico D.F., Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ledeneva, Y., Hernández, R.G., Soto, R.M., Reyes, R.C., Gelbukh, A. (2011). EM Clustering Algorithm for Automatic Text Summarization. In: Batyrshin, I., Sidorov, G. (eds) Advances in Artificial Intelligence. MICAI 2011. Lecture Notes in Computer Science(), vol 7094. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25324-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-25324-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25323-2
Online ISBN: 978-3-642-25324-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics