Document Clustering Based on Maximal Frequent Sequences

Hernández-Reyes, Edith; García-Hernández, Rene A.; Carrasco-Ochoa, J. A.; Martínez-Trinidad, J. Fco.

doi:10.1007/11816508_27

Edith Hernández-Reyes²¹,
Rene A. García-Hernández²¹,
J. A. Carrasco-Ochoa²¹ &
…
J. Fco. Martínez-Trinidad²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

International Conference on Natural Language Processing (in Finland)

Abstract

Document clustering has the goal of discovering groups with similar documents. The success of the document clustering algorithms depends on the model used for representing these documents. Documents are commonly represented with the vector space model based on words or n-grams. However, these representations have some disadvantages such as high dimensionality and loss of the word sequential order. In this work, we propose a new document representation in which the maximal frequent sequences of words are used as features of the vector space model. The proposed model efficiency is evaluated by clustering different document collections and compared against the vector space model based on words and n-grams, through internal and external measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Su, Z., Zhang, L., Pan, Y.: Document Clustering Based on Vector Quatization and Growing-Cell Structure, pp. 326–336. Springer, Heidelberg (2003)
Google Scholar
Yoelle, S., Fagin, Ronald, Ben-Shaul, Pelleg, I.Z.y., Dan: Ephemeral Document Clustering for Web Applications. IBM Research. Report RJ 10186 (2000)
Google Scholar
Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Information Retrieval. Journal of the American Society for information Science, 613–620 (1975)
Google Scholar
Jing, L., Ng, M.K., Xu, J., Huang, J.Z.: Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 802–812. Springer, Heidelberg (2005)
Chapter Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. Text mining workshop, KDD (2000)
Google Scholar
Pantel, P., Lin, D.: Efficiently Clustering Documents with Committees. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS, vol. 2417, pp. 424–433. Springer, Heidelberg (2002)
Chapter Google Scholar
http://www.ics.uci.edu/~kdd/databases/reuters21578/reuters21578.html
Manning, C.D., Schütze, H.: Foundations of Statical Natural Language Processing. Massachussets Institute of Technology (2001)
Google Scholar
Luo, X., Zincir-Heywood, N.: Analyzing the Temporal Sequences for Text Categorization, pp. 498–505. Springer, Heildeberg (2004)
Google Scholar
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proc. of the ICML 1999 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999)
Google Scholar
Doucet, A.: Advanced Document Description, a Sequential Approach. Thesis PhD. University of Helsinki Finland (2005)
Google Scholar
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences in a Text. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Optics and Electronics, National Institute for Astrophysics, Luis Enrique Erro No.1 Sta. Ma., Tonantzintla, Puebla, C.P. 72840, México
Edith Hernández-Reyes, Rene A. García-Hernández, J. A. Carrasco-Ochoa & J. Fco. Martínez-Trinidad

Authors

Edith Hernández-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Rene A. García-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
J. A. Carrasco-Ochoa
View author publications
You can also search for this author in PubMed Google Scholar
J. Fco. Martínez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernández-Reyes, E., García-Hernández, R.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2006). Document Clustering Based on Maximal Frequent Sequences. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_27

Download citation

DOI: https://doi.org/10.1007/11816508_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics