Abstract
Document clustering has the goal of discovering groups with similar documents. The success of the document clustering algorithms depends on the model used for representing these documents. Documents are commonly represented with the vector space model based on words or n-grams. However, these representations have some disadvantages such as high dimensionality and loss of the word sequential order. In this work, we propose a new document representation in which the maximal frequent sequences of words are used as features of the vector space model. The proposed model efficiency is evaluated by clustering different document collections and compared against the vector space model based on words and n-grams, through internal and external measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Su, Z., Zhang, L., Pan, Y.: Document Clustering Based on Vector Quatization and Growing-Cell Structure, pp. 326–336. Springer, Heidelberg (2003)
Yoelle, S., Fagin, Ronald, Ben-Shaul, Pelleg, I.Z.y., Dan: Ephemeral Document Clustering for Web Applications. IBM Research. Report RJ 10186 (2000)
Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Information Retrieval. Journal of the American Society for information Science, 613–620 (1975)
Jing, L., Ng, M.K., Xu, J., Huang, J.Z.: Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 802–812. Springer, Heidelberg (2005)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. Text mining workshop, KDD (2000)
Pantel, P., Lin, D.: Efficiently Clustering Documents with Committees. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS, vol. 2417, pp. 424–433. Springer, Heidelberg (2002)
http://www.ics.uci.edu/~kdd/databases/reuters21578/reuters21578.html
Manning, C.D., Schütze, H.: Foundations of Statical Natural Language Processing. Massachussets Institute of Technology (2001)
Luo, X., Zincir-Heywood, N.: Analyzing the Temporal Sequences for Text Categorization, pp. 498–505. Springer, Heildeberg (2004)
Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proc. of the ICML 1999 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999)
Doucet, A.: Advanced Document Description, a Sequential Approach. Thesis PhD. University of Helsinki Finland (2005)
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences in a Text. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hernández-Reyes, E., García-Hernández, R.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2006). Document Clustering Based on Maximal Frequent Sequences. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_27
Download citation
DOI: https://doi.org/10.1007/11816508_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)