Skip to main content

Text Clustering

  • Chapter
  • First Online:
Machine Learning for Text
  • 10k Accesses

Abstract

The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The first eigenvector is not discriminative in terms of the clustering structure and can be dropped. Its value can be shown to depend only on the square-root of the frequency of the corresponding term or document.

  2. 2.

    The general form of symmetric spectral clustering [361] (cf. Sect. 4.8.2), which is applicable to all types of bipartite and non-bipartite graphs, normalizes each row to unit norm. This choice is a worthy alternative.

  3. 3.

    Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.

  4. 4.

    This model is discussed only in later chapters. The uninitiated reader may choose to skip over this section in the first reading.

  5. 5.

    Each of the respective sums of the diagonal entries of S and Δ are the same because the trace of a matrix is invariant under similarity transformation [460]. Therefore, the eigenvalues sum to 0. Unless all eigenvalues are 0 (i.e., S = 0), at least one negative eigenvalue will exist.

  6. 6.

    This section requires an understanding of the classification problem. We recommend the uninitiated reader to skip this section at the first reading of the book, and return to it only after covering the material in the next chapter. The notations and terminologies used in this section assume such an understanding.

  7. 7.

    The sigmoid function 1∕(1 + e λs) can be used to convert an arbitrary score s to the range (0, 1), which is followed by normalizing the scores to sum to 1 over all classes.

Bibliography

  1. C. Aggarwal. Data mining: The textbook. Springer, 2015.

    Google Scholar 

  2. C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]

    Article  Google Scholar 

  3. C. Aggarwal and C. Reddy. Data clustering: algorithms and applications, CRC Press, 2013.

    Google Scholar 

  4. C. Aggarwal and S. Sathe. Outlier ensembles: An introduction. Springer, 2017.

    Google Scholar 

  5. C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.

    Google Scholar 

  6. C. Aggarwal and P. Yu. On clustering massive text and categorical data streams. Knowledge and Information Systems, 24(2), pp. 171–196, 2010.

    Article  Google Scholar 

  7. C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.

    Google Scholar 

  8. J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.

    Google Scholar 

  9. L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.

    Google Scholar 

  10. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.

    MATH  Google Scholar 

  11. P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy. AAAI Press/MIT Press, 1996.

    Google Scholar 

  12. W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28, pp. 341–344, 1977.

    Article  Google Scholar 

  13. D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. ACM SIGIR Conference, pp. 318–329, 1992.

    Google Scholar 

  14. I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. ACM KDD Conference, pp. 269–274, 2001.

    Google Scholar 

  15. I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1–2), pp. 143–175, 2001.

    Article  Google Scholar 

  16. C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, pp. 606–610, 2005.

    Chapter  Google Scholar 

  17. C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.

    Article  MathSciNet  Google Scholar 

  18. C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. ACM KDD Conference, pp. 126–135, 2006.

    Google Scholar 

  19. J. Ghosh and A. Acharya. Cluster ensembles: Theory and applications. Data Clustering: Algorithms and Applications, CRC Press, 2013.

    Google Scholar 

  20. S. Gilpin, T. Eliassi-Rad, and I. Davidson. Guided learning for role discovery (glrd): framework, algorithms, and applications. ACM KDD Conference, pp. 113–121, 2013.

    Google Scholar 

  21. T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.

    Google Scholar 

  22. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 41(1–2), pp. 177–196, 2001.

    Article  Google Scholar 

  23. Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.

    Google Scholar 

  24. D. Lee and H. Seung. Algorithms for non-negative matrix factorization. Advances in Meural Information Processing Systems, pp. 556–562, 2001.

    Google Scholar 

  25. H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.

    Google Scholar 

  26. Y. Li, C. Luo, and S. Chung. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), pp. 641–652, 2008.

    Article  Google Scholar 

  27. S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), pp. 24–45, 2004.

    Article  Google Scholar 

  28. A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.

  29. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781

  30. G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography (special issue), 3(4), pp. 235–312, 1990. https://wordnet.princeton.edu/

    Article  Google Scholar 

  31. A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS Conference, pp. 849–856, 2002.

    Google Scholar 

  32. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.

    Article  Google Scholar 

  33. H. Paulheim and R. Meusel. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning, 100(2–3), pp. 509–531, 2015.

    Article  MathSciNet  Google Scholar 

  34. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. ACL Conference, pp. 183–190, 1993.

    Google Scholar 

  35. H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.

    Article  Google Scholar 

  36. F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), pp. 378–386, 2006.

    Article  Google Scholar 

  37. G. Strang. An introduction to linear algebra. Wellesley Cambridge Press, 2009.

    Google Scholar 

  38. E. Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), pp. 465–476, 1986.

    Article  Google Scholar 

  39. W. Wilbur and K. Sirotkin. The automatic identification of stop words. Journal of Information Science, 18(1), pp. 45–55, 1992.

    Article  Google Scholar 

  40. C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS Conference, 2000.

    Google Scholar 

  41. W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. ACM SIGIR Conference, pp. 267–273, 2003.

    Google Scholar 

  42. M. Zaki and W. Meira Jr. Data mining and analysis: Fundamental concepts and algorithms. Cambridge University Press, 2014.

    Google Scholar 

  43. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. ACM SIGIR Conference, pp. 46–54, 1998.

    Google Scholar 

  44. Y. Zhao, G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.

    Article  Google Scholar 

  45. S. Zhong. Efficient streaming text clustering. Neural Networks, Vol. 18, 5–6, 2005.

    Article  Google Scholar 

  46. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

  47. https://cran.r-project.org/web/packages/tm/

  48. http://www.cs.waikato.ac.nz/ml/weka/

  49. http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

  50. https://www.mathworks.com/help/stats/cluster-analysis.html

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Text Clustering. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics