Text Clustering

Aggarwal, Charu C.

doi:10.1007/978-3-319-73531-3_4

Charu C. Aggarwal²

10k Accesses

Abstract

The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The first eigenvector is not discriminative in terms of the clustering structure and can be dropped. Its value can be shown to depend only on the square-root of the frequency of the corresponding term or document.
2.
The general form of symmetric spectral clustering [361] (cf. Sect. 4.8.2), which is applicable to all types of bipartite and non-bipartite graphs, normalizes each row to unit norm. This choice is a worthy alternative.
3.
Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.
4.
This model is discussed only in later chapters. The uninitiated reader may choose to skip over this section in the first reading.
5.
Each of the respective sums of the diagonal entries of S and Δ are the same because the trace of a matrix is invariant under similarity transformation [460]. Therefore, the eigenvalues sum to 0. Unless all eigenvalues are 0 (i.e., S = 0), at least one negative eigenvalue will exist.
6.
This section requires an understanding of the classification problem. We recommend the uninitiated reader to skip this section at the first reading of the book, and return to it only after covering the material in the next chapter. The notations and terminologies used in this section assume such an understanding.
7.
The sigmoid function 1∕(1 + e ^−λs) can be used to convert an arbitrary score s to the range (0, 1), which is followed by normalizing the scores to sum to 1 over all classes.

Bibliography

C. Aggarwal. Data mining: The textbook. Springer, 2015.
Google Scholar
C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]
Article Google Scholar
C. Aggarwal and C. Reddy. Data clustering: algorithms and applications, CRC Press, 2013.
Google Scholar
C. Aggarwal and S. Sathe. Outlier ensembles: An introduction. Springer, 2017.
Google Scholar
C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.
Google Scholar
C. Aggarwal and P. Yu. On clustering massive text and categorical data streams. Knowledge and Information Systems, 24(2), pp. 171–196, 2010.
Article Google Scholar
C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.
Google Scholar
J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.
Google Scholar
L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.
Google Scholar
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.
MATH Google Scholar
P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy. AAAI Press/MIT Press, 1996.
Google Scholar
W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28, pp. 341–344, 1977.
Article Google Scholar
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. ACM SIGIR Conference, pp. 318–329, 1992.
Google Scholar
I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. ACM KDD Conference, pp. 269–274, 2001.
Google Scholar
I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1–2), pp. 143–175, 2001.
Article Google Scholar
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, pp. 606–610, 2005.
Chapter Google Scholar
C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.
Article MathSciNet Google Scholar
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. ACM KDD Conference, pp. 126–135, 2006.
Google Scholar
J. Ghosh and A. Acharya. Cluster ensembles: Theory and applications. Data Clustering: Algorithms and Applications, CRC Press, 2013.
Google Scholar
S. Gilpin, T. Eliassi-Rad, and I. Davidson. Guided learning for role discovery (glrd): framework, algorithms, and applications. ACM KDD Conference, pp. 113–121, 2013.
Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.
Google Scholar
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 41(1–2), pp. 177–196, 2001.
Article Google Scholar
Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.
Google Scholar
D. Lee and H. Seung. Algorithms for non-negative matrix factorization. Advances in Meural Information Processing Systems, pp. 556–562, 2001.
Google Scholar
H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.
Google Scholar
Y. Li, C. Luo, and S. Chung. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), pp. 641–652, 2008.
Article Google Scholar
S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), pp. 24–45, 2004.
Article Google Scholar
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography (special issue), 3(4), pp. 235–312, 1990. https://wordnet.princeton.edu/
Article Google Scholar
A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS Conference, pp. 849–856, 2002.
Google Scholar
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.
Article Google Scholar
H. Paulheim and R. Meusel. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning, 100(2–3), pp. 509–531, 2015.
Article MathSciNet Google Scholar
F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. ACL Conference, pp. 183–190, 1993.
Google Scholar
H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.
Article Google Scholar
F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), pp. 378–386, 2006.
Article Google Scholar
G. Strang. An introduction to linear algebra. Wellesley Cambridge Press, 2009.
Google Scholar
E. Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), pp. 465–476, 1986.
Article Google Scholar
W. Wilbur and K. Sirotkin. The automatic identification of stop words. Journal of Information Science, 18(1), pp. 45–55, 1992.
Article Google Scholar
C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS Conference, 2000.
Google Scholar
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. ACM SIGIR Conference, pp. 267–273, 2003.
Google Scholar
M. Zaki and W. Meira Jr. Data mining and analysis: Fundamental concepts and algorithms. Cambridge University Press, 2014.
Google Scholar
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. ACM SIGIR Conference, pp. 46–54, 1998.
Google Scholar
Y. Zhao, G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.
Article Google Scholar
S. Zhong. Efficient streaming text clustering. Neural Networks, Vol. 18, 5–6, 2005.
Article Google Scholar
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://cran.r-project.org/web/packages/tm/
http://www.cs.waikato.ac.nz/ml/weka/
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
https://www.mathworks.com/help/stats/cluster-analysis.html

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C. (2018). Text Clustering. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-73531-3_4
Published: 20 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics