Spectral Analysis of Text Collection for Similarity-Based Clustering

Li, Wenyuan; Ng, Wee-Keong; Lim, Ee-Peng

doi:10.1007/978-3-540-24775-3_47

Wenyuan Li¹⁹,
Wee-Keong Ng¹⁹ &
Ee-Peng Lim¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2963 Accesses
2 Citations

Abstract

Clustering of natural text collections is generally difficult due to the high dimensionality, heterogeneity, and large size of text collections. These characteristics compound the problem of determining the appropriate similarity space for clustering algorithms. In this paper, we propose to use the spectral analysis of the similarity space of a text collection to predict clustering behavior before actual clustering is performed. Spectral analysis is a technique that has been adopted across different domains to analyze the key encoding information of a system. Spectral analysis for prediction is useful in first determining the quality of the similarity space and discovering any possible problems the selected feature set may present. Our experiments showed that such insights can be obtained by analyzing the spectrum of the similarity matrix of a text collection. We showed that spectrum analysis can be used to estimate the number of clusters in advance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chung, F.R.K.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92. American Mathematical Society, Providence (1997)
MATH Google Scholar
Golub, G., Loan, C.V.: Matrix Computations. The Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Gordon, A.: Classification, 2nd edn. Chapman and Hall/CRC (1999)
Google Scholar
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proc. of the 23rd ACM SIGIR, Greece, July 2000, pp. 224–231 (2000)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31(13), 264–323 (1999)
Article Google Scholar
Karypis, G.: Cluto: A clustering toolkit. TR #02-017, Univ. of Minnesota (2002)
Google Scholar
Modha, D., Spangler, S.: Feature weighting in k-means clustering. Machine Learning 52(3), 556–561 (2003)
Article Google Scholar
Sinka, M.P., Corne, D.W.: A Large Benchmark Dataset for Web Document Clustering. In: Soft Computing Systems: Design, Management and Application, pp. 881–890. IOS Press, Amsterdam (2002)
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. of AAAI Workshop on AI for Web Search, Texas, July 2000, pp. 58–64 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore
Wenyuan Li, Wee-Keong Ng & Ee-Peng Lim

Authors

Wenyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Wee-Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Ee-Peng Lim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Ng, WK., Lim, EP. (2004). Spectral Analysis of Text Collection for Similarity-Based Clustering. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_47

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics