Topic-Constrained Hierarchical Clustering for Document Datasets

Zhao, Ying

doi:10.1007/978-3-642-17316-5_17

Ying Zhao²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6440))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2344 Accesses
1 Citations

Abstract

In this paper, we propose the topic-constrained hierarchical clustering, which organizes document datasets into hierarchical trees consistant with a given set of topics. The proposed algorithm is based on a constrained agglomerative clustering framework and a semi-supervised criterion function that emphasizes the relationship between documents and topics and the relationship among documents themselves simultaneously. The experimental evaluation show that our algorithm outperformed the traditional agglomerative algorithm by 7.8% to 11.4%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: 18th International Conference on Machine Learning (ICML 2001), pp. 577–584 (2001)
Google Scholar
Basu, S., Bilenko, M., Monney, R.: A probabilistic framework for semi-supervised clustering. In: 10th International Conference on Knowledge Discovery and Data Mining (2004)
Google Scholar
Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. In: Advances in Neural Information Processing Systerms, vol. 15, pp. 505–512 (2003)
Google Scholar
Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databeses, pp. 59–70 (2005)
Google Scholar
Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 554–560 (2006)
Google Scholar
Bade, K., Nurnberger, A.: Creating a cluster hierarchy under constraints of a partially known hierarchy. In: 2008 SIAM International Conference on Data Mining (SDM 2008) (2008)
Google Scholar
Zhao, Y., Karypis, G.: Topic-driven Clustering for Document Datasets. In: 2005 SIAM International Conference on Data Mining (SDM 2005), pp. 358–369 (2005)
Google Scholar
Bilenko, M., Basu, S., Monney, R.: Integrating constraints and metric learning in semi-supervised clustering. In: 21th International Conference on Machine Learning, ICML 2004 (2004)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1/2), 143–175 (2001)
Article MATH Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Zhao, Y., Karypis, G.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10(2), 144–168 (2005)
Article MathSciNet Google Scholar
TREC. Text REtrieval conference, http://trec.nist.gov
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China, 100084
Ying Zhao

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, University of Technology Sydney, 2007, Sydney, NSW, Australia
Longbing Cao
College of Computer Science, Chongqing University, 400030, Chongqing, China
Yong Feng
College of Computer Science, Chongqing University , 400030, Chongqing, China
Jiang Zhong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y. (2010). Topic-Constrained Hierarchical Clustering for Document Datasets. In: Cao, L., Feng, Y., Zhong, J. (eds) Advanced Data Mining and Applications. ADMA 2010. Lecture Notes in Computer Science(), vol 6440. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17316-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-17316-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17315-8
Online ISBN: 978-3-642-17316-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics