Skip to main content

Cluster Analysis and K-means Clustering: An Introduction

  • Chapter
  • First Online:
Advances in K-means Clustering

Part of the book series: Springer Theses ((Springer Theses))

Abstract

The phrase “data mining” was termed in the late eighties of the last century, which describes the activity that attempts to extract interesting patterns from data. Since then, data mining and knowledge discovery has become one of the hottest topics in both academia and industry. It provides valuable business and scientific intelligence hidden in a large amount of historical data

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.kdd.org/.

  2. 2.

    http://www.cs.uvm.edu/~icdm/.

  3. 3.

    http://www.informatik.uni-trier.de/~ley/db/conf/sdm/.

References

  • Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)

    Google Scholar 

  • Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207–216 (1993)

    Google Scholar 

  • Anderberg, M.: Cluster Analysis for Applications. Academic Press, New York (1973)

    MATH  Google Scholar 

  • Banerjee, A., Dhillon, I., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)

    MathSciNet  MATH  Google Scholar 

  • Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MathSciNet  MATH  Google Scholar 

  • Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on, Machine Learning, pp. 19–26 (2002)

    Google Scholar 

  • Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. In: Proceedings of 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59–68 (2004)

    Google Scholar 

  • Bellman, R.E., Corporation, R.: Dynamic Programming. Princeton University Press, New Jersey (1957)

    Google Scholar 

  • Bentkus, V.: On hoeffding’s inequalities. Ann. Probab. 32(2), 1650–1673 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Beringer, J., Hullermeier, E.: Online clustering of parallel data streams. Data Knowl. Eng. 58(2), 180–204 (2005)

    Article  Google Scholar 

  • Berkhin, P.: Survey of clustering data mining techniques. Technical Report, Accrue Software, San Jose (2002)

    Google Scholar 

  • Berry, M., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley, New York (1997)

    Google Scholar 

  • Berry, M., Linoff, G.: Matering Data Mining: The Art and Science of Customer Relationship Management. Wiley, New York (1999)

    Google Scholar 

  • Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press, New York (1981)

    Book  Google Scholar 

  • Bilmes, J.: A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report, ICSITR-97-021, International Computer Science Institute and U.C. Berkeley (1997)

    Google Scholar 

  • Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decis. Support Syst. 27(3), 329–341 (1999)

    Article  Google Scholar 

  • Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 9–15 (1998)

    Google Scholar 

  • Bradley, P., Fayyad, U., Reina, C.: Scaling em (expectation maximization) clustering to large databases. Technical Report, MSR-TR-98-35, Microsoft Research (1999)

    Google Scholar 

  • Bregman, L.: The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217 (1967)

    Article  Google Scholar 

  • Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)

    Google Scholar 

  • Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., Dougherty, E.: Model-based evaluation of clustering validation measures. Pattern Recognit. 40, 807–824 (2007)

    Article  MATH  Google Scholar 

  • Childs, A., Balakrishnan, N.: Some approximations to the multivariate hypergeometric distribution with applications to hypothesis testing. Comput. Stat. Data Anal. 35(2), 137–154 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Cover, T., Thomas, J.: Elements of Information Theory, 2nd edn. Wiley-Interscience, Hoboken (2006)

    Google Scholar 

  • Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  • Davidson, I., Ravi, S.: Clustering under constraints: feasibility results and the k-means algorithm. In: Proceedings of the 2005 SIAM International Conference on Data Mining (2005)

    Google Scholar 

  • Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc. Ser. B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Dhillon, I., Guan, Y., Kogan, J.: Iterative clustering of high dimensional text data augmented by local search. In: Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 131–138 (2002)

    Google Scholar 

  • Dhillon, I., Guan, Y., Kulis, B.: Kernel k-means: Spectral clustering and normalized cuts. In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556. New York (2004)

    Google Scholar 

  • Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003)

    MATH  Google Scholar 

  • Dhillon, I., Mallela, S., Modha, D.: Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98 (2003)

    Google Scholar 

  • Ding, C., He, X., Zha, H., Gu, M., Simon, H.: A min-max cut for graph partitioning and data clustering. In: Proceedings of the 1st IEEE International Conference on Data Mining, pp. 107–114 (2001)

    Google Scholar 

  • Domingos, P., Hulten, G.: A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on, Machine Learning, pp. 106–113 (2001)

    Google Scholar 

  • Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)

    Google Scholar 

  • Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)

    Google Scholar 

  • Forgy, E.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21(3), 768–769 (1965)

    Google Scholar 

  • Fred, A., Jain, A.: Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 835–850 (2005)

    Article  Google Scholar 

  • Friedman, H., Rubin, J.: On some invariant criteria for grouping data. J. Am. Stat. Assoc. 62, 1159–1178 (1967)

    Article  MathSciNet  Google Scholar 

  • Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005)

    Article  Google Scholar 

  • Ghosh, J.: Scalable clustering methods for data mining. In: Ye, N. (ed.) Handbook of Data Mining, pp. 247–277. Lawrence Ealbaum (2003)

    Google Scholar 

  • Gray, R., Neuhoff, D.: Quantization. IEEE Trans. Info. Theory 44(6), 2325–2384 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  • Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: Part I. SIGMOD Rec. 31(2), 40–45 (2002)

    Article  Google Scholar 

  • Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part II. SIGMOD Rec. 31(3), 19–27 (2002)

    Article  Google Scholar 

  • Han, E.H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J.: Webace: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents, pp. 408–415 (1998)

    Google Scholar 

  • Hand, D., Yu, K.: Idiot’s bayes—not so stupid after all? Int. Stat. Rev. 69(3), 385–399 (2001)

    Article  MATH  Google Scholar 

  • Hansen, P., Mladenovic, N.: Variable neighborhood search: principles and applications. Euro. J. Oper. Res. 130, 449–467 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Hinneburg, A., Keim, D.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 58–65. AAAI Press, New York (1998)

    Google Scholar 

  • Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)

    Article  MATH  Google Scholar 

  • Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  • Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  • Jarvis, R., Patrick, E.: Clusering using a similarity measure based on shared nearest neighbors. IEEE Trans. Comput. C-22(11), 1025–1034 (1973)

    Google Scholar 

  • Karypis, G., Han, E.H., Kumar, V.: Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75 (1999)

    Article  Google Scholar 

  • Karypis, G., Kumar, V.: A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM J. Sc. Comput. 20(1), 359–392 (1998)

    Article  MathSciNet  Google Scholar 

  • Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (1990)

    Google Scholar 

  • Kent, J., Bibby, J., Mardia, K.: Multivariate Analysis (Probability and Mathematical Statistics). Elsevier Limited, New York (2006)

    Google Scholar 

  • Kleinberg, J.: An impossibility theorem for clustering. In: Proceedings of the 16th Annual Conference on Neural Information Processing Systems, pp. 9–14 (2002)

    Google Scholar 

  • Kohonen, T., Huang, T., Schroeder, M.: Self-Organizing Maps. Springer,Heidelberg (2000)

    Google Scholar 

  • Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)

    Google Scholar 

  • Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on, World Wide Web, pp. 631–640 (2010)

    Google Scholar 

  • Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Info. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  • Lu, Z., Peng, Y., Xiao, J.: From comparing clusterings to combining clusterings. In: Fox, D., Gomes, C. (eds.) Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 361–370. AAAI Press, Chicago (2008)

    Google Scholar 

  • Luo, P., Xiong, H., Zhan, G., Wu, J., Shi, Z.: Information-theoretic distance measures for clustering validation: Generalization and normalization. IEEE Trans. Knowl. Data Eng. 21(9), 1249–1262 (2009)

    Article  Google Scholar 

  • Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  • MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  • MathWorks: K-means clustering in statistics toolbox. http://www.mathworks.com

  • McLachlan, G., Basford, K.: Mixture Models. Marcel Dekker, New York (2000)

    Book  MATH  Google Scholar 

  • Meila, M.: Comparing clusterings by the variation of information. In: Proceedings of the 16th Annual Conference on Computational Learning Theory, pp. 173–187 (2003)

    Google Scholar 

  • Meila, M.: Comparing clusterings—an axiomatic view. In: Proceedings of the 22nd International Conference on, Machine learning, pp. 577–584 (2005)

    Google Scholar 

  • Milligan, G.: Clustering validation: Results and implications for applied analyses. In: Arabie, P., Hubert, L., Soete, G. (eds.) Clustering and Classification, pp. 345–375. World Scientific, Singapore (1996)

    Google Scholar 

  • Mirkin, B.: Mathematical Classification and Clustering. Kluwer Academic Press, Dordrecht (1996)

    Google Scholar 

  • Mitchell, T.: Machine Learning. McGraw-Hill, Boston (1997)

    MATH  Google Scholar 

  • Mladenovic, N., Hansen, P.: Variable neighborhood search. Comput. Oper. Res. 24(11), 1097–1100 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52(1–2), 91–118 (2003)

    Article  MATH  Google Scholar 

  • Murtagh, F.: Clustering massive data sets. In: Abello, J., Pardalos, P.M., Resende, M.G. (eds.) Handbook of Massive Data Sets, pp. 501–543. Kluwer Academic Publishers, Norwell (2002)

    Google Scholar 

  • Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856. MIT Press (2001)

    Google Scholar 

  • Nguyen, N., Caruana, R.: Consensus clusterings. In: Proceedings of the 7th IEEE International Conference on Data Mining, pp. 607–612. Washington (2007)

    Google Scholar 

  • Ordonez, C.: Clustering binary data streams with k-means. In: Proceedings of the SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)

    Google Scholar 

  • Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. 6(1), 90–105 (2004)

    Article  Google Scholar 

  • Pearson, K.: Contributions to the mathematical theory of evolution. Philos. Trans. Royal Soc. Lond. 185, 71–110 (1894)

    MATH  Google Scholar 

  • Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworths, London (1979)

    Google Scholar 

  • Rose, K.: Deterministic annealing for clustering, compression, classification, regression and related optimization problems. Proc. IEEE 86, 2210–2239 (1998)

    Article  Google Scholar 

  • Rose, K., Gurewitz, E., Fox, G.: A deterministic annealing approach to clustering. Pattern Recognit. Lett. 11, 589–594 (1990)

    Article  MATH  Google Scholar 

  • Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  • Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of the KDD Workshop on Text Mining (2000)

    Google Scholar 

  • Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  Google Scholar 

  • Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceedings of the AAAI Workshop on AI for Web Search (2000)

    Google Scholar 

  • Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009, Article ID 421,425, 19 pp (2009)

    Google Scholar 

  • Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)

    Google Scholar 

  • Tang, B., Shepherd, M., Heywood, M., Luo, X.: Comparing dimension reduction techniques for document clustering. In: Proceedings of the Canadian Conference on, Artificial Intelligence, pp. 292–296 (2005)

    Google Scholar 

  • Topchy, A., Jain, A., Punch, W.: Combining multiple weak clusterings. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 331–338. Melbourne (2003)

    Google Scholar 

  • Topchy, A., Jain, A., Punch, W.: A mixture model for clustering ensembles. In: Proceedings of the 4th SIAM International Conference on Data Mining. Florida (2004)

    Google Scholar 

  • Vapnik, V.: The Nature of Statistical Learning. Springer, New York (1995)

    MATH  Google Scholar 

  • Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on, Machine Learning, pp. 577–584 (2001)

    Google Scholar 

  • Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  • Xiong, H., Pandey, G., Steinbach, M., Kumar, V.: Enhancing data analysis with noise removal. IEEE Trans. Knowl. Data Eng. 18(3), 304–319 (2006)

    Article  Google Scholar 

  • Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(2), 318–331 (2009)

    Article  Google Scholar 

  • Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

  • Yang, J., Yuz, K., Gongz, Y., Huang, T.: Linear spatial pyramid matching using sparse coding. In: Proceedings of the 2009 IEEE Conference on Computer Vision and, Pattern Recognition, pp. 1794–1801 (2009)

    Google Scholar 

  • Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  • Zhong, S., Ghosh, J.: A unified framework for model-based clustering. J. Mach. Learn. Res. 4(6), 1001–1037 (2004)

    MathSciNet  MATH  Google Scholar 

  • Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Wu, J. (2012). Cluster Analysis and K-means Clustering: An Introduction. In: Advances in K-means Clustering. Springer Theses. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29807-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29807-3_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29806-6

  • Online ISBN: 978-3-642-29807-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics