Skip to main content
Log in

CL.E.KMODES: a modified k-modes clustering algorithm

  • Special Issue Paper
  • Published:
Journal of the Operational Research Society

Abstract

In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3

References

  • Aha D and Murphy P (1994). UCI Repository of Machine Learning Datasets. University of California, Department of Information and Computer Science: Irvine, CA. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.

  • Ahmad A and Dey L (2007). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set . Pattern Recogn Lett 28(1): 110–118.

    Article  Google Scholar 

  • Anderberg MR (1973). Cluster Analysis for Applications . Academic Press: New York.

    Google Scholar 

  • Bohanec M and Rajkovic V (1988). Knowledge acquisition and explanation for multi-attribute decision making . Proceedings of the 8th International Workshop on Expert Systems and their Applications. Avignon: France, pp. 59–78.

    Google Scholar 

  • Boriah S, Chandola V and Kumar V (2008). Similarity measures for categorical data: A comparative evaluation . Proceedings of the Eighth SIAM International Conference on Data Mining. Atlanta, GA, pp. 243–254.

    Google Scholar 

  • Bouyssou D (2001). Outranking methods . In: Floudas CA and Pardalos PM (eds). Encyclopedia of Optimization Vol. 4. Kluwer, pp. 255–289.

    Google Scholar 

  • Burnaby T (1970). On a method for character weighting a similarity coefficient, employing the concept of information . Math Geol 2(1): 25–28.

    Article  Google Scholar 

  • Cramer H (1946). The Elements of Probability Theory and Some of its Applications . John Wiley & Sons: New York.

    Google Scholar 

  • Dubes CR (1998). Cluster analysis and related issues . Handbook Pattern Recogn Comput Vis 1: 3–32.

    Google Scholar 

  • Dubes CR and Jain AK (1979). Validity studies in clustering methodologies . Pattern Recogn 11: 235–254.

    Article  Google Scholar 

  • Eskin E, Arnold A, Prerau M, Portnoy L and Stolfo S (2002). A geometric framework for unsupervised anomaly detection . In: Barbara D and Jajodia S (eds). Applications of Data Mining in Computer Security. Kluwer Academic Publishers: Dordrecht (Hingham, MA), pp. 78–100.

    Google Scholar 

  • Figueira J, Mousseau V and Roy B (2005). Electre Methods . In: Figueira J, Greco S and Ehrogott M (eds). Multiple Criteria Decision Analysis: State of the Art Surveys. Springer: New York, pp. 133–153.

    Chapter  Google Scholar 

  • Fowlkes EB and Mallows CL (1983). A method for comparing two hierarchical clusterings . J Am Stat Assoc 78: 553–569.

    Article  Google Scholar 

  • Gambaryan P (1964). A mathematical model of taxonomy . Izvest Akad Nauk Armen SSR 17(12): 47–53.

    Google Scholar 

  • Gan G, Yang Z and Wu J (2005). A genetic k-modes algorithm for clustering categorical data . Proceedings of 1st International Conference, ADMA'05. Wuhan: China, pp. 195–202.

    Google Scholar 

  • Goodall DW (1966). A new similarity index based on probability . Biometrics 22(4): 882–907.

    Article  Google Scholar 

  • Guha S, Rastogi R and Shim K (2000). ROCK: A robust clustering algorithm for categorical attributes . Inform Syst 25(5): 345–366.

    Article  Google Scholar 

  • Halkidi M, Batistakis Y and Vazirgiannis M (2001). On clustering validation techniques . J Intell Inf Syst 17(2–3): 107–145.

    Article  Google Scholar 

  • He Z, Deng S and Xu X (2005). Improving k-modes algorithm considering frequencies of attribute values in mode . Proceedings of the International Conference on Computational Intelligence and Security :CIS 2005. Xi'an: China, pp. 157–162.

    Google Scholar 

  • Huang Z (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values . Data Min Knowl Disc 2(3): 283–304.

    Article  Google Scholar 

  • Huang Z and Ng MK (1999). A fuzzy k-modes algorithm for clustering categorical data . IEEE T Fuzzy Syst 7(4): 446–452.

    Article  Google Scholar 

  • Huang Z, Ng MK, Rong H and Li Z (2005). Automated variable weighing in k-means type clustering . IEEE T Pattern Anal 27(5): 657–668.

    Article  Google Scholar 

  • Jain AK, Murty NM and Flynn JP (1999). Data clustering: A review . ACM Comput Surv 31(3): 264–323.

    Article  Google Scholar 

  • Jollois F and Nadif M (2002). Clustering large categorical data . Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD'02. Taipei: Taiwan, pp. 257–263.

    Chapter  Google Scholar 

  • Le SQ and Ho TB (2005). An association-based dissimilarity measure for categorical data . Pattern Recogn Lett 26(16): 2549–2557.

    Article  Google Scholar 

  • Lin D (1998). An information-theoretic definition of similarity . Proceedings of the 15th International Conference on Machine Learning ICML'98. Madison, WI, pp. 296–304.

    Google Scholar 

  • MacQueen JB (1967). Some methods for classification and analysis of multivariate observations . Proceeding of the 5th Berkley Symposium on Mathematics, Statistics and Probability. University of California press, Vol. 1, pp. 281–297.

    Google Scholar 

  • Makarenkov V and Legendre P (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software . J Classif 18: 245–271.

    Google Scholar 

  • Maung K (1941). Measurement of association in a contingency table with special reference to the pigmentation of hair and eye colours of Scottish school children . Ann Eugen 11: 189–223.

    Article  Google Scholar 

  • Milligan GW, Soon SC and Sokol LM (1983). The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure . IEEE T Pattern Anal 5: 40–47.

    Article  Google Scholar 

  • Modha DS and Spangler WS (2003). Feature weighting in k-means clustering . Mach Learn 52: 217–237.

    Article  Google Scholar 

  • Ng MK and Wong JC (2002). Clustering categorical data sets using tabu search techniques . Pattern Recogn 35(12): 2783–2790.

    Article  Google Scholar 

  • Ng MK, Li MJ, Huang JZ and He Z (2007). On the impact of dissimilarity measure in k-modes clustering algorithm . IEEE T Pattern Anal 29(3): 503–506.

    Article  Google Scholar 

  • Pearson K (1916). On the general theory of multiple contingency with special reference to partial contingency . Biometrica 11(3): 145–158.

    Article  Google Scholar 

  • Pirlot M (1997). A common framework for describing some outranking methods . J Multi-Criteria Deci Anal 6: 86–92.

    Article  Google Scholar 

  • Rand WM (1971). Objective criteria for the evaluation of clustering methods . J Am Stat Assoc 66(336): 846–850.

    Article  Google Scholar 

  • Rosenberg A and Hirschberg J (2007). V-measure: A conditional entropy-based external cluster evaluation measure . Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing and Computational Language Learning. Prague, pp. 410–420.

    Google Scholar 

  • Roy B (1968). Classement et choix en présence de points de vue muptiples : La méthode ELECTRE . RIRO 8: 57–75.

    Google Scholar 

  • Roy B (1991). The outranking approach and the foundations of ELECTRE methods . Theor Decis 31: 49–73.

    Article  Google Scholar 

  • Roy B (1996). Multicriteria Methodology for Decision Aiding . Kluwer Academic Publishers: Dordrecht.

    Book  Google Scholar 

  • San O, Huynh V and Nakamori Y (2004). An alternative extension of the k-means algorithm for clustering categorical data . Int J Appl Math Comput Sci 14(2): 241–247.

    Google Scholar 

  • Shyu ML, Kuruppu-Appuhamilage IP, Chen SC and Chang L (2005). Handling missing values via decomposition of the conditioned set . Proceedings of the IEEE International Conference on Information Reuse and Integration, IRI'05. Las Vegas, NV, pp. 199–204.

    Google Scholar 

  • Siegler RS (1976). Three aspects of cognitive development . Cognitive Psychol 8: 481–520.

    Article  Google Scholar 

  • Smirnov ES (1968). On exact methods in systematics . Syst Zool 17(1): 1–13.

    Article  Google Scholar 

  • Stanfill C and Waltz D (1986). Toward memory-based reasoning . Commun ACM 29(12): 1213–1228.

    Article  Google Scholar 

  • Sun Y, Zhu Q and Chen Z (2002). An iterative initial-points refinement algorithm for categorical data clustering . Pattern Recogn Lett 23(7): 875–884.

    Article  Google Scholar 

  • Theodoridis S and Koutroumbas K (1999). Pattern Recognition . Academic Press: Orlando, FL.

    Google Scholar 

  • Van Rijsbergen CJ (1979). Information Retrieval, 2nd edn.. Butterworths: London.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

The estimation of a new mode is based on the process of finding a mode for a set of the k-modes algorithm (Huang, 1998). In more details, if Y={y 1,y 2,…,y n } is a set of n objects evaluated over d attributes with categorical values, then the mode of Y is a vector H={h 1,h 2,…,h d } that minimizes the following relation:

where D(y j ,H) is defined according to relation (1). It should be noted that H is not necessarily an element of Y. Let be the number of objects having category c k,f on attribute a f and the relative frequency of category c k,f in Y. Then DIS(H,Y) is minimized if fr(a f =h f /Y)⩾fr(a f =c k,f /Y) for h f c k,f for every attribute a f . Note that a mode of a data set Y is not unique.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mastrogiannis, N., Giannikos, I., Boutsinas, B. et al. CL.E.KMODES: a modified k-modes clustering algorithm. J Oper Res Soc 60, 1085–1095 (2009). https://doi.org/10.1057/jors.2009.25

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jors.2009.25

Keywords

Navigation