Abstract
In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.
References
Aha D and Murphy P (1994). UCI Repository of Machine Learning Datasets. University of California, Department of Information and Computer Science: Irvine, CA. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.
Ahmad A and Dey L (2007). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set . Pattern Recogn Lett 28(1): 110–118.
Anderberg MR (1973). Cluster Analysis for Applications . Academic Press: New York.
Bohanec M and Rajkovic V (1988). Knowledge acquisition and explanation for multi-attribute decision making . Proceedings of the 8th International Workshop on Expert Systems and their Applications. Avignon: France, pp. 59–78.
Boriah S, Chandola V and Kumar V (2008). Similarity measures for categorical data: A comparative evaluation . Proceedings of the Eighth SIAM International Conference on Data Mining. Atlanta, GA, pp. 243–254.
Bouyssou D (2001). Outranking methods . In: Floudas CA and Pardalos PM (eds). Encyclopedia of Optimization Vol. 4. Kluwer, pp. 255–289.
Burnaby T (1970). On a method for character weighting a similarity coefficient, employing the concept of information . Math Geol 2(1): 25–28.
Cramer H (1946). The Elements of Probability Theory and Some of its Applications . John Wiley & Sons: New York.
Dubes CR (1998). Cluster analysis and related issues . Handbook Pattern Recogn Comput Vis 1: 3–32.
Dubes CR and Jain AK (1979). Validity studies in clustering methodologies . Pattern Recogn 11: 235–254.
Eskin E, Arnold A, Prerau M, Portnoy L and Stolfo S (2002). A geometric framework for unsupervised anomaly detection . In: Barbara D and Jajodia S (eds). Applications of Data Mining in Computer Security. Kluwer Academic Publishers: Dordrecht (Hingham, MA), pp. 78–100.
Figueira J, Mousseau V and Roy B (2005). Electre Methods . In: Figueira J, Greco S and Ehrogott M (eds). Multiple Criteria Decision Analysis: State of the Art Surveys. Springer: New York, pp. 133–153.
Fowlkes EB and Mallows CL (1983). A method for comparing two hierarchical clusterings . J Am Stat Assoc 78: 553–569.
Gambaryan P (1964). A mathematical model of taxonomy . Izvest Akad Nauk Armen SSR 17(12): 47–53.
Gan G, Yang Z and Wu J (2005). A genetic k-modes algorithm for clustering categorical data . Proceedings of 1st International Conference, ADMA'05. Wuhan: China, pp. 195–202.
Goodall DW (1966). A new similarity index based on probability . Biometrics 22(4): 882–907.
Guha S, Rastogi R and Shim K (2000). ROCK: A robust clustering algorithm for categorical attributes . Inform Syst 25(5): 345–366.
Halkidi M, Batistakis Y and Vazirgiannis M (2001). On clustering validation techniques . J Intell Inf Syst 17(2–3): 107–145.
He Z, Deng S and Xu X (2005). Improving k-modes algorithm considering frequencies of attribute values in mode . Proceedings of the International Conference on Computational Intelligence and Security :CIS 2005. Xi'an: China, pp. 157–162.
Huang Z (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values . Data Min Knowl Disc 2(3): 283–304.
Huang Z and Ng MK (1999). A fuzzy k-modes algorithm for clustering categorical data . IEEE T Fuzzy Syst 7(4): 446–452.
Huang Z, Ng MK, Rong H and Li Z (2005). Automated variable weighing in k-means type clustering . IEEE T Pattern Anal 27(5): 657–668.
Jain AK, Murty NM and Flynn JP (1999). Data clustering: A review . ACM Comput Surv 31(3): 264–323.
Jollois F and Nadif M (2002). Clustering large categorical data . Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD'02. Taipei: Taiwan, pp. 257–263.
Le SQ and Ho TB (2005). An association-based dissimilarity measure for categorical data . Pattern Recogn Lett 26(16): 2549–2557.
Lin D (1998). An information-theoretic definition of similarity . Proceedings of the 15th International Conference on Machine Learning ICML'98. Madison, WI, pp. 296–304.
MacQueen JB (1967). Some methods for classification and analysis of multivariate observations . Proceeding of the 5th Berkley Symposium on Mathematics, Statistics and Probability. University of California press, Vol. 1, pp. 281–297.
Makarenkov V and Legendre P (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software . J Classif 18: 245–271.
Maung K (1941). Measurement of association in a contingency table with special reference to the pigmentation of hair and eye colours of Scottish school children . Ann Eugen 11: 189–223.
Milligan GW, Soon SC and Sokol LM (1983). The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure . IEEE T Pattern Anal 5: 40–47.
Modha DS and Spangler WS (2003). Feature weighting in k-means clustering . Mach Learn 52: 217–237.
Ng MK and Wong JC (2002). Clustering categorical data sets using tabu search techniques . Pattern Recogn 35(12): 2783–2790.
Ng MK, Li MJ, Huang JZ and He Z (2007). On the impact of dissimilarity measure in k-modes clustering algorithm . IEEE T Pattern Anal 29(3): 503–506.
Pearson K (1916). On the general theory of multiple contingency with special reference to partial contingency . Biometrica 11(3): 145–158.
Pirlot M (1997). A common framework for describing some outranking methods . J Multi-Criteria Deci Anal 6: 86–92.
Rand WM (1971). Objective criteria for the evaluation of clustering methods . J Am Stat Assoc 66(336): 846–850.
Rosenberg A and Hirschberg J (2007). V-measure: A conditional entropy-based external cluster evaluation measure . Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing and Computational Language Learning. Prague, pp. 410–420.
Roy B (1968). Classement et choix en présence de points de vue muptiples : La méthode ELECTRE . RIRO 8: 57–75.
Roy B (1991). The outranking approach and the foundations of ELECTRE methods . Theor Decis 31: 49–73.
Roy B (1996). Multicriteria Methodology for Decision Aiding . Kluwer Academic Publishers: Dordrecht.
San O, Huynh V and Nakamori Y (2004). An alternative extension of the k-means algorithm for clustering categorical data . Int J Appl Math Comput Sci 14(2): 241–247.
Shyu ML, Kuruppu-Appuhamilage IP, Chen SC and Chang L (2005). Handling missing values via decomposition of the conditioned set . Proceedings of the IEEE International Conference on Information Reuse and Integration, IRI'05. Las Vegas, NV, pp. 199–204.
Siegler RS (1976). Three aspects of cognitive development . Cognitive Psychol 8: 481–520.
Smirnov ES (1968). On exact methods in systematics . Syst Zool 17(1): 1–13.
Stanfill C and Waltz D (1986). Toward memory-based reasoning . Commun ACM 29(12): 1213–1228.
Sun Y, Zhu Q and Chen Z (2002). An iterative initial-points refinement algorithm for categorical data clustering . Pattern Recogn Lett 23(7): 875–884.
Theodoridis S and Koutroumbas K (1999). Pattern Recognition . Academic Press: Orlando, FL.
Van Rijsbergen CJ (1979). Information Retrieval, 2nd edn.. Butterworths: London.
Author information
Authors and Affiliations
Appendix
Appendix
The estimation of a new mode is based on the process of finding a mode for a set of the k-modes algorithm (Huang, 1998). In more details, if Y={y 1,y 2,…,y n } is a set of n objects evaluated over d attributes with categorical values, then the mode of Y is a vector H={h 1,h 2,…,h d } that minimizes the following relation:
where D(y j ,H) is defined according to relation (1). It should be noted that H is not necessarily an element of Y. Let be the number of objects having category c k,f on attribute a f and the relative frequency of category c k,f in Y. Then DIS(H,Y) is minimized if fr(a f =h f /Y)⩾fr(a f =c k,f /Y) for h f ≠c k,f for every attribute a f . Note that a mode of a data set Y is not unique.
Rights and permissions
About this article
Cite this article
Mastrogiannis, N., Giannikos, I., Boutsinas, B. et al. CL.E.KMODES: a modified k-modes clustering algorithm. J Oper Res Soc 60, 1085–1095 (2009). https://doi.org/10.1057/jors.2009.25
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1057/jors.2009.25