CL.E.KMODES: a modified k-modes clustering algorithm

Mastrogiannis, N; Giannikos, I; Boutsinas, B; Antzoulatos, G

doi:10.1057/jors.2009.25

CL.E.KMODES: a modified k-modes clustering algorithm

Special Issue Paper
Published: 29 April 2009

Volume 60, pages 1085–1095, (2009)
Cite this article

Journal of the Operational Research Society

N Mastrogiannis¹,
I Giannikos¹,
B Boutsinas^1,2 &
…
G Antzoulatos¹

60 Accesses
2 Citations
Explore all metrics

Abstract

In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aha D and Murphy P (1994). UCI Repository of Machine Learning Datasets. University of California, Department of Information and Computer Science: Irvine, CA. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.
Ahmad A and Dey L (2007). A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set . Pattern Recogn Lett 28(1): 110–118.
Article Google Scholar
Anderberg MR (1973). Cluster Analysis for Applications . Academic Press: New York.
Google Scholar
Bohanec M and Rajkovic V (1988). Knowledge acquisition and explanation for multi-attribute decision making . Proceedings of the 8th International Workshop on Expert Systems and their Applications. Avignon: France, pp. 59–78.
Google Scholar
Boriah S, Chandola V and Kumar V (2008). Similarity measures for categorical data: A comparative evaluation . Proceedings of the Eighth SIAM International Conference on Data Mining. Atlanta, GA, pp. 243–254.
Google Scholar
Bouyssou D (2001). Outranking methods . In: Floudas CA and Pardalos PM (eds). Encyclopedia of Optimization Vol. 4. Kluwer, pp. 255–289.
Google Scholar
Burnaby T (1970). On a method for character weighting a similarity coefficient, employing the concept of information . Math Geol 2(1): 25–28.
Article Google Scholar
Cramer H (1946). The Elements of Probability Theory and Some of its Applications . John Wiley & Sons: New York.
Google Scholar
Dubes CR (1998). Cluster analysis and related issues . Handbook Pattern Recogn Comput Vis 1: 3–32.
Google Scholar
Dubes CR and Jain AK (1979). Validity studies in clustering methodologies . Pattern Recogn 11: 235–254.
Article Google Scholar
Eskin E, Arnold A, Prerau M, Portnoy L and Stolfo S (2002). A geometric framework for unsupervised anomaly detection . In: Barbara D and Jajodia S (eds). Applications of Data Mining in Computer Security. Kluwer Academic Publishers: Dordrecht (Hingham, MA), pp. 78–100.
Google Scholar
Figueira J, Mousseau V and Roy B (2005). Electre Methods . In: Figueira J, Greco S and Ehrogott M (eds). Multiple Criteria Decision Analysis: State of the Art Surveys. Springer: New York, pp. 133–153.
Chapter Google Scholar
Fowlkes EB and Mallows CL (1983). A method for comparing two hierarchical clusterings . J Am Stat Assoc 78: 553–569.
Article Google Scholar
Gambaryan P (1964). A mathematical model of taxonomy . Izvest Akad Nauk Armen SSR 17(12): 47–53.
Google Scholar
Gan G, Yang Z and Wu J (2005). A genetic k-modes algorithm for clustering categorical data . Proceedings of 1st International Conference, ADMA'05. Wuhan: China, pp. 195–202.
Google Scholar
Goodall DW (1966). A new similarity index based on probability . Biometrics 22(4): 882–907.
Article Google Scholar
Guha S, Rastogi R and Shim K (2000). ROCK: A robust clustering algorithm for categorical attributes . Inform Syst 25(5): 345–366.
Article Google Scholar
Halkidi M, Batistakis Y and Vazirgiannis M (2001). On clustering validation techniques . J Intell Inf Syst 17(2–3): 107–145.
Article Google Scholar
He Z, Deng S and Xu X (2005). Improving k-modes algorithm considering frequencies of attribute values in mode . Proceedings of the International Conference on Computational Intelligence and Security :CIS 2005. Xi'an: China, pp. 157–162.
Google Scholar
Huang Z (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values . Data Min Knowl Disc 2(3): 283–304.
Article Google Scholar
Huang Z and Ng MK (1999). A fuzzy k-modes algorithm for clustering categorical data . IEEE T Fuzzy Syst 7(4): 446–452.
Article Google Scholar
Huang Z, Ng MK, Rong H and Li Z (2005). Automated variable weighing in k-means type clustering . IEEE T Pattern Anal 27(5): 657–668.
Article Google Scholar
Jain AK, Murty NM and Flynn JP (1999). Data clustering: A review . ACM Comput Surv 31(3): 264–323.
Article Google Scholar
Jollois F and Nadif M (2002). Clustering large categorical data . Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD'02. Taipei: Taiwan, pp. 257–263.
Chapter Google Scholar
Le SQ and Ho TB (2005). An association-based dissimilarity measure for categorical data . Pattern Recogn Lett 26(16): 2549–2557.
Article Google Scholar
Lin D (1998). An information-theoretic definition of similarity . Proceedings of the 15th International Conference on Machine Learning ICML'98. Madison, WI, pp. 296–304.
Google Scholar
MacQueen JB (1967). Some methods for classification and analysis of multivariate observations . Proceeding of the 5th Berkley Symposium on Mathematics, Statistics and Probability. University of California press, Vol. 1, pp. 281–297.
Google Scholar
Makarenkov V and Legendre P (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software . J Classif 18: 245–271.
Google Scholar
Maung K (1941). Measurement of association in a contingency table with special reference to the pigmentation of hair and eye colours of Scottish school children . Ann Eugen 11: 189–223.
Article Google Scholar
Milligan GW, Soon SC and Sokol LM (1983). The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure . IEEE T Pattern Anal 5: 40–47.
Article Google Scholar
Modha DS and Spangler WS (2003). Feature weighting in k-means clustering . Mach Learn 52: 217–237.
Article Google Scholar
Ng MK and Wong JC (2002). Clustering categorical data sets using tabu search techniques . Pattern Recogn 35(12): 2783–2790.
Article Google Scholar
Ng MK, Li MJ, Huang JZ and He Z (2007). On the impact of dissimilarity measure in k-modes clustering algorithm . IEEE T Pattern Anal 29(3): 503–506.
Article Google Scholar
Pearson K (1916). On the general theory of multiple contingency with special reference to partial contingency . Biometrica 11(3): 145–158.
Article Google Scholar
Pirlot M (1997). A common framework for describing some outranking methods . J Multi-Criteria Deci Anal 6: 86–92.
Article Google Scholar
Rand WM (1971). Objective criteria for the evaluation of clustering methods . J Am Stat Assoc 66(336): 846–850.
Article Google Scholar
Rosenberg A and Hirschberg J (2007). V-measure: A conditional entropy-based external cluster evaluation measure . Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing and Computational Language Learning. Prague, pp. 410–420.
Google Scholar
Roy B (1968). Classement et choix en présence de points de vue muptiples : La méthode ELECTRE . RIRO 8: 57–75.
Google Scholar
Roy B (1991). The outranking approach and the foundations of ELECTRE methods . Theor Decis 31: 49–73.
Article Google Scholar
Roy B (1996). Multicriteria Methodology for Decision Aiding . Kluwer Academic Publishers: Dordrecht.
Book Google Scholar
San O, Huynh V and Nakamori Y (2004). An alternative extension of the k-means algorithm for clustering categorical data . Int J Appl Math Comput Sci 14(2): 241–247.
Google Scholar
Shyu ML, Kuruppu-Appuhamilage IP, Chen SC and Chang L (2005). Handling missing values via decomposition of the conditioned set . Proceedings of the IEEE International Conference on Information Reuse and Integration, IRI'05. Las Vegas, NV, pp. 199–204.
Google Scholar
Siegler RS (1976). Three aspects of cognitive development . Cognitive Psychol 8: 481–520.
Article Google Scholar
Smirnov ES (1968). On exact methods in systematics . Syst Zool 17(1): 1–13.
Article Google Scholar
Stanfill C and Waltz D (1986). Toward memory-based reasoning . Commun ACM 29(12): 1213–1228.
Article Google Scholar
Sun Y, Zhu Q and Chen Z (2002). An iterative initial-points refinement algorithm for categorical data clustering . Pattern Recogn Lett 23(7): 875–884.
Article Google Scholar
Theodoridis S and Koutroumbas K (1999). Pattern Recognition . Academic Press: Orlando, FL.
Google Scholar
Van Rijsbergen CJ (1979). Information Retrieval, 2nd edn.. Butterworths: London.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Patras, Patras, Greece
N Mastrogiannis, I Giannikos, B Boutsinas & G Antzoulatos
University of Patras Artificial Intelligence Research Center (UPAIRC), Patras, Greece
B Boutsinas

Authors

N Mastrogiannis
View author publications
You can also search for this author in PubMed Google Scholar
I Giannikos
View author publications
You can also search for this author in PubMed Google Scholar
B Boutsinas
View author publications
You can also search for this author in PubMed Google Scholar
G Antzoulatos
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

The estimation of a new mode is based on the process of finding a mode for a set of the k-modes algorithm (Huang, 1998). In more details, if Y={y ₁,y ₂,…,y _n} is a set of n objects evaluated over d attributes with categorical values, then the mode of Y is a vector H={h ₁,h ₂,…,h _d} that minimizes the following relation:

where D(y _j,H) is defined according to relation (1). It should be noted that H is not necessarily an element of Y. Let be the number of objects having category c _k,f on attribute a _f and the relative frequency of category c _k,f in Y. Then DIS(H,Y) is minimized if fr(a _f=h _f/Y)⩾fr(a _f=c _k,f/Y) for h _f≠c _k,f for every attribute a _f. Note that a mode of a data set Y is not unique.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mastrogiannis, N., Giannikos, I., Boutsinas, B. et al. CL.E.KMODES: a modified k-modes clustering algorithm. J Oper Res Soc 60, 1085–1095 (2009). https://doi.org/10.1057/jors.2009.25

Download citation

Received: 01 November 2007
Accepted: 01 January 2009
Published: 29 April 2009
Issue Date: 01 August 2009
DOI: https://doi.org/10.1057/jors.2009.25

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CL.E.KMODES: a modified k-modes clustering algorithm

Abstract

Access this article

References

Author information

Authors and Affiliations

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation