Determining the Number of Clusters Using Multivariate Ranks

Baragilly, Mohammed; Chakraborty, Biman

doi:10.1007/978-81-322-3643-6_2

Mohammed Baragilly^5,6 &
Biman Chakraborty⁵

1164 Accesses
1 Citations

Abstract

Determining number of clusters in a multivariate data has become one of the most important issues in very diversified areas of scientific disciplines. The forward search algorithm is a graphical approach that helps us in this task. The traditional forward search approach based on Mahalanobis distances has been introduced by Hadi (1992), Atkinson (1994), while Atkinson et al. (2004) used it as a clustering method. But like many other Mahalanobis distance-based methods, it cannot be correctly applied to asymmetric distributions and more generally, to distributions which depart from the elliptical symmetry assumption. We propose a new forward search methodology based on spatial ranks, where clusters are grown with one data point at a time sequentially, using spatial ranks with respect to the points already in the subsample. The algorithm starts from a randomly chosen initial subsample. We illustrate with simulated data that the proposed algorithm is robust to the choice of initial subsample and it performs well in different mixture multivariate distributions. We also propose a modified algorithm based on the volume of central rank regions. Our numerical examples show that it produces the best results under elliptic symmetry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atkinson AC (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89:1329–1339
Article MATH Google Scholar
Atkinson AC, Mulira H (1993) The stalactite plot for the detection of multivariate outliers. Stat Comput 3:27–35
Article Google Scholar
Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52:272–285
Article MathSciNet MATH Google Scholar
Atkinson AC, Riani M (2012) Discussion on the paper by spiegelhalter, sherlaw-johnson, bardsley, blunt, wood and grigg. J Roy Stat Soc 175
Google Scholar
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, NewYork
Book MATH Google Scholar
Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. Springer, Berlin, pp 163–171
Google Scholar
Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117–134
Article MathSciNet MATH Google Scholar
Azzalini A, Bowman A (1990) A look at some data on the old faithful geyser. J Roy Stat Soc 39(3):357–365
MATH Google Scholar
Banfield J, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49:803–821
Article MathSciNet MATH Google Scholar
Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483
Article MathSciNet MATH Google Scholar
Beale EML (1969) Euclidean cluster analysis. ISI, Voorburg, Netherlands
Google Scholar
Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
Article MathSciNet MATH Google Scholar
Chakraborty B (2001) On affine equivariant multivariate quantiles. Ann Inst Stat Math 53:380–403
Article MathSciNet MATH Google Scholar
Chaudhuri P (1996) On a geometric notion of multivariate data. J Am Stat Assoc 90:862–872
Article MathSciNet MATH Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Everitt B, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Chichester
Book MATH Google Scholar
Fraley C, Raftery A (2003) Enhanced model-based clustering, density estimation and discriminant analysis: Mclust. J Classif 20(263):286
MathSciNet MATH Google Scholar
Friedman HP, Rubin J (1967) On some invariant criteria for grouping data. J Am Stat Assoc 62:1159–1178
Article MathSciNet Google Scholar
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications. ASA-SIAM series on statistics and applied probability. Philadelphia
Google Scholar
Gordon AD (1998) Cluster validation. In: C Hayashi KYeae, N Ohsumi (eds) Data science, classification and related methods. Springer, Tokyo, pp 22–39
Google Scholar
Hadi AS (1992) Identifying multiple outliers in multivariate data. J Roy Stat Soc 54:761–771
MathSciNet Google Scholar
Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88(424):1264–1272
Article MathSciNet Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York
Book MATH Google Scholar
Koltchinskii V (1997) M-estimation, convexity and quantiles. Ann Stat 25:435–477
Article MathSciNet MATH Google Scholar
Krzanowski WJ, Lai YT (1985) A criterion for determining the number of clusters in a data set. Biometrics 44(23):34
MathSciNet MATH Google Scholar
Marriott FHC (1971) Practical problems in a method of cluster analysis. Biometrics 27:501–514
Article Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50:159–179
Article Google Scholar
Mojena R (1977) Hierarchical grouping methods and stopping rules: an evaluation. Comput J 20:359–363
Article MATH Google Scholar
Overall JE, Magee KN (1992) Replication as a rule for determining the number of clusters in hierarchical cluster analysis. Appl Psychol Measur 16:119–128
Article Google Scholar
Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Birkhaeuser, pp 25–38
Google Scholar
Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98:750–763
Article MathSciNet MATH Google Scholar
Thorndike RL (1953) Who belongs in a family? Psychometrika 18:267–276
Article Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc 63:411–423
Article MathSciNet MATH Google Scholar
Venables W, Ripley B (2002) Modern applied statistics with S, 4th edn. Springer, NewYork
Book MATH Google Scholar

Download references

Acknowledgments

The authors would like to greatly thank the editors of ICORS 2015 and the two referees for their helpful remarks and comments on an earlier version of the manuscript. The research of Mohammed Baragilly is partially supported by the Egyptian Government and he would like to express his greatest appreciation to the Egyptian Cultural Centre and Educational Bureau in London and to the Department of Applied Statistics, Helwan University.

Author information

Authors and Affiliations

School of Mathematics, University of Birmingham, Birmingham, B15 2TT, UK
Mohammed Baragilly & Biman Chakraborty
Department of Applied Statistics, Helwan University, Cairo, Egypt
Mohammed Baragilly

Authors

Mohammed Baragilly
View author publications
You can also search for this author in PubMed Google Scholar
Biman Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Baragilly .

Editor information

Editors and Affiliations

Department of Mathematics, University of Trento, Trento, Italy
Claudio Agostinelli
Interdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolkata, India
Ayanendranath Basu
Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria
Peter Filzmoser
Sampling and Official Statistics Unit, Indian Statistical Institute, Kolkata, India
Diganta Mukherjee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baragilly, M., Chakraborty, B. (2016). Determining the Number of Clusters Using Multivariate Ranks. In: Agostinelli, C., Basu, A., Filzmoser, P., Mukherjee, D. (eds) Recent Advances in Robust Statistics: Theory and Applications. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3643-6_2

Download citation

DOI: https://doi.org/10.1007/978-81-322-3643-6_2
Published: 11 November 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-3641-2
Online ISBN: 978-81-322-3643-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics