Abstract
Model-based approaches to cluster analysis and mixture modeling often involve maximizing classification and mixture likelihoods. Without appropriate constrains on the scatter matrices of the components, these maximizations result in ill-posed problems. Moreover, without constrains, non-interesting or “spurious” clusters are often detected by the EM and CEM algorithms traditionally used for the maximization of the likelihood criteria. Considering an upper bound on the maximal ratio between the determinants of the scatter matrices seems to be a sensible way to overcome these problems by affine equivariant constraints. Unfortunately, problems still arise without also controlling the elements of the “shape” matrices. A new methodology is proposed that allows both control of the scatter matrices determinants and also the shape matrices elements. Some theoretical justification is given. A fast algorithm is proposed for this doubly constrained maximization. The methodology is also extended to robust model-based clustering problems.
Similar content being viewed by others
References
Andrews, J., Wickins, J., Boers, N., McNicholas, P.: teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J. Stat. Softw. 83, 1–32 (2018)
Bagnato, L., Punzo, A., Zoia, M.G.: The multivariate leptokurtic-normal distribution and its application in model-based clustering. Can. J. Stat. 45, 95–119 (2017)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Baudry, J.P., Celeux, G.: EM for mixtures—initialization requires special care. Stat. Comput. 25, 713–726 (2015)
Biernacki, C., Chretien, S.: Degeneracy in the maximum likelihood estimation of univariate. Stat. Probab. Lett. 61, 373–382 (2003)
Biernacki, C., Lourme, A.: Stable and visualizable Gaussian parsimonious clustering models. Stat. Comput. 24, 953–969 (2014)
Browne, R., Subedi, S., McNicholas, P.: Constrained optimization for a subset of the Gaussian parsimonious clustering models (2013). preprint available at arXiv:1306.5824
Celeux, G., Govaert, A.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data. 14, 315–332 (1992)
Cerioli, A., García-Escudero, L., Mayo-Iscar, A., Riani, M.: Finding the number of normal groups in model-based clustering via constrained likelihoods. J. Comput. Graph Stat. 27, 404–416 (2018)
Coretto, P., Hennig, C.: Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J. Am. Stat. Assoc. 111, 1648–1659 (2016)
Dang, U., Browne, R., McNicholas, P.D.: Mixtures of multivariate power exponential distributions. Biometrics 71, 1081–1089 (2015)
Day, N.: Estimating the components of a mixture of normal distributions. Biometrika 56, 463–474 (1969)
Dotto, F., Farcomeni, A., García-Escudero, L., Mayo-Iscar, A.: A reweighting approach to robust clustering. Stat. Comput. 28, 477–493 (2018)
Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)
Friedman, H., Rubin, J.: On some invariant criteria for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)
Fritz, H., García-Escudero, L., Mayo-Iscar, A.: A fast algorithm for robust constrained clustering. Comput. Stat. Data Anal. 61, 124–136 (2013)
Gallegos, M., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)
Gallegos, M., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 10, 135–167 (2009)
Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Exploring the number of groups in robust model-based clustering. Stat. Comput. 21, 585–599 (2011)
García-Escudero, L., Gordaliza, A., Mayo-Iscar, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 8, 27–43 (2014a)
García-Escudero, L., Gordaliza, A., Mayo-Iscar, A.: A constrained robust proposal for mixture modeling avoiding spurious solutions. Adv. Data Anal. Classif. 8, 27–43 (2014b)
García-Escudero, L., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: Avoiding spurious local maximizers in mixture modeling. Stat. Comput. 25, 619–633 (2015)
García-Escudero, L., Gordaliza, A., Greselin, F., Ingrassia, S., Mayo-Iscar, A.: Eigenvalues and constraints in mixture modeling: geometric and computational issues. Adv. Data Anal. Classif. 12, 203–233 (2018)
Hathaway, R.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)
Hennig, C., Liao, T.F.: How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. J. R. Stat. Soc. Ser. C 62, 309–369 (2013)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Ingrassia, S., Rocci, R.: Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput. Stat. Data Anal. 51, 5339–5351 (2007)
Kiefer, J., Wolfowitz, J.: Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Stat. 27, 887–906 (1956)
Maitra, R., Melnykov, V.: Simulating data to study performance of finite mixture modeling and clustering algorithms. J. Comput. Graph Stat. 19, 354–376 (2010)
Maronna, R., Jacovkis, P.: Multivariate clustering procedures with variable metrics. Biometrics 30, 499–505 (1974)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley, New York (2000)
Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)
Peel, D., McLachlan, G.J.: Robust mixture modelling using the \(t\) distribution. Stat. Comput. 10, 339–348 (2000)
Punzo, A., McNicholas, P.D.: Parsimonious mixtures of multivariate contaminated normal distributions. Biomet. J. 58, 1506–1537 (2016)
Punzo, A., Mazza, A., McNicholas, P.D.: Contaminatedmixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85, 1–25 (2018)
Riani, M., Perrotta, D., Torti, F.: FSDA: a Matlab toolbox for robust analysis and interactive data exploration. Chemom. Intell. Lab. Syst. 116, 17–32 (2012)
Riani, M., Cerioli, A., Perrotta, D., Torti, F.: Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. Adv. Data Anal. Classif. 9, 461–481 (2015)
Riani, M., Atkinson, A., Cerioli, A., Corbellini, A.: Efficient robust methods via monitoring for clustering and multivariate data analysis. Pattern Recognit. 88, 246–260 (2019)
Ritter, G.: Cluster Analysis and Variable Selection. CRC Press, Boca Raton (2014)
Rocci, R., Gattone, S., Di Mari, R.: A data driven equivariant approach to constrained Gaussian mixture modeling. Adv. Data Anal. Classif. 12, 235–260 (2018)
Rousseeuw, P., Van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)
Seo, B., Kim, D.: Root selection in normal mixture models. Comput. Stat. Data Anal. 56, 2454–2470 (2012)
Zhang, J., Liang, F.: Robust clustering using exponential power mixtures. Biometrics 66, 1078–1086 (2010)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is partially supported by Spanish Ministerio de Economía y Competitividad, Grant MTM2017-86061-C2-1-P, and by Consejería de Educación de la Junta de Castilla y León and FEDER, Grant VA005P17 and VA002G18. This research benefits from the HPC (High Performance Computing) facility of the University of Parma, Italy. M.R. gratefully acknowledges support from the CRoNoS project, reference CRoNoS COST Action IC1408 and the University of Parma project “Statistics for fraud detection, with 237 applications to trade data and financial statement”. The authors also thank the editor, the associate editor, and the anonymous referees for their constructive comments.
Rights and permissions
About this article
Cite this article
García-Escudero, L.A., Mayo-Iscar, A. & Riani, M. Model-based clustering with determinant-and-shape constraint. Stat Comput 30, 1363–1380 (2020). https://doi.org/10.1007/s11222-020-09950-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09950-w