Abstract
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm which allows to efficiently explore the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number K of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter \(\alpha \) as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of \(\alpha \), enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R-package greed accompanying the paper.
Similar content being viewed by others
Notes
Or a product of Dirichlet distributions \({{\,\mathrm{Dir}\,}}_{{K_{r}}}(\varvec{\alpha }_r) \times {{\,\mathrm{Dir}\,}}_{{K_{c}}}(\varvec{\alpha }_c)\) in the case of co-clustering with the LBM. Except for an additional notation burden, the rest of the discussion easily extends to this case, which is discussed in detail in the Supplementary Materials.
Available at http://github.com/comeetie/greed
Available at http://www-personal.umich.edu/~mejn/netdata/.
Available at http://data.assemblee-nationale.fr/.
Available at http://www.redhotjazz.com/.
References
Adamic LA, Glance N (2005) The Political Blogosphere and the 2004 U.S. Election: Divided They Blog. In: Proceedings of the 3rd International Workshop on Link Discovery. LinkKDD ’05. Chicago, Illinois: ACM, pp 36–43 (cit. on p. 22)
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automatic Control 19(6):716–723
Andrews JL, McNicholas PD (2013) Using evolutionary algorithms for model-based clustering. Pattern Recognit Lett 34(9):987–992
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. In: Biometrics, pp 803–821 (cit. on p. 28)
Bar-Joseph Z, Gifford DK, Jaakkola TS (2001) Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17(1):S22–S29
Bates D, Maechler M (2019) Matrix: sparse and dense matrix classes and methods. R package version 1.2-17 (cit. on p. 7)
Baudry J-P et al (2010) Combining Mixture Components for Clustering. In: Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America 9 2, pp 332–353 (cit. on p. 28)
Bengtsson H (2019) Future: unified parallel and distributed processing in R for everyone. R package version 1.13.0 (cit. on p. 7)
Bertoletti M, Friel N, Rastelli R (2015) Choosing the number of clusters in a finite mixture model using an exact integrated completed likelihood criterion. METRON 73(2):177–199
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 7:719–725
Biernacki C, Celeux G, Govaert G (2010) Exact and monte carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Inference 140:2991–3002
Blei David M, Alp Kucukelbir, McAuliffe Jon D (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
Bouveyron C et al (2019) Model-Based Clustering and Classification for Data Science: With Applications in R. Vol. 50. Cambridge University Press (cit. on p. 2)
Cole R M (1998) Clustering with genetic algorithms (cit. on p. 27)
Côme E, Latouche P (2015) Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat Model 15(6):564–589
Côme E et al (2021) Supplementary Materials: Hierarchical clustering with discrete latent variable models and the integrated classification likelihood. In: Advances in Data Analysis and Classification (cit. on pp 6, 19, 30)
Corneli M, Latouche P, Rossi F (2016) Exact ICL maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks. In: Neurocomputing 192. Advances in artificial neural networks, machine learning and computational intelligence, pp 81–91 (cit. on p. 6)
Daudin J, Picard F, Robin S (2008) A mixture model for random graph. Stat Comput 18:1–36
Dempster AP, Laird NM, Rubin DB (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. J Royal Stat Soc Ser B 39(1):1–38
Eddelbuettel D, Balamuta JJ (2017) Extending extitR with extitC++: A Brief Introduction to extitRcpp. In: PeerJ Preprints 5, p e3188v1 (cit. on p. 7)
Eddelbuettel D, Sanderson C (2014) RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Comput Stat Data Anal 71:1054–1063
Eiben AE, Smith JE (2004) Introduction to evolutionary computing, 2nd edn. Springer, Berlin
Everitt Brian S, Landau Sabine, Leese Morven (2011) Cluster Analysis, Fifth Edition (Wiley Series in Probability and Statistics). 5th. Wiley Series in Probability and Statistics. Wiley (cit. on p. 2)
Fraley C (1998) Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Comput 20(1):270–281
Fruhwirth-Schnatter S, Celeux G, Robert CP (2019) Handbook of mixture analysis. Chapman and Hall/CRC, London
Gelman A et al (2004) Bayesian data analysis, 2nd edn. Chapman & Hall/CRC, London
Gleiser PM, Danon L (2003) Community structure in Jazz. Adv Complex Syst 6(4):565–573
Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on Machine learning. ACM, pp 297–304 (cit. on p. 28)
Hruschka E R et al (2009) A survey of evolutionary algorithms for clustering. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39.2, pp 133–155 (cit. on pp. 10, 27)
Karrer B, Newman MEJ (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107
Mariadassou M, Robin S, Vacher C (2010) Uncovering latent structure in valued graphs: a variational approach. Ann Appl Stat 4(2):715–742
Matias C, Robin S (2014) Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM Proc Surv 47:55–74
McLachlan G, Peel D (2000) Finite mixture models. Wiley, Hoboken
McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, Hoboken
Murtagh F, Raftery AE (1984) Fitting straight lines to point patterns. Pattern Recognit 17(5):479–483
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys. Rev. E 69(2):026113
Newman MEJ, Reinert G (2016) Estimating the number of communities in a network. Phys Rev Lett 117(7):078301
Nowicki K, Snijders TAB (2001) Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 96(455):1077–1087
Peixoto TP (2014) Hierarchical block structures and high-resolution model selection in large networks. Phys Rev X 4(1):011047
Qin T, Rohe K (2013) Regularized spectral clustering under the degree-corrected stochastic blockmodel”. In: Proceedings of Nips (cit. on p. 19)
Core Team R (2019) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria (cit. on p. 7)
Riolo MA et al (2017) Efficient method for estimating the number of communities in a network. Phys Rev E 96(3):032310
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Scrucca L (2016) Genetic algorithms for subset selection in model-based clustering. In: Unsupervised Learning Algorithms. Springer, pp 55–70 (cit. on p. 27)
Sneath PHA (1957) The application of computers to taxonomy. Microbiology 17(1):201–226
Sokal R R, Michener C D (1958) A statistical method for evaluating systematic relationships. In: University of Kansas Science Bulletin 38, pp 1409–1438 (cit. on p. 27)
Tessier D et al (2006) Evolutionary latent class clustering of qualitative data. Tech Rep (cit. on pp. 6, 10, 27)
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res, pp 2837–2854 (cit. on p. 19)
Wang YJ, Wong GY (1987) Stochastic blockmodels for directed graphs. J Am Stat Assoc 82(397):8–19
Ward J, Joe H (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Wyse J, Friel N, Latouche P (2017) Inferring structure in bipartite networks using the latent blockmodel and exact ICL. Network Sci 5(1):45–69
Zhao Y, Levina E, Zhu J (2012) Consistency of community detection in networks under degree-corrected stochastic block models. Ann Stat 40(4):266–2292
Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037
Zhu Y, Yan X, Moore C (2014) Oriented and degreegenerated block models: generating and inferring communities with inhomogeneous degree distributions. J Complex Netw 2(1):1–18
Zreik R, Latouche P, Bouveyron C (2016) The dynamic random subgraph model for the clustering of evolving networks. Comput Stat (cit. on p. 6)
Acknowledgements
The authors would like to thank the editor and the two anonymous referees for their fruitful comments which helped to improve this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
A Dealing with bipartitions
A Dealing with bipartitions
Genetic algorithm The hybrid algorithm presented in Sect. 2 can be easily extended to the co-clustering problem. The latter simultaneously seeks for a partition of the n rows and p columns of a data matrix \(\varvec{X}\in {\mathbb {R}}^{n\times p}\). In this case, we work with a partition \(\mathcal {P}\) of \(\{1,\ldots , n+p\}\) with the additional constraints that it decomposes into two disjoint sets of clusters that corresponds to a partition of \(\{1,\ldots ,n\}\) and \(\{n+1,\ldots ,n+p\}\) respectively (one for the rows and one for the columns):
This constraint can be easily incorporated by defining \({{\,\mathrm{ICL_{\textit{ex}}}\,}}(\mathcal {P})=-\infty \) for partitions that do not fulfill this constraint and by initializing the algorithm with admissible solutions. This is sufficient to ensure that the obtained solutions will also be compatible with the constraints, since the admissible set of partitions is closed under the crossover and mutation operations used by the algorithm.
Hierarchical algorithm Furthermore, the hierarchical methodology of Sect. 3 can also be easily extended to bi-partitions. Indeed, the LBM prior
leaves a factorized integrated likelihood for \(p(\varvec{Z}\mid \alpha )\), with a common parameter \(\alpha \) (see Côme et al. 2021 for a detailed discussion). Thus, the \({{\,\mathrm{ICL_{\textit{lin}}}\,}}\) approximation of Equation (10) is still log-linear in \(\alpha \) and writes:
with \(I(\varvec{Z}) = I(\varvec{Z}^{r}) + I(\varvec{Z}^{c})\) the intercepts defined in Equation (9). Hence, with the constraint that a merge cannot be done between rows and columns clusters, one can look for the best row or column fusion to do at each step, therefore building two dendrograms in parallel, with a shared (\(\alpha _f)_f\) sequence.
Rights and permissions
About this article
Cite this article
Côme, E., Jouvin, N., Latouche, P. et al. Hierarchical clustering with discrete latent variable models and the integrated classification likelihood. Adv Data Anal Classif 15, 957–986 (2021). https://doi.org/10.1007/s11634-021-00440-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00440-z