Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis

José-García, Adán; Jacques, Julie; Filiot, Alexandre; Handl, Julia; Launay, David; Sobanski, Vincent; Dhaenens, Clarisse

doi:10.1007/978-3-031-14721-0_25

Adán José-García¹³,
Julie Jacques^13,14,
Alexandre Filiot¹⁵,
Julia Handl¹⁸,
David Launay¹⁶,
Vincent Sobanski^15,17 &
…
Clarisse Dhaenens¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13399))

Included in the following conference series:

International Conference on Parallel Problem Solving from Nature

909 Accesses
1 Citations

Abstract

Electronic health records (EHRs) involve heterogeneous data types such as binary, numeric and categorical attributes. As traditional clustering approaches require the definition of a single proximity measure, different data types are typically transformed into a common format or amalgamated through a single distance function. Unfortunately, this early transformation step largely pre-determines the cluster analysis results and can cause information loss, as the relative importance of different attributes is not considered. This exploratory work aims to avoid this premature integration of attribute types prior to cluster analysis through a multi-objective evolutionary algorithm called MVMC. This approach allows multiple data types to be integrated into the clustering process, explore trade-offs between them, and determine consensus clusters that are supported across these data views. We evaluate our approach in a case study focusing on systemic sclerosis (SSc), a highly heterogeneous auto-immune disease that can be considered a representative example of an EHRs data problem. Our results highlight the potential benefits of multi-view learning in an EHR context. Furthermore, this comprehensive classification integrating multiple and various data sources will help to understand better disease complications and treatment goals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
SSc patients in the Internal Medicine Department of University Hospital of Lille, France, between October 2014 and December 2021 as part of the FHU PRECISE project (PREcision health in Complex Immune-mediated inflammatory diseaSEs); sample collection and usage authorization, CPP 2019-A01083-54.
2.
Note that the Silhouette score is intended to compare different partitions produced by a single method. Usually, the Rand index is preferred to the Silhouette score to compare two solutions when a ground-truth partition is available [35].

References

Abdullin, A., Nasraoui, O.: Clustering heterogeneous data sets. In: American Web Congress, pp. 1–8. IEEE (2012)
Google Scholar
Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Article Google Scholar
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)
Article Google Scholar
Ahmad, A., Khan, S.S.: initKmix-a novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst. Appl. 167, 114149 (2021)
Article Google Scholar
Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., Cremers, D.: Clustering with deep learning: taxonomy and new methods (2018). arXiv:1801.07648
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49(3), 803–821 (1993)
Article MathSciNet Google Scholar
Basel, A.J., Rui, F., Nandi, K.A.: Integrative cluster analysis in bioinformatics. John Wiley & Sons, USA (2015)
Google Scholar
Bécue-Bertaut, M., Pagés, J.: Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Comput. Stat. Data Anal. 52(6), 3255–3268 (2008)
Article MathSciNet Google Scholar
Ben Ali, B., Massmoudi, Y.: K-means clustering based on gower similarity coefficient: a comparative study. In: International Conference on Modeling, Simulation and Applied Optimization (ICMSAO), pp. 1–5. IEEE (2013)
Google Scholar
Budiaji, W., Leisch, F.: Simple k-medoids partitioning algorithm for mixed variable data. Algorithms 12(9), 177 (2019)
Article Google Scholar
de Carvalho, F., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)
Article Google Scholar
de Carvalho, F.D.A., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)
Article Google Scholar
Chiu, T., Fang, D., Chen, J., Wang, Y., Jeris, C.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 263–268. Association for Computing Machinery, New York, NY, USA (2001)
Google Scholar
de Carvalho, F., Lechevallier, Y., Despeyroux, T., de Melo, F.M.: Advances in knowledge discovery and management. In: Zighed, F., Abdelkader, G., Gilles, P., Venturini, B.D. (eds.) Multi-view Clustering on Relational Data, pp. 37–51. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02999-3_3
Chapter Google Scholar
Foss, A.H., Markatou, M., Ray, B.: Distance metrics and clustering methods for mixed-type data. Int. Stat. Rev. 87(1), 80–109 (2019)
Article MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Article Google Scholar
Green, P.E., Rao, V.R.: A note on proximity measures and cluster analysis. J. Mark. Res. 3(6), 359–364 (1969)
Article Google Scholar
Harikumar, S., Surya, P.V.: K-medoid clustering for heterogeneous datasets. Procedia Comput. Sci. 70, 226–237 (2015)
Article Google Scholar
Hsu, C.C., Chen, C.L., Su, Y.W.: Hierarchical clustering of mixed data based on distance hierarchy. Inf. Sci. 177(20), 4474–4492 (2007)
Article Google Scholar
Huang, J., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)
Article Google Scholar
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)
Google Scholar
Hunt, L., Jorgensen, M.: Clustering mixed data. WIREs Data Min. Knowl. Disc. 1(4), 352–361 (2011)
Article Google Scholar
José-García, A., Gómez-Flores, W.: Automatic clustering using nature-inspired metaheuristics: a survey. Appl. Soft Comput. 41, 192–213 (2016)
Article Google Scholar
José-García, A., Gómez-Flores, W.: A survey of cluster validity indices for automatic data clustering using differential evolution. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 314–322. ACM Press (2021). https://doi.org/10.1145/3449639.3459341
José-García, A., Handl, J.: On the interaction between distance functions and clustering criteria in multi-objective clustering. In: Ishibuchi, H., Zhang, Q., Cheng, R., Li, K., Li, H., Wang, H., Zhou, A. (eds.) EMO 2021. LNCS, vol. 12654, pp. 504–515. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72062-9_40
Chapter Google Scholar
José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: Many-view clustering: an illustration using multiple dissimilarity measures. In: Genetic and Evolutionary Computation Conference - GECCO 2019, pp. 213–214. ACM Press, Prague, Czech Republic (2019)
Google Scholar
José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: An evolutionary many-objective approach to multiview clustering using feature and relational data. Appl. Soft Comput. 108, 107425 (2021)
Article Google Scholar
Landi, I., et al.: Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digital Med. 3(1), 96 (2020)
Article Google Scholar
Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4), 673–690 (2002)
Article Google Scholar
Liu, C., Chen, Q., Chen, Y., Liu, J.: A fast multiobjective fuzzy clustering with multimeasures combination. Math. Prob. Eng. 2019, 1–21 (2019)
MathSciNet MATH Google Scholar
Liu, C., Liu, J., Peng, D., Wu, C.: A general multiobjective clustering approach based on multiple distance measures. IEEE Access 6, 41706–41719 (2018)
Article Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
MATH Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press (1967)
Google Scholar
Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLOS ONE 10(12), e0144059 (2015)
Article Google Scholar
Sobanski, V., Giovannelli, J., Allanore, Y., et al.: Phenotypes determined by cluster analysis and their survival in the prospective european scleroderma trials and research cohort of patients with systemic sclerosis. Arthritis Rheumatol. 71(9), 1553–1570 (2019)
Article Google Scholar
Theodoridis, S., Koutrumbas, K.: Pattern Recognition. Elsevier Inc., Amsterdam (2009)
Google Scholar
Vandromme, M., Jacques, J., Taillard, J., Jourdan, L., Dhaenens, C.: A biclustering method for heterogeneous and temporal medical data. IEEE Trans. Knowl. Data Eng. 34(2), 506–518 (2022)
Article Google Scholar
van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. WIREs Comput. Stat. 11(3), e1456 (2019)
MathSciNet Google Scholar
Wei, M., Chow, T., Chan, R.: Clustering heterogeneous data with k-means by mutual information-based unsupervised feature transformation. Entropy 17(3), 1535–1548 (2015)
Article Google Scholar

Download references

Acknowledgments

The authors are grateful to the University of Lille, CHU Lille, and INSERM, founded by the MEL through the I-Site cluster humAIn@Lille.

Author information

Authors and Affiliations

Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, 59000, Lille, France
Adán José-García, Julie Jacques & Clarisse Dhaenens
FGES, Université Catholique de Lille, 59000, Lille, France
Julie Jacques
Univ. Lille, Inserm, CHU Lille, U1286, INFINITE, 59000, Lille, France
Alexandre Filiot & Vincent Sobanski
Univ. Lille, Inserm, CHU Lille, Service de Médecine Interne et Immunologie Clinique, CeRAINO, U1286, INFINITE, 59000, Lille, France
David Launay
Institut Universitaire de France (IUF), Paris, France
Vincent Sobanski
Alliance Manchester Business School, University of Manchester, Manchester, UK
Julia Handl

Authors

Adán José-García
View author publications
You can also search for this author in PubMed Google Scholar
Julie Jacques
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Filiot
View author publications
You can also search for this author in PubMed Google Scholar
Julia Handl
View author publications
You can also search for this author in PubMed Google Scholar
David Launay
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Sobanski
View author publications
You can also search for this author in PubMed Google Scholar
Clarisse Dhaenens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adán José-García .

Editor information

Editors and Affiliations

TU Dortmund, Dortmund, Germany
Günter Rudolph
Leiden University, Leiden, The Netherlands
Anna V. Kononova
Shinshu University, Nagano, Japan
Hernán Aguirre
Technische Universität Dresden, Dresden, Germany
Pascal Kerschke
University of Stirling, Stirling, UK
Gabriela Ochoa
Jožef Stefan Institute, Ljubljana, Slovenia
Tea Tušar

Appendix

This Appendix includes figures complementing the results of the experiments presented in Sect. 5. From Fig. 5 (Appendix), it is clear that the determined number of clusters is three as the Silhouette index obtained its highest point value at this point, \(k=3\). Also, from the Pareto front approximations obtained by these configurations, a substantial inference of the {Num} view is observed over the {Bin} and {Gower} views, respectively. Accordingly, the clustering solutions and the weighted embedding space are remarkably similar between these two data-view configurations.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

José-García, A. et al. (2022). Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis. In: Rudolph, G., Kononova, A.V., Aguirre, H., Kerschke, P., Ochoa, G., Tušar, T. (eds) Parallel Problem Solving from Nature – PPSN XVII. PPSN 2022. Lecture Notes in Computer Science, vol 13399. Springer, Cham. https://doi.org/10.1007/978-3-031-14721-0_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-14721-0_25
Published: 15 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14720-3
Online ISBN: 978-3-031-14721-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation