Skip to main content

Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis

  • Conference paper
  • First Online:
Parallel Problem Solving from Nature – PPSN XVII (PPSN 2022)

Abstract

Electronic health records (EHRs) involve heterogeneous data types such as binary, numeric and categorical attributes. As traditional clustering approaches require the definition of a single proximity measure, different data types are typically transformed into a common format or amalgamated through a single distance function. Unfortunately, this early transformation step largely pre-determines the cluster analysis results and can cause information loss, as the relative importance of different attributes is not considered. This exploratory work aims to avoid this premature integration of attribute types prior to cluster analysis through a multi-objective evolutionary algorithm called MVMC. This approach allows multiple data types to be integrated into the clustering process, explore trade-offs between them, and determine consensus clusters that are supported across these data views. We evaluate our approach in a case study focusing on systemic sclerosis (SSc), a highly heterogeneous auto-immune disease that can be considered a representative example of an EHRs data problem. Our results highlight the potential benefits of multi-view learning in an EHR context. Furthermore, this comprehensive classification integrating multiple and various data sources will help to understand better disease complications and treatment goals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SSc patients in the Internal Medicine Department of University Hospital of Lille, France, between October 2014 and December 2021 as part of the FHU PRECISE project (PREcision health in Complex Immune-mediated inflammatory diseaSEs); sample collection and usage authorization, CPP 2019-A01083-54.

  2. 2.

    Note that the Silhouette score is intended to compare different partitions produced by a single method. Usually, the Rand index is preferred to the Silhouette score to compare two solutions when a ground-truth partition is available [35].

References

  1. Abdullin, A., Nasraoui, O.: Clustering heterogeneous data sets. In: American Web Congress, pp. 1–8. IEEE (2012)

    Google Scholar 

  2. Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)

    Article  Google Scholar 

  3. Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)

    Article  Google Scholar 

  4. Ahmad, A., Khan, S.S.: initKmix-a novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst. Appl. 167, 114149 (2021)

    Article  Google Scholar 

  5. Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., Cremers, D.: Clustering with deep learning: taxonomy and new methods (2018). arXiv:1801.07648

  6. Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49(3), 803–821 (1993)

    Article  MathSciNet  Google Scholar 

  7. Basel, A.J., Rui, F., Nandi, K.A.: Integrative cluster analysis in bioinformatics. John Wiley & Sons, USA (2015)

    Google Scholar 

  8. Bécue-Bertaut, M., Pagés, J.: Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Comput. Stat. Data Anal. 52(6), 3255–3268 (2008)

    Article  MathSciNet  Google Scholar 

  9. Ben Ali, B., Massmoudi, Y.: K-means clustering based on gower similarity coefficient: a comparative study. In: International Conference on Modeling, Simulation and Applied Optimization (ICMSAO), pp. 1–5. IEEE (2013)

    Google Scholar 

  10. Budiaji, W., Leisch, F.: Simple k-medoids partitioning algorithm for mixed variable data. Algorithms 12(9), 177 (2019)

    Article  Google Scholar 

  11. de Carvalho, F., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)

    Article  Google Scholar 

  12. de Carvalho, F.D.A., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)

    Article  Google Scholar 

  13. Chiu, T., Fang, D., Chen, J., Wang, Y., Jeris, C.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 263–268. Association for Computing Machinery, New York, NY, USA (2001)

    Google Scholar 

  14. de Carvalho, F., Lechevallier, Y., Despeyroux, T., de Melo, F.M.: Advances in knowledge discovery and management. In: Zighed, F., Abdelkader, G., Gilles, P., Venturini, B.D. (eds.) Multi-view Clustering on Relational Data, pp. 37–51. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02999-3_3

    Chapter  Google Scholar 

  15. Foss, A.H., Markatou, M., Ray, B.: Distance metrics and clustering methods for mixed-type data. Int. Stat. Rev. 87(1), 80–109 (2019)

    Article  MathSciNet  Google Scholar 

  16. Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)

    Article  Google Scholar 

  17. Green, P.E., Rao, V.R.: A note on proximity measures and cluster analysis. J. Mark. Res. 3(6), 359–364 (1969)

    Article  Google Scholar 

  18. Harikumar, S., Surya, P.V.: K-medoid clustering for heterogeneous datasets. Procedia Comput. Sci. 70, 226–237 (2015)

    Article  Google Scholar 

  19. Hsu, C.C., Chen, C.L., Su, Y.W.: Hierarchical clustering of mixed data based on distance hierarchy. Inf. Sci. 177(20), 4474–4492 (2007)

    Article  Google Scholar 

  20. Huang, J., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)

    Article  Google Scholar 

  21. Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)

    Google Scholar 

  22. Hunt, L., Jorgensen, M.: Clustering mixed data. WIREs Data Min. Knowl. Disc. 1(4), 352–361 (2011)

    Article  Google Scholar 

  23. José-García, A., Gómez-Flores, W.: Automatic clustering using nature-inspired metaheuristics: a survey. Appl. Soft Comput. 41, 192–213 (2016)

    Article  Google Scholar 

  24. José-García, A., Gómez-Flores, W.: A survey of cluster validity indices for automatic data clustering using differential evolution. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 314–322. ACM Press (2021). https://doi.org/10.1145/3449639.3459341

  25. José-García, A., Handl, J.: On the interaction between distance functions and clustering criteria in multi-objective clustering. In: Ishibuchi, H., Zhang, Q., Cheng, R., Li, K., Li, H., Wang, H., Zhou, A. (eds.) EMO 2021. LNCS, vol. 12654, pp. 504–515. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72062-9_40

    Chapter  Google Scholar 

  26. José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: Many-view clustering: an illustration using multiple dissimilarity measures. In: Genetic and Evolutionary Computation Conference - GECCO 2019, pp. 213–214. ACM Press, Prague, Czech Republic (2019)

    Google Scholar 

  27. José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: An evolutionary many-objective approach to multiview clustering using feature and relational data. Appl. Soft Comput. 108, 107425 (2021)

    Article  Google Scholar 

  28. Landi, I., et al.: Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digital Med. 3(1), 96 (2020)

    Article  Google Scholar 

  29. Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4), 673–690 (2002)

    Article  Google Scholar 

  30. Liu, C., Chen, Q., Chen, Y., Liu, J.: A fast multiobjective fuzzy clustering with multimeasures combination. Math. Prob. Eng. 2019, 1–21 (2019)

    MathSciNet  MATH  Google Scholar 

  31. Liu, C., Liu, J., Peng, D., Wu, C.: A general multiobjective clustering approach based on multiple distance measures. IEEE Access 6, 41706–41719 (2018)

    Article  Google Scholar 

  32. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)

    MATH  Google Scholar 

  33. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press (1967)

    Google Scholar 

  34. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)

    Article  Google Scholar 

  35. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  36. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLOS ONE 10(12), e0144059 (2015)

    Article  Google Scholar 

  37. Sobanski, V., Giovannelli, J., Allanore, Y., et al.: Phenotypes determined by cluster analysis and their survival in the prospective european scleroderma trials and research cohort of patients with systemic sclerosis. Arthritis Rheumatol. 71(9), 1553–1570 (2019)

    Article  Google Scholar 

  38. Theodoridis, S., Koutrumbas, K.: Pattern Recognition. Elsevier Inc., Amsterdam (2009)

    Google Scholar 

  39. Vandromme, M., Jacques, J., Taillard, J., Jourdan, L., Dhaenens, C.: A biclustering method for heterogeneous and temporal medical data. IEEE Trans. Knowl. Data Eng. 34(2), 506–518 (2022)

    Article  Google Scholar 

  40. van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. WIREs Comput. Stat. 11(3), e1456 (2019)

    MathSciNet  Google Scholar 

  41. Wei, M., Chow, T., Chan, R.: Clustering heterogeneous data with k-means by mutual information-based unsupervised feature transformation. Entropy 17(3), 1535–1548 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the University of Lille, CHU Lille, and INSERM, founded by the MEL through the I-Site cluster humAIn@Lille.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adán José-García .

Editor information

Editors and Affiliations

Appendix

Appendix

This Appendix includes figures complementing the results of the experiments presented in Sect. 5. From Fig. 5 (Appendix), it is clear that the determined number of clusters is three as the Silhouette index obtained its highest point value at this point, \(k=3\). Also, from the Pareto front approximations obtained by these configurations, a substantial inference of the {Num} view is observed over the {Bin} and {Gower} views, respectively. Accordingly, the clustering solutions and the weighted embedding space are remarkably similar between these two data-view configurations.

Fig. 5.
figure 5

MVMC clustering solutions for two data-view configurations, {Bin,Num} and {Num,Gower}. Each configuration includes (i) the convergence plots shown in blue and gray, with the best solution marked red; (ii) the Pareto front approximation corresponding to the estimated k value; (iii) the clustering solution, which is visualized in a weighted embedding space associated with the data views in the configuration. (Color figure online)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

José-García, A. et al. (2022). Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis. In: Rudolph, G., Kononova, A.V., Aguirre, H., Kerschke, P., Ochoa, G., Tušar, T. (eds) Parallel Problem Solving from Nature – PPSN XVII. PPSN 2022. Lecture Notes in Computer Science, vol 13399. Springer, Cham. https://doi.org/10.1007/978-3-031-14721-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14721-0_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14720-3

  • Online ISBN: 978-3-031-14721-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics