Skip to main content

Mining High-Dimensional CyTOF Data: Concurrent Gating, Outlier Removal, and Dimension Reduction

  • Conference paper
  • First Online:
Databases Theory and Applications (ADC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10538))

Included in the following conference series:

Abstract

Cytometry is a powerful tool in clinical diagnosis of health disorders, in particular, immunodeficiency diseases and acute leukemia. Recent technological advancements have enabled up to 100 measurements to be taken simultaneously on each cell, thus generating high-throughput and high-dimensional datasets. Current analysis, relying on manual segmentation of cell populations (gating) on sequential low-dimensional projections of the data, is subjective, time consuming and error-prone. It is also known that these multidimensional cytometric data typically exhibit non-normal features, including asymmetry, multimodality, and heavy tails. This present a great challenge to traditional clustering methods which are typically based on symmetric distributions.

In recent years, non-normal distributions have received increasing interest in the statistics literature. In particular, finite mixtures of skew distributions have emerged as a promising alternative to the traditional normal mixture modelling. However, these models are not well suited to high-dimensional settings.

This paper describes a flexible statistical approach designed for performing, at the same time, unsupervised clustering, dimension reduction, and outlier removal for cytometric data. The approach is based on finite mixtures of multivariate skew normal factor analyzers (SkewFA) with threshold pruning. The model can be fitted by maximum likelihood (ML) via an expectation-maximization (EM) algorithm. An application to a large CyTOF data is presented to demonstrate the usefulness of the SkewFA model and to illustrate its effectiveness relative to other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aghaeepour, N., Finak, G., The FlowCAP Consortium, The DREAM Consortium, Hoos, H., Mosmann, T., Gottardo, R., Brinkman, R.R., Scheuermann, R.H.: Critical assessment of automated flow cytometry analysis techniques. Nat. Methods 10, pp. 228–238 (2013)

    Google Scholar 

  2. Aghaeepour, N., Nikoloc, R., Hoos, H.H., Brinkman, R.R.: Rapid cell population identification in flow cytometry data. Cytom. A 79, 6–13 (2011)

    Article  Google Scholar 

  3. Arellano-Valle, R.B., Azzalini, A.: On the unification of families of skew-normal distributions. Scand. J. Stat. 33, 561–574 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  4. Arellano-Valle, R.B., Genton, M.G.: On fundamental skew distributions. J. Multivar. Anal. 96, 93–116 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\)-distribution. J. Royal Stat. Soc. B 65, 367–389 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  6. Azzalini, A., Dalla Valle, A.: The multivariate skew-normal distribution. Biometrika 83, 715–726 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bendall, S.C., Simonds, E.F., Qiu, P., Amir, E.D., Krutzik, P.O., Finck, R.: Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011)

    Article  Google Scholar 

  8. Branco, M.D., Dey, D.K.: A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 79, 99–113 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cabral, C.R.B., Lachos, V.H., Prates, M.O.: Multivariate mixture modeling using skew-normal independent distributions. Comput. Stat. Data Anal. 56, 126–142 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\(t\) distributions. Biostatistics 11, 317–336 (2010)

    Article  Google Scholar 

  11. García-Escudero, L.A., Gordaliza, A., Ingrassia, S., Mayo-Iscar, A.: The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers. Comput. Stat. Data Anal. 99, 131–147 (2016)

    Article  MathSciNet  Google Scholar 

  12. García-Escudero, L.A., Greselin, F., Mayo-Iscar, A., McLachlan, G.J.: Robust estimation of mixtures of skew-normal distributions. In: Proceedings of the 48th Scientific Meeting of the Italian Statistical Society (SIS2016) (2016)

    Google Scholar 

  13. Ghahramani, Z., Beal, M.: Variational inference for bayesian mixture of factor analysers. In: Solla, S., Leen, T., Muller, K.R. (eds.) Advances in Neural Information Processing System, pp. 449–455. MIT Press, Cambridge (2000)

    Google Scholar 

  14. Lee, S.X., McLachlan, G.J.: Model-based clustering and classification with non-normal mixture distributions. Stat. Methods Appl. 22, 427–454 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  15. Lee, S.X., McLachlan, G.J.: On mixtures of skew-normal and skew \(t\)-distributions. Adv. Data Anal. Classif. 7, 241–266 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  16. Lee, S.X., McLachlan, G.J.: Finite mixtures of canonical fundamental skew \(t\)-distributions: The unification of the restricted and unrestricted skew \(t\)-mixture models. Stat. Comput. 26, 573–589 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  17. Lee, S.X., McLachlan, G.J., Pyne, S.: Modelling of inter-sample variation in flow cytometric data with the Joint Clustering and Matching (JCM) procedure. Cytom. A 89, 30–43 (2016)

    Article  Google Scholar 

  18. Levine, J.H., Simonds, E.F., Bendall, S.C., Davis, K.L., Amir, E.D., Tadmor, M.D., Nolan, G.P.: Data driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015)

    Article  Google Scholar 

  19. McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp. 599–606. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  20. Mosmann, T.R., Naim, I., Rebhahn, J., Datta, S., Cavenaugh, J.S., Weaver, J.M.: SWIFT - scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets. Cytom. A 85A, 422–433 (2014)

    Article  Google Scholar 

  21. Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  22. Pyne, S., et al.: Automated high-dimensional flow cytometric data analysis. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 577–577. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12683-3_41

    Chapter  Google Scholar 

  23. Pyne, S., Lee, S.X., Wang, K., Irish, J., Tamayo, P., Nazaire, M.D., Duong, T., Ng, S.K., Hafler, D., Levy, R., Nolan, G.P., Mesirov, J., McLachlan, G.: Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLoS ONE 9, e100334 (2014)

    Article  Google Scholar 

  24. Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to bayesian regression models. Can. J. Stat. 31, 129–150 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  25. Sorensen, T., Baumgart, S., Durek, P., Grutzkau, A., Haaupl, T.: immunoClust - an automated analysis pipeline for the identification of immunophenotypic signatures in high-dimensional cytometric datasets. Cytom. A 87A, 603–615 (2015)

    Article  Google Scholar 

  26. Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew \(t\) mixture models: applications to fluorescence-activated cell sorting data. In: Shi, H., Zhang, Y., Bottema, M.J., Lovell, B.C., Maeder, A.J. (eds.) Proceedings of Conference of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE, Los Alamitos, California (2009)

    Google Scholar 

  27. Weber, L.M., Robinson, M.D.: Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom. A 89A, 1084–1096 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sharon X. Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Lee, S.X. (2017). Mining High-Dimensional CyTOF Data: Concurrent Gating, Outlier Removal, and Dimension Reduction. In: Huang, Z., Xiao, X., Cao, X. (eds) Databases Theory and Applications. ADC 2017. Lecture Notes in Computer Science(), vol 10538. Springer, Cham. https://doi.org/10.1007/978-3-319-68155-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68155-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68154-2

  • Online ISBN: 978-3-319-68155-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics