Abstract
Cytometry is a powerful tool in clinical diagnosis of health disorders, in particular, immunodeficiency diseases and acute leukemia. Recent technological advancements have enabled up to 100 measurements to be taken simultaneously on each cell, thus generating high-throughput and high-dimensional datasets. Current analysis, relying on manual segmentation of cell populations (gating) on sequential low-dimensional projections of the data, is subjective, time consuming and error-prone. It is also known that these multidimensional cytometric data typically exhibit non-normal features, including asymmetry, multimodality, and heavy tails. This present a great challenge to traditional clustering methods which are typically based on symmetric distributions.
In recent years, non-normal distributions have received increasing interest in the statistics literature. In particular, finite mixtures of skew distributions have emerged as a promising alternative to the traditional normal mixture modelling. However, these models are not well suited to high-dimensional settings.
This paper describes a flexible statistical approach designed for performing, at the same time, unsupervised clustering, dimension reduction, and outlier removal for cytometric data. The approach is based on finite mixtures of multivariate skew normal factor analyzers (SkewFA) with threshold pruning. The model can be fitted by maximum likelihood (ML) via an expectation-maximization (EM) algorithm. An application to a large CyTOF data is presented to demonstrate the usefulness of the SkewFA model and to illustrate its effectiveness relative to other algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aghaeepour, N., Finak, G., The FlowCAP Consortium, The DREAM Consortium, Hoos, H., Mosmann, T., Gottardo, R., Brinkman, R.R., Scheuermann, R.H.: Critical assessment of automated flow cytometry analysis techniques. Nat. Methods 10, pp. 228–238 (2013)
Aghaeepour, N., Nikoloc, R., Hoos, H.H., Brinkman, R.R.: Rapid cell population identification in flow cytometry data. Cytom. A 79, 6–13 (2011)
Arellano-Valle, R.B., Azzalini, A.: On the unification of families of skew-normal distributions. Scand. J. Stat. 33, 561–574 (2006)
Arellano-Valle, R.B., Genton, M.G.: On fundamental skew distributions. J. Multivar. Anal. 96, 93–116 (2005)
Azzalini, A., Capitanio, A.: Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\)-distribution. J. Royal Stat. Soc. B 65, 367–389 (2003)
Azzalini, A., Dalla Valle, A.: The multivariate skew-normal distribution. Biometrika 83, 715–726 (1996)
Bendall, S.C., Simonds, E.F., Qiu, P., Amir, E.D., Krutzik, P.O., Finck, R.: Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011)
Branco, M.D., Dey, D.K.: A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 79, 99–113 (2001)
Cabral, C.R.B., Lachos, V.H., Prates, M.O.: Multivariate mixture modeling using skew-normal independent distributions. Comput. Stat. Data Anal. 56, 126–142 (2012)
Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\(t\) distributions. Biostatistics 11, 317–336 (2010)
García-Escudero, L.A., Gordaliza, A., Ingrassia, S., Mayo-Iscar, A.: The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers. Comput. Stat. Data Anal. 99, 131–147 (2016)
García-Escudero, L.A., Greselin, F., Mayo-Iscar, A., McLachlan, G.J.: Robust estimation of mixtures of skew-normal distributions. In: Proceedings of the 48th Scientific Meeting of the Italian Statistical Society (SIS2016) (2016)
Ghahramani, Z., Beal, M.: Variational inference for bayesian mixture of factor analysers. In: Solla, S., Leen, T., Muller, K.R. (eds.) Advances in Neural Information Processing System, pp. 449–455. MIT Press, Cambridge (2000)
Lee, S.X., McLachlan, G.J.: Model-based clustering and classification with non-normal mixture distributions. Stat. Methods Appl. 22, 427–454 (2013)
Lee, S.X., McLachlan, G.J.: On mixtures of skew-normal and skew \(t\)-distributions. Adv. Data Anal. Classif. 7, 241–266 (2013)
Lee, S.X., McLachlan, G.J.: Finite mixtures of canonical fundamental skew \(t\)-distributions: The unification of the restricted and unrestricted skew \(t\)-mixture models. Stat. Comput. 26, 573–589 (2016)
Lee, S.X., McLachlan, G.J., Pyne, S.: Modelling of inter-sample variation in flow cytometric data with the Joint Clustering and Matching (JCM) procedure. Cytom. A 89, 30–43 (2016)
Levine, J.H., Simonds, E.F., Bendall, S.C., Davis, K.L., Amir, E.D., Tadmor, M.D., Nolan, G.P.: Data driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015)
McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp. 599–606. Morgan Kaufmann, San Francisco (2000)
Mosmann, T.R., Naim, I., Rebhahn, J., Datta, S., Cavenaugh, J.S., Weaver, J.M.: SWIFT - scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets. Cytom. A 85A, 422–433 (2014)
Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)
Pyne, S., et al.: Automated high-dimensional flow cytometric data analysis. In: Berger, B. (ed.) RECOMB 2010. LNCS, vol. 6044, pp. 577–577. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12683-3_41
Pyne, S., Lee, S.X., Wang, K., Irish, J., Tamayo, P., Nazaire, M.D., Duong, T., Ng, S.K., Hafler, D., Levy, R., Nolan, G.P., Mesirov, J., McLachlan, G.: Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLoS ONE 9, e100334 (2014)
Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to bayesian regression models. Can. J. Stat. 31, 129–150 (2003)
Sorensen, T., Baumgart, S., Durek, P., Grutzkau, A., Haaupl, T.: immunoClust - an automated analysis pipeline for the identification of immunophenotypic signatures in high-dimensional cytometric datasets. Cytom. A 87A, 603–615 (2015)
Wang, K., Ng, S.K., McLachlan, G.J.: Multivariate skew \(t\) mixture models: applications to fluorescence-activated cell sorting data. In: Shi, H., Zhang, Y., Bottema, M.J., Lovell, B.C., Maeder, A.J. (eds.) Proceedings of Conference of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE, Los Alamitos, California (2009)
Weber, L.M., Robinson, M.D.: Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytom. A 89A, 1084–1096 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Lee, S.X. (2017). Mining High-Dimensional CyTOF Data: Concurrent Gating, Outlier Removal, and Dimension Reduction. In: Huang, Z., Xiao, X., Cao, X. (eds) Databases Theory and Applications. ADC 2017. Lecture Notes in Computer Science(), vol 10538. Springer, Cham. https://doi.org/10.1007/978-3-319-68155-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-68155-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68154-2
Online ISBN: 978-3-319-68155-9
eBook Packages: Computer ScienceComputer Science (R0)