Abstract
Principal component analysis (PCA) is widely used to analyze high-dimensional data, but it is very sensitive to outliers. Robust PCA methods seek fits that are unaffected by the outliers and can therefore be trusted to reveal them. FastHCS (high-dimensional congruent subsets) is a robust PCA algorithm suitable for high-dimensional applications, including cases where the number of variables exceeds the number of observations. After detailing the FastHCS algorithm, we carry out an extensive simulation study and three real data applications, the results of which show that FastHCS is systematically more robust to outliers than state-of-the-art methods.
Similar content being viewed by others
References
Björck, Å., Golub, G.H.: Numerical methods for computing angles between linear subspaces. Math. Comput. 27(2), 579–594 (1973)
Christensen, B.C., Houseman, E.A., Marsit, C.J., Zheng, S., Wrench, M.R., Wiemels, J.L., Nelson, H.H., Karagas, M.R., Padbury, J.F., Bueno, R., Sugarbaker, D.J., Yeh, R., Wiencke, J.K., Kelsey, K.T.: Aging and environemental exposure alter tissue-specific DNA methylation dependent upon CpG Island context. PLoS Genet. 5(8), e1000602 (2009)
Croux, C., Ruiz-Gazen, A.: High breakdown estimators for principal components: the projection-pursuit approach revisited. J. Multivar. Anal. 95, 206–226 (2005)
Donoho, D.L.: Breakdown properties of multivariate location estimators. Ph.D. Qualifying Paper Harvard University (1982)
Debruyne, M., Hubert, M.: The influence function of the Stahel-Donoho covariance estimator of smallest outlyingness. Stat. Probab. Lett. 79(3), 275–282 (2009)
Deepayan, S.: Lattice: Multivariate Data Visualization with R. Springer, New York (2008)
Dyrby, M., Engelsen, S.B., Nørgaard, L., Bruhn, M., Lundsberg Nielsen, L.: Chemometric quantitation of the active substance in a pharmaceutical tablet using near infrared (NIR) transmittance and NIR FT Raman spectra. Appl. Spectrosc. 56(5), 579–585 (2002)
Hubert, M., Rousseeuw, P.J., Vanden Branden, K.: ROBPCA: a new approach to robust principal components analysis. Technometrics 47, 64–79 (2005)
Hubert, M., Rousseeuw, P., Vakili, K.: Shape bias of robust covariance estimators: an empirical study. Stat. Pap. 55(1), 15–28 (2014)
Jensen, D.R.: The structure of ellipsoidal distributions, II. Principal components. Biom. J. 28, 363–369 (1986)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, New York (2002)
Krzanowski, W.J.: Between-groups comparison of principal components. J. Am. Stat. Assoc. 74(367), 703–707 (1979)
Li, G., Chen, Z.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J. Am. Stat. Assoc. 80, 759–766 (1985)
Locantore, N., Marron, J.S., Simpson, D.G., Tripoli, N., Zhang, J.T., Cohen, K.L.: Robust principal component analysis for functional data. Test 8(1), 1–73 (1999)
Maronna, R.A., Yohai, V.J.: The behavior of the Stahel-Donoho Robust multivariate estimator. J. Am. Stat. Assoc. 90(429), 330–341 (1995)
Maronna, R.: Principal components and orthogonal regression based on Robust scales. Technometrics 47, 264–273 (2005)
Maronna, R.A., Martin, R.D., Yohai, V.J.: Robust Statistics: Theory and Methods. Wiley, New York (2006)
Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2014)
Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, New York (1987)
Schmitt, E., Öllerer, V., Vakili, K.: The finite sample breakdown point of PCS. Stat. Probab. Lett. 94, 214–220 (2014)
Seber, G.A.F.: Matrix Handbook for Statisticians. Wiley Series in Probability and Statistics. Wiley, New York (2008)
Stahel, W.: Breakdown of Covariance Estimators. Research Report 31, Fachgrupp für Statistik. E.T.H. Zürich (1981)
Todorov, V., Filzmoser, P.: An object-oriented framework for Robust multivariate analysis. J. Stat. Softw. 32, 1–47 (2009)
Tyler, D.E.: Finite sample breakdown points of projection based multivariate location and scatter statistics. Ann. Stat. 22(2), 1024–1044 (1994)
Vakili, K., Schmitt, E.: Finding multivariate outliers with FastPCS. Comput. Stat. Data Anal. 69, 54–66 (2014)
Van Breukelen, M., Duin, R.P.W., Tax, D.M.J., Den Hartog, J.E.: Handwritten digit recognition by combined classifiers. Kybernetika 34, 381–386 (1998)
Wu, W., Massart, D.L., de Jong, S.: The Kernel PCA algorithms for wide data. Part I: theory and algorithms. Chemom. Intell. Lab. Syst. 36, 165–172 (1997)
Yohai, V.J., Maronna, R.A.: The maximum bias of Robust covariances. Commun. Stat. Theory Methods 19, 2925–2933 (1990)
Acknowledgments
The authors wish to acknowledge the helpful comments from two anonymous referees and the editor which improved this paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Vulnerability of the I-index to orthogonal outliers
Throughout this appendix, let \(\pmb Y\) be an \(n\times p\) data matrix of uncontaminated observations drawn from a rank q distribution \(\mathscr {F}\), with q and integer satisfying \(2<q<\min (p,n)\). However, we do not observe \(\pmb Y\) but an \(n\times p\) (potentially) corrupted data matrix \(\pmb Y^{\varepsilon }\) that consists of \(g<n\) observations from \(\pmb Y\) and \(c=n-g\) arbitrary values with \(\varepsilon =c/n\) denoting the (unknown) rate of contamination. Throughout, \(h=\lceil (n+q+1)/2\rceil \) and the PCA estimates \((\pmb t^I, \pmb L_q^I,\pmb P_q^I)\) are defined as in Sect. 2 with \((\pmb L_q^I)_j,1\leqslant j\leqslant q\) will denoting the j-th diagonal entry of \(\pmb L_q^I\).
We will consider the finite sample breakdown (Donoho 1982) in the context of PCA following (Li and Chen 1985):
Equation (11) defines the so-called finite sample explosion breakdown point and Eq. (12) the so-called finite sample implosion breakdown point of PCA estimates \((\pmb t, \pmb L_q,\pmb P_q)\), and the general finite sample breakdown point is \(\varepsilon ^*_n = \min (\varepsilon _1, \varepsilon _2)\).
The following assumptions [as per, for example Tyler (1994)] all pertain to the original, uncontaminated, data set \(\pmb Y\). We will consider the case whereby the point cloud formed by \(\pmb Y\) lies in general position in \(\mathbb {R}^q\). The following definition of general position is adapted from Rousseeuw and Leroy (1987):
Definition 1
General position in \(\mathbb {R}^q\). \(\pmb Y\) is in general position in \(\mathbb {R}^q\) if no more than q-points of \(\pmb Y\) lie in any \((q-1)\)-dimensional affine subspace. For q-dimensional data, this means that there are no more than q points of \(\pmb Y\) on any hyperplane, so that any \(q+1\) points of \(\pmb Y\) always determine a q-simplex with non-zero determinant.
The I-index is shift invariant so that, w.l.o.g., we only consider cases where the good observations are centered at the origin. Throughout, we will also assume that the members of \(\pmb Y\) are bounded:
for some bounded scalar \(U_0\) depending only on the uncontaminated observations and that the uncontaminated observations contain no duplicates:
Theorem 1
The implosion breakdown, \(\varepsilon _2(\pmb t^I, \pmb L_q^I,\pmb P_q^I)\), is \((n-h+1)/n\)
Proof
If at least h rows of \(\pmb Y^{\varepsilon }\) are in general position in \(\mathbb {R}^q\), any subset of h observations will contain at least \(q+1\) observations in general position. This guarantees that the q th eigenvalue corresponding to any h-subset is non-zero (Seber 2008). Thus, it follows that \(\varepsilon _2(\pmb t^I, \pmb L_q^I,\pmb P_q^I)=(n-h+1)/n\). \(\square \)
1.1 Finite sample explosion breakdown of \((\pmb t^I, \pmb L_q^I,\pmb P_q^I)\)
Denote \(\pmb z\in \mathbb {R}^p\) the outlying entries of \(\pmb Y^{\varepsilon }\) and \(\pmb z^m=||\pmb z\pmb P_0^m||\). The only outliers capable of causing explosion breakdown must satisfy:
for any bounded scalar \(U_1\) and \(U_2\) depending only on the uncontaminated observations.
Proof
Suppose that the outliers do not satisfy Eq. (13) so that \(\max _{i}||\pmb y_i^{\varepsilon }||\leqslant U_1\), but that the PCA estimates \((\pmb t^I, \pmb L_q^I,\pmb P_q^I)\) break down. This leads to a contradiction since
Therefore, for a contaminated h-subset to cause explosion breakdown, the outliers must satisfy Eq. (13).
Assume that an outlier \(\pmb z\) does not satisfy Condition (14). Schmitt et al. (2014) showed that any h subset \(H^m\) indexing \(\pmb z\) will have an unbounded value of \(I(H^m,\pmb S_0^m)\) if and only if \(\pmb z^m\) is unbounded. But for the uncontaminated data, it holds that
so if the contaminated data set \(\pmb Y^{\varepsilon }\) contains at least h entries from the original data matrix \(\pmb Y\), then it is always possible to construct a subset \(H^m\) of entries of \(\pmb Y^{\varepsilon }\) for which \(I(H^l,\pmb S_0^l)\) is bounded so that \(H^m\) will never be selected over \(H^l\). \(\square \)
Appendix 2: The finite sample breakdown point of FastHCS
In this appendix, we derive the finite sample breakdown point of FastHCS. Define \(\pmb Y\), \(\pmb Y^\varepsilon \) and \(\varepsilon ^*_n\) as in Appendix 1. Recall that
where \(H^-=H^{PP} \setminus H^{I}\). Then, if \(D(\pmb Y^\varepsilon ,H^I,H^{PP})>0\) or if \(\displaystyle \max _{j=1}^q{\mathop {{{\mathrm{var}}}}\limits _{i \in H^{-}}}(\pmb y^\varepsilon _i\pmb P^{PP}_j)=0\) then the final FastHCS estimates are based on \(H^{PP}\). Otherwise, they are based on \( H^{I}\).
Lemma 1
If \(||\pmb y^\varepsilon _i||>U_1\) and \(\varepsilon < (n-1)/2n\), then \(i \notin H^{\bullet }\).
Proof
Debruyne and Hubert (2009) showed that the population breakdown point of \((\pmb t^{PP}, \pmb L_q^{PP},\pmb P_q^{PP})\) is 50 %, which corresponds to a finite sample breakdown point of \((n-1)/2n\). Consequently, \(H^{PP}\) will not index any data point for which \(||\pmb y^\varepsilon _i||>U_1\). Since \(H^{\bullet }\) indexes the overlap between \(H^I\) and \(H^{PP}\), if \(||\pmb y^\varepsilon _i||>U_1\), then \(i \notin H^{\bullet }\).
Lemma 2
When \(\pmb Y\) is in general position, \(n>q>2\), and \(\varepsilon < \varepsilon _1 = (n-1)/2n,\; (\pmb L_q^I)_1 <\infty \).
Proof
We will proceed by showing that the denominators in Eq. (17) are bounded, while only the numerator dependent on \(H^{PP}\) is bounded.
Lemma 1 implies there exists a fixed constant \(U_4\) such that
for any orthogonal matrix \(\pmb P\). Similarly, since the projection pursuit approach has a breakdown point of \((n-1)/2n\), there exists a fixed \(U_5\) such that
As a consequence of (18) and (19), there exists a fixed constant \(U_6\) such that:
Next, note that
[Equation (22) follows from Appendix 1, Theorem 1], so that
is not bounded from above. Conversely, \((\pmb t^{PP}, \pmb L_q^{PP},\pmb P_q^{PP})\) has an explosion breakdown point of \((n-1)/2n\), so that there exists a fixed \(U_8\) such that:
From Eq. (20) and the unboundedness of (23) it follows that the left-hand side in Eq. (17) is unbounded. However, by Eqs. (20) and (24), the right-hand side of Eq. (17) is bounded from above so that in cases where outliers cause explosion breakdown of \((\pmb t^I, \pmb L_q^I,\pmb P_q^I)\), criterion (17) will select \(H^* = H^{PP}\). Since the breakdown point of \((\pmb t^{PP}, \pmb L_q^{PP},\pmb P_q^{PP})\) is \((n-1)/2n\), we have that \(\varepsilon _1 = (n-1)/2n\). \(\square \)
Lemma 3
When \(\pmb Y\) is in general position, \(n>q>2\), and \(\varepsilon < \varepsilon _2 = (n-h+1)/n\), then \( (\pmb L_q^I)_q > 0\).
Proof
By Appendix 1, Theorem 1, we have that the implosion breakdown point of \((\pmb t^I, \pmb L_q^I,\pmb P_q^I)\) is \((n-h+1)/n\). The implosion breakdown point of \((\pmb t^{PP}, \pmb L_q^{PP},\pmb P_q^{PP})\) is \((n-1)/2n\), which is higher, so it follows that \(\varepsilon _2=(n-h+1)/n\).
Theorem 2
For \(n>p+1>2\), the finite sample breakdown point of \(\pmb L_q\) is
Proof
The finite sample breakdown point of \(\pmb L_q = \min (\varepsilon _1, \varepsilon _2)\). Given Lemmas 2 and 3, \(\min ((n-1)/2n, (n-h+1)/n) = (n-h+1)/n\). \(\square \)
Appendix 3: Measures of dissimilarity for robust PCA fits
The objective of the simulation studies in Sect. 3.3 is to measure how much the fitted PCA parameters \((\pmb t,\pmb L_q^{},\pmb P_q^{})\) obtained by four robust PCA methods deviate from the true \((\pmb \mu ^u,\pmb \varLambda _q^{u},\pmb \varPi _q^{u})\) when they are exposed to outliers. One way to compare PCA fits is with respect to their eigenvectors, as in the maxsub criterion (Björck and Golub 1973):
where \(\lambda _q(\pmb D_q)\) is the smallest eigenvalue of the matrix \( \pmb D_q^{}=\pmb \varPi _q^\top \pmb P_q^{}\pmb P_q^\top \pmb \varPi _q^{}\). The maxsub has an appealing geometrical interpretation as it represents the maximum angle between a vector in \(\pmb \varPi _q\) and the vector most parallel to it in \(\pmb P_q\). However, it does not exhaustively account for the dissimilarity between two sets of eigenvectors. As an alternative to the maxsub, Krzanowski (1979) proposes the total dissimilarity:
which is an exhaustive measure of dissimilarity for orthogonal matrices. Furthermore, because \(\sum _{j=1}^q\lambda _j(\pmb D_q)={{\mathrm{Tr}}}(\pmb D_q)\) and \(|\pmb D_q|=1\) (Krzanowski 1979), it is readily seen that (25) is a measure of sphericity of \(\pmb D_q\) [it is proportional to the likelihood ratio test statistics for non-sphericity of \(\pmb D_q\) (Muirhead 1982, pp. 333–335)]. However, note that (25) now forfeits the geometric interpretation enjoyed by the maxsub.
In any case, measures of dissimilarity based solely on eigenvectors, such as the maxsub or sumsub, necessarily fail to account for bias in the estimation of the eigenvalues. This is problematic when used to evaluate robust fits because it is possible for outliers to exert substantially more influence on \(\pmb L_q\) than on \(\pmb P_q\). An extreme example is given by the so-called good leverage type of contamination in which the outliers lie on the subspace spanned by \(\pmb \varPi _q\) so that even the classical PCA estimate (whose eigenvalues can be made arbitrarily bad by such outliers) will have low values of \(\text {maxsub}(\pmb P_q)\).
In contrast, we are interested in an exhaustive measure of dissimilarity; one that summarizes the the effects of the outliers on all the parameters of the PCA fit into a single number, so that the algorithms can be ranked in terms total dissimilarity. To construct such a measure, it is logical to base it on \(\pmb \varSigma _q^u=\pmb \varPi _q^{u}\pmb \varLambda _q^{u}(\pmb \varPi _q^{u})^{\top }\) and its estimate \(\pmb V_q=\pmb P_q^{}\pmb L_q^{}\pmb P_q^{\top }\) because they contain all the parameters of the fitted model. For our purposes, one need to only consider the effects of outliers on \(\pmb G_q=|\pmb V_q|^{-1/q}\pmb V_q\), the shape component of \(\pmb V_q\) (Hubert et al. 2014). This is because to rank the observations in a contaminated sample in terms of their true outlyingness (and thus reveal the outliers), it is sufficient to estimate the shape component of \(\pmb \varSigma _q^u\) correctly. Consequently, an exhaustive measure of dissimilarity between \(\pmb G_q\) and \(\pmb \varGamma _q=|\pmb \varSigma _q|^{-1/q}\pmb \varSigma _q\) is given by \(\phi ((\pmb \varGamma ^u_q)^{-1/2}\pmb G_{q}(\pmb \varGamma ^u_q)^{-1/2})\), where \(\phi \) is any measure of non-sphericity of its argument. In practice several choices of \(\phi \) are possible, the simplest being the condition number of \(\pmb W\) which is defined as the ratio of the largest to the smallest eigenvalue of \(\pmb W\) (Maronna and Yohai 1995), explaining the definition of \(\text {bias}(\pmb V_q)\).
Rights and permissions
About this article
Cite this article
Schmitt, E., Vakili, K. The FastHCS algorithm for robust PCA. Stat Comput 26, 1229–1242 (2016). https://doi.org/10.1007/s11222-015-9602-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9602-5