Abstract
The focus of this paper is to propose an approach to construct histogram values for the principal components of interval-valued observations. Le-Rademacher and Billard (J Comput Graph Stat 21:413–432, 2012) show that for a principal component analysis on interval-valued observations, the resulting observations in principal component space are polytopes formed by the convex hulls of linearly transformed vertices of the observed hyper-rectangles. In this paper, we propose an algorithm to translate these polytopes into histogram-valued data to provide numerical values for the principal components to be used as input in further analysis. Other existing methods of principal component analysis for interval-valued data construct the principal components, themselves, as intervals which implicitly assume that all values within an observation are uniformly distributed along the principal components axes. However, this assumption is only true in special cases where the variables in the dataset are mutually uncorrelated. Representation of the principal components as histogram values proposed herein more accurately reflects the variation in the internal structure of the observations in a principal component space. As a consequence, subsequent analyses using histogram-valued principal components as input result in improved accuracy.
Similar content being viewed by others
References
Anderson TW (1963) Asymptotic theory for principal components analysis. Ann Math Stat 34:122–148
Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. Wiley, New York
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock H-H, Diday E (eds) Analysis of symbolic data: explanatory methods for extracting statistical information from complex data. Springer, Berlin, pp 106–124
Billard L (2008) Sample covariance functions for complex quantitative data. In: Mizuta M, Nakano J (eds) Proceedings world conference of the international association for statistical computing. Japan, pp 157–163
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York
Bock H-H, Diday E (eds) (2000) Analysis of symbolic data: explanatory methods for extracting statistical information from complex data. Springer, Berlin
Cazes P (2002) Analyse Factorielle d’un Tableau de Lois de Probabilité. Revue de Statistique Appliquée 50(3):5–24
Cazes P, Chouakria A, Diday E, Schektman Y (1997) Extension de l’Analyse en Composantes Principales à des Données de Type Intervalle. Revue de Statistique Appliquée 45(3):5–24
Chouakria A (1998) Extension des Méthodes d’analyse Factorielle a des Données de Type Intervalle. Université Paris, Dauphine, Doctoral Thesis
Coppi R, Giordani P, D’Urso P (2006) Component models for fuzzy data. Psychometrika 71:733–761
Davidson KR, Donsig AP (2002) Real analysis with real applications. Prentice Hall, New Jersey
Diday E (1987) Introduction à l’Approache Symbolique en Analyse des Données. CEREMADE, Université Paris, Premières Journées Symbolic-Numérique, pp 21–56
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4:229–246
Gioia F, Lauro NC (2006) Principal component analysis on interval data. Comput Stat 21:343–363
Giordani P, Kiers HAL (2004) Principal component analysis of symmetric fuzzy data. Comput Stat Data Anal 45:519–548
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4:184–198
Irpino A, Lauro NC, Verde R (2003) Visualizing symbolic data by closed shapes. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 244–251
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice Hall, New Jersey
Jolliffe IT (2004) Principal component analysis, 2nd edn. Springer, New York
Lauro NC, Palumbo F (2000) Principal component analysis of interval data: a symbolic data analysis approach. Comput Stat 15:73–87
Lauro NC, Verde R, Irpino A (2008) Principal component analysis of symbolic data described by intervals. In: Diday E, Noirhomme-Fraiture M (eds) Symbolic data analysis and the SODAS software. Wiley, Chichester, pp 279–311
Leroy B, Chouakria A, Herlin I, Diday E (1996) Approche Géométrique et Classification pour la Reconnaissance de Visage. Reconnaissance des Forms et Intelligence Artificelle, INRIA and IRISA and CNRS, France, pp 548–557
Le-Rademacher J, Billard L (2012) Symbolic-covariance principal component analysis and visualization for interval-valued data. J Comput Graph Stat 21:413–432
Makosso Kallyth S, Diday E (2010) Analyse en Axes Principaux de Variables Symboliques de Type Histogrammes. Act. XLII Journées de Statistiques, Marseille, France, pp 1–6. http://hal.archives-ouvertes.fr/inria-00494681/
Palumbo F, Lauro NC (2003) A PCA for interval-valued data based on midpoints and radii. In: Yanai H, Okada A, Shigemasu K, Kano Y, Meulman J (eds) New developments in psychometrics. Springer, Tokyo, pp 641–648
Acknowledgments
The authors wish to thank the Editor, the Associate Editor, and the referees for their thorough review and thoughful comments. Partial support to both authors from NSF grants is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Le-Rademacher, J., Billard, L. Principal component histograms from interval-valued observations. Comput Stat 28, 2117–2138 (2013). https://doi.org/10.1007/s00180-013-0399-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-013-0399-4