Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

Maciejewski, Henryk

doi:10.1007/978-3-319-13881-7_44

Henryk Maciejewski⁴

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 122))

2732 Accesses

Abstract

In this work we demonstrate the effect of small sample size on the risk that feature selection algorithms will select irrelevant features when dealing with high-dimensional data. We develop a simple analytical model to quantify this risk; we verify this model by the means of simulation. These results (i) explain the inherent instability of feature selection from high-dimensional, small sample size data and (ii) can be used to estimate the minimum required sample size which leads to good stability of features. Such results are useful when dealing with data from high-throughput studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21(2):171–178
Article Google Scholar
Ein-Dor L, Zuk O, Domany E (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome of cancer. Proc Natl Acad Sci 103(15):5923–5928
Article Google Scholar
Fisher RA (1915) Frequency distribution of the values of correlation coefficient in samples from an indefinitely large population. Biometrica 10(4):507–521
Google Scholar
Fisher RA (1921) On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1:3–32
Google Scholar
Maciejewski H (2013) Predictive modelling in high-dimensional data: prior domain knowledge-based approaches. Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław
Google Scholar
Wu MC, Lin X (2009) Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways. Stat Methods Med Res 18(6):577–593
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Engineering, Control and Robotics, Wrocław University of Technology, ul. Janiszewskiego 11-17, 50-370, Wrocław, Poland
Henryk Maciejewski

Authors

Henryk Maciejewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henryk Maciejewski .

Editor information

Editors and Affiliations

Institute of Statistics, RWTH Aachen University, Aachen, Germany
Ansgar Steland
Dept. of Computer Engineering, Control and Robotics, Wrocław University of Technology, Wrocław, Poland
Ewaryst Rafajłowicz
Inst. of Mathematics and Computer Science, Wrocław University of Technology, Wrocław, Poland
Krzysztof Szajowski

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 43.8 kB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maciejewski, H. (2015). Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size. In: Steland, A., Rafajłowicz, E., Szajowski, K. (eds) Stochastic Models, Statistics and Their Applications. Springer Proceedings in Mathematics & Statistics, vol 122. Springer, Cham. https://doi.org/10.1007/978-3-319-13881-7_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-13881-7_44
Published: 05 February 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13880-0
Online ISBN: 978-3-319-13881-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics