Skip to main content

Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size

  • Conference paper
  • First Online:
Stochastic Models, Statistics and Their Applications

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 122))

  • 2732 Accesses

Abstract

In this work we demonstrate the effect of small sample size on the risk that feature selection algorithms will select irrelevant features when dealing with high-dimensional data. We develop a simple analytical model to quantify this risk; we verify this model by the means of simulation. These results (i) explain the inherent instability of feature selection from high-dimensional, small sample size data and (ii) can be used to estimate the minimum required sample size which leads to good stability of features. Such results are useful when dealing with data from high-throughput studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21(2):171–178

    Article  Google Scholar 

  2. Ein-Dor L, Zuk O, Domany E (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome of cancer. Proc Natl Acad Sci 103(15):5923–5928

    Article  Google Scholar 

  3. Fisher RA (1915) Frequency distribution of the values of correlation coefficient in samples from an indefinitely large population. Biometrica 10(4):507–521

    Google Scholar 

  4. Fisher RA (1921) On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1:3–32

    Google Scholar 

  5. Maciejewski H (2013) Predictive modelling in high-dimensional data: prior domain knowledge-based approaches. Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław

    Google Scholar 

  6. Wu MC, Lin X (2009) Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways. Stat Methods Med Res 18(6):577–593

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henryk Maciejewski .

Editor information

Editors and Affiliations

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 43.8 kB)

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Maciejewski, H. (2015). Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size. In: Steland, A., Rafajłowicz, E., Szajowski, K. (eds) Stochastic Models, Statistics and Their Applications. Springer Proceedings in Mathematics & Statistics, vol 122. Springer, Cham. https://doi.org/10.1007/978-3-319-13881-7_44

Download citation

Publish with us

Policies and ethics