Skip to main content

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 147))

Abstract

The advent of Big Data, and specially the advent of datasets with high dimensionality, has brought an important necessity to identify the relevant features of the data. In this scenario, the importance of feature selection is beyond doubt and different methods have been developed, although researchers do not agree on which one is the best method for any given setting. This chapter provides the reader with the foundations about feature selection (see Sect. 2.1) as well as a description of the state-of-the-art feature selection methods (Sect. 2.2). Then, these methods will be analyzed on several synthetic datasets (Sect. 2.3) trying to draw conclusions about their performance when dealing with a crescent number of irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features. Finally, in Sect. 2.4, some state-of-the-art methods will be analyzed to study their scalability, i.e. the impact of an increase in the training set on the computational performance of an algorithm in terms of accuracy, training time and stability.

Part of the content of this chapter was previously published in Knowledge and Information Systems (https://doi.org/10.1007/s10115-012-0487-8 and https://doi.org/10.1007/s10115-017-1140-3).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)

    Article  Google Scholar 

  2. Yang, Y. Pederson, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 20th International Conference on Machine Learning, pp. 856–863 (2003)

    Google Scholar 

  3. Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)

    MathSciNet  MATH  Google Scholar 

  4. Provost, F.: Distributed data mining: scaling up and beyond. J. Adv. Distrib. Parallel Knowl. Discov. 3–27 (2000)

    Google Scholar 

  5. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature Extraction: Foundations and Applications. Springer, Berlin (2006)

    Book  MATH  Google Scholar 

  6. Stańczyk, U., Jain, L.C.: Feature Selection for Data and Pattern Recognition. Springer (2015)

    Google Scholar 

  7. Liu, H., Motoda, H.: Computational Methods of Feature Selection. CRC Press (2007)

    Google Scholar 

  8. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-dimensional Data. Springer (2015)

    Book  Google Scholar 

  9. Hall, M.A.: Correlation-based Feature Selection for Machine Learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand (1999)

    Google Scholar 

  10. Dash, M., Liu, H.: Consistency-based search in feature selection. J. Artif. Intell. 151(1–2), 155–176 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. Zhao, Z., Liu, H.: Searching for interacting features. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1156–1167 (1991)

    Google Scholar 

  12. Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. J. Comput. Sci. 98, 4–6 (1998)

    Google Scholar 

  13. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of the European Conference on Machine Learning, pp. 171–182 (1994)

    Chapter  Google Scholar 

  14. Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning, pp. 249–256 (1992)

    Chapter  Google Scholar 

  15. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  16. Guyon, I., Weston, J., Barnhill, S.M.D., Vapnik, V.: Gene selection for cancer classification using support vector machines. J. Mach. Learn. 46(1–3), 389–422 (2002)

    Article  MATH  Google Scholar 

  17. Rakotomamonjy, A.: Variable selection using SVM-based criteria. J. Mach. Learn. Res. 3, 1357–1370 (2003)

    MathSciNet  MATH  Google Scholar 

  18. Mejía-Lavalle, M., Sucar, E., Arroyo, G.: Feature selection with a perceptron neural net. In: Proceedings of the International Workshop on Feature Selection for Data Mining, pp. 131–135 (2006)

    Google Scholar 

  19. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco. http://www.cs.waikato.ac.nz/ml/weka/ (2005). Accessed July 2017]

  20. Belanche, L.A., González, F.F.: Review and evaluation of feature selection algorithms in synthetic problems. http://arxiv.org/abs/1101.2320. Accessed July 2017

  21. John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Proceedings of the 11th International Conference on Machine Learning, pp. 121–129 (1994)

    Chapter  Google Scholar 

  22. Zhu, Z., Ong, Y.S., Zurada, J.M.: Identification of full and partial class relevant genes. IEEE Trans. Comput. Biol. Bioinform. 7(2), 263–277 (2010)

    Article  Google Scholar 

  23. Thrun, S. et al., The MONK’s problems: A performance comparison of different learning algorithms. Technical report CS-91-197, CMU (1991)

    Google Scholar 

  24. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group (1984)

    Google Scholar 

  25. Mamitsuka, H.: Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets. Knowl. Inf. Syst. 9(1), 91–108 (2006)

    Article  Google Scholar 

  26. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)

    Google Scholar 

  27. Rish, I.: An empirical study of the naive bayes classifier. In: Proceedings of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, pp. 41–46 (2001)

    Google Scholar 

  28. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. J. Mach. Learn. 6(1), 37–66 (1991)

    Google Scholar 

  29. Shawe-Taylor, J., Cristianini, N.: An Introduction To Support Vector Machines And Other Kernel-based Learning Methods, Cambridge University Press (2000)

    Google Scholar 

  30. Bolon-Canedo, V., Sanchez-Marono, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)

    Article  Google Scholar 

  31. Kohavi, R., John, G.H.: Wrappers for feature subset selection. J. Artif. Intell. 97(1–2), 273–324 (1997)

    Article  MATH  Google Scholar 

  32. Kim, G., Kim, Y., Lim, H., Kim, H.: An MLP-based feature subset selection for HIV-1 protease cleavage site analysis. J. Artif. Intell. Med. 48, 83–89 (2010)

    Article  Google Scholar 

  33. Seijo-Pardo, B., Bolón-Canedo, V., Alonso-Betanzos, A.: Testing different ensemble configurations for feature selection. Neural Process. Lett. 46, 857–880 (2017)

    Article  Google Scholar 

  34. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst 86, 33–45 (2015)

    Article  Google Scholar 

  35. Khoshgoftaar, T M., Golawala, M. and Van Hulse, J. An empirical study of learning from imbalanced data using random forest. In: ICTAI 2007. 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2, pp. 310–317. IEEE (2007)

    Google Scholar 

  36. Liu, H. and Setiono, R.Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391. IEEE (1995)

    Google Scholar 

  37. Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdiñas, B., Sánchez-Maroño, N.: On the scalability of feature selection methods on high-dimensional data. Knowl. Inf. Syst. (2017, in press)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Verónica Bolón-Canedo .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bolón-Canedo, V., Alonso-Betanzos, A. (2018). Feature Selection. In: Recent Advances in Ensembles for Feature Selection. Intelligent Systems Reference Library, vol 147. Springer, Cham. https://doi.org/10.1007/978-3-319-90080-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-90080-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-90079-7

  • Online ISBN: 978-3-319-90080-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics