Skip to main content

A General Feature Engineering Wrapper for Machine Learning Using \(\epsilon \)-Lexicase Survival

  • Conference paper
  • First Online:
Genetic Programming (EuroGP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10196))

Included in the following conference series:

Abstract

We propose a general wrapper for feature learning that interfaces with other machine learning methods to compose effective data representations. The proposed feature engineering wrapper (FEW) uses genetic programming to represent and evolve individual features tailored to the machine learning method with which it is paired. In order to maintain feature diversity, \(\epsilon \)-lexicase survival is introduced, a method based on \(\epsilon \)-lexicase selection. This survival method preserves semantically unique individuals in the population based on their ability to solve difficult subsets of training cases, thereby yielding a population of uncorrelated features. We demonstrate FEW with five different off-the-shelf machine learning methods and test it on a set of real-world and synthetic regression problems with dimensions varying across three orders of magnitude. The results show that FEW is able to improve model test predictions across problems for several ML methods. We discuss and test the scalability of FEW in comparison to other feature composition strategies, most notably polynomial feature expansion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/lacava/few.

  2. 2.

    https://pypi.python.org/pypi/FEW.

References

  1. Arnaldo, I., O’Reilly, U.M., Veeramachaneni, K.: Building predictive models via feature synthesis, pp. 983–990. ACM Press (2015)

    Google Scholar 

  2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  3. De Melo, V.V.: Kaizen programming, pp. 895–902. ACM Press (2014)

    Google Scholar 

  4. Foster, D., Karloff, H., Thaler, J.: Variable selection is hard. In: Proceedings of The 28th Conference on Learning Theory, pp. 696–709 (2015)

    Google Scholar 

  5. Friedman, J., Hastie, T., Tibshirani, R.: The elements of statistical learning. Springer series in statistics, vol. 1. Springer, Berlin (2001)

    MATH  Google Scholar 

  6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  7. Harrison, D., Rubinfeld, D.L.: Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manage. 5(1), 81–102 (1978)

    Article  MATH  Google Scholar 

  8. Helmuth, T., Spector, L., Matheson, J.: Solving uncompromising problems with lexicase selection. IEEE Trans. Evol. Comput. PP(99), 1–1 (2014)

    Google Scholar 

  9. Iba, H., Sato, T.: Genetic Programming with Local Hill-Climbing. Tech. Rep. ETL-TR-94-4, Electrotechnical Laboratory, 1-1-4 Umezono, Tsukuba-city, Ibaraki, 305, Japan (1994). http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/Iba_1994_GPlHC.pdf

  10. Icke, I., Bongard, J.C.: Improving genetic programming based symbolic regression using deterministic machine learning. In: IEEE Congress on Evolutionary Computation (CEC), 2013, pp. 1763–1770. IEEE (2013)

    Google Scholar 

  11. Kamath, U., Lin, J., De Jong, K.: SAX-EFG: an evolutionary feature generation framework for time series classification, pp. 533–540. ACM Press (2014)

    Google Scholar 

  12. Kommenda, M., Kronberger, G., Winkler, S., Affenzeller, M., Wagner, S.: Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In: Blum, C., Alba, E., Bartz-Beielstein, T., Loiacono, D., Luna, F., Mehnen, J., Ochoa, G., Preuss, M., Tantar, E., Vanneschi, L. (eds.) GECCO 2013 Companion, pp. 1121–1128. ACM, Amsterdam (2013)

    Chapter  Google Scholar 

  13. La Cava, W., Danai, K., Spector, L., Fleming, P., Wright, A., Lackner, M.: Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renew. Energy Part 2 87, 892–902 (2016)

    Article  Google Scholar 

  14. La Cava, W., Spector, L., Danai, K.: Epsilon-Lexicase Selection for Regression, pp. 741–748. ACM Press (2016)

    Google Scholar 

  15. Liskowski, P., Krawiec, K., Helmuth, T., Spector, L.: Comparison of semantic-aware selection methods in genetic programming. In: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion 2015, pp. 1301–1307. ACM, New York (2015)

    Google Scholar 

  16. McConaghy, T.: FFX: fast, scalable, deterministic symbolic regression technology. In: Riolo, R., Vladislavleva, E., Moore, J.H. (eds.) Genetic Programming Theory and Practice IX. Genetic and Evolutionary Computation, pp. 235–260. Springer, New York (2011)

    Chapter  Google Scholar 

  17. Muharram, M., Smith, G.D.: Evolutionary constructive induction. IEEE Trans. Knowl. Data Eng. 17(11), 1518–1528 (2005)

    Article  Google Scholar 

  18. Muharram, M.A., Smith, G.D.: The effect of evolved attributes on classification algorithms. In: Gedeon, T.T.D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 933–941. Springer, Heidelberg (2003). doi:10.1007/978-3-540-24581-0_80

    Chapter  Google Scholar 

  19. Muharram, M.A., Smith, G.D.: Evolutionary feature construction using information gain and gini index. In: Keijzer, M., O’Reilly, U.-M., Lucas, S., Costa, E., Soule, T. (eds.) EuroGP 2004. LNCS, vol. 3003, pp. 379–388. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24650-3_36

    Chapter  Google Scholar 

  20. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. arXiv preprint (2016). http://arxiv.org/abs/1603.06212

  21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  22. Redmond, M., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. Eur. J. Oper. Res. 141(3), 660–678 (2002)

    Article  MATH  Google Scholar 

  23. Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion, pp. 401–408 (2012)

    Google Scholar 

  24. Tan, K.C., Lee, T.H., Khor, E.F.: Evolutionary algorithms with dynamic population size and local exploration for multiobjective optimization. IEEE Trans. Evol. Comput. 5(6), 565–588 (2001)

    Article  Google Scholar 

  25. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  26. Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 155–162 (2001)

    Google Scholar 

  27. Torres-Sospedra, J., Montoliu, R., Martnez-Us, A., Avariento, J.P., Arnau, T.J., Benedito-Bordonau, M., Huerta, J.: UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 261–270. IEEE (2014)

    Google Scholar 

  28. Tsanas, A., Xifara, A.: Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 49, 560–567 (2012)

    Article  Google Scholar 

  29. Vanneschi, L., Cuccu, G.: A study of genetic programming variable population size for dynamic optimization problems. In: International Conference on Evolutionary Computation (ICEC 2009), pp. 119–126. Madeira, Portugal (2009)

    Google Scholar 

  30. Vladislavleva, E., Smits, G., Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009)

    Article  Google Scholar 

  31. White, D.R., McDermott, J., Castelli, M., Manzoni, L., Goldman, B.W., Kronberger, G., Jakowski, W., O’Reilly, U.M., Luke, S.: Better GP benchmarks: community survey results and proposals. Genet. Program. Evolvable Mach. 14(1), 3–29 (2012)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Warren Center for Network and Data Science at the University of Pennsylvania, as well as NIH grants P30-ES013508, AI116794 and LM009012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William La Cava .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

La Cava, W., Moore, J. (2017). A General Feature Engineering Wrapper for Machine Learning Using \(\epsilon \)-Lexicase Survival. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds) Genetic Programming. EuroGP 2017. Lecture Notes in Computer Science(), vol 10196. Springer, Cham. https://doi.org/10.1007/978-3-319-55696-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55696-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55695-6

  • Online ISBN: 978-3-319-55696-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics