Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

Olson, Randal S.; Urbanowicz, Ryan J.; Andrews, Peter C.; Lavender, Nicole A.; Kidd, La Creis; Moore, Jason H.

doi:10.1007/978-3-319-31204-0_9

Randal S. Olson¹⁵,
Ryan J. Urbanowicz¹⁵,
Peter C. Andrews¹⁵,
Nicole A. Lavender¹⁶,
La Creis Kidd¹⁶ &
…
Jason H. Moore¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9597))

Included in the following conference series:

European Conference on the Applications of Evolutionary Computation

5844 Accesses
108 Citations
11 Altmetric

Abstract

Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

RJMetrics: The State of Data Science, November 2015. https://rjmetrics.com/resources/reports/the-state-of-data-science/
Hornby, G.S., Lohn, J.D., Linden, D.S.: Computer-automated evolution of an X-band antenna for NASA’s space technology 5 mission. Evol. Comput. 19(1), 1–23 (2011)
Article Google Scholar
Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 947–954. ACM, New York (2009)
Google Scholar
Spector, L., Clark, D.M., Lindsay, I., Barr, B., Klein, J.: Genetic programming for finite algebras. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO 2008, pp. 1291–1298. ACM, New York (2008)
Google Scholar
Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo (1998)
Book MATH Google Scholar
Hutter, F., Lücke, J., Schmidt-Thieme, L.: Beyond manual tuning of hyperparameters. KI - Künstliche Intelligenz 29(4), 329–337 (2015)
Article Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
MathSciNet MATH Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2951–2959. Curran Associates, Inc. (2012)
Google Scholar
Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: Proceedings of the International Conference on Data Science and Advance Analytics. IEEE (2015)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)
Book MATH Google Scholar
Pan, Q., Hu, T., Malley, J.D., Andrew, A.S., Karagas, M.R., Moore, J.H.: A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet. Epidemiol. 38(3), 209–219 (2014)
Article Google Scholar
Fortin, F.A., Gardner, M.A., Parizeau, M., Gagne, C., de Rainville, F.M.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
MathSciNet MATH Google Scholar
Urbanowicz, R.J., Kiralis, J., Fisher, J.M., Moore, J.H.: Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min. 5(1), 1–13 (2012)
Article Google Scholar
Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 1–14 (2012)
Article Google Scholar
Moore, J.H., Hill, D.P., Sulovari, A., Kidd, L.C.: Genetic analysis of prostate cancer using computational evolution, pareto-optimization and post-processing. In: Riolo, R., Vladislavleva, E., Ritchie, M.D., Moore, J.H. (eds.) Genetic Programming Theory and Practice X, pp. 87–101. Springer, New York (2013)
Chapter Google Scholar
Breiman, L., Cutler, A.: Random forests - classification description, November 2015. http://www.stat.berkeley.edu/breiman/RandomForests/cc_home.htm
Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002)
Book MATH Google Scholar
Konak, A., Coit, D.W., Smith, A.E.: Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Saf. 91(9), 992–1007 (2006)
Article Google Scholar
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Article Google Scholar
Greene, C.S., Penrod, N.M., Kiralis, J., Moore, J.H.: Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min. 2(1), 1 (2009)
Article Google Scholar

Download references

Acknowledgments

We thank Sebastian Raschka for his valuable input during the development of this project. We also thank the Michigan State University High Performance Computing Center for the use of their computing resources. This work was supported by National Institutes of Health grants LM009012, LM010098, and EY022300.

Author information

Authors and Affiliations

Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA
Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews & Jason H. Moore
University of Louisville, 505 S. Hancock St., Louisville, KY, 40202, USA
Nicole A. Lavender & La Creis Kidd

Authors

Randal S. Olson
View author publications
You can also search for this author in PubMed Google Scholar
Ryan J. Urbanowicz
View author publications
You can also search for this author in PubMed Google Scholar
Peter C. Andrews
View author publications
You can also search for this author in PubMed Google Scholar
Nicole A. Lavender
View author publications
You can also search for this author in PubMed Google Scholar
La Creis Kidd
View author publications
You can also search for this author in PubMed Google Scholar
Jason H. Moore
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Randal S. Olson .

Editor information

Editors and Affiliations

Politecnico di Torino, Turin, Italy
Giovanni Squillero
Aalborg University, Copenhagen, Denmark
Paolo Burelli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Olson, R.S., Urbanowicz, R.J., Andrews, P.C., Lavender, N.A., Kidd, L.C., Moore, J.H. (2016). Automating Biomedical Data Science Through Tree-Based Pipeline Optimization. In: Squillero, G., Burelli, P. (eds) Applications of Evolutionary Computation. EvoApplications 2016. Lecture Notes in Computer Science(), vol 9597. Springer, Cham. https://doi.org/10.1007/978-3-319-31204-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-31204-0_9
Published: 15 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31203-3
Online ISBN: 978-3-319-31204-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics