Abstract
Genetic programming has found recent success as a tool for learning sets of features for regression and classification. Multidimensional genetic programming is a useful variant of genetic programming for this task because it represents candidate solutions as sets of programs. These sets of programs expose additional information that can be exploited for building block identification. In this work, we discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. We investigate methods for biasing the components of programs that are promoted in order to guide search towards useful and complementary feature spaces. We study two main approaches: (1) the introduction of new objectives and (2) the use of specialized semantic variation operators. We find that a semantic crossover operator based on stagewise regression leads to significant improvements on a set of regression problems. The inclusion of semantic crossover produces state-of-the-art results in a large benchmark study of open-source regression problems in comparison to several state-of-the-art machine learning approaches and other genetic programming frameworks. Finally, we look at the collinearity and complexity of the data representations produced by different methods, in order to assess whether relevant, concise, and independent factors of variation can be produced in application.
Similar content being viewed by others
References
I. Arnaldo, K. Krawiec, U.M. O’Reilly, Multiple regression genetic programming, in Proceedings of the 2014 Conference on Genetic and Evolutionary Computation (ACM Press, 2014), pp. 879–886. https://doi.org/10.1145/2576768.2598291. http://dl.acm.org/citation.cfm?doid=2576768.2598291. Accessed 15 Oct 2019
I. Arnaldo, U.M. O’Reilly, K. Veeramachaneni, Building predictive models via feature synthesis, in GECCO (ACM Press, 2015), pp. 983–990. https://doi.org/10.1145/2739480.2754693. http://dl.acm.org/citation.cfm?doid=2739480.2754693. Accessed 15 Oct 2019
D.A. Belsley, A guide to using the collinearity diagnostics. Comput. Sci. Econ. Manag. 4(1), 33–50 (1991). https://doi.org/10.1007/BF00426854
Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
P.P. Brahma, D. Wu, Y. She, Why deep learning works: a manifold disentanglement perspective. IEEE Trans. Neural Netw. Learn. Syst. 27(10), 1997–2008 (2016)
M. Castelli, S. Silva, L. Vanneschi, A C++ framework for geometric semantic genetic programming. Genet. Program. Evol. Mach. 16(1), 73–81 (2015). https://doi.org/10.1007/s10710-014-9218-0
T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 (ACM, New York, NY, USA, 2016), pp. 785–794. https://doi.org/10.1145/2939672.2939785
W.S. Cleveland, Visualizing Data (Hobart Press, New Jersey, 1993)
A. Cline, C. Moler, G. Stewart, J. Wilkinson, An estimate for the condition number of a matrix. SIAM J. Numer. Anal. 16(2), 368–375 (1979). https://doi.org/10.1137/0716029
E. Conti, V. Madhavan, F.P. Such, J. Lehman, K.O. Stanley, J. Clune, Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv:1712.06560 [cs] (2017)
C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, S. Yang, Adanet: adaptive structural learning of artificial neural networks. arXiv preprint arXiv:1607.01097 (2016)
V.V. De Melo, Kaizen Programming, in GECCO (ACM Press, New York, 2014), pp. 895–902. https://doi.org/10.1145/2576768.2598264. http://dl.acm.org/citation.cfm?doid=2576768.2598264
K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II, in Parallel Problem Solving from Nature PPSN VI, vol. 1917, ed. by M. Schoenauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J.J. Merelo, H.P. Schwefel (Springer, Berlin, 2000), pp. 849–858. http://repository.ias.ac.in/83498/. Accessed 15 Oct 2019
J. Demšar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
C. Eastwood, C.K.I. Williams, A framework for the quantitative evaluation of disentangled representations, in ICLR (2018). https://openreview.net/forum?id=By-7dz-AZ. Accessed 15 Oct 2019
C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, D. Wierstra, Convolution by evolution: differentiable pattern producing networks. arXiv:1606.02580 [cs] (2016)
R. Ffrancon, M. Schoenauer, Memetic Semantic Genetic Programming (ACM Press, 2015), pp. 1023–1030. https://doi.org/10.1145/2739480.2754697. http://dl.acm.org/citation.cfm?doid=2739480.2754697
S.B. Fine, E. Hemberg, K. Krawiec, U.M. O’Reilly, Exploiting subprograms in genetic programming, in Genetic Programming Theory and Practice XV, Genetic and Evolutionary Computation, ed. by W. Banzhaf, R.S. Olson, W. Tozier, R. Riolo (Springer, Berlin, 2018), pp. 1–16
D. Floreano, P. Dürr, C. Mattiussi, Neuroevolution: from architectures to learning. Evol. Intell. 1(1), 47–62 (2008). https://doi.org/10.1007/s12065-007-0002-4
Y. Freund, R.E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, in Computational Learning Theory, ed. by P. Vitanyi (Springer, Berlin, 1995), pp. 23–37. https://doi.org/10.1007/3-540-59119-2_166
J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning. Springer series in statistics, vol. 1 (Springer, Berlin, 2001). http://statweb.stanford.edu/tibs/book/preface.ps. Accessed 15 Oct 2019
A.H. Gandomi, A.H. Alavi, A new multi-gene genetic programming approach to nonlinear system modeling. Part I: materials and structural engineering problems. Neural Comput. Appl. 21(1), 171–187 (2012). https://doi.org/10.1007/s00521-011-0734-z
F. Gomez, J. Schmidhuber, R. Miikkulainen, Efficient non-linear control through neuroevolution, in ECML, vol. 4212 (Springer, 2006), pp. 654–662. http://link.springer.com/content/pdf/10.1007/11871842.pdf#page=676
A. Gonzalez-Garcia, J. van de Weijer, Y. Bengio, Image-to-image translation for cross-domain disentanglement. arXiv preprint arXiv:1805.09730 (2018)
Goodfellow, I., H. Lee, Q.V. Le, A. Saxe, A.Y. Ng, Measuring invariances in deep networks, in Advances in Neural Information Processing Systems, pp. 646–654 (2009)
M. Graff, E.S. Tellez, E. Villaseñor, S. Miranda, Semantic genetic programming operators based on projections in the phenotype space. Res. Comput. Sci. 94, 73–85 (2015)
N. Hadad, L. Wolf, M. Shahar, A two-step disentanglement method, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 772–780 (2018)
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, A. Lerchner, \(\beta\)-VAE: Learning basic visual concepts with a constrained variational framework, in ICLR (2017)
A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
C. Igel, Neuroevolution for reinforcement learning using evolution strategies, in The 2003 Congress on Evolutionary Computation, 2003. CEC’03, vol. 4 (IEEE, 2003), pp. 2588–2595. http://ieeexplore.ieee.org/abstract/document/1299414/. Accessed 15 Oct 2019
V. Ingalalli, S. Silva, M. Castelli, L. Vanneschi, A multi-dimensional genetic programming approach for multi-class classification problems, in Genetic Programming, ed. by M. Nicolau (Springer, Berlin, 2014), pp. 48–60. https://doi.org/10.1007/978-3-662-44303-3_5
G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical learning, in Springer Texts in Statistics, vol. 103, ed. by N.H. Timm (Springer, New York, 2013). https://doi.org/10.1007/978-1-4614-7138-7
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv:1412.6980 [cs] (2014).
S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
M. Kommenda, G. Kronberger, M. Affenzeller, S.M. Winkler, B. Burlacu, Evolving simple symbolic regression models by multi-objective genetic programming, in Genetic Programming Theory and Practice, vol. XIV. Genetic and Evolutionary Computation (Springer, Ann Arbor, MI, 2015)
K. Krawiec, Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genet. Program. Evol. Mach. 3(4), 329–343 (2002). https://doi.org/10.1023/A:1020984725014
K. Krawiec, On relationships between semantic diversity, complexity and modularity of programming tasks, in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference (ACM, 2012), pp. 783–790. http://dl.acm.org/citation.cfm?id=2330272. Accessed 15 Oct 2019
K. Krawiec, Behavioral Program Synthesis with Genetic Programming, vol. 618 (Springer, Berlin, 2016)
K. Krawiec, U.M. O’Reilly, Behavioral programming: a broader and more detailed take on semantic GP, in Proceedings of the 2014 Conference on Genetic and Evolutionary Computation (ACM Press, 2014), pp. 935–942. https://doi.org/10.1145/2576768.2598288. http://dl.acm.org/citation.cfm?doid=2576768.2598288. Accessed 15 Oct 2019
A. Kumar, P. Sattigeri, A. Balakrishnan, Variational inference of disentangled latent concepts from unlabeled observations, in ICLR (2018). https://openreview.net/forum?id=H1kG7GZAW. Accessed 15 Oct 2019
W. La Cava, T. Helmuth, L. Spector, J.H. Moore, A probabilistic and multi-objective analysis of lexicase selection and \(\varepsilon\)-lexicase selection. Evolut. Comput. (2018). https://doi.org/10.1162/evco_a_00224
W. La Cava, J. Moore, A general feature engineering wrapper for machine learning using \({\backslash }\)epsilon-lexicase survival, in Genetic Programming (Springer, Cham, 2017), pp. 80–95. https://doi.org/10.1007/978-3-319-55696-3_6
W. La Cava, J.H. Moore, Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods, in GECCO ’17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference (ACM, Berlin, Germany), pp. 961–968 (2017). https://doi.org/10.1145/3071178.3071215. arxiv:1703.06934
W. La Cava, J.H. Moore, Semantic variation operators for multidimensional genetic programming, in Proceedings of the 2019 Genetic and Evolutionary Computation Conference, GECCO ’19 (ACM, Prague, Czech Republic, 2019). https://doi.org/10.1145/3321707.3321776. arXiv:1904.08577
W. La Cava, S. Silva, K. Danai, L. Spector, L. Vanneschi, J.H. Moore, Multidimensional genetic programming for multiclass classification. Swarm Evolut. Comput. (2018). https://doi.org/10.1016/j.swevo.2018.03.015
W. La Cava, T.R. Singh, J. Taggart, S. Suri, J.H. Moore, Learning concise representations for regression by evolving networks of trees, in International Conference on Learning Representations, ICLR (2019). arxiv:1807.00981 (in press)
W. La Cava, L. Spector, K. Danai, Epsilon-lexicase selection for regression, in Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16 (ACM, New York, NY, USA, 2016), pp. 741–748. https://doi.org/10.1145/2908812.2908898
Q. Le, B. Zoph, Using machine learning to explore neural network architecture (2017). https://ai.googleblog.com/2017/05/using-machine-learning-to-explore.html. Accessed 15 Oct 2019
C. Liu, B. Zoph, J. Shlens, W. Hua, L.J. Li, L. Fei-Fei, A. Yuille, J. Huang, K. Murphy, Progressive neural architecture search. arXiv preprint arXiv:1712.00559 (2017)
T. McConaghy, FFX: Fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, Berlin, 2011), pp. 235–260. https://doi.org/10.1007/978-1-4614-1770-5_13
D. Medernach, J. Fitzgerald, R.M.A. Azad, C. Ryan, A new wave: a dynamic approach to genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16 (ACM, New York, NY, USA, 2016), pp. 757–764. https://doi.org/10.1145/2908812.2908857
V.V. de Melo, W. Banzhaf, Automatic feature engineering for regression models with machine learning: an evolutionary computation and statistics hybrid. Inf. Sci. (2017). https://doi.org/10.1016/j.ins.2017.11.041
G. Montavon, K.R. Müller, Better representations: invariant, disentangled and reusable, in Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, ed. by G. Montavon, K.R. Müller (Springer, Berlin, 2012), pp. 559–560
A. Moraglio, K. Krawiec, C.G. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature-PPSN XII (Springer, 2012), pp. 21–31. http://link.springer.com/chapter/10.1007/978-3-642-32937-1_3. Accessed 15 Oct 2019
M. Muharram, G.D. Smith, Evolutionary constructive induction. IEEE Trans. Knowl. Data Eng. 17(11), 1518–1528 (2005)
L. Muñoz, S. Silva, L. Trujillo, M3gp—multiclass classification with GP, in Genetic Programming (Springer, 2015), pp. 78–91. http://link.springer.com/chapter/10.1007/978-3-319-16501-1_7. Accessed 15 Oct 2019
L. Muñoz, L. Trujillo, S. Silva, M. Castelli, L. Vanneschi, Evolving multidimensional transformations for symbolic regression with M3gp. Memet. Comput. (2018). https://doi.org/10.1007/s12293-018-0274-5
K. Neshatian, M. Zhang, P. Andreae, A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans. Evolut. Comput. 16(5), 645–661 (2012). (ZSCC: 0000081)
R.M. O’brien, A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 41(5), 673–690 (2007). https://doi.org/10.1007/s11135-006-9018-6. (ZSCC: 0005201)
R.S. Olson, W. La Cava, P. Orzechowski, R.J. Urbanowicz, J.H. Moore, PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining (2017). ArXiv preprint arXiv:1703.00512
P. Orzechowski, W. La Cava, J.H. Moore, Where are we now? A large benchmark study of recent symbolic regression methods. arXiv:1804.09331 [cs] (2018). https://doi.org/10.1145/3205455.3205539.
T.P. Pawlak, B. Wieloch, K. Krawiec, Semantic backpropagation for designing search operators in genetic programming. IEEE Trans. Evol. Comput. 19(3), 326–340 (2015)
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
H. Pham, M.Y. Guan, B. Zoph, Q.V. Le, J. Dean, Efficient neural architecture search via parameter sharing. ArXiv preprint arXiv:1802.03268 (2018)
E. Real, Using evolutionary AutoML to discover neural network architectures (2018). https://ai.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html. Accessed 15 Oct 2019
E. Real, S. Moore, A. Selle, S. Saxena, Y.L. Suematsu, J. Tan, Q. Le, A. Kurakin, Large-scale evolution of image classifiers. arXiv:1703.01041 [cs] (2017)
M. Schmidt, H. Lipson, Age-fitness pareto optimization, in Genetic Programming Theory and Practice VIII (Springer, 2011), pp. 129–146. http://link.springer.com/chapter/10.1007/978-1-4419-7747-2_8. Accessed 15 Oct 2019
D. Searson, M. Willis, G. Montague, Co-evolution of non-linear PLS model components. J. Chemom. 21(12), 592–603 (2007). https://doi.org/10.1002/cem.1084
D.P. Searson, D.E. Leahy, M.J. Willis, GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, in Proceedings of the International Multiconference of Engineers and Computer Scientists, vol. 1 (IMECS, Hong Kong, 2010), pp. 77–80
S. Silva, L. Munoz, L. Trujillo, V. Ingalalli, M. Castelli, L. Vanneschi, Multiclass classification through multidimensional clustering, in Genetic Programming Theory and Practice XIII, vol. 13 (Springer, Ann Arbor, MI, 2015)
L. Spector, Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report, in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion (2012), pp. 401–408. http://dl.acm.org/citation.cfm?id=2330846. Accessed 15 Oct 2019
K.O. Stanley, Compositional pattern producing networks: a novel abstraction of development. Genet. Program. Evolvable Mach. 8(2), 131–162 (2007). https://doi.org/10.1007/s10710-007-9028-8
K.O. Stanley, J. Clune, J. Lehman, R. Miikkulainen, Designing neural networks through neuroevolution. Nat. Mach. Intell. 1(1), 24 (2019). https://doi.org/10.1038/s42256-018-0006-z
K.O. Stanley, D.B. D’Ambrosio, J. Gauci, A hypercube-based encoding for evolving large-scale neural networks. Artif. Life 15(2), 185–212 (2009). https://doi.org/10.1162/artl.2009.15.2.15202
K.O. Stanley, R. Miikkulainen, Evolving neural networks through augmenting topologies. Evolut. Comput. 10(2), 99–127 (2002). https://doi.org/10.1162/106365602320169811
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)
R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu, Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002). https://doi.org/10.1073/pnas.082099299
L. Vanneschi, M. Castelli, L. Manzoni, K. Krawiec, A. Moraglio, S. Silva, I. Gonçalves, PSXO: population-wide semantic crossover, in Proceedings of the Genetic and Evolutionary Computation Conference Companion (ACM, 2017), pp. 257–258
E. Vladislavleva, G. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009). https://doi.org/10.1109/TEVC.2008.926486
W. Whitney, Disentangled representations in neural models. arXiv:1602.02383 [cs] (2016).
B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning (2016). https://arxiv.org/abs/1611.01578
Acknowledgements
This work was supported by NIH Grants K99LM012926-01A1, AI116794 and LM012601, as well as the PA CURE Grant from the Pennsylvania Department of Health. Special thanks to Tilak Raj Singh and other members of the Computational Genetics Lab at the University of Pennsylvania.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Additional experiment information
Table 6 details the hyperparameters for each method used in the experimental results described in Sects. 4 and 5.
1.2 Comparison of selection algorithms
Our initial analysis sought to determine how different SO approaches performed within this framework. We tested five methods: (1) NSGA2, (2) Lex, (3) LexNSGA2, (4) Simulated annealing, and (5) random search. The simulated annealing and random search approaches are described below.
Simulated annealing Simulated annealing (SimAnn) is a non-evolutionary technique that instead models the optimization process on the metallurgical process of annealing. In our implementation, offspring compete with their parents; in the case of multiple parents, offspring compete with the program with which they share more nodes. The probability of an offspring replacing its parent in the population is given by the equation
The probability of offspring replacing its parent is a function of its fitness, F, in our case the mean squared loss of the candidate model. In Eq. 7, t is a scheduling parameter that controls the rate of “cooling”, i.e. the rate at which steps in the search space that are worse are tolerated by the update rule. In accordance with [34], we use an exponential schedule for t, defined as \(t_{g} = (0.9)^gt_0\) , where g is the current generation and t0 is the starting temperature. t0 is set to 10 in our experiments.
Random search We compare the selection and survival methods to random search, in which no assumptions are made about the structure of the search space. To conduct random search, we randomly sample \({\mathbb {S}}\) using the initialization procedure. Since FEAT begins with a linear model of the process, random search will produce a representation at least as good as this initial model on the internal validation set.
A note on archiving When FEAT is used without a complexity-aware survival method (i.e., with Lex, SimAnn, Random), a separate population is maintained that acts as an archive. The archive maintains a Pareto front according to minimum loss and complexity (Eq. 3). At the end of optimization, the archive is tested on a small hold-out validation set. The individual with the lowest validation loss is the final selected model. Maintaining this archive helps protect against overfitting resulting from overly complex/high capacity representations, and also can be interpreted directly to help understand the process being modelled.
We benchmarked these approaches in a separate experiment on 88 datasets from PMLB [60]. The results are shown in Figs. 13, 14, 15 and 16. Considering Figs. 13 and 14, we see that LexNSGA2 achieves the best average \(R^2\) value while producing small solutions in comparison to Lex. NSGA2, SimAnneal, and Random search all produce less accurate models. The runtime comparisons of the methods in Fig. 15 show that they are mostly within an order of magnitude, with NSGA2 being the fastest (due to its maintenance of small representations) and Random search being the slowest, suggesting that it maintains large representations during search. The computational behavior of Random search suggests the variation operators tend to increase the average size of solutions over many iterations.
1.3 Illustrative example
We show an illustrative example of the final archive and model selection process from applying FEAT to a galaxy visualization dataset [8] in Fig. 17. The red and blue points correspond to training and validation scores for each archived representation with a square denoting the final model selection. Five of the representations are printed in plain text, with each feature separated by brackets. The vertical lines in the left figure denote the test scores for FEAT, RF and ElasticNet. It is interesting to note that ElasticNet performance roughly matches the performance of a linear representation, and the RF test performance corresponds to the representation \([\tanh (x_0)][\tanh (x_1)]\) that is suggestive of axis-aligned splits for \(x_0\) and \(x_1\). The selected model is shown on the right, with the features sorted according to the magnitudes of \(\beta\) in the linear model. The final representation combines tanh, polynomial, linear and interacting features. This representation is a clear extension of simpler ones in the archive, and the archive thereby serves to characterize the improvement in predictive accuracy brought about by increasing complexity. Although a mechanistic interpretation requires domain expertise, the final representation is certainly concise and amenable to interpretation.
1.4 Statistical comparisons
We perform pairwise comparisons of methods according to the procedure recommended by Demšar [14] for comparing multiple estimators (Table 7). In Table 8, the CV \(R^2\) rankings are compared. In Table 9, the best model size rankings are compared. Note that KernelRidge is omitted from the size comparisons since we don’t have a comparable way of measuring the model size.
Rights and permissions
About this article
Cite this article
La Cava, W., Moore, J.H. Learning feature spaces for regression with genetic programming. Genet Program Evolvable Mach 21, 433–467 (2020). https://doi.org/10.1007/s10710-020-09383-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10710-020-09383-4