Abstract
Genomic selection (GS) is a popular breeding method that uses genome-wide markers to predict plant phenotypes. Empirical studies and simulations have shown that GS can greatly accelerate the breeding cycle, beyond what is possible with traditional quantitative trait locus (QTL) approaches. GS is a regression problem, where one often uses SNPs to predict the phenotypes. Since the SNP data are extremely high-dimensional, of the order of 100 K dimensions, it is difficult to make accurate phenotypic predictions. Moreover, finding the optimal prediction model is computationally very costly. Out of thousands of SNPs, usually only a few influence a particular phenotypic trait. We first of all show how ensemble-based regression techniques give better prediction accuracy compared to traditional regression methods, which have been used in existing papers. We then further improve the prediction accuracy by using an ensemble of feature selection and feature extraction techniques, which also reduces the time to compute the regression model parameters. We predict three traits: grain yield, time to 50% flowering and plant height for which the existing methods give an accuracy of 0.304, 0.627 and 0.341, respectively. Our proposed regression model gives an accuracy of 0.330, 0.674 and 0.458 for these traits. Additionally, we also propose a computationally efficient regression model that reduces the computation time by as much as 90% and gives an accuracy of 0.342, 0.580 and 0.411, respectively.
Similar content being viewed by others
References
Aggarwal CC (2015) Data mining: the textbook. Springer Publishing Company, Berlin
Alpaydin E (2004) Introduction to machine learning (OIP). MIT Press, Cambridge
Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I et al (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep 5:1
Beukert U, Li Z, Liu G, Zhao Y, Ramachandra N, Mirdita V et al (2017) Genome-based identification of heterotic patterns in rice. Rice 10:1
Bishop CM (2016) Pattern recognition and machine learning. Springer, New York
Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PLoS ONE 10:6
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16). ACM, New York, pp 785–794
Collard BC, Mackill DJ (2008) Marker-assisted selection: an approach for precision plant breeding in the twenty-first century. Philos Trans R Soc B Biol Sci 363(1491):557–572
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on international conference on machine learning (ICML'96), pp 148–156
González-Camacho JM, Ornella L, Pérez-Rodríguez P, Gianola D, Dreisigacker S, Crossa J (2018) Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11:2
Gregorio GB, Islam MR, Vergara GV, Thirumeni S (2013) Recent advances in rice science to design salinity and other abiotic stress tolerant rice varieties. SABRAO J Breed Genetics 45(1):31–40
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning, pp 359–366
Hastie T, Tibshirani R, Friedman JH (2017) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY
James G, Witten D, Hastie T, Tibshirani R (2017) An introduction to statistical learning with applications in R. Springer, New York
Jannink J-L, Lorenz AJ, Iwata H (2010) Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 9(2):166–177
Jena KK, Mackill DJ (2008) Molecular markers and their use in marker-assisted selection in rice. Crop Sci 48(4):1266
Kadam DC, Potts SM, Bohn MO, Lipka AE, Lorenz AJ (2016) Genomic prediction of single crosses in the early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6(11):3443–3453
Khush GS (2005) IR varieties and their impact. International Rice Research Inst, Los Baños
Mackill DJ, Coffman WR, Garrity DP (1996) Rainfed lowland rice improvement. IRRI, Manila
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Peng S, Khushg G (2003) Four Decades of breeding for varietal improvement of irrigated lowland rice in the international rice research institute. Plant Prod Sci 6(3):157–164
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E et al (2015) Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet 11:2
Wang X, Xu Y, Hu Z, Xu C (2018) Genomic selection methods for crop improvement: current status and prospects. Crop J 6:330–340
Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM (2013) Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14(7):507–515
Funding
This research did not receive any specific funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Rights and permissions
About this article
Cite this article
Banerjee, R., Marathi, B. & Singh, M. Efficient genomic selection using ensemble learning and ensemble feature reduction. J. Crop Sci. Biotechnol. 23, 311–323 (2020). https://doi.org/10.1007/s12892-020-00039-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12892-020-00039-4