Abstract
The population discrimination and the classification of individuals have great importance for genetic improvement in population studies and genetic diversity conservation. Furthermore, multivariate approaches are often used, especially the Fisher and Anderson discriminant functions. New methodologies based on machine learning (ML) have shown to be promising for such procedures, but there is nonetheless a need for further evaluation and comparison of these methods. Thus, the present study evaluates the efficacy of supervised ML algorithms in classifying populations with different degrees of similarity—comparing them with discriminant analysis techniques proposed by Anderson and by Fisher. The methods of supervised ML tested were as follows: Naive Bayes, Decision Tree, k-Nearest Neighbors (kNN), Random Forest, Support Vector Machine (SVM) and Multi-layer Perceptron Neural Networks (MLP/ANN). To compare classification methods, we used phenotypic data of populations with different degrees of genetic similarity. Data stemmed from the genotypic information simulation for different populations submitted to the backcrossing scheme. Accuracy here means 30 repetitions from each classification method were compared by the Friedman and Nemenyi tests with a 95% confidence level. Classification methods based on machine learning algorithms showed superior results to the Fisher and Anderson discriminant functions, obtaining high accuracy where there was a higher similarity between populations. The kNN, Random Forest, SVM and Naive Bayes algorithms presented the highest accuracy, surpassing the Decision Tree algorithm and even MLP/ANN (which lost accuracy at a 96.88% similarity condition between populations). Thus, the present work confirms that ML techniques demonstrate greater accuracy in the discrimination and classification of populations without the limitations of statistical techniques.
References
Barbosa CD, Viana AP, Quintal SSR, Pereira MG (2011) Artificial neural network analysis of genetic diversity in Carica papaya L. Crop Breed Appl Biotechnol 11:224–231. https://doi.org/10.1590/S1984-70332011000300004
Barroso LMA, Nascimento M, Nascimento ACC, Silva FF, Ferreira RP (2013) Uso do método de Eberhart e Russell como informação a priori para aplicação de redes neurais artificiais e análise discriminante visando a classificação de genótipos de alfafa quanto à adaptabilidade e estabilidade fenotípica. Rev Bras Biometria 31:176–188
Borém A, Miranda GV, Fritsche-Neto R (2017) Melhoramento de plantas. Editora UFV, Viçosa
Carvalho VP, de Sousa IC, Nascimento M, Nascimento ACC, Cruz CD (2018b) Discrimination of populations under covariance matrix heterogeneity and non-normal random vectors in genetic diversity studies. Científica 46:344. https://doi.org/10.15361/1984-5529.2018v46n4p344-352
CarvalhoSant’Anna VPIC, Nascimento M, Nascimento ACC, Cruz CD, Arbex WA, Oliveira FC, Silva FF (2018) Support vector machines applied to the genetic classification problem of hybrid populations with high degrees of similarity. Genet Mol Res 17:1–10. https://doi.org/10.4238/gmr18122
Coppin B (2010) Inteligência artificial. LTC, Rio de Janeiro
Cruz CD (2006) Programa Genes: Análise multivariada e simulações. UFV, Viçosa
Cruz CD (2013) GENES—a software package for analysis in experimental statistics and quantitative genetics. Acta Sci Agron 35:271–276. https://doi.org/10.4025/actasciagron.v35i3.21251
Cruz CD, Ferreira FM, Pessoni LA (2011) Biometria aplicada ao estudo da diversidade, genética. Suprema, Visconde do Rio Branco
Fonseca AFA, Sediyama T, Cruz CD, Sakiyama NS, Ferrão RG, Ferrão MAG, Bragança SM (2004) Discriminant analysis for the classification and clustering of robusta coffee genotypes. Crop Breed Appl Biotechnol 4:285–289. https://doi.org/10.12702/1984-7033.v04n03a04
Fuentes S, Hernández-Montes E, Escalona JM, Bota J, Gonzalez Viejo C, Poblete-Echeverría C, Tongson E, Medrano H (2018) Automated grapevine cultivar classification based on machine learning using leaf morpho-colorimetry, fractal dimension and near-infrared spectroscopy parameters. Comput Electron Agric 151:311–318. https://doi.org/10.1016/j.compag.2018.06.035
Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10:453–472. https://doi.org/10.1007/s10115-006-0013-y
Maurer HP, Melchinger AE, Frisch M (2008) Population genetic simulation and data analysis with Plabsoft. Euphytica 161:133–139. https://doi.org/10.1007/s10681-007-9493-4
Naik HS, Zhang J, Lofquist A, Assefa T, Sarkar S, Ackerman D, Singh A, Singh AK, Ganapathysubramanian B (2017) A real-time phenotyping framework using machine learning for plant stress severity rating in soybean. Plant Methods 13:23. https://doi.org/10.1186/s13007-017-0173-7
Nascimento M, Peternelli LA, Cruz CD, Nascimento ACC, Ferreira RP, Bhering LL, Salgado CC (2013) Artificial neural networks for adaptability and stability evaluation in alfalfa genotypes. Crop Breed Appl Biotechnol 13:152–156
Nei M (1972) Genetic distance between populations. Am Nat 106:283–292. https://doi.org/10.1086/282771
Odong TL, van Heerwaarden J, Jansen J, van Hintum TJL, van Eeuwijk FA (2011) Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data? Theor Appl Genet 123:195–205. https://doi.org/10.1007/s00122-011-1576-x
Oliveira ACL, Pasqual M, Pio LAS, Lacerda WS, Silva SO (2013) Utilização da modelagem matemática (redes neurais artificiais) na classificação de autotetraploides de bananeira (Musa acuminata Colla). Biosci J 29:617–622
Ornella L, Tapia E (2010) Supervised machine learning and heterotic classification of maize (Zea mays L.) using molecular marker data. Comput Electron Agric 74:250–257. https://doi.org/10.1016/j.compag.2010.08.013
Pereira TM (2009) Discriminação de populações com diferentes graus de similaridade por redes neurais artificiais. Dissertation, Federal University of Viçosa, Viçosa, Brazil
R Core Team (2018) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available online: https://www.R-project.org/
Sant’Anna IC, Tomaz RS, Silva GN, Nascimento M, Bhering LL, Cruz CD (2015) Superiority of artificial neural networks for a genetic classification procedure. Genet Mol Res 14:9898–9906. https://doi.org/10.4238/2015.August.19.24
Silva GN, Tomaz RS, Sant’Anna IC, Carneiro VQ, Cruz CD, Nascimento M (2016) Evaluation of the efficiency of artificial neural networks for genetic value prediction. Genet Mol Res 15:1–11. https://doi.org/10.4238/gmr.15017676
Singh A, Ganapathysubramanian B, Singh AK, Sarkar S (2016) Machine Learning for High-Throughput Stress Phenotyping in Plants. Trends Plant Sci 21:110–124. https://doi.org/10.1016/j.tplants.2015.10.015
Vasconcelos ESD, Cruz CD, Bhering LL, Resende Júnior MFR (2007) Método alternativo para análise de agrupamento. Pesquisa Agropecuária Brasileira 42:421–1428
Zheng A (2015) Evaluating machine learning algorithms: a beginner’s guide to key concepts and pitfalls. O’Reilly Media, Sebastopol
Acknowledgements
We thank and recognize the CNPq (National Council for Scientific and Technological Development) in addition to the CAPES (Coordination for the Improvement of Higher Education Personnel) for the research fellowships (Process PNPD/CAPES 88882.315120/2019-01) and financial support (Process CNPq 301840/2016-4).
Author information
Authors and Affiliations
Contributions
LS was involved in conceptualization, methodology, investigation, formal analysis, data curation, writing—original draft. PMM was involved in methodology, investigation. MLTM was involved in methodology, investigation WNG was involved in methodology and investigation. MC was involved in methodology and investigation. CSC was involved in methodology and investigation. WSF was involved in investigation, writing—review & editing. RBC was involved in funding acquisition, supervision, investigation, writing—review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Data archiving statement
Data used in current study are available in the Harvard Dataverse repository: Skowronski, Leandro, 2020, "Replication Data for: Machine learning in plant breeding," https://doi.org/10.7910/DVN/BCVCZVV, Harvard Dataverse, DRAFT VERSION
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Skowronski, L., de Moraes, P.M., de Moraes, M.L.T. et al. Supervised learning algorithms in the classification of plant populations with different degrees of kinship. Braz. J. Bot 44, 371–379 (2021). https://doi.org/10.1007/s40415-021-00703-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40415-021-00703-1