Abstract
Key message
Current genome-enabled prediction models assumed errors normally distributed, which are sensitive to outliers. We propose a model with errors assumed to follow a Laplace distribution to deal better with outliers.
Abstract
Current genome-enabled prediction models use regressions that fit the expected value (mean) of a response variable with errors assumed normally distributed, which are often sensitive to outliers, either genetic or environmental. For this reason, we propose a robust Bayesian genome median regression (BGMR) model that fits regressions to the medians of a distribution, with errors assumed to follow a Laplace distribution to deal better with outliers. The BGMR model was evaluated under a Bayesian framework with Markov Chain Monte Carlo sampling using a location–scale mixture representation of the Laplace distribution. The BGMR was implemented with two simulated and two real genomic data sets, and we compared its prediction performance with that of a conventional genomic best linear unbiased prediction (GBLUP) model and the Laplace maximum a posteriori (LMAP) method. The prediction accuracies of BGMR were higher than those of the GBLUP and LMAP methods when there were outliers. The BGMR model could be useful to breeders who need to predict and select genotypes based on data with unknown outliers.
Similar content being viewed by others
References
Crossa J, de los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun H-J (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186(2):713–724. https://doi.org/10.1534/genetics.110.118521
de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345
Edgeworth FY (1887) On observations relating to several quantities. Hermathena 6:279–285
Fen F, Wang H, Lu N, Chen T, He H, Lu Y, Tu XM (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105–109. https://doi.org/10.3969/j.issn.1002-0829.2014.02.009
Feng C, Wang H, Lu N, Tu XM (2012) Log-transformation: applications and interpretation in biomedical research. Stat Med 32:230–239. https://doi.org/10.1002/sim.5486
Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the bayesian alphabet. Genetics 183(1):347–363
Gianola D, Cecchinato A, Naya H, Schön C-C (2018) Prediction of complex traits: robust alternatives to best linear unbiased prediction. Front Genet 9:195. https://doi.org/10.3389/fgene.2018.00195
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (2005) Robust statistics: the approach based on influence functions. Wiley, London
Huber P (1973) Robust regression: asymptotics, conjectures, and monte carlo. Ann Stat 1(5):799–821
Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50
Kozumi H, Kobayashi G (2011) Gibbs sampling methods for Bayesian quantile regression. J Stat Comput Simul 81(11):1565–1578
Lange KL, Little RJA, Taylor JMG (1989) Robust statistical modeling using the T-distribution. J Am Stat Assoc 84:881–896
Lehermeier C, Wimmer V, Albrecht T, Auinger HJ, Gianola D, Schmid VJ, Schön CC (2013) Sensitivity to prior specification in Bayesian genome-based prediction models. Stat Appl Genet Mol Biol 12(3):375–391. https://doi.org/10.1515/sagmb-2012-0042
Li Z, Möttönen J, Sillanpää MJ (2015) A robust multiple-locus method for quantitative trait locus analysis of non-normally distributed multiple traits. Heredity 115(6):556–564
Lourenço VM, Pires AM (2014) M-regression, false discovery rates and outlier detection with application to genetic association studies. J Comput Stat Data Anal 78:33–42
Lourenço VM, Pires AM, Kirst M (2011) Robust linear regression methods in association studies. Bioinformatics 27(6):815–821
Lourenço VM, Rodrigues PC, Pires AM, Piepho H-P (2017) A robust DF-REML framework for variance components estimation in genetic studies. Bioinformatics 33(22):3584–3594
Montesinos-López OA, Montesinos-López A, Crossa J, Toledo F, Pérez-Hernández O, Eskridge KM, Rutkoski J (2016) A genomic bayesian multi-trait and multi-environment model. G3: Genes|Genomes|Genetics 6(9):2725–2744
Nascimento M, de Resende MD, Cruz CD, Nascimento AC, Viana JM, Azevedo CF, Barroso LM (2017) Regularized quantile regression applied to genome-enabled prediction of quantitative traits. Genet Mol Res. https://doi.org/10.4238/gmr16019538
Ould-Estaghvirou SB, Ogutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3: Genes, Genomes, Genetics 4(12):2317–2328
Park T, Casella G (2008) The Bayesian lasso. J Am Stat Assoc 103(482):681–686
Pérez P, de los Campos G, Crossa J, Gianola D (2010) Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. Plant Genome 3:106–116
Pérez-Rodríguez P, de los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198:483–495
Pourhoseingholi A, Pourhoseingholi MA, Vahedi M, Moghimi-Dehkordi B, Maserat AS, Zali MR (2009) Relation between demographic factors and hospitalization in patients with gastrointestinal disorders, using quantile regression analysis. East Afr J Public Health 6(1):45–47
Rodrigues PC, Monteiro A, Lourenço VM (2016) A robust AMMI model for the analysis of genotype-by-environment data. Bioinformatics 32(1):58–66
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Seber GAF, Lee AJ (2003) Linear regression analysis, 2nd edn. Wiley, Hoboken
Strandén I, Gianola D (1998) Attenuating effects of preferential treatment with Student-t mixed linear models: a simulation study. Genet Sel Evol 30:565–583
Strandén I, Gianola D (1999) Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol 31:25–42. https://doi.org/10.1186/1297-9686-31-1-25
VanRaden PM (2007) Genomic measures of relationship and inbreeding. Interbull Bull 37:33–36
Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656
Yu K, Moyeed A (2001) Bayesian quantile regression. Stat Probab Lett 54:437–447
Acknowledgments
We thank all scientists, field workers, and lab assistants from National Programs and CIMMYT who collected the data used in this study. We acknowledge the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) in Norway through NFR Grant 267806. We are also thankful for the financial support provided by CIMMYT CRP (maize and wheat), the Bill & Melinda Gates Foundation, as well the USAID projects (Cornell University and Kansas State University) that financed the collection of the CIMMYT maize and wheat data analyzed in this study.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare they do not have any conflict of interest.
Additional information
Communicated by Mikko J. Sillanpaa.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Deriving full conditional distributions for the Bayesian Laplace regression model
From Eq. (2), given the random effects and \(u = (u_{1} , \ldots ,u_{n} )^{\text{T}} ,\) the joint conditional density of the vector of responses is given by
Fully conditional for \(\mu\)
where \(\tilde{\sigma }_{0}^{2} = \frac{1}{{\sigma_{0}^{ - 2} + 8^{ - 1} \sum\nolimits_{j = 1}^{n} {u_{j}^{ - 1} \sigma^{ - 2} } }}\). and \(\tilde{\mu }_{0} = \tilde{\sigma }_{0}^{2} [\mu_{0} \sigma_{0}^{ - 2} + \sigma^{ - 2} 1^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - \varvec{b})]\). Then \(\mu |{\text{ELSE}}\,\sim\,N\left( {\tilde{\mu }_{0} ,\tilde{\sigma }_{0}^{2} } \right)\).
Fully conditional for \(\varvec{b}\)
Similarly, we have that
where \(\tilde{\varvec{\varSigma }}_{1} = (\varvec{G}^{ - 1} \sigma_{1}^{ - 2} + \sigma^{ - 2} \varvec{D}_{u}^{ - 1} )^{ - 1}\) and \(\tilde{\varvec{b}} = \sigma^{ - 2} \tilde{\varvec{\varSigma }}_{{b_{1} }} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu )\). So \(\varvec{b}|{\text{ELSE}}\,\sim\,N(\tilde{\varvec{b}},\tilde{\varvec{\varSigma }}_{1} )\).
Fully conditional for \(\sigma_{1}^{2}\)
Fully conditional for \(\sigma^{2}\)
Fully conditional for \(\varvec{u}\)
where \({\text{GIG(}}v,a,b )\) denotes the generalized inverse Gaussian distribution with parameters \(v\), \(a\) and \(b\) (Kozumi and Kobayashi 2011).
Fully conditional for missing values
where \(\varvec{\eta}^{*} = 1^{\varvec{*}} \mu + \varvec{g}^{\varvec{*}}\) is the corresponding linear predictor of the missing values in the model in Eq. (2) and \(\varvec{D}_{u}^{*}\) is a diagonal matrix that retains the elements in \(\varvec{D}_{u}\) corresponding to the missing values.
Appendix 2: Setting hyperparameters for the prior distributions of the BGMR model
The prior mean (\(\mu_{0}\)) for the general mean (\(\mu\)) was settled as the mean response sample in the training data, while the rest of the hyperparameters for the BGMR model were set similarly to those used in the BGLR software (Pérez-Rodríguez and de los Campos 2014). These rules provide proper, but weakly informative prior distributions. We partitioned the total variance–covariance of the phenotypes into two components: (1) the error and (2) the linear predictor. First, the variance of the phenotypes \(y_{i}\) under the model is given by
Therefore, the average of the variance of the individuals, called total variance, is equal to
Then, by setting \(R_{1}^{2}\) as the proportion of the total variance (\({\mathbf{V}}_{y}\)) that is explained by lines a priori, \(V_{g} = R_{1}^{2} {\mathbf{V}}_{y}\), and replacing \(\sigma_{1}^{2}\) in \(V_{1}\) by its prior mode, \(\frac{{S_{1} }}{{df_{1} + 2}}\). Once we have set a value for \(df_{1}\), the scale parameter is given by
For the shape parameter by default, we set \(df_{1} = 5\) and \(R_{1}^{2} = 0.5\).
Similarly, once there is a value for the shape parameter of the prior distribution of \(\sigma^{2}\), \(df\), the value of the scale parameter is given by
where \(1 - R_{1}^{2}\) is the proportion of the total variance (\({\mathbf{V}}_{y}\)) that is explained by the error a priori. By default, we set \(df = 5.\)
The pdf of the scaled inverse Chi-square distribution with \(v\) degrees of freedom and scale parameter \(S\), \(\chi^{ - 2} \left( {v,S} \right)\), is given by
and the mean, mode, and variance of this distribution are given by \(\frac{S}{df - 2}\), \(\frac{S}{df + 2}\), and \(\frac{{2S^{2} }}{{\left( {df - 2} \right)^{2} \left( {df - 4} \right)}}\), respectively. Specifically, the prior mean, mode, and variance for the variance components are:
Appendix 3
Rights and permissions
About this article
Cite this article
Montesinos-López, A., Montesinos-López, O.A., Villa-Diharce, E.R. et al. A robust Bayesian genome-based median regression model. Theor Appl Genet 132, 1587–1606 (2019). https://doi.org/10.1007/s00122-019-03303-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00122-019-03303-6