A robust Bayesian genome-based median regression model

Montesinos-López, Abelardo; Montesinos-López, Osval A.; Villa-Diharce, Enrique R.; Gianola, Daniel; Crossa, José

doi:10.1007/s00122-019-03303-6

A robust Bayesian genome-based median regression model

Original Article
Published: 12 February 2019

Volume 132, pages 1587–1606, (2019)
Cite this article

Theoretical and Applied Genetics Aims and scope Submit manuscript

Abelardo Montesinos-López¹,
Osval A. Montesinos-López²,
Enrique R. Villa-Diharce³,
Daniel Gianola⁴ &
…
José Crossa ORCID: orcid.org/0000-0001-9429-5855⁵

496 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Key message

Current genome-enabled prediction models assumed errors normally distributed, which are sensitive to outliers. We propose a model with errors assumed to follow a Laplace distribution to deal better with outliers.

Abstract

Current genome-enabled prediction models use regressions that fit the expected value (mean) of a response variable with errors assumed normally distributed, which are often sensitive to outliers, either genetic or environmental. For this reason, we propose a robust Bayesian genome median regression (BGMR) model that fits regressions to the medians of a distribution, with errors assumed to follow a Laplace distribution to deal better with outliers. The BGMR model was evaluated under a Bayesian framework with Markov Chain Monte Carlo sampling using a location–scale mixture representation of the Laplace distribution. The BGMR was implemented with two simulated and two real genomic data sets, and we compared its prediction performance with that of a conventional genomic best linear unbiased prediction (GBLUP) model and the Laplace maximum a posteriori (LMAP) method. The prediction accuracies of BGMR were higher than those of the GBLUP and LMAP methods when there were outliers. The BGMR model could be useful to breeders who need to predict and select genotypes based on data with unknown outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genome-Enabled Prediction Using the BLR (Bayesian Linear Regression) R-Package

Bayesian Genomic Linear Regression

Genome-wide prediction using Bayesian additive regression trees

Article Open access 10 June 2016

References

Crossa J, de los Campos G, Pérez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, Arief V, Banziger M, Braun H-J (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186(2):713–724. https://doi.org/10.1534/genetics.110.118521
Article CAS PubMed PubMed Central Google Scholar
de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345
Article PubMed Central Google Scholar
Edgeworth FY (1887) On observations relating to several quantities. Hermathena 6:279–285
Google Scholar
Fen F, Wang H, Lu N, Chen T, He H, Lu Y, Tu XM (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105–109. https://doi.org/10.3969/j.issn.1002-0829.2014.02.009
Article Google Scholar
Feng C, Wang H, Lu N, Tu XM (2012) Log-transformation: applications and interpretation in biomedical research. Stat Med 32:230–239. https://doi.org/10.1002/sim.5486
Article PubMed Google Scholar
Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability and the bayesian alphabet. Genetics 183(1):347–363
Article PubMed PubMed Central Google Scholar
Gianola D, Cecchinato A, Naya H, Schön C-C (2018) Prediction of complex traits: robust alternatives to best linear unbiased prediction. Front Genet 9:195. https://doi.org/10.3389/fgene.2018.00195
Article PubMed PubMed Central Google Scholar
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (2005) Robust statistics: the approach based on influence functions. Wiley, London
Book Google Scholar
Huber P (1973) Robust regression: asymptotics, conjectures, and monte carlo. Ann Stat 1(5):799–821
Article Google Scholar
Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50
Article Google Scholar
Kozumi H, Kobayashi G (2011) Gibbs sampling methods for Bayesian quantile regression. J Stat Comput Simul 81(11):1565–1578
Article Google Scholar
Lange KL, Little RJA, Taylor JMG (1989) Robust statistical modeling using the T-distribution. J Am Stat Assoc 84:881–896
Google Scholar
Lehermeier C, Wimmer V, Albrecht T, Auinger HJ, Gianola D, Schmid VJ, Schön CC (2013) Sensitivity to prior specification in Bayesian genome-based prediction models. Stat Appl Genet Mol Biol 12(3):375–391. https://doi.org/10.1515/sagmb-2012-0042
Article PubMed Google Scholar
Li Z, Möttönen J, Sillanpää MJ (2015) A robust multiple-locus method for quantitative trait locus analysis of non-normally distributed multiple traits. Heredity 115(6):556–564
Article CAS PubMed PubMed Central Google Scholar
Lourenço VM, Pires AM (2014) M-regression, false discovery rates and outlier detection with application to genetic association studies. J Comput Stat Data Anal 78:33–42
Article Google Scholar
Lourenço VM, Pires AM, Kirst M (2011) Robust linear regression methods in association studies. Bioinformatics 27(6):815–821
Article CAS PubMed Google Scholar
Lourenço VM, Rodrigues PC, Pires AM, Piepho H-P (2017) A robust DF-REML framework for variance components estimation in genetic studies. Bioinformatics 33(22):3584–3594
Article CAS PubMed Google Scholar
Montesinos-López OA, Montesinos-López A, Crossa J, Toledo F, Pérez-Hernández O, Eskridge KM, Rutkoski J (2016) A genomic bayesian multi-trait and multi-environment model. G3: Genes|Genomes|Genetics 6(9):2725–2744
Article PubMed PubMed Central Google Scholar
Nascimento M, de Resende MD, Cruz CD, Nascimento AC, Viana JM, Azevedo CF, Barroso LM (2017) Regularized quantile regression applied to genome-enabled prediction of quantitative traits. Genet Mol Res. https://doi.org/10.4238/gmr16019538
Article PubMed Google Scholar
Ould-Estaghvirou SB, Ogutu JO, Piepho HP (2014) Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3: Genes, Genomes, Genetics 4(12):2317–2328
Article Google Scholar
Park T, Casella G (2008) The Bayesian lasso. J Am Stat Assoc 103(482):681–686
Article CAS Google Scholar
Pérez P, de los Campos G, Crossa J, Gianola D (2010) Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. Plant Genome 3:106–116
Article PubMed PubMed Central Google Scholar
Pérez-Rodríguez P, de los Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198:483–495
Article Google Scholar
Pourhoseingholi A, Pourhoseingholi MA, Vahedi M, Moghimi-Dehkordi B, Maserat AS, Zali MR (2009) Relation between demographic factors and hospitalization in patients with gastrointestinal disorders, using quantile regression analysis. East Afr J Public Health 6(1):45–47
PubMed Google Scholar
Rodrigues PC, Monteiro A, Lourenço VM (2016) A robust AMMI model for the analysis of genotype-by-environment data. Bioinformatics 32(1):58–66
CAS PubMed Google Scholar
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Article Google Scholar
Seber GAF, Lee AJ (2003) Linear regression analysis, 2nd edn. Wiley, Hoboken
Book Google Scholar
Strandén I, Gianola D (1998) Attenuating effects of preferential treatment with Student-t mixed linear models: a simulation study. Genet Sel Evol 30:565–583
Article PubMed Central Google Scholar
Strandén I, Gianola D (1999) Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol 31:25–42. https://doi.org/10.1186/1297-9686-31-1-25
Article PubMed Central Google Scholar
VanRaden PM (2007) Genomic measures of relationship and inbreeding. Interbull Bull 37:33–36
Google Scholar
Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656
Article Google Scholar
Yu K, Moyeed A (2001) Bayesian quantile regression. Stat Probab Lett 54:437–447
Article Google Scholar

Download references

Acknowledgments

We thank all scientists, field workers, and lab assistants from National Programs and CIMMYT who collected the data used in this study. We acknowledge the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) in Norway through NFR Grant 267806. We are also thankful for the financial support provided by CIMMYT CRP (maize and wheat), the Bill & Melinda Gates Foundation, as well the USAID projects (Cornell University and Kansas State University) that financed the collection of the CIMMYT maize and wheat data analyzed in this study.

Author information

Authors and Affiliations

Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430, Guadalajara, JAL, Mexico
Abelardo Montesinos-López
Facultad de Telemática, Universidad de Colima, Colima, Mexico
Osval A. Montesinos-López
Departamento de Estadística, Centro de Investigación en Matemáticas (CIMAT), 36240, Guanajuato, Mexico
Enrique R. Villa-Diharce
Departments of Animal Sciences, Dairy Science, and Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53706, USA
Daniel Gianola
Biometrics and Statistics Unit and Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Mexico, DF, Mexico
José Crossa

Authors

Abelardo Montesinos-López
View author publications
You can also search for this author in PubMed Google Scholar
Osval A. Montesinos-López
View author publications
You can also search for this author in PubMed Google Scholar
Enrique R. Villa-Diharce
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gianola
View author publications
You can also search for this author in PubMed Google Scholar
José Crossa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Osval A. Montesinos-López or Daniel Gianola.

Ethics declarations

Conflict of interest

The authors declare they do not have any conflict of interest.

Additional information

Communicated by Mikko J. Sillanpaa.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Deriving full conditional distributions for the Bayesian Laplace regression model

From Eq. (2), given the random effects and $u = (u_{1} , \ldots ,u_{n} )^{\text{T}} ,$ the joint conditional density of the vector of responses is given by

$$\begin{aligned} f(\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} ) & \propto \prod\limits_{j = 1}^{n} {\frac{1}{{\sqrt {\sigma^{2} u_{j} } }}\exp \left[ { - \frac{{\left( {y_{j} - \mu - z_{j}^{T} b} \right)^{2} }}{{2\left( 8 \right)\sigma^{2} u_{j} }}} \right]} \\ & \propto \exp \left[ { - \frac{1}{{2\sigma^{2} }}\left( {\varvec{Y} - 1\varvec{\mu}- \varvec{b}} \right)^{T} \varvec{D}_{u}^{ - 1} \left( {\varvec{Y} - 1\varvec{\mu}- \varvec{b}} \right)} \right]. \\ \end{aligned}$$

Fully conditional for $\mu$

$$\begin{aligned} f\left( {\mu | {\text{ELSE}}} \right) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)f\left( {\mu |\sigma_{0}^{2} } \right) \\ & \propto \exp \left[ { - (\varvec{Y} - 1\mu - \varvec{b})^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu - \varvec{b}) - \frac{1}{{2\sigma_{0}^{2} }}(\mu - \mu_{0} )^{2} } \right] \\ & \propto \exp \left[ { - \frac{1}{{2\tilde{\sigma }_{0}^{2} }}(\mu - \tilde{\mu }_{0} )^{2} } \right] \\ \end{aligned}$$

(4)

where $\tilde{\sigma }_{0}^{2} = \frac{1}{{\sigma_{0}^{ - 2} + 8^{ - 1} \sum\nolimits_{j = 1}^{n} {u_{j}^{ - 1} \sigma^{ - 2} } }}$. and $\tilde{\mu }_{0} = \tilde{\sigma }_{0}^{2} [\mu_{0} \sigma_{0}^{ - 2} + \sigma^{ - 2} 1^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - \varvec{b})]$. Then $\mu |{\text{ELSE}}\,\sim\,N\left( {\tilde{\mu }_{0} ,\tilde{\sigma }_{0}^{2} } \right)$.

Fully conditional for $\varvec{b}$

Similarly, we have that

$$\begin{aligned} f(\varvec{b} | {\text{ELSE}}) & \propto f(\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} )f(\varvec{b} |\sigma_{1}^{2} ) \\ & \propto \exp \left[ { - \frac{1}{{2\sigma^{2} }}(\varvec{Y} - 1\mu - \varvec{b})^{\text{T}} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu - \varvec{b}) - \frac{1}{{2\sigma_{0}^{2} }}(\mu - \mu_{0} )^{2} } \right] \\ & \propto \exp \left[ { - \frac{1}{2}(\varvec{b} - \tilde{\varvec{b}})^{\text{T}} \tilde{\varvec{\varSigma }}_{1}^{ - 1} \left( { \varvec{b} - \tilde{\varvec{b}}} \right)} \right] \\ \end{aligned}$$

(5)

where $\tilde{\varvec{\varSigma }}_{1} = (\varvec{G}^{ - 1} \sigma_{1}^{ - 2} + \sigma^{ - 2} \varvec{D}_{u}^{ - 1} )^{ - 1}$ and $\tilde{\varvec{b}} = \sigma^{ - 2} \tilde{\varvec{\varSigma }}_{{b_{1} }} \varvec{D}_{u}^{ - 1} (\varvec{Y} - 1\mu )$. So $\varvec{b}|{\text{ELSE}}\,\sim\,N(\tilde{\varvec{b}},\tilde{\varvec{\varSigma }}_{1} )$.

Fully conditional for $\sigma_{1}^{2}$

$$\begin{aligned} f\left( {\sigma_{1}^{2} | {\text{ELSE}}} \right) & \propto f(\varvec{b} |\sigma_{1}^{2} )f(\sigma_{1}^{2} ) \\ & \propto (\sigma_{1}^{2} )^{{ - \frac{{\nu_{1} + J}}{2} - 1}} \exp \left\{ { - \left( {\frac{{\varvec{b}^{\text{T}} \varvec{G}^{ - 1} \varvec{b} + S_{1} }}{{2\sigma_{1}^{2} }}} \right)} \right\} \\ & \propto \chi^{ - 2} \left( {\nu_{1} + J,\varvec{b}^{\text{T}} \varvec{G}^{ - 1} \varvec{b} + S_{1} } \right) \\ \end{aligned}$$

(6)

Fully conditional for $\sigma^{2}$

$$\begin{aligned} f(\sigma^{2} | {\text{ELSE}}) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)f(\sigma^{2} ) \\ & \propto (\sigma^{2} )^{{ - \frac{df + n}{2} - 1}} \exp \left[ { - \frac{{\left( {\varvec{y} - 1\mu - \varvec{b}} \right)^{\text{T}} \varvec{D}_{u}^{ - 1} \left( {\varvec{y} - 1\mu - \varvec{b}} \right) + S}}{{2\sigma^{2} }}} \right] \\ & \propto \chi^{ - 2} \left( {df + n,\left( {\varvec{y} - 1\mu - \varvec{b}} \right)^{\text{T}} \varvec{D}_{u}^{ - 1} \left( {\varvec{y} - 1\mu - \varvec{b}} \right) + S} \right) \\ \end{aligned}$$

(7)

Fully conditional for $\varvec{u}$

$$\begin{aligned} P(\varvec{u}\left| {\text{ELSE}} \right.) & \propto f\left( {\varvec{y}|\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right)\prod\limits_{j = 1}^{n} {f(u_{j} )} \\ & \propto \mathop \prod \limits_{j = 1}^{n} \frac{1}{{\sqrt {u_{j} } }}{ \exp }\left( { - \frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{2\left( 8 \right)\sigma^{2} u_{j} }}} \right){\text{exp(}} - u_{j} )\\ & \propto \mathop \prod \limits_{j = 1}^{n} u_{j}^{{\frac{1}{2} - 1}} { \exp }\left( { - \frac{1}{2}\left[ {\frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{8\sigma^{2} }}u_{j}^{ - 1} + 2u_{j} } \right]} \right) \\ & \propto \mathop \prod \limits_{j = 1}^{n} {\text{GIG}}\left( {\frac{1}{2},\frac{{\left( {y_{j} - \mu - b_{j} } \right)^{2} }}{{8\sigma^{2} }},2} \right) \\ \end{aligned}$$

(8)

where ${\text{GIG(}}v,a,b )$ denotes the generalized inverse Gaussian distribution with parameters $v$, $a$ and $b$ (Kozumi and Kobayashi 2011).

Fully conditional for missing values

$$\begin{aligned} f\left( {\varvec{y}_{\text{miss}} | {\text{ELSE}}} \right) & \propto f\left( {\varvec{y}_{\text{miss}} |\varvec{b}, \varvec{u},\mu , \sigma_{1}^{2} ,\sigma^{2} } \right) \\ & \propto N\left( {\varvec{\eta}^{*} ,\sigma^{2} \varvec{D}_{u}^{*} } \right) \\ \end{aligned}$$

(9)

where $\varvec{\eta}^{*} = 1^{\varvec{*}} \mu + \varvec{g}^{\varvec{*}}$ is the corresponding linear predictor of the missing values in the model in Eq. (2) and $\varvec{D}_{u}^{*}$ is a diagonal matrix that retains the elements in $\varvec{D}_{u}$ corresponding to the missing values.

Appendix 2: Setting hyperparameters for the prior distributions of the BGMR model

The prior mean ($\mu_{0}$) for the general mean ($\mu$) was settled as the mean response sample in the training data, while the rest of the hyperparameters for the BGMR model were set similarly to those used in the BGLR software (Pérez-Rodríguez and de los Campos 2014). These rules provide proper, but weakly informative prior distributions. We partitioned the total variance–covariance of the phenotypes into two components: (1) the error and (2) the linear predictor. First, the variance of the phenotypes $y_{i}$ under the model is given by

$${\text{Var}}(y_{j} ) = {\text{Var}}(b_{j} ) + 8\sigma^{2}$$

Therefore, the average of the variance of the individuals, called total variance, is equal to

$$\frac{1}{n}\mathop \sum \limits_{j = 1}^{n} {\text{Var}}(y_{j} ) = \frac{1}{n}\mathop \sum \limits_{j = 1}^{n} {\text{Var(}}b_{j} )+ 8\sigma^{2} = \frac{1}{n}{\text{tr}}(\varvec{G})\sigma_{1}^{2} + 8\sigma^{2} = V_{1} + V_{\epsilon } .$$

Then, by setting $R_{1}^{2}$ as the proportion of the total variance (${\mathbf{V}}_{y}$) that is explained by lines a priori, $V_{g} = R_{1}^{2} {\mathbf{V}}_{y}$, and replacing $\sigma_{1}^{2}$ in $V_{1}$ by its prior mode, $\frac{{S_{1} }}{{df_{1} + 2}}$. Once we have set a value for $df_{1}$, the scale parameter is given by

$$S_{1} = \frac{{R_{1}^{2} {\mathbf{V}}_{y} }}{{\frac{1}{n}{\text{tr(}}\varvec{G} )}}\left( {df_{1} + 2} \right).$$

For the shape parameter by default, we set $df_{1} = 5$ and $R_{1}^{2} = 0.5$.

Similarly, once there is a value for the shape parameter of the prior distribution of $\sigma^{2}$, $df$, the value of the scale parameter is given by

$$S = \frac{{\left( {1 - R_{1}^{2} } \right){\mathbf{V}}_{y} }}{8}\left( {df + 2} \right)$$

where $1 - R_{1}^{2}$ is the proportion of the total variance (${\mathbf{V}}_{y}$) that is explained by the error a priori. By default, we set $df = 5.$

The pdf of the scaled inverse Chi-square distribution with $v$ degrees of freedom and scale parameter $S$, $\chi^{ - 2} \left( {v,S} \right)$, is given by

$$f(x;\,df,S) = \frac{{\left( {\frac{S}{2}} \right)^{df/2} }}{{\Gamma (df/2)}}x^{ - 1 - df/2} \exp \left( { - \frac{S}{2x}} \right), x > 0$$

and the mean, mode, and variance of this distribution are given by $\frac{S}{df - 2}$, $\frac{S}{df + 2}$, and $\frac{{2S^{2} }}{{\left( {df - 2} \right)^{2} \left( {df - 4} \right)}}$, respectively. Specifically, the prior mean, mode, and variance for the variance components are:

$$\begin{aligned} & E\left( {\sigma_{1}^{2} } \right) = \frac{{S_{1} }}{{df_{1} - 2}}, \quad {\text{Mode}}\left( {\sigma_{1}^{2} } \right) = \frac{{S_{1} }}{{df_{1} + 2}}\;{\text{and}}\;{\text{Var}}\left( {\sigma_{1}^{2} } \right) = \frac{{2S_{1}^{2} }}{{\left( {df_{1} - 2} \right)^{2} \left( {df_{1} - 4} \right)}} \\ & E\left( {\sigma^{2} } \right) = \frac{S}{df - 2},\quad {\text{Mode}}\left( {\sigma^{2} } \right) = \frac{S}{df + 2} \;{\text{and}}\;{\text{Var}}\left( {\sigma^{2} } \right) = \frac{{2S^{2} }}{{\left( {df - 2} \right)^{2} \left( {df - 4} \right)}}. \\ \end{aligned}$$

Appendix 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Montesinos-López, A., Montesinos-López, O.A., Villa-Diharce, E.R. et al. A robust Bayesian genome-based median regression model. Theor Appl Genet 132, 1587–1606 (2019). https://doi.org/10.1007/s00122-019-03303-6

Download citation

Received: 19 December 2018
Accepted: 02 February 2019
Published: 12 February 2019
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s00122-019-03303-6

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A robust Bayesian genome-based median regression model