Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter February 15, 2020

A Bayesian Framework for Robust Quantitative Trait Locus Mapping and Outlier Detection

Crispin M. Mutshinda ORCID logo, Andrew J. Irwin and Mikko J. Sillanpää

Abstract

We introduce a Bayesian framework for simultaneous feature selection and outlier detection in sparse high-dimensional regression models, with a focus on quantitative trait locus (QTL) mapping in experimental crosses. More specifically, we incorporate the robust mean shift outlier handling mechanism into the multiple QTL mapping regression model and apply LASSO regularization concurrently to the genetic effects and the mean-shift terms through the flexible extended Bayesian LASSO (EBL) prior structure, thereby combining QTL mapping and outlier detection into a single sparse model representation problem. The EBL priors on the mean-shift terms prevent outlying phenotypic values from distorting the genotype-phenotype association and allow their detection as cases with outstanding mean shift values following the LASSO shrinkage. Simulation results demonstrate the effectiveness of our new methodology at mapping QTLs in the presence of outlying phenotypic values and simultaneously identifying the potential outliers, while maintaining a comparable performance to the standard EBL on outlier-free data.

Code and data availability

The R and Stan code used in the analyses as well as the simulated marker data are available in the online Supplementary Material.

Acknowledgements

CMM and AJI were supported by the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems/CBIOMES (Grant ID: 549935, AJI).

References

[1] Nascimento M, Silva FF, de Resende MD, Cruz CD, Nascimento AC, Viana JM, et al. Regularized quantile regression applied to genome-enabled prediction of quantitative traits. Genet Mol Res. 2017;16:gmr16019538.10.4238/gmr16019538Search in Google Scholar PubMed

[2] Hawkins DM. Identification of outliers. London: Chapman and Hall, 1980.10.1007/978-94-015-3994-4Search in Google Scholar

[3] Anscombe FJ. Rejection of outliers. Technometrics. 1960;2:123–47.10.1080/00401706.1960.10489888Search in Google Scholar

[4] Liu H, Shah S, Jiang W. On-line outlier detection and data cleaning. Comput Chem Eng. 2004;28:1635–47.10.1016/j.compchemeng.2004.01.009Search in Google Scholar

[5] Jansen RC, Stam P. High resolution of quantitative traits into multiple loci via interval mapping. Genetics. 1994;136:1447–55.10.1093/genetics/136.4.1447Search in Google Scholar PubMed PubMed Central

[6] Feingold E. Regression-based quantitative-trait-locus mapping in the twenty-first century. Am J Hum Genet. 2002;71:217–22.10.1086/341964Search in Google Scholar PubMed PubMed Central

[7] Barnett V, Lewis T. Outliers in statistical data. 3rd ed. Chichester, UK: John Wiley & Sons, 1994.Search in Google Scholar

[8] Weisberg S. Applied linear regression. 2nd ed. New York, NY: Wiley, 1985.Search in Google Scholar

[9] Hadi AS, Simonoff JS. Procedures for the identification of multiple outlier in linear models. J Am Stat Assoc. 1993;88:1264–72.10.1080/01621459.1993.10476407Search in Google Scholar

[10] She Y, Owen AB. Outlier detection using nonconvex penalized regression. J American Stat Assoc. 2011;106:626–39.10.1198/jasa.2011.tm10390Search in Google Scholar

[11] Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. 2nd ed. London, England: Chapman and Hall, 2013.10.1201/b16018Search in Google Scholar

[12] Gilks WR, Richardson S, Spiegelhalter DJ, eds. Markov Chain Monte Carlo in practice. London, England: Chapman and Hall, 1996.10.1201/b14835Search in Google Scholar

[13] Guttman I. Care and handling of univariate or multivariate outliers in detecting spuriousity: a Bayesian approach. Technometrics. 1973;15:723–38.Search in Google Scholar

[14] Box GE, Tiao GC. A Bayesian approach to some outlier problems. Biometrika. 1968;55:119–29.10.1093/biomet/55.1.119Search in Google Scholar

[15] Mutshinda CM, Sillanpää MJ. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics. 2010;186:1067–75.10.1534/genetics.110.119586Search in Google Scholar PubMed PubMed Central

[16] Mutshinda CM, Sillanpää MJ. A decision rule for quantitative trait locus detection under the extended Bayesian LASSO model. Genetics. 2012;192:1483–91.10.1534/genetics.111.130278Search in Google Scholar PubMed PubMed Central

[17] Onogi A, Iwata H. VIGoR: variational Bayesian inference for Genome-Wide regression. J Open Res Software. 2016;4:e11. DOI: 10.5334/jors.80.Search in Google Scholar

[18] Park T, Casella G. The Bayesian LASSO. J Am Stat Assoc. 2008;103:681–6.10.1198/016214508000000337Search in Google Scholar

[19] Mutshinda CM, Sillanpää MJ. Swift block-updating EM and pseudo-EM procedures for Bayesian shrinkage analysis of quantitative trait loci. Theor Appl Genet. 2012;125:1575–87.10.1007/s00122-012-1936-1Search in Google Scholar PubMed

[20] Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984;6:721–41.10.1016/B978-0-08-051581-6.50057-XSearch in Google Scholar

[21] Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equations of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–92.10.2172/4390578Search in Google Scholar

[22] Hastings WK. Monte Carlo sampling methods using Markov Chains and their applications. Biometrika. 1970;57:97–109.10.1093/biomet/57.1.97Search in Google Scholar

[23] Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press; 2007.10.1017/CBO9780511790942Search in Google Scholar

[24] Monnahan CC, Thorson JT, Branch TA. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Meth Ecol Evol. 2017;8:339–48.10.1111/2041-210X.12681Search in Google Scholar

[25] Stan Development Team. Stan modeling language users guide and reference manual. Version 2.18.0, 2018. http://mc-stan.org.Search in Google Scholar

[26] Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res. 2014;15:1351–81.Search in Google Scholar

[27] Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90:773–95.10.1080/01621459.1995.10476572Search in Google Scholar

[28] Jeffreys H. The theory of probability. 3rd ed. Oxford, UK: Oxford University Press; 1961.Search in Google Scholar

[29] Andrews DF, Mallows CL. Scale mixtures of normal distributions. J R Stat Soc B. 1974;36:99–102.10.1111/j.2517-6161.1974.tb00989.xSearch in Google Scholar

[30] West M. On scale mixtures of normal distributions. Biometrika. 1987;74:646–8.10.1093/biomet/74.3.646Search in Google Scholar

[31] Tukey JW. Exploratory data analysis. Cambridge, MA: Addison-Wesley, 1977.Search in Google Scholar

[32] Xu S. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity. 2010;105:483–94.10.1038/hdy.2009.180Search in Google Scholar PubMed

[33] Mutshinda CM, Sillanpää MJ. Bayesian shrinkage analysis of QTLs under shape-adaptive shrinkage priors, and accurate re-estimation of genetic effects. Heredity. 2011;107:405–12.10.1038/hdy.2011.37Search in Google Scholar PubMed PubMed Central

[34] Wang S, Basten CJ, Zeng Z-B. Windows QTL cartographer 2.5. Raleigh, NC: Department of Statistics, North Carolina State University, 2006.Search in Google Scholar

[35] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Statl Soc Series B: Stat Method. 2008;70:849–911.10.1111/j.1467-9868.2008.00674.xSearch in Google Scholar PubMed PubMed Central

[36] Fan J, Song R. Sure independence screening in generalized linear models with np-dimensionality. The Ann Stat. 2010;38:3567–604.10.1214/10-AOS798Search in Google Scholar

[37] Tinker NA, Mather DE, Rosnagel BG, Kasha KJ, Kleinhofs A. Regions of the genome that affect agronomic performance in two-row barley. Crop Sci. 1996;36:1053–62.10.2135/cropsci1996.0011183X003600040040xSearch in Google Scholar

[38] Knürr T, Läärä E, Sillanpää MJ. Genetic analysis of complex traits via Bayesian variable selection: the utility of a mixture of uniform priors. Genet Res. 2011;93:303–18.10.1017/S0016672311000164Search in Google Scholar PubMed

[39] von Rohr P, Hoeschele I. Bayesian QTL mapping using skewed student-t distributions. Genet Sel Evol. 2002;34:1–21.10.1186/1297-9686-34-1-1Search in Google Scholar PubMed PubMed Central

[40] Gianola D, Cecchinato A, Naya H, Schon C-C. Prediction of complex traits: robust alternatives to best linear unbiased prediction. Front Genet. 2018;9:195.10.3389/fgene.2018.00195Search in Google Scholar PubMed PubMed Central

[41] Strandén I, Gianola D. Attenuating effects of preferential treatment with student-t mixed linear models: a simulation study. Genet Sel Evol. 1998;30:565.10.1186/1297-9686-30-6-565Search in Google Scholar

[42] Strandén I, Gianola D. Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol. 1999;31:25–42.10.1186/1297-9686-31-1-25Search in Google Scholar

[43] Rosa GJ, Padovani CR, Gianola D. Robust linear mixed models with normal/independent distributions and Bayesian MCMC implementation. Biometrical J. 2003;45:573–90.10.1002/bimj.200390034Search in Google Scholar

[44] Rosa GJ, Gianola D, Padovani CR. Bayesian longitudinal data analysis with mixed models and thick-tailed distributions using MCMC. J Appl Stat. 2004;31:855–73.10.1080/0266476042000214538Search in Google Scholar

[45] Cardoso FF, Rosa GJ, Tempelman RJ. Multiple-breed genetic inference using heavy-tailed structural models for heterogeneous residual variances. J Anim Sci. 2005;83:1766–79.10.2527/2005.8381766xSearch in Google Scholar PubMed

[46] Varona L, Mekkawy W, Gianola D, Blasco A. A whole-genome analysis using robust asymmetric distributions. Genet Res. 2006;88:143–51.10.1017/S0016672307008488Search in Google Scholar PubMed

[47] Lambert-Lacroix S, Zwald L. Robust regression through the Hubers criterion and adaptive LASSO penalty. Electron J Stat. 2011;16:1015–53.Search in Google Scholar

[48] Mutshinda CM, Noykova N, Sillanpää MJ. A hierarchical Bayesian approach to multi-trait clinical quantitative trait locus modeling. Front Genet. 2012;3:97.10.3389/fgene.2012.00097Search in Google Scholar PubMed PubMed Central

[49] Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.10.1093/genetics/157.4.1819Search in Google Scholar PubMed PubMed Central

[50] Gianola D, Perez-Enciso M, Toro MA. On marker-assisted prediction of genetic value: beyond the ridge. Genetics. 2003;163:347–65.10.1093/genetics/163.1.347Search in Google Scholar PubMed PubMed Central

[51] Ogutu JO, Torben S-S, Piepho H-P. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc. 2012;6:S10.10.1186/1753-6561-6-S2-S10Search in Google Scholar PubMed PubMed Central

[52] Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67:301–20.10.1111/j.1467-9868.2005.00503.xSearch in Google Scholar

[53] Xu S. Genetic mapping and genomic selection using recombination breakpoint data. Genetics. 2013;195:1103–15.10.1534/genetics.113.155309Search in Google Scholar PubMed PubMed Central


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/ijb-2019-0038).


Received: 2019-04-09
Revised: 2020-01-20
Accepted: 2020-02-04
Published Online: 2020-02-15

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Scroll Up Arrow