Random forests on distance matrices for imaging genetics studies

Aaron Sim 1 , Dimosthenis Tsagkrasoulis 2 , and Giovanni Montana
  • 1 Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, UK
  • 2 Statistics Section, Department of Mathematics, Imperial College London, UK
  • 3 Department of Biomedical Engineering, King’s College London, UK
Aaron Sim, Dimosthenis Tsagkrasoulis and Giovanni Montana


We propose a non-parametric regression methodology, Random Forests on Distance Matrices (RFDM), for detecting genetic variants associated to quantitative phenotypes, obtained using neuroimaging techniques, representing the human brain’s structure or function. RFDM, which is an extension of decision forests, requires a distance matrix as the response that encodes all pair-wise phenotypic distances in the random sample. We discuss ways to learn such distances directly from the data using manifold learning techniques, and how to define such distances when the phenotypes are non-vectorial objects such as brain connectivity networks. We also describe an extension of RFDM to detect espistatic effects while keeping the computational complexity low. Extensive simulation results and an application to an imaging genetics study of Alzheimer’s Disease are presented and discussed.

  • Albert, M. S., S. T. DeKosky, D. Dickson, B. Dubois, H. H. Feldman, N. C. Fox, A. Gamst, D. M. Holtzman, W. J. Jagust, R. C. Petersen, P. J. Snyder, M. C. Carrillo, B. Thies and C. H. Phelps (2011): “The diagnosis of mild cognitive impairment due to Alzheimer’s disease: recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease,” Alzheimers Dement.: J. Alzheimer’s Assoc., 7, 270–279.

  • Alter, M. D., R. Kharkar, K. E. Ramsey, D. W. Craig, R. D. Melmed, T. A. Grebe, R. C. Bay, S. Ober-Reynolds, J. Kirwan, J. J. Jones, J. B. Turner, R. Hen and D. A. Stephan (2011): “Autism and Increased Paternal Age Related Changes in Global Levels of Gene Expression Regulation,” PLOS One, 6 (2), e16715.

    • Crossref
  • Belkin, M. and P. Niyogi (2003): “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural. Comput., 15, 1373–1396.

  • Braak, H. H. and E. E. Braak (1998): “Evolution of neuronal changes in the course of Alzheimer’s disease,” J. Neural Transm. Supplementum, 53, 127–140.

  • Breiman, L. (1984): Classification and regression trees, US, Florida: Chapman & Hall/CRC.

  • Breiman, L. (2001): “Random forests - Springer,” Mach. Learn., 45, 5–32.

  • Brier, M. R., J. B. Thomas, A. Z. Snyder, T. L. Benzinger, D. Zhang, M. E. Raichle, D. M. Holtzman, J. C. Morris and B. M. Ances (2012): “Loss of intranetwork and internetwork resting state functional connections with Alzheimer’s disease progression,” J. Neurosci., 32, 8890–8899.

  • Bureau, A., J. Dupuis, K. Falls, K. Lunetta, B. Hayward, T. Keith and P. Van Eerdewegh (2005): “Identifying SNPs predictive of phenotype using random forests,” Genet. Epidimiol., 28, 171–182.

  • Busoniu, L., R. Babuska, B. De Schutter and D. Ernst (2010): “Extremely randomized trees,” in Reinforcement learning and dynamic programming using function approximations, Automation and Control Engineering Series, Florida, US: CRC Press-Taylor & Francis Group, pp. 235–238.

  • Chen, L., G. Yu, C. D. Langefeld, D. J. Miller, R. T. Guy, J. Raghuram, X. Yuan, D. M. Herrington and Y. Wang (2011): “Comparative analysis of methods for detecting interacting loci,” BMC Genomics, 12, Article no. 344.

  • Corder, E. H., A. M. Saunders, W. J. Strittmatter, D. E. Schmechel, P. C. Gaskell, G. W. Small, A. D. Roses, J. L. Haines and M. A. Pericak-Vance (1993): “Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families,” Science (New York, N.Y.), 261, 921–923.

  • Criminisi, A. (2012): “Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning,” Foundations and Trends in Computer Graphics and Vision, 7, 81–227.

    • Crossref
  • Deza, M.-M. and E. Deza (2013): “Encyclopedia of Distances,” Springer-Verlag, ISBN 978-3-642-00233-5.

    • Crossref
  • De Lobel, L., P. Geurts, G. Baele, F. Castro-Giner, M. Kogevinas and K. Van Steen (2010): “A screening methodology based on Random Forests to improve the detection of gene-gene interactions,” Eur. J. Hum. Genet., 18, 1127–1132.

  • Drzezga, A., T. Grimmer, G. Henriksen, I. Stangier, R. Pemeczky, J. Diehl-Schmid, C. A. Mathis, W. E. Klunck, J. Price, S. DeKosky, H.-J. Wester, M. Schwaiger and A. Kurz (2008): “Imaging of amyloid plaques and cerebral glucose metabolism in semantic dementia and Alzheimer’s disease,” NeuroImage, 39, 619–633.

    • Crossref
  • Förstner, W. and B. Moonen (1999): “A metric for covariance matrices,” Quo vadis geodesia, 113–.

  • Friedman, J., T. Hastie and R. Tibshirani (2008): “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics (Oxford, England), 9, 432–441.

    • Crossref
  • Gerber, S., T. Tasdizen, S. Joshi and R. Whitaker (2009): “On the manifold structure of the space of brain images,” Med. Image Comput. Comput. Assist. Interv., 12, 305–312.

  • Gerber, S., T. Tasdizen, P. Thomas Fletcher, S. Joshi and R. Whitaker (2010): “Manifold modeling for brain population analysis,” Med. Image Anal., 14, 643–653.

  • Glahn, D. C., P. M. Thompson and J. Blangero (2007): “Neuroimaging endophenotypes: strategies for finding genes influencing brain structure and function,” Hum. Brain Mapp., 28, 488–501.

  • Goldstein, B. A., A. E. Hubbard, A. Cutler and L. F. Barcellos (2010): “An application of Random forests to a genome-wide association dataset: methodological considerations & new findings,” BMC Genetics, 11, Article no. 49.

  • Goldstein, B. A., E. C. Polley and F. B. S. Briggs (2011): “Random forests for genetic association studies,” Statis. Appl. Genetics Mol. Biol., 10, 32.

  • Gray, K. R., P. Aljabar, R. A. Heckemann, A. Hammers, D. Rueckert and for the Alzheimer’s Disease Neuroimaging Initiative (2013): “Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease,” Neuroimage, 65C, 167–175.

    • Crossref
  • Gray, K. R., P. Aljabar, R. A. Heckemann, A. Hammers and D. Rueckert (2011): Random forest-based manifold learning for classification of imaging data in dementia. In: Machine Learning in Medical Imaging, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 159–166.

    • Crossref
  • Hahn, L., M. Ritchie and J. Moore (2003): “Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions,” Bioinformatics, 19, 376–382.

    • Crossref
  • Hastie, T., R. Tibshirani and J. Friedman (2009): The elements of statistical learning, Springer Series in Statistics, second edition, New York: Springer.

    • Crossref
  • Higham, N. J. (2002): “Computing the nearest correlation matrix–a problem from finance,” IMA J. Numer. Anal., 22, 329–343.

  • Hinrichs, C., V. Singh, G. Xu and S. C. Johnson (2011): “Predictive markers for AD in a multi-modality framework: an analysis of MCI progression in the ADNI population,” Neuroimage, 55, 16–16.

    • Crossref
  • Huang, S., J. Li, L. Sun, J. Ye, A. Fleisher, T. Wu, K. Chen and E. Reiman (2010): “Learning brain connectivity of Alzheimer’s disease by sparse inverse covariance estimation,” Neuroimage, 50, 935–949.

    • Crossref
  • Iwamoto, K., M. Bundo and T. Kato (2005): “Altered expression of mitochondria-related genes in postmortem brains of patients with bipolar disorder or schizophrenia, as revealed by large-scale DNA microarray analysis,” Hum. Mol. Genet., 14, 241–253.

    • Crossref
  • Iwangoff, P., R. Armbruster, A. Enz and W. Meierruge (1980): “Glycolytic Enzymes from human autoptic brain cortex - normal aged and demented cases,” Mech. Ageing Deve., 14, 203–209.

  • Jack, C. R., Jr., M. A. Bernstein, N. C. Fox, P. Thompson, G. Alexander, D. Harvey, B. Borowski, P. J. Britson, J. L. Whitwell, C. Ward, A. M. Dale, J. P. Felmlee, J. L. Gunter, D. L. G. Hill, R. Killiany, N. Schuff, S. Fox-Bosetti, C. Lin, C. Studholme, C. S. DeCarli, G. Krueger, H. A. Ward, G. J. Metzger, K. T. Scott, R. Mallozzi, D. Blezek, J. Levy, J. P. Debbins, A. S. Fleisher, M. Albert, R. Green, G. Bartzokis, G. Glover, J. Mugler, M. W. Weiner and A. Study (2008): “The Alzheimer’s Disease Neuroimaging Initiative (ADNI): MRI methods,” J. Magn. Reson. Im., 27, 685–691.

  • Jiang, R., W. Tang, X. Wu and W. Fu (2009): “A random forest approach to the detection of epistatic interactions in case-control studies,” BMC Bioinformatics, 10. doi: 10.1186/1471-2105-10-S1-S65.

    • Crossref
  • Kohannim, O., D. P. Hibar, J. L. Stein, N. Jahanshad, C. R. Jack, Jr., M. W. Weiner, A. W. Toga, P. M. Thompson and A. D. N. Initiative (2011): “Boosting power to detect genetic associations in imaging multi-locus, genome-wide scans and ridge regression,” In: 2011 8th IEEE International Symposium on Biomedical Imaging (ISBI) - From Nano to Macro, pp. 1855–1859.

  • Marchini, J., P. Donnelly and L. Cardon (2005): “Genome-wide strategies for detecting multiple loci that influence complex diseases,” Nat. Genet., 37, 413–417.

  • McAlonan, G. M., V. Cheung, C. Cheung, J. Suckling, G. Y. Lam, K. S. Tai, L. Yip, D. G. M. Murphy and S. E. Chua (2005): “Mapping the brain in autism. A voxel-based MRI study of volumetric differences and intercorrelations in autism,” Brain: A J. Neurol., 128, 268–276.

  • McKinney, B. A., J. E. Crowe, Jr., J. Guo and D. Tian (2009): “Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis,” PLOS Genetics, 5 (3).

    • Crossref
  • Miller, D. J., Y. Zhang, G. Yu, Y. Liu, L. Chen, C. D. Langefeld, D. Herrington and Y. Wang (2009): “An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions,” Bioinformatics (Oxford, England), 25, 2478–2485.

    • Crossref
  • Minas, C., S. J. Waddell and G. Montana (2011): “Distance-based differential analysis of gene curves.” Bioinformatics (Oxford, England), 27, 3135–3141.

    • Crossref
  • Moosmann, F., B. Triggs, F. Jurie (2007): “Fast discriminative visual codebooks using randomized clustering forests,” Adv. Neural Info. Processing Syst. 19, 985–992.

  • Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay (2011): “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., 12, 2825–2830.

  • Peng, B. B. and M. M. Kimmel (2005): “simuPOP: A forward-time population genetics simulation environment.” Bioinformatics (Oxford, England), 21, 3686–3687.

    • Crossref
  • Peng, B. and C. I. Amos (2010): “Forward-time simulation of realistic samples for genome-wide association studies,” BMC Bioinformatics, 11, 442–442.

    • Crossref
  • Pericak-Vance, M., J. Bebout, P. Gaskell, L. Yamaoka, W. Hung, A. MJ, A. Walker, R. Bartlett, C. Haynes, K. Welsh, N. Earl, A. Heyman, C. Clark and A. Roses (1991): “Linkage studies in Familiar Alzheimer-disease - Evidence for Chromosome 19 linkage,” Am. J. Hum. Genet., 48, 1034–1050.

  • Saykin, A. J., L. Shen, T. M. Foroud, S. G. Potkin, S. Swaminathan, S. Kim, S. L. Risacher, K. Nho, M. J. Huentelman, D. W. Craig, P. M. Thompson, J. L. Stein, J. H. Moore, L. A. Farrer, R. C. Green, L. Bertram, C. R. Jack, Jr., M. W. Weiner and A. D. N. Initi (2010): “Alzheimer’s Disease Neuroimaging Initiative biomarkers as quantitative phenotypes: genetics core aims, progress, and plans,” Alzheimer’s Dement.: J. Alzheimer’s Assoc., 6, 265–273.

  • Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.” Stat. Appl. Genet. Mol. Biol., 4 (1), Article no. 32.

    • Crossref
  • Scherzer, C. R., A. C. Eklund, L. J. Morse, Z. Liao, J. J. Locascio, D. Fefer, M. A. Schwarzschild, M. G. Schlossmacher, M. A. Hauser, J. M. Vance, L. R. Sudarsky, D. G. Standaert, J. H. Growdon, R. V. Jensen and S. R. Gullans (2007): “Molecular markers of early Parkinson’s disease based on gene expression in blood,” Proc. Nat. Acad. Sci. USA, 104, 955–960.

    • Crossref
  • Segal, M. and Y. Xiao (2011): “Multivariate random forests,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 80–87.

    • Crossref
  • Silver, M., E. Janousova, X. Hua, P. M. Thompson, G. Montana and The Alzheimer’s Disease Neuroimaging Initiative (2012): “Identification of gene pathways implicated in Alzheimer’s disease using longitudinal imaging phenotypes with sparse regression,” Neuroimage, 63, 1681–1694.

    • Crossref
  • Sperling, R. A., P. S. Aisen, L. A. Beckett, D. A. Bennett, S. Craft, A. M. Fagan, T. Iwatsubo, C. R. Jack, J. Kaye, T. J. Montine, D. C. Park, E. M. Reiman, C. C. Rowe, E. Siemers, Y. Stern, K. Yaffe, M. C. Carrillo, B. Thies, M. Morrison-Bogorad, M. V. Wagster and C. H. Phelps (2011): “Toward defining the preclinical stages of Alzheimer’s disease: recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease,” Alzheimer’s Dement.: J. Alzheimer’s Assoc., 7 (3), 280–292.

    • Crossref
  • Stein, J. L. J., X. X. Hua, S. S. Lee, A. J. A. Ho, A. D. A. Leow, A. W. A. Toga, A. J. A. Saykin, L. L. Shen, T. T. Foroud, N. N. Pankratz, M. J. M. Huentelman, D. W. D. Craig, J. D. J. Gerber, A. N. A. Allen, J. J. J. Corneveaux, B. M. B. Dechairo, S. G. S. Potkin, M. W. M. Weiner and P. P. Thompson (2010): “Voxelwise genome-wide association study (vGWAS),” Neuroimage, 53, 15.

    • Crossref
  • Strittmatter, W. J., A. M. Saunders, D. Schmechel, M. Pericak-Vance, J. Enghild, G. S. Salvesen and A. D. Roses (1993): “Apolipoprotein E: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial Alzheimer disease,” Proc. Natl. Acad. Sci. USA, 90, 1977–1981.

    • Crossref
  • Sun, L., R. Patel, J. Liu, K. Chen, T. Wu, J. Li, E. Reiman and J. Ye (2009): “Mining brain region connectivity for Alzheimer’s disease study via sparse inverse covariance estimation,” in the 15th ACM SIGKDD international conference, New York, USA: ACM Press, 1335.

    • Crossref
  • Tenenbaum, J. B., V. de Silva and J. C. Langford (2000): “A global geometric framework for nonlinear dimensionality reduction,” Science, 290, 2319–2323.

    • Crossref
  • Verma, R., P. Khurd and C. Davatzikos (2007): “On analyzing diffusion tensor images by identifying manifold structure using isomaps,” IEEE Trans. Med. Imaging, 26, 772–778.

    • Crossref
  • Vounou, M., E. Janousova, R. Wolz, J. L. Stein, P. M. Thompson, D. Rueckert, G. Montana and Alzheimer’s Disease Neuroimaging Initiative (2012): “Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer’s disease,” Neuroimage, 60, 700–716.

    • Crossref
  • Vounou, M., T. E. Nichols and G. Montana (2010): “Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach,” Neuroimage, 53, 1147–1159.

    • Crossref
  • Warde-Farley, D., S. L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C. T. Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G. Wright, G. D. Bader and Q. Morris (2010): “The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function,” Nucleic Acid Res., 38, W214–W220.

  • Yang, C., Z. He, X. Wan, Q. Yang, H. Xue and W. Yu (2009): “SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies,” Bioinformatics, 25, 504–511.

    • Crossref
  • Yoshida, M. and A. Koike (2011): “SNPInterForest: a new method for detecting epistatic interactions,” BMC Bioinformatics, 12, Article no. 469.

  • Zhang, D., Y. Wang, L. Zhou, H. Yuan and D. Shen (2011): “Multimodal classification of Alzheimer’s disease and mild cognitive impairment,” NeuroImage, 55, 856–867.

    • Crossref
  • Zhang, Y. and J. S. Liu (2007): “Bayesian inference of epistatic interactions in case-control studies,” Nat. Genet., 39, 1167–1173.

  • Zhang, Y., D. J. Miller, and G. Kesidis (2009): “Hierarchical maximum entropy modeling for regression,” In: 2009 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 1–6.

    • Crossref
  • Zhao, W.-Q., P. N. Lacor, H. Chen, M. P. Lambert, M. J. Quon, G. A. Krafft and W. L. Klein (2009): “Insulin receptor dysfunction impairs cellular clearance of neurotoxic oligomeric a beta,” J. Biol. Chem., 284, 18742–18753.

Purchase article
Get instant unlimited access to the article.
Log in
Already have access? Please log in.

Log in with your institution

Journal + Issues