Accessible Unlicensed Requires Authentication Published by De Gruyter March 10, 2016

Resistant multiple sparse canonical correlation

Jacob Coleman, Joseph Replogle, Gabriel Chandler and Johanna Hardin

Abstract

Canonical correlation analysis (CCA) is a multivariate technique that takes two datasets and forms the most highly correlated possible pairs of linear combinations between them. Each subsequent pair of linear combinations is orthogonal to the preceding pair, meaning that new information is gleaned from each pair. By looking at the magnitude of coefficient values, we can find out which variables can be grouped together, thus better understanding multiple interactions that are otherwise difficult to compute or grasp intuitively. CCA appears to have quite powerful applications to high-throughput data, as we can use it to discover, for example, relationships between gene expression and gene copy number variation. One of the biggest problems of CCA is that the number of variables (often upwards of 10,000) makes biological interpretation of linear combinations nearly impossible. To limit variable output, we have employed a method known as sparse canonical correlation analysis (SCCA), while adding estimation which is resistant to extreme observations or other types of deviant data. In this paper, we have demonstrated the success of resistant estimation in variable selection using SCCA. Additionally, we have used SCCA to find multiple canonical pairs for extended knowledge about the datasets at hand. Again, using resistant estimators provided more accurate estimates than standard estimators in the multiple canonical correlation setting. R code is available and documented at https://github.com/hardin47/rmscca.


Corresponding author: Johanna Hardin, Department of Mathematics, Pomona College, 610 N. College Ave., Claremont, CA 91711, USA, e-mail:

Acknowledgments

JC was supported by the Pomona College Summer Undergraduate Research Program and the Department of Mathematics at Pomona College. JR was supported by a grant to Pomona College from the Howard Hughes Medical Institute through the Precollege and Undergraduate Science Education Program. JH was supported by the Institute for Pure and Applied Mathematics, National Science Foundation Grant DMS-0931852.

Appendix

Consider the case of the first dimension of Y, Y1, which is centered at the first p1 dimensions of the random variable X. Because the majority of the correlation between the dimensions of the random variable Y values comes from their dependence on the random variable X, let ΣYY be a diagonal matrix. In contrast, ΣXX is made up of ρ(=0.2) at the appropriate off-diagonal elements and 1 on the diagonal.

Below is the derivation for the first diagonal entry of ΣYY, σYY,11. The goal is to find σYY,11 such that Cor(yl1, yl2)=ρ.

YlMVNq(μl,ΣYY), where μl=Xl×B,l=1,,nYl=Xl×B+εl, where εlMVNq(0,ΣYY),l=1,,nYl1=i=1p1Xli+εl1, where εl1iidN(0,σYY,11)Var(Yl1)=Var(i=1p1Xli+εl1)=p1σXX,11+(p12p1)σXX,12+Var(εl1)   Var(Yl1)=p1+(p12p1)ρ+σYY,11WLOGCov(Yl1,Yl2)=Cov(i=1p1Xli+εl1,i=1p1Xli+εl2)=p1σXX,11+p1(p11)σXX,12+cov(εl1,εl2)   =p1+p1(p11)ρWLOGCor(Yl1,Yl2)=p1+(p12p1)ρp1+(p12p1)ρ+σYY,11=ρσYY,11=(1ρ1)(p1+(p12p1)ρ)

By increasing the variance for each of the simulated Y variables involved in the true linear relationships, we create correlations of ρ (=0.2 in our simulations) between the Y variables in a group. The cross-covariance matrix between X and Y (ΣXY) is not pre-specified, but rather it is given by the relationship between ΣXX, ΣYY, and B.

References

Branco, J., C. Croux, P. Filzmoser and R. Oliveira (2005): “Robust canonical correlations: a comparative study,” Computation. Stat., 20, 203–231.Search in Google Scholar

Chalise, P. and B. Fridley (2011): “Comparison of penalty functions for sparse canonical correlation analysis,” Computation. Stat., 56, 245–254.Search in Google Scholar

Chin, K., S. DeVries, J. Fridlyand, P. T. Spellman, R. Roydasgupta, W.-L. Kuo, A. Lapuk, R. M. Neve, Z. Qian, T. Ryder, F. Chen, H. Feiler, T. Tokuyasu, C. Kingsley, S. Dairkee, Z. Meng, K. Chew, D. Pinkel, A. Jain, B. M. Ljung, L. Esserman, D. G. Albertson, F. M. Waldman and J. W. Gray (2006): “Genomic and transcriptional aberrations linked to breast cancer pathophysiologies,” Cancer Cell, 100, 529–541.Search in Google Scholar

Dehon, C., P. Filzmoser, and C. Croux (2000): Data analysis, classification, and related methods, chapter Robust Methods for Canonical Correlation Analysis. New York, NY: Springer, pp. 321–326.Search in Google Scholar

Gao, C., Z. Ma, Z. Ren and H. H. Zhou (2015): “Minimax estimation in sparse canonical correlation analysis,” Ann. Statist., 43, 2168–2197.Search in Google Scholar

Hardin, J. and J. Wilson (2009): “A note on oligonucleotide expression values not being normally distributed,” Biostatistics, 10, 446–50.Search in Google Scholar

Hardin, J., A. Mitani, L. Hicks and B. VanKoten (2007): “A robust measure of correlation between two genes on a microarray,” BMC Bioinformatics, 8, 220.Search in Google Scholar

Hong, S., X. Chen, L. Jin and M. Xiong (2013): “Canonical correlation analysis for rna-seq co-expression networks,” Nuc. Acids Res., 41, e95.Search in Google Scholar

Hotelling, H. (1936): “Relations between two sets of variates,” Biometrika, 28, 321–377.Search in Google Scholar

Huber, P. (1985): “Projection pursuit,” Ann. Stat., 13, 435–525.Search in Google Scholar

Karnel, G. (1991): “Robust canonical correlation and correspondence analysis,” The Frontiers of Statistical Scientific and Industrial Applications, 2, 335–354.Search in Google Scholar

Lê Cao, K.-A., P. G. Martin, C. Robert-Granié and P. Bess (2009): “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC Bioinformatics, 10, 34.Search in Google Scholar

Nguyen, D. and D. M. Rocke (2001): “Tumor classification by partial least squares using microarray gene expression data,” Bioinformatics, 18, 39–50.Search in Google Scholar

Parkhomenko, E., D. Tritchler, and J. Beyene (2009): “Sparse canonical correlation analysis with application to genomic data integration,” Statistical, 8, 1–36.Search in Google Scholar

Pearson, K. (1901): “On lines and planes of closest fit to systems of points in space,” Philos. Mag., 11, 559–572.Search in Google Scholar

R Core Team (2014): “R: A Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria. URL .Search in Google Scholar

Rousseeuw, P. (1984): “Least median of squares regression,” J. Am. Stat. Assoc., 79, 871–880.Search in Google Scholar

Roy, S. and A. M. Reif (2013): “Evaluation of calling algorithms for array-cgh,” Front. Genet., 4, 217.Search in Google Scholar

Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc. B, 58, 267–288.Search in Google Scholar

Wang, Y. R., K. Jiang, L. J. Feldman, P. J. Bickel and H. Huang (2015): “Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis,” Ann. Appl. Stat., 9, 300–323.Search in Google Scholar

Witten, D. and R. Tibshirani (2009): “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Mol. Biol., 8, 901–929.Search in Google Scholar

Witten, D., R. Tibshirani, and T. Hastie (2009): “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, 100, 515–534.Search in Google Scholar

Witten, D., R. Tibshirani, S. Gross and B. Narasimhan (2013): PMA: Penalized Multivariate Analysis, URL . R package version 1.0.9.Search in Google Scholar

Wold, H. (1973): Multivariate Analysis II, chapter Nonlin ear Iterative Partial Least Squares (NIPALS) Modeling: Some Current Developments. New York: Academic Press, pp. 383–407.Search in Google Scholar

Zou, H., T. Hastie, and R. Tibshirani (2006): “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, 15, 262–286.Search in Google Scholar

Published Online: 2016-3-10
Published in Print: 2016-4-1

©2016 by De Gruyter