High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.
Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu and J. I. Powell (2000): “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, 403, 503–511.
Anders, S., P. T. Pyl and W. Huber (2015): “HTSeq—a Python framework to work with high-throughput sequencing data,” Bioinformatics, 31, 166–169.
Bø, T. H., B. Dysvik and I. Jonassen (2004): “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., 32, e34.
Brás, L. P. and J. C. Menezes (2007): “Improving cluster-based missing value estimation of dna microarray data,” Biomol. Eng., 24, 273–282.
Breiman, L. (2001): “Random forests,” Mach. Learn., 45, 5–32.
Brock, G. N., J. R. Shaffer, R. E. Blakesley, M. J. Lotz and G. C. Tseng (2008): “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinformatics, 9, 12.
Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson and T. R. Gingeras (2013): “STAR: ultrafast universal RNA-seq aligner,” Bioinformatics, 29, 15–21.
Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumors using gene expression data,” J Am. Stat. Assoc., 97, 77–87.
Feten, G., T. Almoy and A. H. Aastveit (2005): “Prediction of missing values in microarray and use of mixed models to evaluate the predictors,” Stat. Appl. Genet. Mol. Biol., 4, 10.
Frazee, A. C., B. Langmead and J. T. Leek (2011): “Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets,” BMC Bioinformatics, 12, 449.
Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu (2013): “impute: impute: Imputation for microarray data,” http://www.bioconductor.org/packages/release/bioc/html/impute.html, r package version 1.36.0.
Jung, K., A. Gannoun, B. Sitek, H. E. Meyer, K. Stühler and W. Urfer (2005): “Analysis of dynamic protein expression data,” RevStat-Stat. J., 3, 99–111.
Jung, K., A. Gannoun, B. Sitek, O. Apostolov, A. Schramm, H. E. Meyer, K. Stühler and W. Urfer (2006): “Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study,” RevStat-Stat. J., 4, 67–80.
Khan, J., J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer (2001): “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., 7, 673–679.
Kim, K.-Y., B.-J. Kim and G.-S. Yi (2004): “Reuse of imputed data in microarray analysis increases imputation efficiency,” BMC Bioinformatics, 5, 160.
Klambauer, G., T. Unterthiner and S. Hochreiter (2013): “Dexus: identifying differential expression in RNA-seq studies with unknown conditions,” Nucleic Acids Res., 41, e198, http://nar.oxfordjournals.org/content/41/21/e198.abstract.
Kruppa, J., F. Kramer, T. Beißbarth and K. Jung (2016): “A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments,” Stat. Appl. Genet. Mol. Biol. 15, 401–414.
Montgomery, S. B., M. Sammeth, M. Gutierrez-Arcelus, R. P. Lach, C. Ingle, J. Nisbett, R. Guigo and E. T. Dermitzakis (2010): “Transcriptome genetics using second generation sequencing in a Caucasian population,” Nature, 464, 773–777.
Ouyang, M., W. J. Welsh and P. Georgopoulos (2004): “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, 20, 917–923.
Pickrell, J. K., J. C. Marioni, A. A. Pai, J. F. Degner, B. E. Engelhardt, E. Nkadori, J.-B. Veyrieras, M. Stephens, Y. Gilad and J. K. Pritchard (2010): “Understanding mechanisms underlying human gene expression variation with rna sequencing,” Nature, 464, 768–772.
Sehgal, M. S. B., I. Gondal and L. Dooley (2004): “K-ranked covariance based missing values estimation for microarray data classification,” In: Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference on, IEEE Japan. pp. 274–279.
Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genet. Mol. Biol, 4, 32.
Städler, N. and P. Bühlmann (2014): “Pattern alternating maximization algorithm for high-dimensional missing data,” J. Mach. Learn. Res., 15, 1903–1928.
Stekhoven, D. J. and P. Bühlmann (2012): “Missforest: a non-parametric missing value imputation for mixed-type data,” Bioinformatics, 28, 112–118.
Templ, M., A. Alfons, A. Kowarik and B. Prantner (2013): “VIM: visualization and imputation of missing values,” http://CRAN.R-project.org/package=VIM, r package version 4.0.0.
Tritchler, D., E. Parkhomenko and J. Beyene (2009): “Filtering genes for cluster and network analysis,” BMC Bioinformatics, 10, 193, http://doi.org/10.1186/1471-2105-10-193.
Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman (2001): “Missing value estimation methods for dna microarrays,” Bioinformatics, 17, 520–525.
Tuikkala, J., L. L. Elo, O. S. Nevalainen and T. Aittokallio (2008): “Missing value imputation improves clustering and interpretation of gene expression microarray data,” BMC Bioinformatics, 9, 202.
Tutz, G. and S. Ramzan (2015): “Improved methods for the imputation of missing data by nearest neighbor methods,” Comput. Stat. Data Anal., 90, 84–99.
Waljee, A. K., A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, J. Marrero, J. Zhu and P. D. Higgins (2013): “Comparison of imputation methods for missing laboratory data in medicine,” BMJ Open, 3, e002847.