degruyter.com uses cookies to store information that enables us to optimize our website and make browsing more comfortable for you. To learn more about the use of cookies, please read our Privacy Policy. OK

Missing value imputation for gene expression data by tailored nearest neighbors

Shahla Faisal 1 , 2  and Gerhard Tutz 1
  • 1 Department of Statistics, Ludwig-Maximilians-University Munich, Germany
  • 2 Department of Statistics, Government College University Faisalabad, Pakistan
Shahla Faisal
  • Corresponding author
  • Department of Statistics, Ludwig-Maximilians-University Munich, Germany
  • Department of Statistics, Government College University Faisalabad, Pakistan
  • Email
  • Search for other articles:
  • degruyter.comGoogle Scholar
and Gerhard Tutz

Abstract

High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

  • Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu and J. I. Powell (2000): “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, 403, 503–511.

    • Crossref
    • PubMed
    • Export Citation
  • Anders, S., P. T. Pyl and W. Huber (2015): “HTSeq—a Python framework to work with high-throughput sequencing data,” Bioinformatics, 31, 166–169.

    • Crossref
    • PubMed
    • Export Citation
  • Bø, T. H., B. Dysvik and I. Jonassen (2004): “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., 32, e34.

    • Crossref
    • PubMed
    • Export Citation
  • Brás, L. P. and J. C. Menezes (2007): “Improving cluster-based missing value estimation of dna microarray data,” Biomol. Eng., 24, 273–282.

    • Crossref
    • PubMed
    • Export Citation
  • Breiman, L. (2001): “Random forests,” Mach. Learn., 45, 5–32.

    • Crossref
    • Export Citation
  • Brock, G. N., J. R. Shaffer, R. E. Blakesley, M. J. Lotz and G. C. Tseng (2008): “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinformatics, 9, 12.

    • Crossref
    • PubMed
    • Export Citation
  • Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson and T. R. Gingeras (2013): “STAR: ultrafast universal RNA-seq aligner,” Bioinformatics, 29, 15–21.

    • Crossref
    • PubMed
    • Export Citation
  • Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumors using gene expression data,” J Am. Stat. Assoc., 97, 77–87.

    • Crossref
    • Export Citation
  • Feten, G., T. Almoy and A. H. Aastveit (2005): “Prediction of missing values in microarray and use of mixed models to evaluate the predictors,” Stat. Appl. Genet. Mol. Biol., 4, 10.

  • Frazee, A. C., B. Langmead and J. T. Leek (2011): “Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets,” BMC Bioinformatics, 12, 449.

    • Crossref
    • PubMed
    • Export Citation
  • Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu (2013): “impute: impute: Imputation for microarray data,” http://www.bioconductor.org/packages/release/bioc/html/impute.html, r package version 1.36.0.

  • Jung, K., A. Gannoun, B. Sitek, H. E. Meyer, K. Stühler and W. Urfer (2005): “Analysis of dynamic protein expression data,” RevStat-Stat. J., 3, 99–111.

  • Jung, K., A. Gannoun, B. Sitek, O. Apostolov, A. Schramm, H. E. Meyer, K. Stühler and W. Urfer (2006): “Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study,” RevStat-Stat. J., 4, 67–80.

  • Khan, J., J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer (2001): “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., 7, 673–679.

    • Crossref
    • PubMed
    • Export Citation
  • Kim, K.-Y., B.-J. Kim and G.-S. Yi (2004): “Reuse of imputed data in microarray analysis increases imputation efficiency,” BMC Bioinformatics, 5, 160.

    • Crossref
    • PubMed
    • Export Citation
  • Klambauer, G., T. Unterthiner and S. Hochreiter (2013): “Dexus: identifying differential expression in RNA-seq studies with unknown conditions,” Nucleic Acids Res., 41, e198, http://nar.oxfordjournals.org/content/41/21/e198.abstract.

    • PubMed
    • Export Citation
  • Kruppa, J., F. Kramer, T. Beißbarth and K. Jung (2016): “A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments,” Stat. Appl. Genet. Mol. Biol. 15, 401–414.

    • PubMed
    • Export Citation
  • Montgomery, S. B., M. Sammeth, M. Gutierrez-Arcelus, R. P. Lach, C. Ingle, J. Nisbett, R. Guigo and E. T. Dermitzakis (2010): “Transcriptome genetics using second generation sequencing in a Caucasian population,” Nature, 464, 773–777.

    • Crossref
    • Export Citation
  • Ouyang, M., W. J. Welsh and P. Georgopoulos (2004): “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, 20, 917–923.

    • Crossref
    • PubMed
    • Export Citation
  • Pickrell, J. K., J. C. Marioni, A. A. Pai, J. F. Degner, B. E. Engelhardt, E. Nkadori, J.-B. Veyrieras, M. Stephens, Y. Gilad and J. K. Pritchard (2010): “Understanding mechanisms underlying human gene expression variation with rna sequencing,” Nature, 464, 768–772.

    • Crossref
    • PubMed
    • Export Citation
  • Sehgal, M. S. B., I. Gondal and L. Dooley (2004): “K-ranked covariance based missing values estimation for microarray data classification,” In: Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference on, IEEE Japan. pp. 274–279.

  • Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genet. Mol. Biol, 4, 32.

  • Städler, N. and P. Bühlmann (2014): “Pattern alternating maximization algorithm for high-dimensional missing data,” J. Mach. Learn. Res., 15, 1903–1928.

  • Stekhoven, D. J. and P. Bühlmann (2012): “Missforest: a non-parametric missing value imputation for mixed-type data,” Bioinformatics, 28, 112–118.

    • Crossref
    • PubMed
    • Export Citation
  • Templ, M., A. Alfons, A. Kowarik and B. Prantner (2013): “VIM: visualization and imputation of missing values,” http://CRAN.R-project.org/package=VIM, r package version 4.0.0.

  • Tritchler, D., E. Parkhomenko and J. Beyene (2009): “Filtering genes for cluster and network analysis,” BMC Bioinformatics, 10, 193, http://doi.org/10.1186/1471-2105-10-193.

    • Crossref
    • PubMed
    • Export Citation
  • Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman (2001): “Missing value estimation methods for dna microarrays,” Bioinformatics, 17, 520–525.

    • Crossref
    • PubMed
    • Export Citation
  • Tuikkala, J., L. L. Elo, O. S. Nevalainen and T. Aittokallio (2008): “Missing value imputation improves clustering and interpretation of gene expression microarray data,” BMC Bioinformatics, 9, 202.

    • Crossref
    • PubMed
    • Export Citation
  • Tutz, G. and S. Ramzan (2015): “Improved methods for the imputation of missing data by nearest neighbor methods,” Comput. Stat. Data Anal., 90, 84–99.

    • Crossref
    • Export Citation
  • Waljee, A. K., A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, J. Marrero, J. Zhu and P. D. Higgins (2013): “Comparison of imputation methods for missing laboratory data in medicine,” BMJ Open, 3, e002847.

    • Crossref
    • PubMed
    • Export Citation
Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

SAGMB publishes significant research on the application of statistical ideas to problems arising from computational biology. The range of topics includes linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarrary data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.

Search