Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2017: 0.04

See all formats and pricing
More options …
Volume 15, Issue 5


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments

Jochen Kruppa
  • Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, D-30559, Germany
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Frank Kramer
  • Department of Medical Statistics, University Medical Center Göttingen, 37099 Göttingen, Germany
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Tim Beißbarth
  • Department of Medical Statistics, University Medical Center Göttingen, 37099 Göttingen, Germany
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Klaus Jung
  • Corresponding author
  • Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Hannover, D-30559, Germany
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2016-09-21 | DOI: https://doi.org/10.1515/sagmb-2015-0082


As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.

Keywords: count data; gene ontology; next-generation sequencing; pathways; simulation


  • Adler, A. S., M. L. McCleland, S. Yee, M. Yaylaoglu, S. Hussain, E. Cosino, E. Quinones, Z. Modrusan, S. Seshagiri, E. Torres, V. S. Chopra, B. Haley, Z. Zhang, E. M. Blackwood, M. Singh, M. Junttila, J-P. Stephan, J. Liu, G. Pau, E. R. Fearon, Z. Jiang and R. Firestein (2014): “An integrative analysis of colon cancer identifies an essential function for PRPF6 in tumor growth,” Genes. Dev., 28, 1068–1084.Google Scholar

  • Allen, G. I. and Z. Liu (2012): “A log-linear graphical model for inferring genetic networks from high-throughput sequencing data,” IEEE Int. Conf. Bioinf. Biomed., 41–46. doi: .CrossrefGoogle Scholar

  • Allen, G. I. and Z. Liu (2013): “A local poisson graphical model for inferring genetic networks from next generation sequencing data,” IEEE Trans. Nanobiosci., 12, 1–10.Google Scholar

  • Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106.CrossrefGoogle Scholar

  • Anders, S., P. T. Pyl and W. Huber (2015): “HTSeq – A Python framework to work with high-throughput sequencing data,” Bioinformatics, 31, 166–169.Google Scholar

  • Böhning, D., E. Dietz and P. Schlattmann (1999): “The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology,” J. Royal. Stat. Soc., Series A, 162, 195–209.Google Scholar

  • Canale, A. and D. B. Dunson (2012): “Nonparametric Bayes modelling of count processes,” Biometrika, 100, 801–816.Google Scholar

  • Choi, H., D. Fermin and A. I. Nesvizhskii (2008): “Significance analysis of spectral count data in label-free shotgun proteomics,” Mol. Cell. Proteomics, 7, 2373–2385.Google Scholar

  • Demir, E., M. P. Cary, S. Paley, K. Fukuda, C. Lemer, I. Vastrik, G. Wu, P. D’Eustachio, C. Schaefer, J. Luciano, F. Schacherer, I. Martinez-Flores, Z. Hu, V. Jimenez-Jacinto, G. Joshi-Tope, K. Kandasamy, A. C. Lopez-Fuentes, H. Mi, E. Pichler, I. Rodchenkov, A. Splendiani, S. Tkachev, J. Zucker, G. Gopinath, H. Rajasimha, R. Ramakrishnan, I. Shah, M. Syed, N. Anwar, O. Babur, M. Blinov, E. Brauner, D. Corwin, S. Donaldson, F. Gibbons, R. Goldberg, P. Hornbeck, A. Luna, P. Murray-Rust, E. Neumann, O. Ruebenacker, M. Samwald, M. van Iersel, S. Wimalaratne, K. Allen, B. Braun, M. Whirl-Carrillo, K. H. Cheung, K. Dahlquist, A. Finney, M. Gillespie, E. Glass, L. Gong, R. Haw, M. Honig, O. Hubaut, D. Kane, S. Krupa, M. Kutmon, J. Leonard, D. Marks, D. Merberg, V. Petri, A. Pico, D. Ravenscroft, L. Ren, N. Shah, M. Sunshine, R. Tang, R. Whaley, S. Letovksy, K. H. Buetow, A. Rzhetsky, V. Schachter, B. S. Sobral, U. Dogrusoz, S. McWeeney, M. Aladjem, E. Birney, J. Collado-Vides, S. Goto, M. Hucka, N. Le Novère, N. Maltsev, A. Pandey, P. Thomas, E. Wingender, P. D. Karp, C. Sander and G. D. Bader (2010): “The BioPAX community standard for pathway data sharing,” Nat. Biotechnol., 28, 935–942.Google Scholar

  • Fischer, M., S. Zilkenat, R. G. Gerlach, S. Wagner and B. Y. Renard (2014): “Pre- and post-processing workflow for affinity purification mass spectrometry data,” J. Proteom. Res., 13, 2239–2249.Google Scholar

  • Frazee, A. C., G. Pertea, A. E. Jaffe, B. Langmead, S. L. Salzberg and J. T. Leek (2014): “Flexible isoform-level differential expression analysis with Ballgown,” bioRxiv reprint, doi: .CrossrefGoogle Scholar

  • Fröhlich, H., Ö. Sahin, D. Arlt, C. Bender and T. Beissbarth (2009): “Deterministic Effects Propagation Networks for reconstructing protein signaling networks from multiple interventions,” BMC Bioinform., 10, 322.Google Scholar

  • Galati, J. C., K. A. Seaton, K. J. Lee, J. A. Simpson and J. B. Carlin (2014): “Rounding non-binary categorical variables following multivariate normal imputation: evaluation of simple methods and implications for practice,” J. Stat. Comput. Simul., 84, 798–811.Google Scholar

  • Goeman, J. J., S.A. van de Geer, F. de Kort and H. C. van Houwelingen (2004): “A global test for groups of genes: testing association with a clinical outcome,” Bioinformatics, 20, 93–99.Google Scholar

  • Griebel, T., B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó and M. Sammeth (2012): “Modelling and simulating generic RNA-Seq experiments with the flux simulator,” Nucleic Acids Res., 40, 10073–10083.Google Scholar

  • Higham, N. (2002): “Computing the nearest correlation matrix – a problem from finance,” IMA J. Numer. Anal., 22, 329–343.Google Scholar

  • Horton, N. J., S. R. Lipsitz and M. Parzen (2003): “A potential for bias when rounding in multiple imputation,” Am. Stat., 57, 229–232.Google Scholar

  • Jung, K., H. Dihazi, A. Bibi, G. H. Dihazi and T. Beissbarth (2014): “Adaption of the global test idea to proteomics data with missing values,” Bioinformatics, 30, 1424–1430.Google Scholar

  • Karlis, D. and L. Meligkotsidou (2005): “Multivariate Poisson regression with covariance structure,” Stat. Comput., 15, 255–265.Google Scholar

  • Kirk, P. D. W. and M. P. H. Stumpf (2009): “Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data,” Bioinformatics, 25, 1300–1306.Google Scholar

  • Kramer, F. (2014): “Integration of Pathway Data as Prior Knowledge into Methods for Network Reconstruction,” Dissertation, Georg-August-Universit at Göttingen.Google Scholar

  • Kramer, F., M. Bayerlová, F. Klemm, A. Bleckmann and T. Beissbarth (2013): “rBiopaxParser – an R package to parse, modify and visualize BioPAX data,” Bioinformatics, 29, 520–522.Google Scholar

  • Kramer, F., M. Bayerlová and T. Beißbarth (2014): “R-based software for the integration of pathway data into bioinformatic algorithms,” Biology, 3, 85–100.Google Scholar

  • Ledoit, O. and M. Wolf (2003): “Improved estimation of the covariance matrix of stock returns with an application to portfolio selection,” J. Empir. Financ., 10, 603–621.Google Scholar

  • Leisch, F., A. Weingessel and K. Hornik (1998): “On the generation of correlated artificial binary data.” Working Papers SFB ‘Adaptive Information Systems and Modelling in Economics and Management Science’, 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna.Google Scholar

  • Li, B. and C. Dewey (2011): “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome,” BMC Bioinform., 12, 323.Google Scholar

  • Li, C.-S., J.-C. Lu, J. Park, K. Kim, P. A. Brinkley and J. P. Peterson (1999): “Multivariate zero-inflated poisson models and their applications,” Technometrics, 41, 29–38.Google Scholar

  • Liao, Y., G. K. Smyth and W. Shi (2014): “FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features,” Bioinformatics, 30, 923–930.Google Scholar

  • Liu, Z., F. Sun, J. Braun, D. P. B. McGovern and S. Piantadosi (2015): “Multilevel regularized regression for simultaneous taxa selection and network construction with metagenomic count data,” Bioinformatics, 31, 1067–1074.Google Scholar

  • Mansmann, U. and R. Meister (2006): “Testing differential gene expression in functional groups,” Methods Inf. Med., 44, 449–453.Google Scholar

  • Opgen-Rhein, R. and K. Strimmer (2007): “Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach,” Statist. Appl. Genet. Mol. Biol., 6, 9.Google Scholar

  • R Core Team (2013): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

  • Robinson, M. D., D. J. McCarthy and G. K. Smyth (2010): “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, 26, 139–140.Google Scholar

  • Rustici G., N. Kolesnikov, M. Brandizi, T. Burdett, M. Dylag, I. Emam, A. Farne, E. Hastings, J. Ison, M. Keays, N. Kurbatova, J. Malone, R. Mani, A. Mupo, R. Pedro Pereira, E. Pilicheva, J. Rung, A. Sharma, Y. A. Tang, T. Ternent, A. Tikhonov, D. Welter, E. Williams, A. Brazma, H. Parkinson and U. Sarkans (2013): “ArrayExpress update – trends in database growth and links to data analysis tools,” Nucleic Acids. Res., 31, D987–D990.CrossrefGoogle Scholar

  • Schaefer, C. F., K. Anthony, S. Krupa, J. Buchoff, M. Day, T. Hannay and K. H. Buetow (2009): “PID: the pathway interaction database,” Nucleic Acids. Res., 37, D674–D679.CrossrefGoogle Scholar

  • Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance estimation and implications for functional genomics,” Statist. Appl. Genet. Mol. Biol., 4, 32.Google Scholar

  • Shi, P. and E. A. Valdez (2014): “Multivariate negative binomial models for insurance claim counts,” Insur. Math. Econ., 55, 18–29.Google Scholar

  • Shin, K. and R. Pasupathy (2007): “A method for fast generation of bivariate Poisson random vectors,” Proc 2007 Winter Simulation Conf, 472–479.Google Scholar

  • Yahav, I. and G. Shmueli (2012): “On generating multivariate Poisson data in management science applications,” Appl. Stoch. Model. Bus., 28, 91–102.Google Scholar

  • Zhang, L. and B. K. Mallick (2013): “Inferring gene networks from discrete expression data,” Biostatistics, 14, 708–722.Google Scholar

  • Zhao, T. and H. Liu (2012): “The huge Package for High-Dimensional Undirected Graph Estimation in R,” J. Mach. Learn. Res., 13, 1059–1062.Google Scholar

  • Zhou, H., J. Jin, Z. Haojun, Y. Bo, M. Wozniak and W. Limsoon (2012): “IntPath – an integrated pathway gene relationship database for model organisms and important pathogens,” BMC Syst. Biol., 6:Suppl 2, S2.CrossrefGoogle Scholar

About the article

Published Online: 2016-09-21

Published in Print: 2016-10-01

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 15, Issue 5, Pages 401–414, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2015-0082.

Export Citation

©2016 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Shahla Faisal and Gerhard Tutz
Statistical Applications in Genetics and Molecular Biology, 2017, Volume 16, Number 2

Comments (0)

Please log in or register to comment.
Log in