Accessible Unlicensed Requires Authentication Published by De Gruyter September 21, 2016

A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments

Jochen Kruppa, Frank Kramer, Tim Beißbarth and Klaus Jung


As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.


The work was partially supported by the German Federal Ministry of Education and Research (BMBF) through the projects ELSA Genoperspektiv (grant number 01GP1402) and e:Med MMML-Demonstrators (grant number 031A428B). The authors would also like to thank the reviewers for further references.


Adler, A. S., M. L. McCleland, S. Yee, M. Yaylaoglu, S. Hussain, E. Cosino, E. Quinones, Z. Modrusan, S. Seshagiri, E. Torres, V. S. Chopra, B. Haley, Z. Zhang, E. M. Blackwood, M. Singh, M. Junttila, J-P. Stephan, J. Liu, G. Pau, E. R. Fearon, Z. Jiang and R. Firestein (2014): “An integrative analysis of colon cancer identifies an essential function for PRPF6 in tumor growth,” Genes. Dev., 28, 1068–1084.Search in Google Scholar

Allen, G. I. and Z. Liu (2012): “A log-linear graphical model for inferring genetic networks from high-throughput sequencing data,” IEEE Int. Conf. Bioinf. Biomed., 41–46. doi: 10.1109/BIBM.2012.6392619.Search in Google Scholar

Allen, G. I. and Z. Liu (2013): “A local poisson graphical model for inferring genetic networks from next generation sequencing data,” IEEE Trans. Nanobiosci., 12, 1–10.Search in Google Scholar

Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106.Search in Google Scholar

Anders, S., P. T. Pyl and W. Huber (2015): “HTSeq – A Python framework to work with high-throughput sequencing data,” Bioinformatics, 31, 166–169.Search in Google Scholar

Böhning, D., E. Dietz and P. Schlattmann (1999): “The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology,” J. Royal. Stat. Soc., Series A, 162, 195–209.Search in Google Scholar

Canale, A. and D. B. Dunson (2012): “Nonparametric Bayes modelling of count processes,” Biometrika, 100, 801–816.Search in Google Scholar

Choi, H., D. Fermin and A. I. Nesvizhskii (2008): “Significance analysis of spectral count data in label-free shotgun proteomics,” Mol. Cell. Proteomics, 7, 2373–2385.Search in Google Scholar

Demir, E., M. P. Cary, S. Paley, K. Fukuda, C. Lemer, I. Vastrik, G. Wu, P. D’Eustachio, C. Schaefer, J. Luciano, F. Schacherer, I. Martinez-Flores, Z. Hu, V. Jimenez-Jacinto, G. Joshi-Tope, K. Kandasamy, A. C. Lopez-Fuentes, H. Mi, E. Pichler, I. Rodchenkov, A. Splendiani, S. Tkachev, J. Zucker, G. Gopinath, H. Rajasimha, R. Ramakrishnan, I. Shah, M. Syed, N. Anwar, O. Babur, M. Blinov, E. Brauner, D. Corwin, S. Donaldson, F. Gibbons, R. Goldberg, P. Hornbeck, A. Luna, P. Murray-Rust, E. Neumann, O. Ruebenacker, M. Samwald, M. van Iersel, S. Wimalaratne, K. Allen, B. Braun, M. Whirl-Carrillo, K. H. Cheung, K. Dahlquist, A. Finney, M. Gillespie, E. Glass, L. Gong, R. Haw, M. Honig, O. Hubaut, D. Kane, S. Krupa, M. Kutmon, J. Leonard, D. Marks, D. Merberg, V. Petri, A. Pico, D. Ravenscroft, L. Ren, N. Shah, M. Sunshine, R. Tang, R. Whaley, S. Letovksy, K. H. Buetow, A. Rzhetsky, V. Schachter, B. S. Sobral, U. Dogrusoz, S. McWeeney, M. Aladjem, E. Birney, J. Collado-Vides, S. Goto, M. Hucka, N. Le Novère, N. Maltsev, A. Pandey, P. Thomas, E. Wingender, P. D. Karp, C. Sander and G. D. Bader (2010): “The BioPAX community standard for pathway data sharing,” Nat. Biotechnol., 28, 935–942.Search in Google Scholar

Fischer, M., S. Zilkenat, R. G. Gerlach, S. Wagner and B. Y. Renard (2014): “Pre- and post-processing workflow for affinity purification mass spectrometry data,” J. Proteom. Res., 13, 2239–2249.Search in Google Scholar

Frazee, A. C., G. Pertea, A. E. Jaffe, B. Langmead, S. L. Salzberg and J. T. Leek (2014): “Flexible isoform-level differential expression analysis with Ballgown,” bioRxiv reprint, doi: 10.1101/003665.Search in Google Scholar

Fröhlich, H., Ö. Sahin, D. Arlt, C. Bender and T. Beissbarth (2009): “Deterministic Effects Propagation Networks for reconstructing protein signaling networks from multiple interventions,” BMC Bioinform., 10, 322.Search in Google Scholar

Galati, J. C., K. A. Seaton, K. J. Lee, J. A. Simpson and J. B. Carlin (2014): “Rounding non-binary categorical variables following multivariate normal imputation: evaluation of simple methods and implications for practice,” J. Stat. Comput. Simul., 84, 798–811.Search in Google Scholar

Goeman, J. J., S.A. van de Geer, F. de Kort and H. C. van Houwelingen (2004): “A global test for groups of genes: testing association with a clinical outcome,” Bioinformatics, 20, 93–99.Search in Google Scholar

Griebel, T., B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó and M. Sammeth (2012): “Modelling and simulating generic RNA-Seq experiments with the flux simulator,” Nucleic Acids Res., 40, 10073–10083.Search in Google Scholar

Higham, N. (2002): “Computing the nearest correlation matrix – a problem from finance,” IMA J. Numer. Anal., 22, 329–343.Search in Google Scholar

Horton, N. J., S. R. Lipsitz and M. Parzen (2003): “A potential for bias when rounding in multiple imputation,” Am. Stat., 57, 229–232.Search in Google Scholar

Jung, K., H. Dihazi, A. Bibi, G. H. Dihazi and T. Beissbarth (2014): “Adaption of the global test idea to proteomics data with missing values,” Bioinformatics, 30, 1424–1430.Search in Google Scholar

Karlis, D. and L. Meligkotsidou (2005): “Multivariate Poisson regression with covariance structure,” Stat. Comput., 15, 255–265.Search in Google Scholar

Kirk, P. D. W. and M. P. H. Stumpf (2009): “Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data,” Bioinformatics, 25, 1300–1306.Search in Google Scholar

Kramer, F. (2014): “Integration of Pathway Data as Prior Knowledge into Methods for Network Reconstruction,” Dissertation, Georg-August-Universit at Göttingen.Search in Google Scholar

Kramer, F., M. Bayerlová, F. Klemm, A. Bleckmann and T. Beissbarth (2013): “rBiopaxParser – an R package to parse, modify and visualize BioPAX data,” Bioinformatics, 29, 520–522.Search in Google Scholar

Kramer, F., M. Bayerlová and T. Beißbarth (2014): “R-based software for the integration of pathway data into bioinformatic algorithms,” Biology, 3, 85–100.Search in Google Scholar

Ledoit, O. and M. Wolf (2003): “Improved estimation of the covariance matrix of stock returns with an application to portfolio selection,” J. Empir. Financ., 10, 603–621.Search in Google Scholar

Leisch, F., A. Weingessel and K. Hornik (1998): “On the generation of correlated artificial binary data.” Working Papers SFB ‘Adaptive Information Systems and Modelling in Economics and Management Science’, 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna.Search in Google Scholar

Li, B. and C. Dewey (2011): “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome,” BMC Bioinform., 12, 323.Search in Google Scholar

Li, C.-S., J.-C. Lu, J. Park, K. Kim, P. A. Brinkley and J. P. Peterson (1999): “Multivariate zero-inflated poisson models and their applications,” Technometrics, 41, 29–38.Search in Google Scholar

Liao, Y., G. K. Smyth and W. Shi (2014): “FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features,” Bioinformatics, 30, 923–930.Search in Google Scholar

Liu, Z., F. Sun, J. Braun, D. P. B. McGovern and S. Piantadosi (2015): “Multilevel regularized regression for simultaneous taxa selection and network construction with metagenomic count data,” Bioinformatics, 31, 1067–1074.Search in Google Scholar

Mansmann, U. and R. Meister (2006): “Testing differential gene expression in functional groups,” Methods Inf. Med., 44, 449–453.Search in Google Scholar

Opgen-Rhein, R. and K. Strimmer (2007): “Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach,” Statist. Appl. Genet. Mol. Biol., 6, 9.Search in Google Scholar

R Core Team (2013): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL .Search in Google Scholar

Robinson, M. D., D. J. McCarthy and G. K. Smyth (2010): “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, 26, 139–140.Search in Google Scholar

Rustici G., N. Kolesnikov, M. Brandizi, T. Burdett, M. Dylag, I. Emam, A. Farne, E. Hastings, J. Ison, M. Keays, N. Kurbatova, J. Malone, R. Mani, A. Mupo, R. Pedro Pereira, E. Pilicheva, J. Rung, A. Sharma, Y. A. Tang, T. Ternent, A. Tikhonov, D. Welter, E. Williams, A. Brazma, H. Parkinson and U. Sarkans (2013): “ArrayExpress update – trends in database growth and links to data analysis tools,” Nucleic Acids. Res., 31, D987–D990.Search in Google Scholar

Schaefer, C. F., K. Anthony, S. Krupa, J. Buchoff, M. Day, T. Hannay and K. H. Buetow (2009): “PID: the pathway interaction database,” Nucleic Acids. Res., 37, D674–D679.Search in Google Scholar

Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance estimation and implications for functional genomics,” Statist. Appl. Genet. Mol. Biol., 4, 32.Search in Google Scholar

Shi, P. and E. A. Valdez (2014): “Multivariate negative binomial models for insurance claim counts,” Insur. Math. Econ., 55, 18–29.Search in Google Scholar

Shin, K. and R. Pasupathy (2007): “A method for fast generation of bivariate Poisson random vectors,” Proc 2007 Winter Simulation Conf, 472–479.Search in Google Scholar

Yahav, I. and G. Shmueli (2012): “On generating multivariate Poisson data in management science applications,” Appl. Stoch. Model. Bus., 28, 91–102.Search in Google Scholar

Zhang, L. and B. K. Mallick (2013): “Inferring gene networks from discrete expression data,” Biostatistics, 14, 708–722.Search in Google Scholar

Zhao, T. and H. Liu (2012): “The huge Package for High-Dimensional Undirected Graph Estimation in R,” J. Mach. Learn. Res., 13, 1059–1062.Search in Google Scholar

Zhou, H., J. Jin, Z. Haojun, Y. Bo, M. Wozniak and W. Limsoon (2012): “IntPath – an integrated pathway gene relationship database for model organisms and important pathogens,” BMC Syst. Biol., 6:Suppl 2, S2.Search in Google Scholar

Published Online: 2016-9-21
Published in Print: 2016-10-1

©2016 Walter de Gruyter GmbH, Berlin/Boston