Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

6 Issues per year

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2017: 0.04

See all formats and pricing
More options …
Volume 12, Issue 2


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Exploring the sampling universe of RNA-seq

Stefanie Tauber
  • Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Arndt von Haeseler
  • Corresponding author
  • Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2013-04-16 | DOI: https://doi.org/10.1515/sagmb-2012-0049


How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.

Keywords: RNA sequencing; sampling; modeling RNA-seq; deep sequencing; Pitman sampling formula


  • Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data,” Genome Biol., 11, R106.CrossrefGoogle Scholar

  • Blencowe, B. J., S. Ahmad and L. J. Lee (2009): “Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes,” Genes Dev., 23, 1379–1386.CrossrefWeb of SciencePubMedGoogle Scholar

  • Bullard, J. H., E. Purdom, K. D. Hansen and S. Dudoit (2010): “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments,” BMC Bioinformatics, 11, 94.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Durden, C. and Q. Dong (2009): “RICHEST–a web server for richness estimation in biological data,” Bioinformation, 3, 296–2988.PubMedCrossrefGoogle Scholar

  • Ewens, W. J. (1972): “The sampling theory of selectively neutral alleles,” Theor. Popul. Biol., 3, 87–112.CrossrefGoogle Scholar

  • Favaro, S., A. Lijoi, R. H. Mena and I. Prünster (2009): “Bayesian non-parametric inference for species variety with a two-parameter Poisson “Dirichlet process prior,” J. Royal Statistical Soc., 71, 993–1008.Google Scholar

  • Garber, M., M. G. Grabherr, M. Guttman and C. Trapnell (2011): “Computational methods for transcriptome annotation and quantification using RNA-seq,” Nat. Methods, 8: 469–477.Web of ScienceGoogle Scholar

  • Gentleman, R. C., V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang and J. Zhang (2004): “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biol., 5, R80.CrossrefGoogle Scholar

  • Griebel,T., B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó and M. Sammeth (2012): “Modelling and simulating generic RNA-Seq experiments with the flux simulator,” Nucleic Acids Res., 40, 10073–10083.PubMedCrossrefWeb of ScienceGoogle Scholar

  • Hansen, K. D., S. E. Brenner and S. Dudoit (2010): “Biases in Illumina transcriptome sequencing caused by random hexamer priming,” Nucleic Acids Res., 38, e131.CrossrefWeb of ScienceGoogle Scholar

  • Hoppe, F. M. (1984): “Pólya-like urns and the Ewen’s sampling formula,” J. Math. Biol., 20, 91–94.CrossrefGoogle Scholar

  • Huang,W., L. Li, J. R. Myers and G. T. Marth (2012): “ART: a next-generation sequencing read simulator,” Bioinformatics, 28, 593–594.PubMedCrossrefGoogle Scholar

  • Human BodyMap 2.0 data from Illumina (2011) http://www.ensembl.info/blog/2011/05/24/human-bodymap-2-0-data-from-illumina/. Accessed on 3 August, 2012.

  • Knierim, E., B. Lucke, J. M. Schwarz, M. Schuelke and D. Seelow (2011): “Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing,” PLoS One, 6, e28240.Google Scholar

  • Lander, E. S. and M. S. Waterman (1988): “Genomic mapping by fingerprinting random clones: a mathematical analysis,” Genomics, 2, 231–239.CrossrefPubMedGoogle Scholar

  • Levin, J. Z., M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke and A. Regev (2010): “Comprehensive comparative analysis of strand-specific RNA sequencing methods,” Nat. Methods, 7, 709–715.Google Scholar

  • Li, H. and R. Durbin (2009): “Fast and accurate short read alignment with Burrows-Wheeler Transform,” Bioinformatics, 25, 1754–1760.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Li, B., V. Ruotti, R. M. Stewart, J. A. Thomson and C. N. Dewey (2010): “RNA-Seq gene expression estimation with read mapping uncertainty,” Bioinformatics, 26, 493–500.Web of SciencePubMedCrossrefGoogle Scholar

  • Lijoi, A., R. H. Mena and I. Prünster (2008): “A Bayesian nonparametric approach for comparing clustering structures in EST libraries,” J. Comput. Biol., 15, 1315–1327.Web of ScienceCrossrefGoogle Scholar

  • McElroy, K. E., F. Luciani, and T. Thomas (2012): “GemSIM: general, error-model based simulator of next-generation sequencing data,” BMC Genomics, 13, 74.PubMedCrossrefGoogle Scholar

  • Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008): “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nat. Methods, 5, 621–628.Web of ScienceGoogle Scholar

  • Oshlack, A., M. D. Robinson and M. D. Young (2010): “From RNA-seq reads to differential expression results,” Genome Biol., 11, 220.CrossrefWeb of SciencePubMedGoogle Scholar

  • Ozsolak, F. and P. M. Milos (2011): “RNA sequencing: advances, challenges and opportunities,” Nat. Rev. Genet., 12, 87–98.CrossrefWeb of ScienceGoogle Scholar

  • Pitman, J. (1995): “Exchangeable and partially exchangeable random partitions,” Probab. Theory Relat. Fields, 102, 145–158.Google Scholar

  • Pitman, J. (2006): Combinatorial stochastic processes. Berlin Heidelberg, Germany: Springer.Google Scholar

  • R-Development-Core-Team (2012): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Vienna, Austria.Google Scholar

  • Richard, H., M. H. Schulz, M. Sultan, A. Nürnberger, S. Schrinner, D. Balzereit, E. Dagand, A. Rasche, H. Lehrach, M. Vingron, S. A. Haas and M. Yaspo (2010): “Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments,” Nucleic Acids Res., 38, e112.Web of ScienceCrossrefGoogle Scholar

  • Roberts, A. and L. Pachter (2013): “Streaming fragment assignment for real-time analysis of sequencing experiments,” Nat. Methods, 10, 71–73.Web of ScienceGoogle Scholar

  • Roberts, A., C. Trapnell, J. Donagehey, J. L. Rinn and L. Pachter (2011): “Improving RNA-Seq expression estimates by correcting for fragment bias,” Genome Biol., 12, R22.CrossrefWeb of ScienceGoogle Scholar

  • Robinson, M. D. and G. K. Smyth (2007): “Moderated statistical tests for assessing differences in tag abundance,” Bioinformatics, 23, 2881–2887.PubMedWeb of ScienceCrossrefGoogle Scholar

  • Robinson, M. D. and A. Oshlack (2010): “A scaling normalization method for differential expression analysis of RNA-seq data,” Genome Biol., 11, R25.CrossrefGoogle Scholar

  • Schwartz, S., R. Oren and G. Ast (2011): “Detection and removal of biases in the analysis of next-generation sequencing reads,” PLoS One, 6, e16685.Google Scholar

  • Shen, Y., F. Yue, D. F. McCleary, Z. Ye, L. Edsall, S. Kuan, U. Wagner, J. Dixon, L. Lee, V. V. Lobanenkov and B. Ren (2012): “A map of the cis-regulatory sequences in the mouse genome,” Nature, 488, 116–120.Web of SciencePubMedCrossrefGoogle Scholar

  • Shendure, J. and H. Ji (2008): “Next-generation DNA sequencing,” Nat. Biotechnol., 26, 1135–1145.Google Scholar

  • Smyth, G. K. and M. D. Robinson (2008): “Small-sample estimation of negative binomial dispersion, with applications to SAGE data,” Biostatistics, 9, 321–332.PubMedWeb of ScienceGoogle Scholar

  • Tarazona, S., F. García-Alcalde, J. Dopazo, A. Ferrer and A. Conesa (2011): Differential expression in RNA-seq: a matter of depth. Genome Res., 21, 2213–2223.PubMedCrossrefWeb of ScienceGoogle Scholar

  • Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold and L. Pachter (2010): “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,” Nat. Biotechnol., 28, 511–515.Web of ScienceGoogle Scholar

  • Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth and C. B. Burge (2008): “Alternative isoform regulation in human tissue transcriptomes,” Nature, 456, 470–476.Web of ScienceCrossrefPubMedGoogle Scholar

  • Wang, Z., M. Gerstein and M. Snyder (2009): “RNA-Seq: a revolutionary tool for transcriptomics,” Nat. Rev. Genet., 10, 57–63.Web of ScienceCrossrefGoogle Scholar

  • Wang, L., Z. Feng, X. Wang, X. Wang and X. Zhang (2010): “DEGseq: an R package for identifying differentially expressed genes from RNA-seq data,” Bioinformatics, 26, 136–138.CrossrefWeb of ScienceGoogle Scholar

  • Zabell, S. L. (1992): “Predicting the unpredictable,” Synthese, 90, 205–232.CrossrefGoogle Scholar

About the article

Corresponding authors: Stefanie Tauber and Arndt von Haeseler: Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria

Published Online: 2013-04-16

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 12, Issue 2, Pages 175–188, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2012-0049.

Export Citation

©2013 by Walter de Gruyter Berlin Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Guoshuai Cai, Shoudan Liang, Xiaofeng Zheng, and Feifei Xiao
BMC Bioinformatics, 2017, Volume 18, Number 1
Andrea Prunotto, Brian J. Stevenson, Corinne Berthonneche, Fanny Schüpfer, Jacques S. Beckmann, Fabienne Maurer, and Sven Bergmann
BMC Genomics, 2016, Volume 17, Number 1

Comments (0)

Please log in or register to comment.
Log in