Several high-throughput technologies have emerged in the past decade, most notably next generation sequencing, but also methods that estimate abundance levels of proteins and small molecules. Together, these methods are contributing to an enormous collection of experimental data. However, current research in molecular science is typically based on rather small studies in terms of sample size, many of them addressing the same disease or target. The findings obtained across platforms and studies are often quite diverse and an increasingly important task is to strengthen the evidence of these findings. Hence, there is a strong demand for statistical methods that integrate such findings, for example for combining microarray-based expression measurements with RNA-seq results.
A central task is the integration of such data, which differ in important aspects such as laboratory technology, quantification, scale, and study size. When several studies are combined, the involved sets of genes or of other omics entities usually do not match and missing observations are likely to occur. Moreover, often only subsets of unknown size of these data are relevant or informative. In almost all situations the original metric measurements from the involved studies can be transformed into rank data. Until recently, most integration tools for rank data have been heuristic in nature and could not meet all the above mentioned demands. The few statistical integration approaches in use are limited to microarray results (Yang et al., 2006; Plaisier et al., 2010). A general methodology allowing for the integration of other high-throughput technologies, as well as allowing for a platform and technology mix, even when ranked lists are incomplete, had been lacking until the work of Lin and Ding (2009) and Hall and Schimek (2012). Schimek et al. (2012) combined these approaches and extended them with the goal of processing arbitrarily long multiple ranked lists. To turn such novel statistical methods into practical tools, we have implemented them in the
2 Structure and availability of the
3 Implementation and performance of the
The time needed for computation in the modules
4 Brief description of the statistical methods
The purpose of the
For a given set of items, the input is the overlap of rank positions represented by a sequence of indicators, where Ij=1 if the ranking, given by the second assessor to the item ranked j by the first assessor, is not more than δ index positions distant from j, otherwise Ij=0. The assumption that the variables Ij follow a Bernoulli random distribution can be relaxed. There is theoretical and simulation evidence that dependencies among the ranked lists do not impair the estimates (Hall and Schimek, 2012). As well as the distanceδ, the inter-assessor or inter-platform variability, there is another tuning parameter, the pilot sample sizeν, which is a smoothing parameter controlling the irregularity of assessments or expression measurements. A graphical method called Δ-plot is implemented in
The overall estimate
The principle of the
5 Application to cross platform microRNA profiles
Stimulated by the methodological discussion of microRNA profiling in Baker (2010), we compared non-small cell lung cancer (NSCLC) cell lines grown in vitro and in vivo as xenograft models across platforms. From the NCBI GEO database we retrieved data (Tam et al., 2014) of five in vitro and five in vivo samples from three different platforms: (i) GSE51501, Illumina Human v2 MicroRNA Expression BeadChip; (ii) GSE51504, NanoString nCounter Human v1 miRNA Expression Assay; (iii) GSE51507, Illumina HiSeq 2500 (High Throughput Sequencing, abb. HTS). Data (i) and (ii) were normalized using Bioconductor’s
Data exploration led to the choice of δ=40 and ν=22 for the inference procedure (for details please refer to the show case instructions on the Web page). The obtained result was
Finally, we calculated an optimized aggregate list
The CEMC stochastic search algorithm may select items that are top-ranked only in one of the lists (here BeadChip). This applies to the following items in Table 1: hsa-miR-576-5p, hsa-miR-490-5p, hsa-miR-139-5p, hsa-miR-1233, hsa-miR-1284, and hsa-miR-505. In contrast, Fisher’s method tends to select ’consensus’ items, thus having greater agreements with the aggregation map results. Within the top-5 positions the same items are selected by all methods. Only the orders are permuted. However, apart from this rather limited set of overlapping miRNAs, both aggregate lists from CEMC, as well as the aggregation map discussed before, clearly point at substantial platform differences.
Using the miRSystem (Lu et al., 2012) we found the final lists (one for Kendall, one for Spearman, and one for Fisher’s method) of ranked miRNAs to be highly enriched for the JAK-STAT signaling pathway and the Hedgehog signaling pathway both of which were suggested to play an important role in NSCLC. The interesting candidates comprise hsa-miR-143, which is among a set of 43 miRNAs that were found to be differentially expressed between noncancerous lung tissues and lung cancer tissues (Yanaihara et al., 2006) and has also been suggested as a putative biomarker for NSCLC (Gao et al., 2010). Finally, on rank 1 and on rank 2, respectively, we have the RAB14 targeting tumor suppressor hsa-miR-451 (Wang et al., 2011).
Aggregate list results of the NSCLC application.
First and second columns: CEMC consolidated list results under the distance measures Kendall’s τ and Spearman’s footrule.
Third column: consolidated list using Fisher’s method for combining p-values (miR-symbols in bold coincide with the aggregation map result in Figure 1).
A major advantage over ground truth-based and other ad hoc methods is
Gratefully acknowledged are financial support of the Medical University of Graz (MGS, VS) as well as funding by the Deutsche Forschungsgemeinschaft (DFG) to SFB 924 (KGK) and by the US National Science Foundation grant DMS-1220772 (SL).
Fisher, R. A. (1925): Statistical methods for research workers, Edinburgh: Oliver and Boyd, ISBN 0-05-002170-2.
Gao, W., Y. Yu, H. Cao, H. Shen, X. Li, S. Pan and Y. Shu (2010): “Deregulated expression of miR-21, miR-143 and miR-181a in non small cell lung cancer is related to clinicopathologic characteristics or patient prognosis,” Biomed Pharmacother, 64, 399–408.
- Export Citation
Gao, W., Y. Yu, H. Cao, H. Shen, X. Li, S. Pan and Y. Shu (2010): “Deregulated expression of miR-21, miR-143 and miR-181a in non small cell lung cancer is related to clinicopathologic characteristics or patient prognosis,” Biomed Pharmacother, 64, 399–408.)| false 10.1016/j.biopha.2010.01.018
Hall, P. and M. G. Schimek (2012): “Moderate-deviation-based inference for random degeneration in paired rank lists,” J. Am. Stat. Assoc., 107, 661–672.
Kugler, K. G., L. A. Mueller and A. Graber (2010): “MADAM: an open source meta-analysis toolbox for R and Bioconductor,” Source Code Biol. Med., 5, 3.
Lin, S. (2010): “Space oriented rank-based data integration,” Stat. Appl. Genet. Mol. Biol., 9:Article20. doi: 10.2202/1544-6115.1534. Epub 2010 Apr 9.
Lin, S. and J. Ding (2009): “Integration of ranked lists via Cross Entropy Monte Carlo with applications to mRNA and microRNA studies,” Biometrics, 65, 9–18.
Love, M. I., W. Huber, and S. Anders (2014): “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,” Genome. Biol., 15:550.
Lu, T.-P., C.-Y. Lee, M.-H. Tsai, Y.-C. Chiu, C. K. Hsiao, L.-C. Lai and E. Y. Chuang (2012): “miRSystem: an integrated system for characterizing enriched functions and pathways of microRNA targets,” PloS One, 7, e42390.
Plaisier, S. B., R. Taschereau, J. A. Wong and T. G. Graeber (2010): “Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures,” Nucleic Acids Res., 38, e169.
Schimek, M. G., A. Myšicková and E. Budinská (2012): “An inference and integration approach for the consolidation of ranked lists,” Commun. Stat. – Simul. C., 41, 1152–1166.
Takahashi, Y., A. R. Forrest, E. Maeno, T. Hashimoto, C. O. Daub and J. Yasuda (2009): “MiR-107 and MiR-185 can induce cell cycle arrest in human non small cell lung cancer cell lines,” PloS One, 4, e6677.
Tam, S., R. de Borja, M.-S. Tsao, and J. D. McPherson (2014): “Robust global microRNA expression profiling using next-generation sequencing technologies,” Lab. Invest., 94, 350–358.
Tibshirani, R., G. Chu, B. Narasimhan and J. Li (2011): “Significance analysis of microarrays – samr: R package version 2.0,” URL http://CRAN.R-project.org/package=samr.
Verzani, J. (2014): “gWidgets: gWidgets API for building toolkit-independent, interactive GUIs – R package version 0.0-54,” URL http://CRAN.R-project.org/package=gWidgets.
Wang, R., Z. Wang, J. Yang, X. Pan, W. De and L. Chen (2011): “MicroRNA-451 functions as a tumor suppressor in human non-small cell lung cancer by targeting ras-related protein 14 (RAB14),” Oncogene, 30, 2644–2658.
Yanaihara, N., N. Caplen, E. Bowman, M. Seike, K. Kumamoto, M. Yi, R. M. Stephens, A. Okamoto, J. Yokota, T. Tanaka, et al. (2006): “Unique microRNA molecular profiles in lung cancer diagnosis and prognosis,” Cancer Cell, 9, 189–198.
- Export Citation
Yanaihara, N., N. Caplen, E. Bowman, M. Seike, K. Kumamoto, M. Yi, R. M. Stephens, A. Okamoto, J. Yokota, T. Tanaka, et al. (2006): “Unique microRNA molecular profiles in lung cancer diagnosis and prognosis,” Cancer Cell, 9, 189–198.)| false 16530703 10.1016/j.ccr.2006.01.025
Yang, X., S. Bentink, S. Scheid and R. Spang (2006): “Similarities of ordered gene lists,” J Bioinform Comput Biol., 4, 693–708.