Integration of data from multiple omics platforms has become a major challenge in studying complex systems and traits. For integrating data from multiple platforms, the underlying spaces from which the top ranked elements come from are likely to be different. Thus, taking the underlying spaces into consideration explicitly is important, as failure to do so would lead to inefficient use of data and might render biases and/or sub-optimal results. We propose two space oriented classes of heuristic algorithms for integrating ranked lists from omic scale data. These algorithms are either Borda inspired or Markov chain based that take the underlying spaces of the individual ranked lists into account explicitly. We applied this set of algorithms to a number of problems, including one that aims at aggregating results from three cDNA and two Affymetrix gene expression studies in which the underlying spaces between Affymetrix and cDNA platforms are clearly different.
Chromatin interactions mediated by a particular protein are of interest for studying gene regulation, especially the regulation of genes that are associated with, or known to be causative of, a disease. A recent molecular technique, Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), that uses chromatin immunoprecipitation (ChIP) and high throughput paired-end sequencing, is able to detect such chromatin interactions genomewide. However, ChIA-PET may generate noise (i.e., pairings of DNA fragments by random chance) in addition to true signal (i.e., pairings of DNA fragments by interactions). In this paper, we propose MC_DIST based on a mixture modeling framework to identify true chromatin interactions from ChIA-PET count data (counts of DNA fragment pairs). The model is cast into a Bayesian framework to take into account the dependency among the data and the available information on protein binding sites and gene promoters to reduce false positives. A simulation study showed that MC_DIST outperforms the previously proposed hypergeometric model in terms of both power and type I error rate. A real data study showed that MC_DIST may identify potential chromatin interactions between protein binding sites and gene promoters that may be missed by the hypergeometric model. An R package implementing the MC_DIST model is available at http://www.stat.osu.edu/~statgen/SOFTWARE/MDM.
High-throughput sequencing techniques are increasingly affordable and produce massive amounts of data. Together with other high-throughput technologies, such as microarrays, there are an enormous amount of resources in databases. The collection of these valuable data has been routine for more than a decade. Despite different technologies, many experiments share the same goal. For instance, the aims of RNA-seq studies often coincide with those of differential gene expression experiments based on microarrays. As such, it would be logical to utilize all available data. However, there is a lack of biostatistical tools for the integration of results obtained from different technologies. Although diverse technological platforms produce different raw data, one commonality for experiments with the same goal is that all the outcomes can be transformed into a platform-independent data format – rankings – for the same set of items. Here we present the R package TopKLists, which allows for statistical inference on the lengths of informative (top-k) partial lists, for stochastic aggregation of full or partial lists, and for graphical exploration of the input and consolidated output. A graphical user interface has also been implemented for providing access to the underlying algorithms. To illustrate the applicability and usefulness of the package, we integrated microRNA data of non-small cell lung cancer across different measurement techniques and draw conclusions. The package can be obtained from CRAN under a LGPL-3 license.