Abstract
The rapid development of high throughput experimental techniques has resulted in a growing diversity of genomic datasets being produced and requiring analysis. Therefore, it is increasingly being recognized that we can gain deeper understanding about underlying biology by combining the insights obtained from multiple, diverse datasets. Thus we propose a novel scalable computational approach to unsupervised data fusion. Our technique exploits network representations of the data to identify similarities among the datasets. We may work within the Bayesian formalism, using Bayesian nonparametric approaches to model each dataset; or (for fast, approximate, and massive scale data fusion) can naturally switch to more heuristic modeling techniques. An advantage of the proposed approach is that each dataset can initially be modeled independently (in parallel), before applying a fast post-processing step to perform data integration. This allows us to incorporate new experimental data in an online fashion, without having to rerun all of the analysis. We first demonstrate the applicability of our tool on artificial data, and then on examples from the literature, which include yeast cell cycle, breast cancer and sporadic inclusion body myositis datasets.
1 Introduction
Given the broad variety of high-throughput biological datasets now being generated, attention is increasingly focusing on methods for their integration (Kirk et al., 2012; Lock and Dunson, 2013; Wang et al., 2014). Different technologies allow us to probe different aspects of biological systems, and combining these complementary perspectives can yield greater insight (Altman, 2013; Savage et al., 2013; Schimek et al., 2015).
Computational techniques for data fusion are rapidly evolving, and moving towards potential clinical applications (Altman, 2013). For example, recently proposed methods try to identify cancer subtypes using fused similarity networks applied to a combination of DNA methylation, mRNA expression and miRNA expression datasets (Wang et al., 2014). Cancer subtype discovery has also been a focus of other recent data integration efforts (Yuan et al., 2011; Savage et al., 2013; Lock and Dunson, 2013). Further, data exploration via e.g. simultaneous clustering of gene expression networks (Narayanan et al., 2010), inference of transcriptional module networks (Lemmens et al., 2009), and bi-clustering with connected biologically relevant information (Reiss et al., 2006) are also becoming prominent in the recent data fusion literature.
Data integration methods can be categorized as either unsupervised (the subject of the present work), or supervised (Troyanskaya et al., 2003; Myers et al., 2005; Myers and Troyanskaya, 2007; Huttenhower et al., 2009). Existing unsupervised techniques for the fusion of multiple (i.e. more than two) datasets fall into one of two broad categories: those which are Bayesian (Kirk et al., 2012; Lock and Dunson, 2013) and those which are not (Shen et al., 2009; Wang et al., 2014). For many real-world or large-scale analyses the lack of suitable training sets places severe limitations on supervised algorithms, and unsupervised learning methods have thus gained in popularity. Current Bayesian approaches for unsupervised data fusion rely on mixture model representations of each dataset, with dependencies between datasets modeled either using coefficients that describe the similarity between pairs of datasets (Kirk et al., 2012), or by assuming that each dataset has a structure that adheres – to a lesser or greater degree – to an overall consensus structure that is common to all datasets (Lock and Dunson, 2013). Bayesian approaches have the advantage of having firm probabilistic foundations, of forcing all prior beliefs and assumptions to be formally expressed from the outset, and of allowing the uncertainty in unknown quantities to be encapsulated in (samples from) posterior densities. For these reasons, Bayesian approaches are usually preferred; however, computational cost may prohibit their application to large (e.g. genome-scale) datasets.
In this work we introduce a new approach for the fusion of heterogeneous datasets; the methodology presented here is closely related to a graph-theoretic approach, which may be used to test for associations between disparate sources of data (Balasubramanian et al., 2004). Our approach has two basic steps. In the first, we obtain (independently) for each dataset either an ensemble of networks (in the Bayesian case), or a single network (in the non-Bayesian case), where networks are used to represent the structure and dependencies present in the dataset. In the second, we perform a post-processing step which compares the networks obtained for different datasets, in order to identify common edges, and to thereby assess and quantify the similarities in dataset structure.
Although, in principle, we could consider any type of structure that may be represented by a network, in the present work we are specifically interested in identifying and comparing the clustering structure possessed by each dataset, and it is at this level that we perform data fusion (Figure 1). In the Bayesian case, we employ for the purpose of cluster identification Dirichlet process mixture (DPM) models with either Gaussian process (GP) or multinomial likelihood functions. In our approach, data fusion is performed by constructing the connectivity networks that represent each clustering, and then forming “consensus networks” that identify the clusters upon which multiple datasets agree.
The integration step is somewhat similar to the consensus clustering approach (Monti et al., 2003), which was developed for the purpose of assessing cluster stability and for identifying the consensus across multiple evaluations of the same clustering approach. In contrast with other existing techniques (Kirk et al., 2012; Lock and Dunson, 2013), in our approach each dataset is initially considered independently. Although this is likely to result in some loss of sensitivity, it also means that computations can be performed in a parallel fashion, which offers potentially significant advantages in cases where we are dealing with large datasets, or where it is necessary to rerun computations in order to consider additional datasets.
2 Methods
Here we develop a novel graph theoretical approach for integrating clustering outcomes across several datasets, which we refer to as “GTA” for brevity. GTA may be applied to the output of Bayesian or non-Bayesian clustering algorithms.
2.1 Data integration
There are many different methods for data integration, most of which set out to accomplish one (or both) of the following two key aims: (i) modeling the dependencies that exist within and between datasets; and (ii) combining the predictions derived from one dataset with those derived from another. Assuming for convenience of notation that we wish to integrate just a pair of datasets, we might consider that the “ideal” way to fulfill the first of these two aims would be to model the joint distribution, p(q(1), q(2)|D(1), D(2)), of the predictions, q(1) and q(2), derived from datasets, D(1) and D(2), respectively (for the sake of generality, we leave the definition of q deliberately vague, but these predictions could be, for example, assessments of disease risk). Such an approach poses many potential challenges, not least that the datasets may be of very different types and/or may be of (very) high dimension. In contrast, the second aim just requires us to define a fusion function, f, for combining the predictions. We shall assume that this function is deterministic; i.e. for given predictions, q(1) and q(2), we assume that f maps the predictions onto a single combined output, f(q(1), q(2)). One challenge in this case is to assess the uncertainty/confidence in this output. If we were able to sample M pairs,
In general, modeling the joint distribution, p(q(1), q(2)|D(1), D(2)), is much more computationally demanding than defining a fusion function. To make modeling p(q(1), q(2)|D(1), D(2)) more tractable, here we make an independence assumption and factorize the joint distribution as,
While, in practice, the quality of this approximation will depend upon a number of factors (including the signal-to-noise ratio associated with the individual datasets), it has the advantage of circumventing many of the challenges associated with trying to model p(q(1), q(2)|D(1), D(2)). Moreover, this independence assumption means that we may perform computations for each dataset in parallel, allowing us to scale to potentially large number of datasets. Having made this independence assumption, we may obtain samples from the joint distribution by sampling independently from each of the factors, p(q(1)|D(1)) and p(q(2)|D(2)). For a given fusion function, f, we may then consider the set
2.2 GTA: a graph theoretic approach to unsupervised data integration
We start with a collection of datasets, D(1), …, D(R), each of which comprises measurements taken on a common set of N entities/items (e.g. genes or patients). We consider the situation in which the prediction, q(r), associated with dataset D(r) is the clustering structure possessed by these items. To proceed, we must therefore define: (i) a method for sampling from p(q(r)|D(r)); and (ii) a fusion function, f, for combining the predicted clustering structures derived from the various datasets. Here, our aim is to identify similarities between the clustering structures associated with each of the datasets.
2.2.1 Sampling from p(q(r)|D(r))
We consider two different approaches for sampling from p(q(r)|D(r)). Where possible, we use Dirichlet process mixture modeling, a nonparametric Bayesian approach, which is discussed in the Appendix A. However, in some cases the size of the datasets being considered prohibits the use of such approaches. In these instances, rather than using an ensemble of M samples from p(q(r)|D(r)), we instead take M=1 and treat the maximum likelihood estimate,
2.2.2 Defining the fusion function
In order to define a fusion function, f, we take inspiration from Balasubramanian et al. (2004) work and adopt a graph theoretic approach. We define c to be a clustering of the N items of interest (genes, patients, etc.), so that ci is the cluster label associated with item i. The N×N adjacency matrix is given as,
Given a collection of R such clusterings, say c(1), …, c(R), we may form a corresponding collection of adjacency matrices, A(1), …, A(R), where A(k) is the adjacency matrix representation of clustering c(k). The Hadamard (entry-wise) product of these matrices, H=A(1) ∘…∘A(R), defines a new clustering, in which items i and j appear in the same cluster if and only if all of the clusterings c(1), …, c(R) agree that they should appear in the same cluster. From a graph theoretic perspective, H corresponds to the graph formed by taking the intersection of the graphs defined by A(1), …, A(R) (see Figure 1). Assuming that each of the clusterings, c(1), …, c(R), corresponds to a different dataset, we define our fusion function f to act on these and then return the clustering corresponding to the Hadamard product, H. Equipped with this fusion function, and M samples from p(q(r)|D(r)) (for r=1, …, R), we may proceed to describe with the GTA algorithm (Algorithm 1).
2.2.3 Summarizing the fused output
The output of the GTA algorithm is a collection of Mfused clusterings. While these provide a useful indication of the uncertainty in the fused output, it is often also helpful to condense these into a single, summary fused clustering, c̅. This can be done by constructing a posterior similarity matrix (Fritsch and Ickstadt, 2009) of
2.2.4 Heuristic score of clustering similarity
In some instances, it is useful to measure the compatibility between data sources by comparing the independent clusterings (obtained before data integration) with the fused clustering (obtained using the GTA algorithm). Due to the nature of our technique it is expected to observe more clusters after the data integration process. This is particularly true when studying less related datasets and when integrating more than a few data sources. We therefore define the following measure of similarity between any two data sources Dr and DI,
where
3 GTA applications
In this section we explore the capabilities of GTA to data integration. Here we are interested in comparing our result to the result delivered by data fusion methods, which explicitly model relationships and dependencies between datasets. We start by illustrating the applicability of heuristic score to identify similarities between several data sources. Then we test GTA performance on popular examples from literature, and on sporadic inclusion body myositis dataset.
Example R code and instructions are available from https://sites.google.com/site/gtadatafusion/.
3.1 Capturing similarities across artificial dataset
We consider a S. cerevisiae dataset (Cho et al., 1998) that contains mRNA transcription levels taken to study the cell cycle. From 416 genes that had previously been identified to have periodic changes in transcript levels we select 100, and assign them into seven clusters using DPM with GPR likelihood model (Figure 2A). The details regarding MCMC specification and diagnostics are summarized in Appendix B. To demonstrate the performance of GTA, we further consider the artificial dataset example (Kirk et al., 2012) that was constructed using the same 100 genes. Briefly, this example consists of six data sources, where the first source is the original dataset (see Figure 2A), and the other five were obtained sequentially, by randomly permuting a quarter of gene names in the previous dataset. Applying GTA on pairwise combinations of these datasets we identify the numbers of genes that cluster together before and after the fusion. These can be used for computing the score of agreement and for determining the similarities across all six datasets. The pairwise similarities are summarized in Figure 2B where columns and rows identify which combination of datasets were considered, and color illustrates the level of similarity. Alternatively, the similarity between these datasets can be identified from the final clusterings by computing the adjusted Rand index (ARI) (Hubert and Arabie, 1985). The ARI compares two given partitions of the same list of genes and is based on how often a gene (observation) is associated with the same cluster in both partitions (see Figure 2C).
3.2 Integrating cell cycle datasets
We next compare the results from GTA to the results by Multiple Dataset Integration (MDI) (Kirk et al., 2012). In this example we consider integrating a number of datasets from yeast cell cycle studies. The first dataset contains gene expression time courses (Granovskaia et al., 2010), where mRNA measurements are taken at 41 time points across 551 genes that exhibit oscillatory expression profiles. The second dataset is ChIP-chip data (Harbison et al., 2004) that contains binary information about proteins binding to DNA, and the third is a protein-protein interaction (PPI) dataset from BioGRID (Stark et al., 2006).
In order to apply GTA for data fusion, we initially cluster all datasets by running in parallel three independent DPM models (see Appendices A and B). To specify the DPM, we use Gaussian process likelihood to cluster gene expression time courses, and multinomial likelihood to cluster the transcription factor binding and PPI datasets. As before, further details regarding MCMC specifications and diagnostic plots are summarized in Appendix B. Although we use DPM here to automatically determine the number of cluster for each dataset; it is worth noting that initial clustering could also be done using other methods since GTA operates on pre-clustered dataset.
In this example we are aiming to compare the results from our method to the results from MDI, for this reason we consider two cases: (i) integration of gene expression and transcription factor binding data; (ii) integration of gene expression, transcription factor binding, and protein-protein interaction data.
MDI is a tool that jointly clusters all datasets by modeling dependencies between them. Using MDI approach the final clustering can be extracted by calculating a fusion probability – the probability that any two genes are fused in two datasets, and by removing those genes where this probability is less than 0.5. For this reason, using GTA we can also consider removing genes that lack the evidence of clustering together. This can be achieved by computing the matrix
As a second step in our yeast cell cycle example we consider integrating three datasets: gene expression, transcription factor binding and protein-protein interaction (PPI) data. In order to also fuse the PPI dataset using MDI, it is necessary to rerun joint clustering from the beginning; however, using GTA we can benefit from previously pre-clustered data and perform data fusion in an “online-fashion” using only samples from the posterior. In this case GTA fuses three datasets in under a minute (40.889 s), whereas for MDI it took to generate approximately 90 samples per chain per hour Kirk et al. (2012). Applying thinning-out strategy (removing genes that lack the evidence of clustering together) we obtained a set of 14 genes that can be assigned into 6 clusters (for comparison MDI assigned 16 genes into 5 clusters). Supplementary Figures 3A,B show the final network structures that correspond to fused three datasets using GTA and MDI, respectively; in addition, Supplementary Figure 2 projects clusters from GE data on to TF and PPI. For convenience, Supplementary Material Tables 3 contains further details regarding gene function. In addition to this, we looked at the overlap score, OS=0.214, which is above predefined threshold and suggests that GTA results can be seen as a useful approximation.
3.3 Breast cancer data
In this example we explore the performance of our data integration technique on a breast cancer dataset, where we aim to integrate four different data sources taken from The Cancer Genome Atlas (Cancer Genome Atlas Network, 2012). We use a previously described dataset (Lock and Dunson, 2013) that consists of preselected 348 tumor samples characterized in four different ways: RNA gene expression (645 genes), DNA methylation (574 probes), miRNA expression (423 miRNAs), and reverse phase protein array (171 proteins). In order to cluster all data sources, we adopt a modified version of Bayesian consensus clustering (BCC) approach Lock and Dunson (2013). BCC is a data integration technique that seeks to simultaneously model data specific and shared features by inferring the overall clustering Ĉ (that describes all datasets), and by inferring data specific clusterings
Breast cancer is a heterogeneous disease, for this reason four biologically distinct molecular subtypes can be connected to these dataset (Cancer Genome Atlas Network, 2012; Lock and Dunson, 2013). They are known as Her2, Basal, Luminal A, Luminal B, and are associated with different clinical prognosis (Rakha et al., 2008; Dawood et al., 2011). In order to further assess our results we look at cancer subtypes that are associated with each cluster and compared these to the BCC clusters. Confusion matrices in Table 2 illustrates the summarized results. The first matrix corresponds to the outcome using BCC method (a single run of publicly available code), while the second matrix summarize results from GTA. This further demonstrates that clusters identified using our technique bear similarity to the BCC results.
TCGA tumor subtypes | BCC single run | GTA | ||||
---|---|---|---|---|---|---|
Cl1 | Cl2 | Cl3 | Cl1 | Cl2 | Cl3 | |
Her2 | 5 | 14 | 20 | 28 | 6 | 5 |
Basal | 3 | 65 | 4 | 6 | 65 | 1 |
Luminal A | 77 | 3 | 81 | 98 | 4 | 59 |
Luminal B | 3 | 0 | 73 | 17 | 0 | 59 |
In the table are given numbers of tumor samples per cluster; e.g. GTA cluster 3 contains six samples of Her2, 65 samples of Basal and four samples of Luminal A, for this reason, cluster 3 can be summarized as containing mostly Basal type tumors. Clusters are classified by particular cancer subtype using publicly available R code (Lock and Dunson, 2013). In bold we highlight tumor samples that could be associated with cluster identity.
3.4 Sporadic inclusion body myositis
In this section we apply our technique on clinical gene expression datasets that include: (i) sporadic Inclusion Body Myositis (sIBM), which is an inflammatory muscle disease that progresses very slowly, causes muscular weakness and eventually muscle atrophy (Grau and Selva-O’Callaghan, 2008; Machado et al., 2009); (ii) polymyositis (PM) which causes chronic inflammation of the muscles; and (iii) a dataset containing human protein-protein interactions (BPPI) (see Supplementary Material, Section B for further details). Both sIBM and PM are associated with aging but interestingly sIBM can be frequently misdiagnosed as PM, and the explicit diagnosis can only be confirmed via a muscle biopsy (Dalakas, 2006). Current understanding is that sIBM is driven by two coexisting processes (autoimmune and degenerative); however, the actions by which sIBM occurs are still only poorly understood (Dalakas, 2006; Needham and Mastaglia, 2007).
3.4.1 PM and sIBM data integration
Here we focus on experimental sIBM and PM datasets that have 5 and 3 data points (clinical cases). In order to apply our technique, we select 424 genes that where previously identified to have the largest variation in their expression across all data points (Thorne et al., 2013). In order to cluster these datasets, we employ “mclust” package in R, which fits a Gaussian mixture model and uses the Bayesian information criterion to estimate the number of components. Because we do not have access to the clustering samples from the posterior for each dataset, we use GTA with a special case M=1 to integrate single clusterings from both datasets. The integrative analysis enables partitioning of large PM and sIBM clusters into smaller ones; however, this results in a greater number of clusters after the fusion, see Figure 4. To validate our results we use the “ToppGene” (https://toppgene.cchmc.org) tool, which performs gene set enrichment analysis and allows the detection of functional enrichment for phenotype (disease) and GO terms such as biological function. Such analysis enabled us to identify those clusters (5 out of 12) that are enriched with diseases like rheumatoid arthritis and myositis, and biological processes related to response from immune system. This suggests that genes in disease enriched clusters might play an important and shared role in both PM and sIBM, and for this reason they are promising targets for further analysis.
4 Discussion
In this paper we have developed and illustrated an alternative way to integrate data generated from different sources. GTA mainly relies on the outcomes from Bayesian clustering approaches and is based on concepts from graph theory. We have demonstrated that our technique, can produce similar data fusion results when compared to recently proposed data integration methods. Equally, while GTA can fuse the outcomes from Bayesian clustering algorithms, the special case M=1 can be applied in order to fuse a single clusterings across various datasets.
Here we have demonstrated the applicability of our technique to a variety of typical biological and biomedical problems: the identification of potentially underlying regulatory mechanisms in the yeast cell cycle; sub-typing tumors in breast cancer data; and exploring similarity patterns across inflammatory muscle diseases. We have compared our technique to MDI and BCC, the current state-of-the-art techniques used to address similar data integration problems. The main benefits of our graph-theoretical approach include: (i) the applicability to Bayesian and non-Bayesian type clustering approaches. This means that our methodology can be applied in order to model multiple sources on a genome scale data without facing computational challenges; and (ii) ability to perform clustering in a parallel fashion. These features might be advantageous in situations where it is necessary to rerun computations in order to consider additional datasets. As part of our modeling approach, we have defined a measure of similarity between two data sources. This can be used to evaluate the effects of data integration routine and in order to assess the agreement between data sources.
We note that it is possible to apply our graph-theoretical approach to the outcomes from simple clustering techniques, for example hierarchical or K-means clustering. This could be done by first performing a bootstrapping approach on each gene within a dataset Dr, M times; for further details on bootstrapping see (Efron, 1981) and (Kerr and Churchill, 2001). This would produce a set of bootstrapped datasets
Acknowledgments
This work was supported by the Leverhulme Trust (to JŽ and MPHS), the Royal Society (to MPHS), HFSP (to PK and MPHS) and BBSRC (to MPHS).
Appendix
A Further details regarding clustering using Dirichlet processes
A.1 Dirichlet process mixtures
The methodology for modeling heterogeneity in genomic datasets bears similarity to the structure of an infinite Gaussian mixture model. A Dirichlet process mixture (DPM) model can be derived as a limit of a finite mixture model when the number of mixture components grows to infinity. Precisely, let consider a dataset D={x1, …, xN} that we intend to model by the following mixture model,
where K is the number of components, πk are the mixing proportions, and ℱ (D|θk) are component density functions parameterised with a set of parameters θk. Furthermore, we associate each data point, xi, with a component indicator variable ci∈{1, …, K}. This allow us to track which mixture component generated a data point xi. We can allocate a symmetric Dirichlet prior to the mixing proportions,
where the subscript, “–i,” is a short notation for all indicators excluding i; and n–i,k denotes the number of observations within the cluster k not including observation xi. Now, by taking the limit as K goes to infinity the conditional prior has the following limits,
Combining conditional priors (4) with a likelihood function, ℱ(xi|θk), will result in a conditional posteriors
that are necessary to perform the inference of all parameters associated with model (3), including the number of clusters K. This can be achieved via Markov chain Monte Carlo (MCMC) methods. In equations (5), H denotes a prior for parameters θk, which might be a conjugate prior and depends on the likelihood model.
The DPM model is a general framework, and here we will focus on specific likelihood models. Specifically, we will employ DPM of Gaussian process regression (GPR) models to cluster gene expression time series, and DPM of multinomial models to model categorical/discrete functional genomics data. These two likelihood options are incorporated in our Matlab package DPMSysBio.
A.2 Likelihood function for the time course data
For convenience, bellow we will use the following notation, xi≡xi. Instead of specifying a parametric (e.g. a multivariate-Gaussian) likelihood function, in this work we capture the time course observations xi ={xi(t1), …, xi(tp)}, where xi(tj) denotes the measurement taken on gene i at time point tj, with a regression model. In a regression approach, each gene xi can be expressed as
where fi is a regression function and εij~𝒩(0, σ2) is added to express the potential uncertainty in measurements. In our case we are modeling observations (genes) that tend to cluster together. This means that each cluster can be described by the same “data generating” function fi≡fk and noise
Here, for simplicity, we adopted a zero mean function (m(t)=0, for all t) and squared exponential function,
where ak, lk>0 are the hyper-parameters. Then, the genes (observations) within each cluster, k,
Here, Nk is the number of observations in cluster k. For convenience we can rewrite the above in an expanded form,
where [fs(t1), …, fk(t1), …, fk(tp), …, fk(tp)]T contains Nk replicates of each fk(tj).
Now, we can define a Gaussian process prior,
Here cov(k) is an Nkp×Nkp matrix that is composed of smaller block matrices,
where [cov(ti, tj)] denotes i, j-th a smaller matrix structure. Here, each block matrix is symmetric and positive definite. This enable us to specify the following likelihood function within each cluster k,
A.3 Likelihood function for the discretised data
For convenience in this section we will describe a multinomial model Kirk et al. (2012) to capture categorical data. Typically, the categorical dataset consist of a list of genes (objects) where measurements, r∈{1, …, R}, for each gene are taken at Q distinctive attributes. For genes that tend to cluster together, xrq denotes the number of times q-th attribute receives a value r. Thus, the multivariate probability mass function for categorical data,
where xq=[x1q, …, xRq]; and θrq denotes the cluster related probability for attribute q to receive a value r.
Setting a Dirichlet prior,
where Bq=β1q, …, βRq and Sq=x1q, …, xRq. From the independence between the attributes follows the marginal likelihood function,
where βR×Q is a matrix.
B Inference of the hyper–parameters
In this section we explain how inference is performed for DPM models. Additionally, we provide the Markov chain Monte Carlo (MCMC) run details for S. cerevisiae and yeast cell cycle (examples in the paper).
In the DPMSysBio package, the Gaussian process likelihood is controlled by three hyper-parameters, θc={ac, lc, σc}, that are necessary in order to learn the means and covariances of each cluster. We set a Gaussian priors for the logarithmic versions of hyper-parameters (log(θc)) and employ a Gibbs sampling algorithm as described in Neal (2000) (see Section 6 for details). This algorithm can be seen as the most general Gibbs sampling scheme that can deal with a non-conjugate priors. In order to learn the hyper-parameters of DPM model with multinomial likelihood, we use Dirichlet priors.
Finally, to infer the concentration parameter α we set the following gamma prior α~𝒢(2, 4), and adopt approach Escobar and West (1995) proposed by Escobar and West.
B.1 S. cerevisiae example
For the original time courses Cho et al. (1998) and five perturbed datasets, we ran DPMSysBio package with GPR likelihood for 50,000 iterations recording each 5th sample. This provided us with 10,000 “thinned-out” MCMC samples. Further, we discarded the first 5000 samples as a “burn-in” period and for further analysis we used 5000 samples per each dataset.
B.2 Yeast cell cycle datasets example
For time course dataset (Granovskaia), we ran DPMSysBio package with GP likelihood for 40,000 iterations recording each 5th sample. This provided us with 8000 “thinned-out” MCMC samples. Further, we discarded the first 5000 samples as a “burn-in” period and for further analysis we used 3000 samples.
For transcription factor binding dataset (Harbison), we ran DPMSysBio package with multinomial likelihood for 20,000 iterations recording each 5th sample. This provided us with 4000 “thinned-out” MCMC samples. Further, we discarded the first 2000 samples as a “burn-in” period and for further analysis we used 2000 samples.
For protein-protein interaction dataset (Biogrid), we ran DPMSysBio package with multinomial likelihood for 30,000 iterations recording each 5th sample. This provided us with 6000 “thinned-out” MCMC samples. Further, we discarded the first 3000 samples as a “burn-in” period and for further analysis we used 3000 samples.
Figure 5 illustrates the Markov chains and posterior distributions for the number of clusters, K, and the concentration parameter, α, for Granovskaia (TC), Harbison (TF) and Biogrid (PPI) datasets.
References
Altman, R. B. (2013): “Personal genomic measurements: the opportunity for information integration,” Clin. Pharmacol. Ther., 93, 21–23.Search in Google Scholar
Bader, G. D. and C. W. Hogue (2003): “An automated method for finding molecular complexes in large protein interaction networks,” BMC Bioinformatics, 4, 2.10.1186/1471-2105-4-2Search in Google Scholar PubMed PubMed Central
Balasubramanian, R., T. LaFramboise, D. Scholtens, and R. Gentlman (2004): “A graph-theoretic approach to testing associations between disparate sources of functional genomics data,” Bioinformatics, 20, 3353–3362.10.1093/bioinformatics/bth405Search in Google Scholar PubMed
Cancer Genome Atlas Network (2012): “Comprehensive molecular portraits of human breast tumours,” Nature, 490, 61–70.10.1038/nature11412Search in Google Scholar PubMed PubMed Central
Cho, R. J., M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart and R. W. Davis (1998): “A genome-wide transcriptional analysis of the mitotic cell cycle,” Mol. Cell, 2, 65–73.Search in Google Scholar
Dalakas, M. C. (2006): “Sporadic inclusion body myositis – diagnosis, pathogenesis and therapeutic strategies,” Nat. Clin. Pract. Neurol., 2, 437–447.Search in Google Scholar
Dawood, S., R. Hu, M. D. Homes, L. C. Collins, S. J. Schnitt, J. Connolly, G. A. Colditz and R. M. Tamimi (2011): “Defining breast cancer prognosis based on molecular phenotypes: results from a large cohort study,” Breast Cancer Res. Treat., 126, 185–192.Search in Google Scholar
Efron, B. (1981): “Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods,” Biometrika, 68, 589–599.10.1093/biomet/68.3.589Search in Google Scholar
Escobar, M. and M. West (1995): “Bayesian density estimation and inference using mixtures,” J. Am. Statist. Assoc., 90, 577–588.Search in Google Scholar
Fritsch, A. and K. Ickstadt (2009): “Improved criteria for clustering based on the posterior similarity matrix,” Bayesian Anal., 4, 367–391.Search in Google Scholar
Granovskaia, M. V., L. J. Jensen, M. E. Ritchie, J. Toedling, Y. Ning, P. Bork, W. Huber and L. M. Steinmetz (2010): “High-resolution transcription atlas of the mitotic cell cycle in budding yeast,” Genome Biol. 11, R24.Search in Google Scholar
Grau, J. M., and A. Selva-O’Callaghan (2008): “Sporadic inclusion body myositis,” In: Diagnostic criteria in autoimmune diseases. New York, NY: Humana Press, 165–168.Search in Google Scholar
Harbison, C. T., D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J.-B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. Gifford, E. Fraenkel and R. A. Young (2004): “Transcriptional regulatory code of a eukaryotic genome,” Nature, 431, 99–104.10.1038/nature02800Search in Google Scholar PubMed PubMed Central
Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218.Search in Google Scholar
Huttenhower, C., E. M. Haley, M. A. Hibbs, V. Dumeaux, D. R. Barrett, H. A. Coller, and O. G. Troyanskaya (2009): “Exploring the human genome with functional maps,” Genome Res., 19, 1093–1106.Search in Google Scholar
Kerr, M. K. and G. A. Churchill (2001): “Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments,” Proc. Natl. Acad. Sci. USA, 98, 8961–8965.10.1073/pnas.161273698Search in Google Scholar PubMed PubMed Central
Kirk, P., J. E. Griffin, R. S. Savage, Z. Ghahramani and D. L. Wild (2012): “Bayesian correlated clustering to integrate multiple datasets,” Bioinformatics, 28, 3290–3297.10.1093/bioinformatics/bts595Search in Google Scholar PubMed PubMed Central
Lemmens, K., T. De Bie, T. Dhollander, S. C. De Keersmaecker, I. M. Thijs, G. Schoofs, A. De Weerdt, B. De Moor, J. Vanderleyden, J. Collado-Vides, K. Engelen and K Marchal (2009): “Distiller: a data integration framework to reveal condition dependency of complex regulons in escherichia coli,” Genome Biol., 10, R27.Search in Google Scholar
Lock, E. F. and D. B. Dunson (2013): “Bayesian consensus clustering,” Bioinformatics, 29:2610–2616.10.1093/bioinformatics/btt425Search in Google Scholar PubMed PubMed Central
Machado, P., A. Miller, J. Holton and M. Hanna (2009): “Sporadic inclusion body myositis: an unsolved mystery,” Acta Reumatol. Port., 34, 161–182.Search in Google Scholar
Monti, S., P. Tamayo, J. Mesirov, and T. Golub (2003): “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data,” Machine Learn., 52, 91–118.Search in Google Scholar
Myers, C. L., D. Robson, A. Wible, M. A. Hibbs, C. Chiriac, C. L. Theesfeld, K. Dolinski and O. G. Troyanskaya (2005): “Discovery of biological networks from diverse functional genomic data,” Genome Biol., 6, R114.Search in Google Scholar
Myers, C. L. and O. G. Troyanskaya (2007): “Context-sensitive data integration and prediction of biological networks,” Bioinformatics, 23, 2322–2330.10.1093/bioinformatics/btm332Search in Google Scholar PubMed
Narayanan, M., A. Vetta, E. E. Schadt and J. Zhu (2010): “Simultaneous clustering of multiple gene expression and physical interaction datasets,” PLoS Comput Biol., 6, e1000742.Search in Google Scholar
Neal, R. M. (2000): “Markov chain sampling methods for dirichlet process mixture models,” J. Comput. Graph. Stat., 9, 249–256.Search in Google Scholar
Needham, M. and F. L. Mastaglia (2007): “Inclusion body myositis: current pathogenetic concepts and diagnostic and therapeutic approaches,” Lancet Neurol., 6, 620–631.Search in Google Scholar
Nepusz, T., H. Yu and A. Paccanaro (2012): “Detecting overlapping protein complexes in protein-protein interaction networks,” Nat. Methods, 9, 471–472.Search in Google Scholar
Rakha, E. A., J. S. Reis-Filho and I. O. Ellis (2008): “Basal-like breast cancer: a critical review,” J. Clin. Oncol., 26, 2568–2581.Search in Google Scholar
Rasmussen, C. and C. Williams (2006): Gaussian processes for machine learning, 55 Hayward Street, Cambridge, MA 02142: The MIT Press, first edition.10.7551/mitpress/3206.001.0001Search in Google Scholar
Reiss, D. J., N. S. Baliga and R. Bonneau (2006): “Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks,” BMC Bioinformatics, 7, 280.10.1186/1471-2105-7-280Search in Google Scholar PubMed PubMed Central
Savage, R. S., Z. Ghahramani, J. E. Griffin, P. Kirk and D. L. Wild (2013): “Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data,” arXiv preprint arXiv:1304.3577.Search in Google Scholar
Schimek, M. G., E. Budinská, K. G. Kugler, V. Švendová, J. Ding and S. Lin (2015): “TopKLists: a comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists,” Stat. Appl. Genet. Mol. Biol., 14, 311–316.Search in Google Scholar
Shen, R., A. B. Olshen and M. Ladanyi (2009): “Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis,” Bioinformatics, 25, 2906–2912.10.1093/bioinformatics/btp543Search in Google Scholar PubMed PubMed Central
Stark, C., B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz and M. Tyers (2006): “BioGRID: a general repository for interaction datasets,” Nucleic Acids Res., 34(suppl 1), D535–D539.Search in Google Scholar
Thorne, T., P. Fratta, M. G. Hanna, A. Cortese, V. Plagnol, E. M. Fisher and M. P. H. Stumpf (2013): “Graphical modelling of molecular networks underlying sporadic inclusion body myositis,” Mol. BioSyst., 9, 1736–1742.Search in Google Scholar
Troyanskaya, O. G., K. Dolinski, A. B. Owen, R. B. Altman and D. Botstein (2003): “A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae),” Proc. Natl. Acad. Sci. USA, 100, 8348–8353.10.1073/pnas.0832373100Search in Google Scholar PubMed PubMed Central
Wang, B., A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains and A. Goldenberg (2014): “Similarity network fusion for aggregating data types on a genomic scale,” Nat. Methods, 11, 333–337.Search in Google Scholar
Wu, M., X. Li, C.-K. Kwoh and S.-K. Ng (2009): “A core-attachment based method to detect protein complexes in ppi networks,” BMC Bioinformatics, 10, 169.10.1186/1471-2105-10-169Search in Google Scholar PubMed PubMed Central
Yuan, Y., R. S. Savage and F. Markowetz (2011): “Patient-specific data fusion defines prognostic cancer subtypes,” PLoS Comput. Biol., 7, e1002227.Search in Google Scholar
Zhang, X.-x., Q.-h. Xiao, B. Li, S. Hu, H.-j. Xiong and B.-h. Zhao (2015): “Overlap maximum matching ratio (ommr): a new measure to evaluate overlaps of essential modules,” Frontiers of Information Technology & Electronic Engineering, 16, 293–300.10.1631/FITEE.1400282Search in Google Scholar
Supplemental Material:
The online version of this article (DOI: 10.1515/sagmb-2016-0016) offers supplementary material, available to authorized users.
©2016, Michael P.H. Stumpf et al., published by De Gruyter
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.