We assess the accuracy of various state-of-the-art statistics and machine learning methods for reconstructing gene and protein regulatory networks in the context of circadian regulation. Our study draws on the increasing availability of gene expression and protein concentration time series for key circadian clock components in Arabidopsis thaliana. In addition, gene expression and protein concentration time series are simulated from a recently published regulatory network of the circadian clock in A. thaliana, in which protein and gene interactions are described by a Markov jump process based on Michaelis-Menten kinetics. We closely follow recent experimental protocols, including the entrainment of seedlings to different light-dark cycles and the knock-out of various key regulatory genes. Our study provides relative network reconstruction accuracy scores for a critical comparative performance evaluation, and sheds light on a series of highly relevant questions: it quantifies the influence of systematically missing values related to unknown protein concentrations and mRNA transcription rates, it investigates the dependence of the performance on the network topology and the degree of recurrency, it provides deeper insight into when and why non-linear methods fail to outperform linear ones, it offers improved guidelines on parameter settings in different inference procedures, and it suggests new hypotheses about the structure of the central circadian gene regulatory network in A. thaliana.
Note that the sets of potential regulators are defined for each gene g specifically. That is, the potential regulators for two target variables yg and can be different, e.g., if certain (biologically-motivated) restrictions are imposed.
For consistency with the fundamental equation of transcription, equation (1), we will enforce that each regulator set πg for yg contains the concentration xg of g, symbolically xg∈πg.
Note that vector x·,t includes every available regulator without any dependency on the target gene g.
Note that the repeated bi-partitioning of the genes into targets and putative regulators renders Glasso equivalent to Lasso, as discussed on page 4 of Friedman et al. (2008). Lasso will be discussed in Section 2.3.
We set: ν=0.005, Aδ=2, and Bδ=0.2, as in Grzegorczyk and Husmeier (2012).
We note that the coupled variant of the non-homogeneous Bayesian regression model cannot be represented properly as a graphical model, as the regression parameter vectors are sequentially coupled among adjacent segments via equations (21–22).
For each yg we apply exactly the same permutation to order the realizations of the explanatory variables (covariates) and thereby ensure that the segment-specific design matrices are built properly.
In our study we follow Rogers and Girolami (2005) and use a slightly modified version of the fast marginal likelihood algorithm from Tipping et al. (2003) for optimization.
We use the authors’ terminology, although the model is not a proper Bayesian network.
More precisely, is obtained by deleting the element corresponding to the target variable yg,t in μg,h, and is obtained by deleting the row and the column corresponding to yg,t in Σg,h.
Note that the abbreviation “BGe” was introduced by Geiger and Heckerman (1994) and stands for Bayesian metric for Gaussian networks having score equivalence; see Geiger and Heckerman (1994) for more details.
We turned off the translation of those proteins contributing to interactions we like to surpress.
In the model equations defined by Guerriero et al. (2012) the concentration of P only appears in a product with the binary light indicator L, where the light variable L is equal to zero in the absence of light.
For the Bayesian methods this can be enforced by setting the prior P(πg) to zero for all πg with xg∉πg.
Matlab software for Disciplined Convex Programming:http://cvxr.com/cvx/.
Note that the maximal number of hidden nodes n is restricted by the number of regulators, Gg. In our simulation study we analyzed various data sets, and we employed the lowest Gg as an upper bound on the number of hidden nodes n.
In our study we initialized the EM-algorithm with allocations obtained by the k-means cluster algorithm. Thereby the initial 𝕂g centers of the k-means algorithms were sampled from a multivariate Gaussian N(μ, I) distribution, where I is the identity matrix and μ is a random expectation vector with entries sampled independently from continuous uniform distributions on the interval [–1, +1]. To avoid that the EM-algorithm is initialized with allocations that possess unoccupied (empty) mixture components, we re-sampled the initial centers and re-ran the k-means algorithm whenever we obtained k-means outputs with empty components.
Loosely speaking, this setting (μ0=0 and T0=I) reflects our “prior belief” that all domain variables, i.e., the potential regulators and the target variable, are i.i.d. standard normally distributed.
The sensitivity is the proportion of true interactions that have been detected, the specificity is the proportion of non-interactions that have been avoided.
The work described in the present article is part of the TiMet project on linking the circadian clock to metabolism in plants. TiMet is a collaborative project (Grant Agreement 245143) funded by the European Commission FP7, in response to call FP7-KBBE-2009-3. Parts of the work were done while M.G. was supported by the German Research Foundation (DFG), research grant GR3853/1-1. A.A. is supported by the BBSRC and the TiMet project. We are grateful to Andrew Millar, Alexander Pokhilko, and V. Anne Smith for helpful discussions.
A.1 ANOVA for method evaluation
Figure 15 shows some of the raw results from our study. Clear patterns are not immediately discernible by visual inspection, which motivates the ANOVA method described in Section 4.8. To ascertain that the underlying assumptions of the ANOVA model are satisfied, we carried out a standard residual analysis. The objective is to test whether the residuals are independent and identically (i.i.d) normally distributed. A violation of this assumption would indicate that some structure in the data has not been captured by the decomposition of equation (56), and that e.g. higher-order interaction terms would have to be included.
Figure 16, left panel, shows a quantile-quantile (QQ) plot of the residuals to test the assumption of a normal distribution. The straight line confirms that there is good agreement with this assumption overall, with only minor deviations for the lowest and highest quantiles, suggesting that the residual distribution is slightly heavier-tailed.
Figure 16, right panel, shows a scatter plot of all residuals against the corresponding values fitted with the ANOVA model of equation (56). For low values, the spread of the residuals seems to become slightly tighter, but this effect is weak, and overall there is no clearly discernible pattern of any dependence between the residual distribution and the fitted value.
Figure 17 shows histograms of the residuals for all possible values of the four main effects in equation (56). There are no obvious deviations from a uniform pattern, and the results are consistent with the assumption that the distributions of the residuals are identical and independent of the main effects.
These diagnostics thus do not indicate any clear violation of the model assumptions and suggest that the ANOVA model proposed in Section 4.8 provides an adequate mechanism for extracting trends and patterns from our simulations studies.
A.2 Data: comparison between Biopepa and qRT-PCR profiles, and assessing the effect of the log transformation
The right panel of Figure 18 shows a QQ-plot to compare the distribution of mRNA concentrations between the realistic data (Section 3.1) and the qRT-PCR profiles from the Timet project (Section 3.2). There is only a mild deviation from an overall linear dependence, which suggests that the specific technical aspects of qRT-PCR measurements, described, e.g., in Bengtsson et al. (2008), do not require a major modification of our stochastic-process model of transcriptional regulation, as reviewed in Table 3 and implemented in Biopepa. This further suggests that the patterns and trends observed in the comparative evaluation based on our realistic data are indicative of results for real qRT-PCR data, and can be used for providing estimates of expected prediction accuracy and guiding decisions on model choice.
This in particular concerns the decision of whether or not to log-transform the data. Inserting log-transformed concentrations, into the fundamental equation of transcriptional regulation, equation (1), and applying the chain rule of differential calculus yields:
It is seen that in comparison with equation (1), the log transformation has led to a more complicated functional dependence, not only by including an extra multiplicative factor on the right-hand side, but also by making fg a function of which increases the amount of non-linearity in the system. This suggests that for network reconstruction, a log-transformation of the data will be counterproductive.
To test this conjecture, we have repeated the network reconstruction on the realistic data after subjecting them to a log transformation. The results are summarised in Figure 18, which displays the differences in the form of Blant-Altman plots for the Lasso (center panel) and HBR (left panel) methods. The average AUROC score difference is 0.06 in favor of the original, non-log transformed data. The distribution of paired differences shows that the proportion of negative differences, where the network reconstruction has deteriorated as a consequence of the log transformation, is significantly higher than the proportion of positive differences. This confirms our conjecture that log-transforming the mRNA concentrations is counterproductive. Due to the reasoning in the first paragraph, that patterns observed for the realistic data are indicative of results to be expected for real qRT-PCR data, we have therefore elected not to log-transform the Timet data.
Ahmed, A. and E. P. Xing (2009): “Recovering time-varying networks of dependencies in social and biological studies,” Proc. Natl. Acad. Sci., 106, 11878–11883.Search in Google Scholar
Äijö, T. and H. Lähdesmäki (2009): “Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics,” Bioinformatics, 25, 2937–2944.Search in Google Scholar
Andrieu, C. and A. Doucet (1999): “Joint Bayesian model selection and estimation of noisy sinusoids via reversible jump MCMC,” IEEE T Signal Proces., 47, 2667–2676.Search in Google Scholar
Barenco, M., D. Tomescu, D. Brewer, R. Callard, J. Stark, and M. Hubank (2006): “Ranked prediction of p53 targets using hidden variable dynamic modeling,” Genome Biology, 7, R25.Search in Google Scholar
Beal, M., F. Falciani, Z. Ghahramani, C. Rangel, and D. Wild (2005): “A Bayesian approach to reconstructing genetic regulatory networks with hidden factors,” Bioinformatics, 21, 349–356.Search in Google Scholar
Beal, M. (2003): Variational Algorithms for Approximate Bayesian Inference, Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London, UK.Search in Google Scholar
Bengtsson, M., M. Hemberg, P. Rorsman, and A. Ståhlberg (2008): “Quantification of mRNA in single cells and modeling of RT-qPCR induced noise,” BMC Molecular Biology, 9, 63.Search in Google Scholar
Bishop, C. M. (2006): Pattern Recognition and Machine Learning, Singapore: Springer.Search in Google Scholar
Brandt, S. (1999): Data Analysis: Statistical and Computational Methods for Scientists and Engineers, New York, USA: Springer.Search in Google Scholar
Brooks, S. and A. Gelman (1999): “General methods for monitoring convergence of iterative simulations,” J. Comput. Graph. Stat., 7, 434–455.Search in Google Scholar
Butte, A. J. and I. S. Kohane (2000): “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” in Pacific Symposium on Biocomputing, volume 5, 418–429.Search in Google Scholar
Ciocchetta, F. and J. Hillston (2009): “Bio-PEPA: A framework for the modeling and analysis of biological systems,” Theor. Comput. Sci., 410, 3065–3084.Search in Google Scholar
Davies, J. and M. Goadrich (2006): “The relationship between Precision-Recall and ROC curves,” Proceedings of the 23rd International Conference on Machine Learning, 233–240.Search in Google Scholar
Edwards, K., O. Akman, K. Knox, P. Lumsden, A. Thomson, P. Brown, A. Pokhilko, L. Kozma-Bognar, F. Nagy, D. Rand, A. J. Millar. (2010): “Quantitative analysis of regulatory flexibility under changing environmental conditions,” Mol. Syst. Biol., 6, 424.Search in Google Scholar
Feugier, F. and A. Satake (2012): “Dynamical feedback between circadian clock and sucrose availability explains adaptive response of starch metabolism to various photoperiods,” Front. Plant Sci., 3.Search in Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani (2008): “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, 9, 432–441.Search in Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., 33, 1–22.Search in Google Scholar
Friedman, N., M. Linial, I. Nachman, and D. Pe’er (2000): “Using Bayesian networks to analyze expression data,” J. Comput. Biol., 7, 601–620.Search in Google Scholar
Geiger, D. and D. Heckerman (1994): “Learning gaussian networks,” in International Conference on Uncertainty in Artificial Intelligence, Seattle, WA: Morgan Kaufmann Publishers, 235–243.Search in Google Scholar
Gelman, A. and D. Rubin (1992): “Inference from iterative simulation using multiple sequences,” Stat. Sci., 7, 457–472.Search in Google Scholar
Gillespie, D. (1977): “Exact stochastic simulation of coupled chemical reactions,” J. Phys. Chem., 81, 2340–2361.Search in Google Scholar
Grzegorczyk, M. and D. Husmeier (2012): “A non-homogeneous dynamic Bayesian network with sequentially coupled interaction parameters for applications in systems and synthetic biology,” Stat. Appl. Genet. Mol. Biol. (SAGMB), 11, article 7.Search in Google Scholar
Grzegorczyk, M. and D. Husmeier (2013): “Regularization of non-homogeneous dynamic Bayesian networks with global information-coupling based on hierarchical Bayesian models,” Mach. Learn., 91, 1–50.Search in Google Scholar
Guerriero, M., A. Pokhilko, A. Fernández, K. Halliday, A. Millar, and J. Hillston (2012): “Stochastic properties of the plant circadian clock,” J. R. Soc. Interface, 9, 744–756.Search in Google Scholar
Hanley, J. A. and B. J. McNeil (1982): “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, 143, 29–36.Search in Google Scholar
Hastie, T., R. Tibshirani, and J. J. H. Friedman (2001): The Elements of Statistical Learning, volume 1, New York: Springer.Search in Google Scholar
Herrero, E., E. Kolmos, N. Bujdoso, Y. Yuan, M. Wang, M. C. Berns, H. Uhlworm, G. Coupland, R. Saini, M. Jaskolski, A. Webb, J. Gonçalves, S. J. Davis. (2012): “EARLY FLOWERING4 recruitment of EARLY FLOWERING3 in the nucleus sustains the Arabidopsis circadian clock,” Plant Cell, 24, 428–443.Search in Google Scholar
Husmeier, D. (1999): Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions, Perspectives in Neural Computing, London: Springer.Search in Google Scholar
Husmeier, D. (2003): “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks,” Bioinformatics, 19, 2271–2282.Search in Google Scholar
Kalaitzis, A. A., A. Honkela, P. Gao, and N. D. Lawrence (2013): gptk: Gaussian processes tool-kit, URL , R package version 1.06.Search in Google Scholar
Ko, Y., C. Zhai, and S. Rodriguez-Zas (2007): “Inference of gene pathways using Gaussian mixture models,” in International Conference on Bioinformatics and Biomedicine, Fremont, CA, 362–367.Search in Google Scholar
Ko, Y., C. Zhai, and S. Rodriguez-Zas (2009): “Inference of gene pathways using mixture Bayesian networks,” BMC Syst. Biol., 3, 54.Search in Google Scholar
Kolmos, E., M. Nowak, M. Werner, K. Fischer, G. Schwarz, S. Mathews, H. Schoof, F. Nagy, J. M. Bujnicki, and S. J. Davis (2009): “Integrating ELF4 into the circadian system through combined structural and functional studies,” HFSP J, 3, 350–366.Search in Google Scholar
Lawrence, N. D., M. Girolami, M. Rattray, and G. Sanguinetti (2010): Learning and inference in computational systems biology, Cambridge, MA: MIT Press Cambridge.Search in Google Scholar
Lèbre, S., J. Becq, F. Devaux, G. Lelandais, and M. Stumpf (2010): “Statistical inference of the time-varying structure of gene-regulation networks,” BMC Syst. Biol., 4.Search in Google Scholar
Locke, J. C. W., M. M. Southern, L. Kozma-Bognár, V. Hibberd, P. E. Brown, M. S. Turner, and A. J. Millar (2005): “Extension of a genetic network model by iterative experimentation and mathematical analysis,” Mol. Syst. Biol., 1.Search in Google Scholar
Locke, J. C. W., L. Kozma-Bognár, P. D. Gould, B. Fehér, E. Kevei, F. Nagy, M. S. Turner, A. Hall, and A. J. Millar (2006): “Experimental validation of a predicted feedback loop in the multi-oscillator clock of Arabidopsis thaliana,” Mol. Syst. Biol., 2.Search in Google Scholar
MacKay, D. J. (1992): “Bayesian interpolation,” Neural Comput., 4, 415–447.Search in Google Scholar
Margolin, A. A., I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Dalla-Favera, and A. Califano (2006): “ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context,” BMC Bioinformatics, 7.Search in Google Scholar
Marin, J.-M. and C. P. Robert (2007): Bayesian core: A practical approach to computational Bayesian statistics, New York, USA: Springer.Search in Google Scholar
Meyer, P. E., F. Lafitte, and G. Bontempi (2008): “minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information,” BMC Bioinformatics, 9.Search in Google Scholar
Morrissey, E. R., M. A. Juárez, K. J. Denby, and N. J. Burroughs (2011): “Inferring the time-invariant topology of a nonlinear sparse gene regulatory network using fully Bayesian spline autoregression,” Biostatistics, 12, 682–694.Search in Google Scholar
Murphy, K. P. (2012): Machine learning: a probabilistic perspective, Cambridge, MA: MIT Press.Search in Google Scholar
Nabney, I. (2002): NETLAB: algorithms for pattern recognition, Springer.Search in Google Scholar
Neuneier, R., F. Hergert, W. Finnoff, and D. Ormoneit (1994): “Estimation of conditional densities: a comparison of neural network approaches,” in International Conference on Artificial Neural Networks, National Cheng Kung University, Taiwan: Springer, 689–692.Search in Google Scholar
Opgen-Rhein, R. and K. Strimmer (2007): “From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data,” BMC Syst. Biol., 1.Search in Google Scholar
Pokhilko, A., A. Fernández, K. Edwards, M. Southern, K. Halliday, and A. Millar (2012): “The clock gene circuit in Arabidopsis includes a repressilator with additional feedback loops,” Mol. Syst. Biol., 8, 574.Search in Google Scholar
Pokhilko, A., S. Hodge, K. Stratford, K. Knox, K. Edwards, A. Thomson, T. Mizuno, and A. Millar (2010): “Data assimilation constrains new connections and components in a complex, eukaryotic circadian clock model,” Mol. Syst. Biol., 6.Search in Google Scholar
Pokhilko, A., P. Mas, A. J. Millar, et al. (2013): “Modeling the widespread effects of TOC1 signaling on the plant circadian clock and its outputs,” BMC Syst. Biol., 7, 1–12.Search in Google Scholar
Rasmussen, C. E., R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani (1996): “The DELVE manual,” URL .Search in Google Scholar
Rasmussen, C. E. (1996): Evaluation of Gaussian processes and other methods for non-linear regression, Ph.D. thesis, Citeseer.Search in Google Scholar
Rasmussen, C. and C. Williams (2006): Gaussian processes for machine learning, volume 1, MA: MIT press Cambridge.Search in Google Scholar
Rogers, S. and M. Girolami (2005): “A Bayesian regression approach to the inference of regulatory networks from gene expression data,” Bioinformatics, 21, 3131–3137.Search in Google Scholar
Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genet. Mol. Biol., 4.Search in Google Scholar
Smith, M. and R. Kohn (1996): “Nonparametric regression using Bayesian variable selection,” J Econometrics, 75, 317–343.Search in Google Scholar
Solak, E., R. Murray-Smith, W. E. Leithead, D. J. Leith, and C. E. Rasmussen (2002): “Derivative observations in Gaussian process models of dynamic systems,” Advances in Neural Information Processing Systems, MIT Press: Vancouver, Canada, 1033–1040.Search in Google Scholar
Tibshirani, R. (1995): “Regression shrinkage and selection via the Lasso,” J. R. Stat. Soc. Series B, 58, 267–288.Search in Google Scholar
TiMet (2014): “The TiMet Project - Linking the clock to metabolism: URL .Search in Google Scholar
Tipping, M. and A. Faul (2003): “Fast marginal likelihood maximisation for sparse Bayesian models,” in Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, 1, 3–6.Search in Google Scholar
Tipping, M. (2001): “Spare Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, 1, 211–244.Search in Google Scholar
Vyshemirsky, V. and M. Girolami (2008): “Bayesian ranking of biochemical system models,” Bioinformatics, 24, 833–839.Search in Google Scholar
Weirauch, M. T., A. Cote, R. Norel, M. Annala, Y. Zhao, T. R. Riley, J. Saez-Rodriguez, T. Cokelaer, A. Vedenko, S. Talukder, DREAM5 Consortium, Bussemaker, H. J., Morris, Q. D., Bulyk, M. L., Stolvitzky, G, and T. R. Hughes (2013): “Evaluation of methods for modeling transcription factor sequence specificity,” Nat. Biotechnol., 31, 126–134.Search in Google Scholar
Werhli, A. V., M. Grzegorczyk, and D. Husmeier (2006): “Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks,” Bioinformatics, 22, 2523–2531.Search in Google Scholar
Wilkinson, D. J. (2009): “Stochastic modeling for quantitative description of heterogeneous biological systems,” Nat. Rev. Genet., 10, 122–133.Search in Google Scholar
Wilkinson, D. (2011): Stochastic modeling for systems biology, volume 44, Taylor & Francis, Boca Raton, FL: CRC press.Search in Google Scholar
Zoppoli, P., S. Morganella, and M. Ceccarelli (2010): “TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by ab information theoretic approach,” BMC Bioinformatics, 11.Search in Google Scholar
Zou, H. and T. Hastie (2005): “Regularization and variable selection via the Elastic Net,” J. R. Stat. Soc. Series B, 67, 301–320.Search in Google Scholar
©2014 by Walter de Gruyter Berlin/Boston