Skip to content
BY 4.0 license Open Access Published by De Gruyter April 7, 2021

Parameter inference with analytical propagators for stochastic models of autoregulated gene expression

Frits Veerman, Nikola Popović and Carsten Marr

Abstract

Stochastic gene expression in regulatory networks is conventionally modelled via the chemical master equation (CME). As explicit solutions to the CME, in the form of so-called propagators, are oftentimes not readily available, various approximations have been proposed. A recently developed analytical method is based on a separation of time scales that assumes significant differences in the lifetimes of mRNA and protein in the network, allowing for the efficient approximation of propagators from asymptotic expansions for the corresponding generating functions. Here, we showcase the applicability of that method to simulated data from a ‘telegraph’ model for gene expression that is extended with an autoregulatory mechanism. We demonstrate that the resulting approximate propagators can be applied successfully for parameter inference in the non-regulated model; moreover, we show that, in the extended autoregulated model, autoactivation or autorepression may be refuted under certain assumptions on the model parameters. These results indicate that our approach may allow for successful parameter inference and model identification from longitudinal single cell data.

1 Introduction and background

Gene expression in regulatory networks is an inherently stochastic process [1]. Mathematical models typically take the form of a chemical master equation (CME), which describes the temporal evolution of the probabilities of observing specific states in the network [2]. Recent advances in single-cell fluorescence microscopy [37] have allowed for the generation of experimental longitudinal data, whereby the fluorescence intensity of mRNA or protein abundances in single cells is measured. While most common models assume the availability of protein abundance data, abundances of mRNA may equally be of interest, depending on the model [8]. Here, we focus on abundances of protein, which we assume to be measured at regular sampling intervals Δt. A typical data set, denoted by Q, thus consists of protein abundances n i at N + 1 different points in time; see Figure 1A. We can group these abundances into transitions n i n i+1; cf. Figure 1B. A model-derived propagator P n i + 1 | n i ( Δ t , Θ ) allows for the calculation of the probabilities of such transitions for some set of model parameters Θ. Summing over all these probabilities for all N observed transitions, we can calculate the log-likelihood L(Θ) of that particular parameter set as

(1.1) L ( Θ ) = i = 0 N 1 log P n i + 1 | n i ( Δ t , Θ ) ,

which can be evaluated over a range of values for the model parameters to yield a ‘log-likelihood landscape’, the maximum of which corresponds to the most likely parameter set Θ subject to the measured data set Q. Assessing the log-likelihood of a model by evaluating the associated propagator for the observed transitions in a time-lapse experiment is a feasible, established approach that has been successfully applied previously [6], [9], [10]. Due to the complex nature of the underlying regulatory networks, explicit expressions for P n i + 1 | n i are difficult to obtain in general. Hence, a variety of approximations have been proposed, which can be either numerical [9], [11] or analytical [12], [13], to cite but a few examples. Here, we apply the analytical method recently developed by the current authors [14], which was based on ideas presented by Popović et al. [12], to obtain fully time-dependent approximate propagators; an outline of the method is given in Section 2.

Figure 1: 
(A) Simulated time series of protein abundance n, with measurements at times t

i
 and sampling interval Δt. (B) Histogram of the frequency k

j
 of transitions 





(



n


0


→
n

)



j




${({n}_{0}\to n)}_{j}$



, inferred from a longer time series with 100 transitions.

Figure 1:

(A) Simulated time series of protein abundance n, with measurements at times t i and sampling interval Δt. (B) Histogram of the frequency k j of transitions ( n 0 n ) j , inferred from a longer time series with 100 transitions.

Our aim in the present article is to demonstrate the applicability of these propagators, as well as to evaluate their performance in the context of parameter inference for synthetic data. Specifically, we showcase the resulting inference procedure for a family of stochastic gene expression models. First, in Section 3, we consider a model that incorporates DNA ‘on’/‘off’ states (‘telegraph model’); see also the work of Raj et al. [15] and Shahrezaei and Swain [10]. Subsequently, in Section 4, that model is extended with an autoregulatory mechanism, whereby protein influences its own production through an autocatalytic reaction. In Section 5, we summarize our results and present an outlook to future research; finally, in Appendix A, we collate the analytical formulae that underly our inference procedure for the family of models showcased here.

2 Method

2.1 Calculation of propagators

Our method [14] is based on an analytical approximation of the probability generating function that is introduced for analysing the CME corresponding to the given gene expression model. Propagators can be calculated from the generating function via the Cauchy integral formula, which implies

(2.1) P n i + 1 | n i ( Δ t , Θ ) = 1 2 π i γ F ( z ; Δ t , n i , Θ ) z n i + 1 + 1 d z ;

here, F(z; Δt, n i , Θ) is the generating function of the (complex) variable z, which additionally depends on the sampling interval Δt, the protein abundance n i , and the model parameter set Θ. The integration contour γ is a closed contour in the complex plane around z = 0. The choice of contour is arbitrary; however, it can have a significant effect on computation times and on the accuracy of the resulting integrals [16]. Here, we choose γ to be a regular 150-sided polygon approximating a circle of radius 0.8 that is centred at the origin in the complex plane, which results in a ‘hybrid analytical–numerical’ procedure for the evaluation of P n i + 1 | n i .

2.2 Parameter inference

The parameter inference procedure proposed here can be divided into the following steps:

2.2.1 Data binning

The simulated data Q is presented as a time series {n i }, 0 ≤ iN, which yields N transitions n i n i+1. Generically, some of these transitions occur more than once. For computational efficiency, we bin the data accordingly to create a binned data set Q ̂ = k j , ( n 0 n ) j , with 1 j N ̂ for N ̂ N , where k j is the frequency of the transition ( n 0 n ) j ; see also Figure 1. Here, N ̂ denotes the number of different transitions observed in the data, that is, the size of the binned data set Q ̂ . We emphasize that binning can substantially accelerate parameter inference, in particular for near-stationary processes.

2.2.2 Marginalization

Frequently, some of the involved species in a model are not observed, and hence have to be marginalized over. In the models discussed in Sections 3 and 4, we assume that protein is measured, while mRNA remains unobserved. Marginalization over unobserved species is usually carried out on the transition probabilities in (2.1). However, since the marginalization procedure is linear, it commutes with the Cauchy integral. Introducing the linear ‘marginalization operator’ M , we may write

(2.2) M P n i + 1 | n i ( Δ t , Θ ) = 1 2 π i γ M F ( z ; Δ t , n i , Θ ) z n i + 1 + 1 d z ,

where M now acts on the generating function F. Therefore, given the analytical approximation for F resulting from our method [14], we define

(2.3) F ̂ ( z ; Δ t , n i , Θ ̂ ) = M F ( z ; Δ t , n i , Θ ) ,

where Θ ̂ Θ is the subset of parameters that remain after the marginalization procedure has been applied. Note that F ̂ is still a fully analytical, general expression which depends on the as yet unspecified values of its arguments.

2.2.3 Evaluation

We choose a set Θ ̂ 0 of numerical values for the parameters in Θ ̂ . Moreover, we specify the integration contour γ, which we discretize as described in Section 2.2.2 to approximate the Cauchy integral in (2.1) by a finite sum. Suppose that the contour γ is discretized as {ζ(l)}, with 0 ≤ lL and ζ(0) = ζ(L); then, the integral of a function G along γ is approximated as

(2.4) γ G ( z ) d z l = 0 L 1 G ( ζ ( l ) ) Δ ζ ( l ) , with  Δ ζ ( l ) = ζ ( l + 1 ) ζ ( l ) .

Now, for every transition ( n 0 n ) j in the binned data set Q ̂ , we evaluate F ̂ , as given in (2.3), for the chosen parameter values Θ ̂ 0 along the discretized contour. We hence obtain the array

(2.5) 1 2 π i F ̂ ( ζ ( l ) ; Δ t , ( n 0 ) j , Θ ̂ 0 ) ζ ( l ) ( n ) j + 1 Δ ζ ( l ) for 0 l L 1  and  1 j N ̂ ,

which we sum over l to find

(2.6) p j ( Θ ̂ 0 , Δ t ) = l = 0 L 1 1 2 π i F ̂ ( ζ ( l ) ; Δ t , ( n 0 ) j , Θ ̂ 0 ) ζ ( l ) ( n ) j + 1 Δ ζ ( l )

as the approximate value of the propagator for the transition ( n 0 n ) j .

2.2.4 Calculation of the log-likelihood

To calculate the log-likelihood of the parameter subset Θ ̂ 0 , we substitute the approximate propagators p j , as defined in (2.6), into (1.1) to obtain

(2.7) L ( Θ ̂ 0 ) = j = 1 N ̂ k j log p j ( Θ ̂ , Δ t ) .

3 Showcase 1: the telegraph model

To demonstrate our parameter inference procedure, we consider a stochastic gene expression model that incorporates DNA ‘on’/‘off’ states (‘telegraph model’) [10], [15]:

(3.1) D k 1 k 0 D * ( DNA activation/deactivation ) , D * ν 0 D * + M ( transcription of DNA to mRNA ) , M ν 1 M + P ( translation of mRNA to protein ) , M d 0 ( decay of mRNA ) , P d 1 ( decay of protein ) .

In recent work [14], we presented an analytical method for obtaining explicit, general, time-dependent expressions for the generating function associated to the CME that arises from the model in (3.1). A pivotal element of the application of that method to (3.1) is the assumption that the protein decay rate d 1 is notably smaller than the decay rate d 0 of mRNA, which implies that the parameter ε d 1 d 0 is small; hence, the associated generating function is approximated to a certain order O = k, corresponding with a theoretical accuracy that is proportional to ɛ k . For more details on the resulting approximation, we refer to Appendix A.

To obtain synthetic data, we simulate the model in (3.1) using Gillespie’s stochastic simulation algorithm (SSA) [17], for fixed values of the (rescaled) parameters

(3.2) κ 0 k 0 d 1 = 1.3 , κ 1 k 1 d 1 = 1.2 , λ ν 0 d 1 = 3.3 , μ ν 1 d 0 = 2.85 , ε d 1 d 0 = 0.1 , and d 1 = 1

on the time interval 0 ≤ t ≤ 10, and we measure the protein abundance n with a fixed sampling interval Δt. As our method assumes that Δt is of order ɛ, cf. again Appendix A, we set Δt = ɛ = 0.1, which yields N = 100 transitions. Finally, we consider random initial states, with mRNA and protein numbers chosen uniformly between 0 and 10, and we assume DNA to be in the ‘on’ state with probability κ 0 κ 0 + κ 1 . Based on the simulated measurement data, we perform the parameter inference procedure described in Section 2. As the data consists of protein abundances only, and as propagators for the model in (3.1) depend on abundances of both mRNA and protein, we marginalize over mRNA, assuming a steady-state distribution reported by Raj et al. [15, Supporting Information, Protocol S1, Eq. (1)]; that distribution coincides with the steady-state limit of the associated fully time-dependent distribution considered by Veerman et al. [14]. We assume that the values of κ 0, κ 1, ɛ, and d 1 are known, and calculate the log-likelihood in (2.7) for varying λ and μ. We scan these two parameters in the range 1 0 3 λ 1 0 3 , 1 0 3 μ 1 0 2 , using a logarithmically spaced grid of 50 × 40 grid points. Figure 2 shows the resulting log-likelihood landscapes and, in particular, a comparison of the performance of the leading (zeroth) order approximation for the generating function, see Figure 2A, with that of the first order approximation in Figure 2B. We emphasize that our choice of λ and μ as the parameters to be inferred is not guided in any way by our analytical method – indeed, any other choice would have served our purpose, under the assumption that ɛ is sufficiently small. Rather, we chose λ and μ for illustrative purposes.

Figure 2: 
Log-likelihood landscapes inferred from a simulation of the telegraph model in (3.1) with N = 100 transitions and parameter values as in (3.2): true value (cross) versus maximum log-likelihood estimate (MLE; dot). (A) Leading (zeroth) order approximation. (B) First order approximation.

Figure 2:

Log-likelihood landscapes inferred from a simulation of the telegraph model in (3.1) with N = 100 transitions and parameter values as in (3.2): true value (cross) versus maximum log-likelihood estimate (MLE; dot). (A) Leading (zeroth) order approximation. (B) First order approximation.

Moreover, it is important to note that, in the analytical derivation of the propagators used to produce Figure 2, all model parameters are assumed to be of order ɛ 0 = 1 [14, Assumption 3.2], which is reflected in the numerical values in (3.2). In particular, it follows that both the ratio of the transcription and translation rates ν 0 ν 1 = ε λ μ and the ratio of the transcription and mRNA decay rates ν 0 d 0 = ε λ are assumed to be small. However, in Figure 2, the ranges over which λ and μ are scanned significantly exceed these estimates without causing apparent issues for our inference procedure. That could be taken as a sign that, while our analytical propagators were derived under certain assumptions on the order of model parameters, the resulting expressions may in fact be valid over a wider parameter range. On the other hand, it is worthwhile to note that noise-induced bistability is observed in a similar gene expression model if the above assumption on λ is abandoned, as considered with an alternative analytical approach in [18].

To quantify the performance of the method developed by Veerman et al. [14] for parameter inference, we compare four different scenarios:

  1. Parameter values as in (3.2), with sampling interval Δt = ɛ = 0.1 on the time interval 0 ≤ t ≤ 10, corresponding to N = 100 transitions, which is the original setup that yields the results shown in Figure 2.

  2. As in (a), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.

  3. As in (a), with ɛ = 0.01: the sampling interval is decreased accordingly to Δt = ɛ = 0.01; measurements are taken on the time interval 0 ≤ t ≤ 1, which yields N = 100 transitions.

  4. As in (a), with μ = 28.5.

For each scenario, we infer the most likely values of the parameters λ and μ, for increasing approximation order O. The inferred values of λ and μ are compared to the ‘true’ values λ true and μ true, where we consider relative errors to quantify the performance of our inference procedure. The results of that comparison are shown in Figure 3. The accuracy of inference for λ clearly increases when the approximation order O is increased from 0 to 1; the increase in accuracy from O = 1 to O = 2 is obfuscated by grid size effects. A ten-fold increase in the number of transitions (b) increases the accuracy of the leading order approximation, while a ten-fold increase in the value of μ true (d) decreases the accuracy of the leading order approximation. For μ, there is no noticeable increase in accuracy with the approximation order O within the parameter grid used, which could indicate that the number of transitions remains the dominant limiting factor. However, the accuracy of inference for μ increases overall when N is increased (b), the small parameter ɛ is decreased (c), or the value of μ true is increased (d). An alternative explanation for the apparent insensitivity of the error in μ to the approximation order can be found in the underlying mRNA marginalization procedure; however, one would expect that marginalization to negatively influence the sensitivity of transcription (λ), rather than that of translation (μ).

Figure 3: 
Relative error 




Δ


r



(

x

)

≔


x
−


x


true






x


true






${{\Delta}}_{r}(x){:=}\frac{x-{x}_{\text{true}}}{{x}_{\text{true}}}$



 of the inferred parameters λ (A) and μ (B), for increasing approximation order O; for O = k, the propagators 




P




n


i
+
1


|



n


i






${P}_{{n}_{i+1}\vert \,{n}_{i}}$



 are approximated up to and including terms of order ɛ

k
. (a) N = 100 transitions, ɛ = 0.1, λ
true = 3.3, and μ
true = 2.85. (b) N = 1000 transitions, other parameters as in (a). (c) ɛ = 0.01, number of transitions N and other parameters as in (a). (d) μ
true = 28.5, number of transitions N and other parameters as in (a).

Figure 3:

Relative error Δ r ( x ) x x true x true of the inferred parameters λ (A) and μ (B), for increasing approximation order O; for O = k, the propagators P n i + 1 | n i are approximated up to and including terms of order ɛ k . (a) N = 100 transitions, ɛ = 0.1, λ true = 3.3, and μ true = 2.85. (b) N = 1000 transitions, other parameters as in (a). (c) ɛ = 0.01, number of transitions N and other parameters as in (a). (d) μ true = 28.5, number of transitions N and other parameters as in (a).

4 Showcase 2: an autoregulated telegraph model

We extend the telegraph model in (3.1) with an autoregulatory mechanism, where the DNA activation rates are influenced by the presence of protein. Autoregulation is modelled in a catalytic manner, via one of the two following reactions:

(4.1a) D + P a P D * + P ( autoactivation through protein ) ,

(4.1b) D * + P r P D + P ( autorepression through protein ) .

The above pair of autoregulation mechanisms was introduced by Hornos et al. [19], and implemented, e.g. by Iyer-Biswas and Jayaprakash [20]; see Section 5 for a discussion of the physical validity of these mechanisms. Propagators for the telegraph model incorporating both autoregulation through protein, as in (4.1), and autoregulation via mRNA were determined in [14]. The protein-autoregulated telegraph model is also discussed in [21], where it is called the ‘full model’; cf. [21, Figure 1] for a comparison and discussion of several reductions of that model.

To assess the performance of our parameter inference procedure, we fix the parameter values as in (3.2). Again, we marginalize over mRNA, assuming a steady-state distribution; see Section 3. Note that the degree to which that steady-state assumption is valid directly depends on the magnitude of the autoregulation rates, as protein levels influence levels of mRNA through DNA activation rates. We generate six data sets, as follows:

  1. Simulate the model in (3.1) without autoregulation (‘null model’; a P = 0 = r P ) on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.

  2. As in (A), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.

  3. Simulate the extended model {(3.1),(4.1a)} with autoactivation for a P δ = 0.3 on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.

  4. As in (C), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.

  5. Simulate the extended model {(3.1),(4.1b)} with autorepression for r P δ = 0.3 on the time interval 0 ≤ t ≤ 10, which yields N = 100 transitions.

  6. As in (E), with the time interval increased to 0 ≤ t ≤ 100, which yields N = 1000 transitions.

Every data set consists of 10 runs of equal length.

Generating functions for the autoregulated extension, by (4.1), of the telegraph model in (3.1) have been derived in the theoretical companion article [14] to the current work, under the assumption that the autoregulation rate a P or r P is small compared to the protein decay rate d 1. That assumption implies that the ratios a P d 1 α P δ and r P d 1 ρ P δ are small.

Parameter inference now proceeds as follows. We fix a data set, and take a single run from that set. For that run, we determine the log-likelihood of the autoactivated model in {(3.1),(4.1a)}, varying 0 ≤ α P δ ≤ 1; likewise, we determine the log-likelihood of the autorepressed model in {(3.1),(4.1b)}, varying 0 ≤ ρ P δ ≤ 1. The log-likelihood L of the autoregulated extension is then compared with the log-likelihood L 0 of the non-regulated model in (3.1); as before, N denotes the number of data points, where L is defined as in (2.7), that is, we simply evaluate the probability of the observed transitions given the model under consideration, after marginalization over mRNA. The log-likelihood difference LL 0, which is equal to the logarithm of the likelihood ratio, quantifies the evidence for that model. We repeat the above procedure for all 10 runs in the data set, and we determine the mean and standard deviation; the outcome is illustrated in Figure 4. We observe that 1000 transitions suffice to correctly refute autorepression in (B, D), and to correctly refute autoactivation in (F). In the case of 100 transitions, no conclusion can be drawn from (A) and (C), while (E) correctly refutes autoactivation; however, the small difference between L and L 0, with |LL 0|< 1, indicates low significance.

Figure 4: 
Parameter inference for the extended autoregulated model in {(3.1),(4.1)} on the basis of various types of synthetic data, where the performance of the model is quantified via the log-likelihood difference L − L
0. On the vertical axis, L − L
0 indicates the difference between the log-likelihood of the autoactivated or autorepressed model and that of the non-regulated model in (3.1); the higher the value of L − L
0, the more likely the associated model is. In each panel, the solid curve indicates the mean values based on 10 model runs; dashed curves indicate the uncertainty (one standard deviation). On the horizontal axis, the strength of autoregulation is measured by α

P

δ (increasing to the right) or ρ

P

δ (increasing to the left). (A) Data generated from the null (non-regulated) model in (3.1), with N = 100 transitions. (C) Data generated from the model in (3.1), with autoactivation as in (4.1a), for α

P

δ = 0.3 and N = 100 transitions. (E) Data generated from the model in (3.1), with autorepression as in (4.1b), for ρ

P

δ = 0.3 and N = 100 transitions. (B, D, F) as (A, C, E), but with N = 1000 transitions; note that the vertical axis has a different scaling. All other model parameters were assumed to be known.

Figure 4:

Parameter inference for the extended autoregulated model in {(3.1),(4.1)} on the basis of various types of synthetic data, where the performance of the model is quantified via the log-likelihood difference LL 0. On the vertical axis, LL 0 indicates the difference between the log-likelihood of the autoactivated or autorepressed model and that of the non-regulated model in (3.1); the higher the value of LL 0, the more likely the associated model is. In each panel, the solid curve indicates the mean values based on 10 model runs; dashed curves indicate the uncertainty (one standard deviation). On the horizontal axis, the strength of autoregulation is measured by α P δ (increasing to the right) or ρ P δ (increasing to the left). (A) Data generated from the null (non-regulated) model in (3.1), with N = 100 transitions. (C) Data generated from the model in (3.1), with autoactivation as in (4.1a), for α P δ = 0.3 and N = 100 transitions. (E) Data generated from the model in (3.1), with autorepression as in (4.1b), for ρ P δ = 0.3 and N = 100 transitions. (B, D, F) as (A, C, E), but with N = 1000 transitions; note that the vertical axis has a different scaling. All other model parameters were assumed to be known.

It is important to reiterate that the validity of our assumption that mRNA obeys a steady-state distribution, which was used in the marginalization step, is coupled to the magnitude of the autoregulation rates. While a steady-state assumption can hence be argued to be approximately valid for small δ, an increase in autoregulation strength increases the protein level feedback on DNA activation rates, thereby indirectly, but dynamically, influencing levels of mRNA. We propose that the ensuing breakdown of the steady-state assumption is reflected in the behaviour of the log-likelihood difference shown in Figure 4. The assumption can be argued to induce a bias as α P δ, ρ P δ → 1, accordingly skewing the log-likelihood difference, which could also explain the absence of a clear maximum in Figure 4D and F.

5 Discussion

In the present article, we showcase a parameter inference procedure that is based on a recently developed analytical method [14] which allows for the efficient numerical approximation of propagators via the Cauchy integral formula on the basis of asymptotic series for the underlying generating functions. The resulting hybrid analytical–numerical approach reduces the need for computationally expensive simulations; moreover, due to its perturbative nature, it is highly applicable over relatively short time scales, such as those occurring naturally in the calculation of the log-likelihood in (1.1).

We simulate protein expression with a simple stochastic simulation algorithm (SSA) [17]; the resulting protein abundances are in the lower range of experimentally measured values [22], [23], which does not, however, limit our proof of principle. In addition, in the presented application of our approach, we assume that some of the model parameters are known, that protein abundances can be measured at regular sampling intervals, and that there is no noise. We then present results for synthetic data in a family of models for stochastic gene expression from the literature under the central assumption that lifetimes of protein are significantly longer than those of mRNA, which introduces a small parameter ɛ and, hence, a separation of scales. An extensive discussion of the validity of our assumption that ɛ is small can be found elsewhere [10], [14]; see also Section 3. The assumed smallness of ɛ is crucial to the underlying analytical method, as introduced by Veerman et al. [14]. In addition, our approach is specifically attuned to the time variability of the expression process, in the sense that we assume the sampling interval Δt to be small, as well; cf. again Section 3. It is thus ideally suited to describing transient dynamics far from steady state, as is evident in the bursting behaviour seen in Figure 1A.

In Section 3, we discuss a simple (‘telegraph’) model for gene expression without autoregulation, showing that our approach can successfully infer relevant model parameters. Unlike in previous work [24], the underlying implementation avoids potential bias due to zero propagator values and large initial protein numbers through the use of ‘implicit’ series expansions in ɛ; see Appendix A for an in-depth argument. Note that, in the simple telegraph model, transcription factors are assumed to act as single molecules. Our analytical approach would have to be extended to incorporate transcription factor dimer regulation, e.g. via the recently developed linear mapping approximation [25].

In Section 4, we perform a model comparison in an autoregulated extension of the telegraph model. We consider three types of gene regulation: autoactivation, autorepression, and no regulation of DNA activity (null model). For each type, we simulate protein abundance data with 100 and 1000 transitions, respectively. Throughout, we find that 100 data points are insufficient to reject model hypotheses with our approach. With 1000 data points, however, we can successfully reject the non-regulated and the autorepressed model for simulated data from an autoactivated model, in which case we can even infer the correct order of the autoactivation parameter. For simulated autorepression, we can reject the model with autoactivation, but not the non-regulated model. Our approach fails to identify the correct model for data from a non-regulated model for 1000 transitions, where the autoactivated model is clearly, but wrongly, favoured. We believe that more research is needed into the sources of these discrepancies in dependence on both model parameters and the order of our approximation.

In both showcases, we observe a trade-off between the accuracy of inference versus the required computation time. Computation times seem to increase exponentially with the approximation order, at least for the setup realized in this article. For practical purposes, we hence propose an algorithm whereby the fastest, leading order approximation is used to obtain an initial estimate for the underlying model parameters; that estimate can then be improved by including higher order corrections, resulting in a much more computationally efficient procedure.

It is insightful to compare our results with other recent work on parameter inference in regulated gene expression models. In work by Feigelman et al. [9], three models for regulated gene expression with a slightly different structure compared to the models studied in the present article were simulated and inferred via a particle filtering-based inference procedure that employs genealogical information of dividing cells. Interestingly, positive and negative autoregulation could be successfully rejected there for data that was simulated from a no-feedback model. However, the no-feedback model could not be rejected for data originating from the corresponding models with positive or negative feedback. From that comparison with Feigelman et al. [9], we conclude that the structure of the data, the intensity of regulatory feedback, and the chosen inference procedure together will influence the extent of insight which can be obtained from an approach that is based on stochastic models of gene expression.

We emphasize that our analytical method is not restricted to the specific models showcased here. Our aim in the present article is to demonstrate the applicability of the method, and to investigate its performance, rather than to assess the biological validity of a given model. Minimally, our approach can be extended to recent, physiologically more relevant modifications of the telegraph model with autoregulation [19] by Grima et al. [26] and Congxin et al. [27]; another feasible alternative model can be obtained by introduction of a refractory state [28]. Ideally, we would like to test a variety of stochastic gene expression models against a given set of measurement data. Current computational approaches struggle to provide propagators for models with more than one regulated species, which can often only be approximated even in that simple scenario. The principal advantage of our hybrid approach is that propagators can be evaluated in a computationally efficient manner, via a combination of mathematical analysis and numerical integration [14]; other approaches rely either on the calculation of propagators based on direct numerical simulation of the underlying model – which is computationally demanding – or on the assumption that symbolic derivatives of the generating function are explicitly known, which only holds for specific models of relatively low complexity [12].

The input for our propagator-based approach is the abundance of the involved species, viz. of protein. Thus, we assume that absolute protein numbers are measured, which is in practice hampered by an unknown scaling factor between the observed fluorescence and the corresponding abundances, and by noise. While various suggestions for inferring that factor have been made [5], [29], [30], and while a linear scaling is customarily assumed [6], [11], an accurate experimental determination remains extremely challenging.

It is instructive to compare our propagator-based approach to alternative approaches to parameter inference in the context of stochastic gene expression, such as the linear noise approximation [31], a system size expansion with moment closure [32], or tensor-based methods for the corresponding Fokker–Planck equation [33]. Both the linear noise approximation and a system size expansion assume large system sizes and are only valid for large copy numbers, see also [31]; moreover, the impact of moment closure on the relevance of nonlinearities is potentially non-negligible [34]. The relatively recently developed tensor-based methods focus on steady state distributions for large system sizes; their validity away from the thermodynamic limit is as yet unclear, nor is it clear how results are influenced by the particular tensor decomposition method that is chosen. In contrast to these approaches, the propagator-based approach we have employed in this article not only performs well with low copy numbers, i.e. for small system sizes; our analytical propagators are also explicitly time-dependent, in contrast to the usual steady state assumption. Details can be found in our theoretical companion article [14].

Finally, the showcases presented in this article are based on synthetic data that was generated in silico; in the future, we plan to consider experimental data, such as can be found in the work by Suter et al. [6].


Corresponding author: Nikola Popović, University of Edinburgh, School of Mathematics, James Clerk Maxwell Building, King’s Buildings, Peter Guthrie Tait Road, Edinburgh EH9 3FD, UK, E-mail:

Funding source: Leverhulme Trust

Award Identifier / Grant number: RPG-2015-017

Acknowledgements

The authors thank Ramon Grima and Peter Swain (both University of Edinburgh), as well as two anonymous reviewers, for valuable comments and suggestions.

  1. Author contribution: All authors conceived the experiments; F. V. conducted the experiments; F. V. and C. M. analysed the results; F. V. and N. P. wrote the manuscript. All authors reviewed the manuscript.

  2. Research funding: This work has been supported by the Leverhulme Trust, through Research Project Grant RPG-2015-017 (‘The nature of gene expression: model selection and parameter inference’).

  3. Conflict of interest statement: The authors declare no competing interests.

Appendix A: Analytical details

The hybrid analytical–numerical approach developed by Veerman et al. [14] introduces a probability generating function which transforms the CME corresponding to a given stochastic gene expression model into a system of partial differential equations. Explicit expressions for the solutions to these equations are obtained using dynamical systems techniques, in combination with perturbation theory. The values of the associated propagators are recovered through a numerically efficient implementation of the Cauchy integral in (2.1); see also Section 2.

A.1 Approximate generating functions

The generating functions used to approximate propagators in the present article, cf. Section 2, are derived via the analytical method presented by Veerman et al. [14]. For the telegraph model in Section 3, the leading order generating function F ̂ 0 is given by

(A.1) F ̂ 0 ( z ; Δ t , n 0 , ε , λ , μ ) = 1 ( 1 z ) e Δ t n 0 × 1 F 1 κ 0 , κ 0 + κ 1 , ε λ 1 f ( z ) f ( z ) 1 e f ( z ) Δ t ε ,

where 1F1 denotes the confluent hypergeometric function [35, chap. 13] and f(z) = 1 + μ(1 − z)e−Δt . All parameters have been rescaled according to (3.2). The generating function has been marginalized over mRNA, assuming a steady-state distribution reported by Raj et al. [15]. Analogously, the first order approximation F ̂ 1 of the generating function is given by

(A.2) F ̂ 1 ( z ; Δ t , n 0 , ε , λ , μ , χ ) = 1 ( 1 z ) e Δ t n 0 × 1 + ε ( 1 χ ) λ 1 f ( z ) f ( z ) e f ( z ) Δ t ε 1 e f ( z ) Δ t ε f ( z ) + Δ t ε e f ( z ) Δ t ε × 1 F 1 κ 0 , κ 0 + κ 1 , ε λ 1 f ( z ) f ( z ) 1 e f ( z ) Δ t ε × 1 + ε 1 f ( z ) 2 + 1 f ( z ) Δ t ε + ( 1 f ( z ) ) Δ t 2 2 ε 2 ;

here, χ = κ 1 κ 0 + κ 1 . For the autoregulated model discussed in Section 4, the same expressions for the generating functions are used; however, χ now depends on the autoregulatory mechanism according to

(A.3) χ = κ 1 κ 0 + κ 1 no autoregulation , κ 1 κ 0 + κ 1 + α P δ n 0 autoactivation , κ 1 + ρ P δ n 0 κ 0 + κ 1 + ρ P δ n 0 autorepression .

A.2 ‘Implicit’ expansions

It is important to note that neither the leading order generating function in (A.1) nor the first order approximation given by (A.2) are expressed as asymptotic series in powers of ɛ, as would be expected on the basis of the perturbative approach taken by Veerman et al. [14]. The underlying reasoning can be summarized as follows.

First, in the derivation of these generating functions, it was assumed that the sampling interval Δt was small, i.e. of order ɛ; note that this assumption is satisfied in all numerical simulations shown in the current article, where Δt = ɛ throughout. Thus, we can write

(A.4) Δ t = ε Δ s .

The accuracy of our series approximation depends on the asymptotic scaling of Δt; see [14, Remark 3.6]. With the above scaling, the approximation order is equal to the order to which the resulting propagators are accurate, in powers of ɛ.

With the scaling for Δt given in (A.4), an expansion of F ̂ 0 and F ̂ 1 , as defined in (A.1) and (A.2), respectively, into asymptotic series in ɛ to the appropriate order yields

(A.5) F ̂ 0 = z n 0 ,

(A.6) F ̂ 1 = z n 0 1 + ε ( 1 z ) n 0 Δ s z ( 1 χ ) λ μ 1 + μ ( 1 z ) μ ( 1 z ) 1 e [ 1 + μ ( 1 z ) ] Δ s 1 + μ ( 1 z ) + Δ s .

From (A.5), one can immediately conclude that

(A.7) γ F ̂ 0 = δ n , n 0 ,

where δ n , n 0 = 1 if n = n 0 and δ n , n 0 = 0 otherwise. From the series for F ̂ 1 in (A.6), we see that we can write

(A.8) F ̂ 1 ( z ) = z n 0 1 + ε k = 1 f 1 , k z k ,

with appropriately chosen coefficients f 1,k ; hence, it follows that

(A.9) γ F ̂ 1 = 0 if n 0 > n + 1 .

More generally, an expansion of the generating function to order k in ɛ will yield

(A.10) γ F ̂ k = 0 if n 0 > n + k .

From these observations, we conclude that decreasing transitions (n i n i+1), where n i > n i+1 + k, will be assigned a probability that is identically zero. Hence, if such transitions are present in the data, the model is ruled out immediately, as our perturbative approach excludes the possibility that decreasing transitions can occur. One can understand this phenomenon by considering the definition of the small parameter ɛ as the ratio of the protein decay rate d 1 over the mRNA decay rate d 0. A leading order approximation of ɛ = 0 is thus equivalent to taking d 1 → 0 which, in turn, implies that protein does not decay at all, since (natural) protein decay is the only reaction in (3.1) that can decrease protein abundance. By the same reasoning, a straightforward expansion of the generating function to order ɛ k will restrict the model to transitions (n i n i+1), where n i+1n i ≥ − k. It would follow that either the order O of the method would be limited from below by the data, leading to high-order expansions in ɛ and, hence, to increased computation times, or that the method could only be applied to a subset of the given data, which would introduce a bias.

Lastly, an asymptotic expansion such as (A.6) implicitly assumes that all parameters and variables in the model are of order 1 in ɛ. For the series expansion of F ̂ 1 in (A.6), that assumption would significantly restrict the range of λ; in comparison, in Figure 2, likelihood values for λ up to order ɛ −3 are calculated. More importantly, the above assumption would restrict the range of n 0, implying that only a subset of the data – with sufficiently low protein numbers – could be used as input for parameter inference.

We emphasize that none of these difficulties occur with the expressions in (A.1) and (A.2), where the expansion order in ɛ is expressed ‘implicitly’ in the respective functional forms of F ̂ 0 and F ̂ 1 .

References

[1] M. B. Elowitz, A. J. Levine, E. D. Siggia, and P. S. Swain, “Stochastic gene expression in a single cell,” Science, vol. 297, no. 5584, pp. 1183–1186, 2002. https://doi.org/10.1126/science.1070919.Search in Google Scholar

[2] N. G. van Kampen, Stochastic Processes in Physics and Chemistry, Lecture Notes in Mathematics, vol. 888, Amsterdam, New York, North-Holland Publishing Co., 1981.Search in Google Scholar

[3] M. M. Crane, I. B. Clark, E. Bakker, S. Smith, and P. S. Swain, “A microfluidic system for studying ageing and dynamic single-cell responses in budding yeast,” PLoS One, vol. 9, no. 6, p. e100042, 2014. https://doi.org/10.1371/journal.pone.0100042.Search in Google Scholar

[4] A. Filipczyk, C. Marr, S. Hastreiter, et al.., “Network plasticity of pluripotency transcription factors in embryonic stem cells,” Nat. Cell Biol., vol. 17, pp. 1235–1246, 2015. https://doi.org/10.1038/ncb3237.Search in Google Scholar

[5] P. S. Hoppe, M. Schwarzfischer, D. Loeffler, et al.., “Early myeloid lineage choice is not initiated by random PU.1 to GATA1 protein ratios,” Nature, vol. 535, pp. 299–302, 2016. https://doi.org/10.1038/nature18320.Search in Google Scholar

[6] D. M. Suter, N. Molina, D. Gatfield, K. Schneider, U. Schibler, and F. Naef, “Mammalian genes are transcribed with widely different bursting kinetics,” Science, vol. 332, no. 6028, pp. 472–474, 2011. https://doi.org/10.1126/science.1198817.Search in Google Scholar

[7] D. Zenklusen, D. R. Larson, and R. H. Singer, “Single-RNA counting reveals alternative modes of gene expression in yeast,” Nat. Struct. Mol. Biol., vol. 15, pp. 1263–1271, 2008. https://doi.org/10.1038/nsmb.1514.Search in Google Scholar

[8] S. M. Janicki, T. Tsukamoto, S. E. Salghetti, et al.., “From silencing to gene expression: real-time analysis in single cells,” Cell, vol. 116, no. 5, pp. 683–698, 2004. https://doi.org/10.1016/s0092-8674(04)00171-0.Search in Google Scholar

[9] J. Feigelman, S. Ganscha, S. Hastreiter, et al.., “Analysis of cell lineage trees by exact Bayesian inference identifies negative autoregulation of nanog in mouse embryonic stem cells,” Cell Syst., vol. 3, no. 5, pp. 480–490, 2016. https://doi.org/10.1016/j.cels.2016.11.001.Search in Google Scholar PubMed

[10] V. Shahrezaei and P. S. Swain, “Analytical distributions for stochastic gene expression,” Proc. Natl. Acad. Sci. U. S. A., vol. 105, no. 45, pp. 17256–17261, 2008. https://doi.org/10.1073/pnas.0803850105.Search in Google Scholar PubMed PubMed Central

[11] C. Zechner, M. Unger, S. Pelete, M. Peter, and H. Koeppl, “Scalable inference of heterogeneous reaction kinetics from pooled single-cell recordings,” Nat. Methods, vol. 11, pp. 197–202, 2013.10.1038/nmeth.2794Search in Google Scholar PubMed

[12] N. Popović, C. Marr, and P. S. Swain, “A geometric analysis of fast-slow models for stochastic gene expression,” J. Math. Biol., vol. 72, no. 1, pp. 87–122, 2016. https://doi.org/10.1007/s00285-015-0876-1.Search in Google Scholar PubMed

[13] D. Schnoerr, G. Sanguinetti, and R. Grima, “Approximation and inference methods for stochastic biochemical kinetics—a tutorial review,” J. Phys. A, vol. 50, no. 9, p. 093001, 2017. https://doi.org/10.1088/1751-8121/aa54d9.Search in Google Scholar

[14] F. Veerman, C. Marr, and N. Popović, “Time-dependent propagators for stochastic models of gene expression: an analytical method,” J. Math. Biol., vol. 77, pp. 261–312, 2018. https://doi.org/10.1007/s00285-017-1196-4.Search in Google Scholar PubMed PubMed Central

[15] A. Raj, C. S. Peskin, D. Tranchina, D. Y. Vargas, and S. Tyagi, “Stochastic mRNA synthesis in mammalian cells,” PLoS Biol., vol. 4, no. 10, pp. 1707–1719, 2006. https://doi.org/10.1371/journal.pbio.0040309.Search in Google Scholar PubMed PubMed Central

[16] F. Bornemann, “Accuracy and stability of computing high-order derivatives of analytic functions by Cauchy integrals,” Found. Comput. Math., vol. 11, no. 1, pp. 1–63, 2011. https://doi.org/10.1007/s10208-010-9075-z.Search in Google Scholar

[17] D. T. Gillespie, “Exact stochastic simulation of coupled chemical reactions,” J. Phys. Chem., vol. 81, no. 25, pp. 2340–2361, 1977. https://doi.org/10.1021/j100540a008.Search in Google Scholar

[18] A. Duncan, S. Liao, Vejchodský, R. Erban, and R. Grima, “Noise-induced multistability in chemical systems: discrete versus continuum modeling,” Phys. Rev., vol. 91, no. 4, p. 042111, 2015. https://doi.org/10.1103/physreve.91.042111.Search in Google Scholar PubMed

[19] J. E. M. Hornos, D. Schultz, G. C. P. Innocentini, et al.., “Self-regulating gene: an exact solution,” Phys. Rev., vol. 72, p. 051907, 2005. https://doi.org/10.1103/physreve.72.051907.Search in Google Scholar

[20] S. Iyer-Biswas and C. Jayaprakash, “Mixed Poisson distributions in exact solutions of stochastic autoregulation models,” Phys. Rev., vol. 90, p. 052712, 2014. https://doi.org/10.1103/physreve.90.052712.Search in Google Scholar PubMed

[21] J. Holehouse, Z. Cao, and R. Grima, “Stochastic modeling of autoregulatory genetic feedback loops: a review and comparative study,” Biophys. J., vol. 118, no. 7, pp. 1517–1525, 2020. https://doi.org/10.1016/j.bpj.2020.02.016.Search in Google Scholar PubMed PubMed Central

[22] R. Milo, “What is the total number of protein molecules per cell volume? A call to rethink some published values,” Bioessays, vol. 35, no. 12, pp. 1050–1055, 2013. https://doi.org/10.1002/bies.201300066.Search in Google Scholar PubMed PubMed Central

[23] B. Schwanhausser, D. Busse, N. Li, et al.., “Global quantification of mammalian gene expression control,” Nature, vol. 473, no. 7347, pp. 337–342, 2011. https://doi.org/10.1038/nature10098.Search in Google Scholar PubMed

[24] J. Feigelman, N. Popović, and C. Marr, “A case study on the use of scale separation-based analytical propagators for parameter inference in models of stochastic gene regulation,” J. Coupled Syst. Multiscale Dyn., vol. 3, no. 2, pp. 164–173, 2015. https://doi.org/10.1166/jcsmd.2015.1074.Search in Google Scholar

[25] Z. Cao and R. Grima, “Linear mapping approximation of gene regulatory networks with stochastic dynamics,” Nat. Commun., vol. 9, p. 3305, 2018. https://doi.org/10.1038/s41467-018-05822-0.Search in Google Scholar PubMed PubMed Central

[26] R. Grima, D. R. Schmidt, and T. J. Newman, “Steady-state fluctuations of a genetic feedback loop: an exact solution,” J. Chem. Phys., vol. 190, p. 035104, 2012. https://doi.org/10.1063/1.4736721.Search in Google Scholar PubMed

[27] L. Congxin, F. Cesbron, M. Oehler, M. Brunner, and T. Höfer, “Frequency modulation of transcriptional bursting enables sensitive and rapid gene regulation,” Cell Syst., vol. 6, no. 4, pp. 409–423, 2018.10.1016/j.cels.2018.01.012Search in Google Scholar PubMed

[28] B. Zoller, D. Nicolas, N. Molina, and F. Naef, “Structure of silent transcription intervals and noise characteristics of mammalian genes,” Mol. Syst. Biol., vol. 11, no. 7, p. 823, 2015. https://doi.org/10.15252/msb.20156257.Search in Google Scholar PubMed PubMed Central

[29] E. Bakker and P. S. Swain, “Estimating numbers of intracellular molecules through analysing fluctuations in photobleaching,” Nat. Sci. Rep., vol. 9, p. 15238, 2019. https://doi.org/10.1038/s41598-019-50921-7.Search in Google Scholar PubMed PubMed Central

[30] N. Rosenfeld, T. J. Perkins, U. Alon, M. B. Elowitz, and P. S. Swain, “A fluctuation method to quantify in vivo fluorescence data,” Biophys. J., vol. 91, no. 2, pp. 759–766, 2006. https://doi.org/10.1529/biophysj.105.073098.Search in Google Scholar PubMed PubMed Central

[31] P. Thomas, H. Matuschek, and R. Grima, “How reliable is the linear noise approximation of gene regulatory networks?” BMC Genom., vol. 14, no. Suppl 4, p. S5, 2013. https://doi.org/10.1186/1471-2164-14-s4-s5.Search in Google Scholar PubMed PubMed Central

[32] F. Fröhlich, P. Thomas, A. Kazeroonian, F. J. Theis, R. Grima, and J. Hasenauer, “Inference for stochastic chemical kinetics using moment equations and system size expansion,” PLoS Comput. Biol., vol. 12, no. 7, p. e1005030, 2016. https://doi.org/10.1371/journal.pcbi.1005030.Search in Google Scholar PubMed PubMed Central

[33] S. Liao, T. Vejchodský, and R. Erban, “Tensor methods for parameter estimation and bifurcation analysis of stochastic reaction networks,” J. R. Soc. Interface, vol. 12, no. 108, p. 20150233, 2015. https://doi.org/10.1098/rsif.2015.0233.Search in Google Scholar PubMed PubMed Central

[34] C. Kuehn, “Moment closure—a brief review,” in Control of Self-Organising Nonlinear Systems, E. Schöll, S. H. L. Klapp, and P. Hövel, Eds., Switzerland, Springer International Publishing, 2016, pp. 253–271.10.1007/978-3-319-28028-8_13Search in Google Scholar

[35] F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, ., Eds. NIST Digital Library of Mathematical Functions. Available at: http://dlmf.nist.gov/, Release 1.1.1 of 2021-03-15.Search in Google Scholar

Received: 2019-10-16
Revised: 2020-11-21
Accepted: 2021-03-11
Published Online: 2021-04-07
Published in Print: 2022-06-25

© 2021 Frits Veerman et al., published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Scroll Up Arrow