Let *n* = *n*_{g} denotes the total number of reads aligning to a gene *g* with *k* transcripts, *g* = 1, …, *G*. Assume that $\mathit{X}={\mathit{X}}_{g}=({X}_{1},\mathrm{\dots},{X}_{k})$ is the vector of reads originating from each transcript, according to an underlying vector $\bm{\theta}={\bm{\theta}}_{g}=({\theta}_{1},\mathrm{\dots},{\theta}_{k})$ of relative abundances which is unknown. A priori, a Dirichlet prior is imposed on *𝜽* and, given *𝜽*, the observed reads are generated according to a multinomial distribution, that is,

$$\begin{array}{ccccc}\bm{\theta}\hfill & \sim \hfill & \mathcal{D}({\delta}_{1},\mathrm{\dots},{\delta}_{k})\hfill & & \\ \mathit{X}|\bm{\theta}\hfill & \sim \hfill & \text{Multinomial}(n,\bm{\theta})\hfill & & \end{array}$$

Integrating out *𝜽*, this model leads to the Dirichlet-Multinomial (Mosimann, 1962) distribution:

$$\text{P}(\mathit{X}=\mathit{x})=\left(\genfrac{}{}{0pt}{}{n}{\mathit{x}}\right)\frac{\mathrm{\Gamma}\left({\delta}_{+}\right)}{\mathrm{\Gamma}\left(n+{\delta}_{+}\right)}\prod _{j=1}^{k}\frac{\mathrm{\Gamma}\left({\delta}_{j}+{x}_{j}\right)}{\mathrm{\Gamma}\left({\delta}_{j}\right)},$$

where the first term in the product denotes the multinomial coefficient and ${\delta}_{+}={\sum}_{k=1}^{K}{\delta}_{k}$. We will write: $\mathit{X}|n,\bm{\delta}\sim \mathcal{D}\mathcal{M}(n,\bm{\delta})$. It can be shown that

$$\mathbb{E}\mathit{X}=n\bm{\pi}$$

and

$$\text{Var}\mathit{X}=\left\{1+\frac{n-1}{{\delta}_{+}+1}\right\}n\left\{\text{diag}\left(\bm{\pi}\right)-\bm{\pi}{\bm{\pi}}^{\prime}\right\},$$

where $\bm{\pi}=\{{\delta}_{j}/{\delta}_{+};j=1,\mathrm{\dots},k-1\}$ and diag(*𝝅*) denotes a diagonal matrix with diagonal entries equal to *π*_{1}, … ,*π*_{k−1}. Note that as ${\delta}_{+}\to \mathrm{\infty}$ the variance-covariance matrix of the Dirichlet-multinomial distribution reduces to $n\left\{\text{diag}\left(\bm{\pi}\right)-\bm{\pi}{\bm{\pi}}^{\prime}\right\}$, that is, the variance-covariance matrix of the multinomial distribution. In any other case extra variation is introduced compared to standard multinomial sampling, a well known property of the Dirichlet-multinomial distribution [see e.g. Neerchal and Morel (1998)].

Consider now that a matrix of (estimated) read counts is available for two different conditions, consisting of *n*_{1} and *n*_{2} replicates. Given two hyper-parameter vectors ${\bm{\delta}}_{1},{\bm{\delta}}_{2}$, let

$$\begin{array}{ccccc}{\mathit{X}}_{i}^{\left(g\right)}|{n}_{1i},{\bm{\delta}}_{1}\hfill & \sim \hfill & \mathcal{D}\mathcal{M}({n}_{1i},{\bm{\delta}}_{1}),\text{independent for}i=1,\mathrm{\dots},{n}_{1}\hfill & & \\ {\mathit{Y}}_{j}^{\left(g\right)}|{n}_{2j},{\bm{\delta}}_{2}\hfill & \sim \hfill & \mathcal{D}\mathcal{M}({n}_{2j},{\bm{\delta}}_{2}),\text{independent for}j=1,\mathrm{\dots},{n}_{2},\hfill & & \end{array}$$

where ${\mathit{X}}_{i}^{\left(g\right)}$, ${\mathit{Y}}_{j}^{\left(g\right)}$ denote two independent vectors of (estimated) number of reads for the transcripts of gene *g* = 1, …, *G* for replicate *i* = 1, …, *n*_{1} and *j* = 1, …, *n*_{2} for the first and second condition, respectively. Obviously, *n*_{1i} and *n*_{2j} denote the total number of reads generated from gene *g* for the first and second condition for replicates *i* and *j*.

In this context, DTU inference is based on comparing the hyper-parameters of the Dirichlet-Multinomial distribution. Note that *𝜹*_{1} and *𝜹*_{2} is proportional to the average expression level of the specific set of transcripts. Typically, there are large differences in the scale of these parameters, thus their direct comparison does not reveal any evidence for DTU. For this reason, it is essential to reparametrize the model as follows:

$$\begin{array}{ccccc}{\bm{\delta}}_{1}\hfill & =\hfill & {d}_{1}{\mathit{g}}_{1}\hfill & & \\ {\bm{\delta}}_{2}\hfill & =\hfill & {d}_{2}{\mathit{g}}_{2},\hfill & & \end{array}$$(4)

where *d*_{1} > 0, *d*_{2} > 0 and ${\mathit{g}}_{1}=({g}_{11},\mathrm{\dots},{g}_{1k})$, ${\mathit{g}}_{2}=({g}_{21},\mathrm{\dots},{g}_{2k})$, with ${\sum}_{i=1}^{k}{g}_{1i}={\sum}_{i=1}^{k}{g}_{2i}=1$ and ${g}_{1i},{g}_{2i}>0$, *i* = 1, …, *k*.

In this case, DTU inference is based on comparing the null model:

$${\mathcal{M}}_{0}:{\mathit{g}}_{1}={\mathit{g}}_{2}$$

versus the full model where

$${\mathcal{M}}_{1}:{\mathit{g}}_{1}\ne {\mathit{g}}_{2}.$$

A likelihood ratio test is implemented in the DRIMSeq package for testing the hypothesis of the null versus the full model. In this work, we propose to compare the two models by applying approximate Bayesian model selection techniques. In particular, a priori it is assumed that

$$\begin{array}{ccccc}{d}_{i}\hfill & \sim \hfill & \mathcal{E}\left(\lambda \right),\text{independent for}i=1,2\hfill & & \\ {\mathit{g}}_{i}\hfill & \sim \hfill & \mathcal{D}(1,\mathrm{\dots},1)\hspace{1em}\text{independent for}i=1,2,\hfill & & \end{array}$$(5)

and furthermore *d*_{i} and *g*_{j} are mutually independent.

In order to perform Bayesian model selection, the Bayes factor (Kass and Raftery, 1995) of the null against the full model is approximated using a two stage procedure. At first, the posterior distribution of each model is approximated using Laplace’s approximation ((Laplace, 1774, 1986), a well established practice for approximating posterior moments and posterior distributions (Tierney and Kadane, 1986; Tierney et al., 1989; Azevedo-Filho and Shachter, 1994; Raftery, 1996). Then, the logarithm of marginal likelihoods of *ℳ*_{0} and *ℳ*_{1} are estimated using independent samples from the posterior distribution via self-normalized sampling importance resampling (Gordon et al., 1993). Finally, the posterior probabilities $p\left({\mathcal{M}}_{0}\right|{\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)})$, and $p\left({\mathcal{M}}_{1}\right|{\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)})$ are estimated assuming equally weighted prior probabilities.

Denote by *g*_{0} the common value of *g*_{1}, *g*_{2} in model *ℳ*_{0}. Let ${\mathit{u}}_{0}=({\mathit{g}}_{0},{d}_{1},{d}_{2})\in {\mathcal{U}}_{0}$, ${\mathit{u}}_{1}=({\mathit{g}}_{1},{\mathit{g}}_{2},{d}_{1},{d}_{2})\in {\mathcal{U}}_{1}$ denote the parameters associated with models *ℳ*_{0} and *ℳ*_{1}, respectively. Obviously, the underlying parameter spaces are defined as ${\mathcal{U}}_{0}={\mathcal{P}}_{{K}_{g}-1}\times {(0,+\mathrm{\infty})}^{2}$ and ${\mathcal{U}}_{1}={\mathcal{P}}_{{K}_{g}-1}^{2}\times {(0,+\mathrm{\infty})}^{2}$. The marginal likelihood of data under model *ℳ*_{j}, is defined as

$$f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathcal{M}}_{j})={\int}_{{\mathcal{U}}_{j}}f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathit{u}}_{j})f\left({\mathit{u}}_{j}\right|\lambda )\mathrm{d}{\mathit{u}}_{j},j=0,1.$$

According to the basic importance sampling identity, the marginal likelihood model can be evaluated using another density *ϕ*, which is absolutely continuous on *𝒰*_{j}, as follows

$$f(\mathit{x},\mathit{y}|{\mathcal{M}}_{j})={\int}_{{\mathcal{U}}_{j}}\frac{f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathit{u}}_{j})f\left({\mathit{u}}_{j}\right|\lambda )}{\varphi \left({\mathit{u}}_{j}\right)}\varphi \left({\mathit{u}}_{j}\right)\mathrm{d}{\mathit{u}}_{j}.$$

The minimum requirement for *ϕ* is to satisfy $\varphi \left({\mathit{u}}_{j}\right)>0$ whenever $f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathit{u}}_{j})f\left({\mathit{u}}_{j}\right|\lambda )>0$. Assume that a sample $\left\{{\mathit{u}}^{\left(i\right)};i=1,\mathrm{\dots},n\right\}$ is drawn from $\varphi (\cdot )$. Then, the importance sampling estimate of the marginal likelihood is

$$\widehat{f}({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathcal{M}}_{j})=\frac{1}{n}\sum _{i=1}^{n}\frac{f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathit{u}}_{j}^{\left(i\right)})f\left({\mathit{u}}_{j}^{\left(i\right)}\right|\lambda )}{\varphi \left({\mathit{u}}_{j}^{\left(i\right)}\right)},j=0,1.$$

The candidate distribution *ϕ* is the approximation of the posterior distribution according to the Laplace’s method. It is well known that basic importance sampling performs reasonably well in cases that the number of parameters is not too large. However, it can be drastically improved using sequential Monte Carlo methods, such as sampling importance resampling (Gordon et al., 1993; Liu and Chen, 1998). The `R` package `LaplacesDemon` (Statisticat and LLC., 2016) is used for this purpose.

Finally, the posterior probability of the DTU model is defined as

$${p}_{g}=\mathbb{P}\left({\mathcal{M}}_{1}\right|{\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)})\propto f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathcal{M}}_{1})P\left({\mathcal{M}}_{1}\right),g=1,\mathrm{\dots},G,$$(6)

by also assuming equally weighted prior probabilities, that is, $P\left({\mathcal{M}}_{1}\right)=P\left({\mathcal{M}}_{0}\right)=0.5$. Note that the Bayes Factor of the null against the full model is then given by

$${B}_{01}^{\left(g\right)}=\frac{\mathbb{P}\left({\mathcal{M}}_{0}\right|{\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)})}{\mathbb{P}\left({\mathcal{M}}_{1}\right|{\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)})}=\frac{f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathcal{M}}_{0})}{f({\mathit{x}}^{\left(g\right)},{\mathit{y}}^{\left(g\right)}|{\mathcal{M}}_{1})},g=1,\mathrm{\dots},G$$

since the prior odds ratio is equal to one.

In case that low expressed transcripts are included in the computation, the Laplace approximation faces many convergence problems. We have found that this problem can be alleviated by pre-filtering low expressed transcripts, as also pointed out by Soneson et al. (2015).

## Comments (0)