Bayesian estimation of Differential Transcript Usage from RNA-seq data

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.


Introduction
High throughput sequencing of cDNA (RNA-seq) [Mortazavi et al., 2008] is an important tool to quantify transcript expression levels and to identify differences between different biological conditions. RNA-seq experiments produce a large number (millions) of short reads (nucleotide sequences) which are typically mapped to the genome or transcriptome. Expression quantification requires estimating the number of reads originating from each transcript in a given sample.
Quantifying the transcriptome between different samples allows the identification of differentially expressed (DE) transcripts between them. However, certain difficulties complicate the inference procedure. In higher eukaryotes, most genes are spliced into alternative transcripts which share specific parts of their sequence (exons). Hence, a given short read typically aligns to different positions of the transcriptome and statistical models are often used to infer the origin probabilistically [Trapnell et al., 2010, Li and Dewey, 2011, Nicolae et al., 2011, Glaus et al., 2012, Rossell et al., 2014, Hensman et al., 2015.
Differential Transcript Expression (DTE) refers to the event where the overall relative expression of a transcript changes between two conditions. In this case, θ k refers to the relative expression of transcript k; k = 1, . . . , K, with respect to the whole set of transcripts, with θ k 0 and K k=1 θ k = 1. On the contrary, Differential Transcript Usage (DTU) refers to the event that the relative within gene abundance of a transcript changes between conditions. Consider a gene g = 1, . . . , G with K g > 1 transcripts. Then, the relative within gene transcript abundance is defined as θ (g) k = θ k j∈g θ j . Obviously, if a transcript belongs to a gene with K g = 1 then it is always non-DTU. According to Gonzàlez-Porta et al. [2013] the dominant transcripts within a gene are likely to be the main contributors to the proteome and switching events between them is a common scenario of gene modification between conditions. Figure 1 illustrates the differences between DTE and DTU, considering a set of three genes (shown in red, blue and green) consisting of 2, 2 and 3 transcripts. In the case of DTE (upper panel) the overall expression of transcripts 1, 2, 6 and 7 change: in particular transcripts 1 and 2 are up-regulated in condition A while trancripts 6 and 7 are up-regulated in condition B. In the lower panel of Figure 1 note that only transcripts 6 and 7 are DTE. However, also note that now the relative expression of these transcripts conditionally on the set of the same-gene transcripts (green color) is not the same between conditions. In general, DTU implies DTE but the reverse is not necessarily true.
In this paper we extend the use of two available methods in order to perform Bayesian inference for the problem of DTU. cjBitSeq [Papastamoulis and Rattray, 2017] was originally introduced as a Bayesian read-based model for DTE inference and here we modify it for the DTU problem. We also propose a Bayesian version of DRIMSeq [Nowicka and Robinson, 2016], a count-based approach originally introduced as a frequentist model for DTU inference. Genomescale studies incorporate a large number of multiple tests, typically at the order of tens of thousands. A crucial issue under a multiple comparisons framework is the control of the False Discovery Rate (FDR), that is, the expected proportions of errors among the rejected hypotheses [Benjamini and Hochberg, 1995]. According to a recent benchmarking study [Soneson et al., 2015], the ability of frequentist count-based methods to control the FDR is drastically improved by pre-filtering low-expressed transcripts. This remains true for the Bayesian version of the count-based method presented here (DRIMSeq). However it is not possible to incorporate such a strategy for read-based methods (cjBitSeq) where transcript expression levels are not known a priori. Therefore, under our Bayesian framework, we also propose the use of transformations of the raw posterior probabilities and filtering the output based on the notion of trust regions which are motivated from realistic scenarios of gene regulation [Gonzàlez-Porta et al., 2013].
The rest of the paper is organized as follows. In Section 2 we briefly describe existing methods. The proposed Bayesian models are presented in Section 3. More specifically, Section 3.1 reviews the cjBitSeq framework and also introduces the necessary prior modifications for the problem of DTU. The likelihood of the DRIMSeq model is presented in Section 3.2 and a Bayesian version is introduced next, along with a detailed description of the inference. Section 4 deals with FDR control procedures. In Section 5 we report our findings on synthetic data using the carefully designed simulation study of Soneson et al. [2015]. In Section 5.1 we compare cjBitSeq and BayesDRIMSeq with respect to the decision rules of Section 4 using power versus achieved FDR plots. In Section 5.2 we benchmark these methods against existing ones and we also report more performance measures, such as ROC and precision/recall curves as well as comparisons in terms of run-time and memory requirements. A real RNA-seq dataset is analysed in Section 6. The manuscript concludes with a Discussion. A prior sensitivity analysis of BayesDRIMSeq as well as a comparison between alternative inputs of BayesDRIMSeq and DRIMSeq based on different quantification methods is provided in the Appendix.

Existing methods
cuffdiff The cufflinks/cuffdiff [Trapnell et al., 2010[Trapnell et al., , 2013 pipeline estimates the expression of a set of transcripts and then performs various differential expression tests both on the transcript and gene level. DTU at the gene level is based on comparing the similarity of two distributions using the square root of the Jensen-Shannon divergence [Osterreicher andVajda, 2003, Endres andSchindelin, 2003]. Following Soneson et al. [2015], we used the gene-wise FDR estimates from the cds.diff output file of cuffdiff (version 2.2.1).
DEXSeq DEXSeq [Anders et al., 2012] is the most popular method for inferring DTU. The genome is divided into disjoint parts of exons (counting bins) and a matrix of read counts into the counting bins is used as input. The default method for counting reads for this purpose is HTSeq [Anders et al., 2015]. Given the estimated reads from HTSeq, a negative binomial generalized linear model is fit and DTU is inferred by testing whether the interaction term between conditions is different from zero.
DRIMSeq This recent package [Nowicka and Robinson, 2016] implements a dirichlet-multinomial model in order to describe the variability between replicates. A likelihood ratio test is performed in order to compare a full model with distinct parameters per condition and a null model which assumes that the parameters are shared. The input is a matrix of counts per transcript. We applied this method using the following filtering criteria: • min gene expr = 1 (Minimal gene expression in cpm) • min feature prop = 0.01 (Minimal proportion for feature expression) • min samps gene expr = 3 (Minimal number of samples where genes should be expressed) • min samps feature prop = 3 (Minimal number of samples where features should be expressed) edgeR The function spliceVariants from the edgeR [Robinson et al., 2010] package can be used to identify genes showing evidence of splice variation using negative binomial generalized linear models. For each gene (containing at least two transcripts) a likelihood ratio test compares a model with an interaction term between each condition against a null model with no interaction term. The input corresponds to a matrix of counts per transcript.
limma The function diffSplice from the limma [Ritchie et al., 2015] package also tests for DTU by fitting negative binomial generalized linear models and performing a likelihood ratio test at the difference of log-fold changes. The input corresponds to a matrix of counts per transcript.
3 New Bayesian approaches cjBitSeq was originally applied to problem of inferring transcripts with DTE and here this model is modified for the problem of DTU. DRIMSeq is a frequentist-based approach for the problem of DTU and this model is now extended under a Bayesian framework. cjBitSeq is a read-based model, that is, the observed data is a matrix of alignments of each read to the transcriptome. On the other hand, DRIMSeq is a count-based model, which uses as input a matrix of (estimated) Interestingly, we note that both distributions were introduced by the same author [Mosimann, 1962, Connor andMosimann, 1969].

cjBitSeq
Let x = (x 1 , . . . , x r ), x i ∈ X , i = 1, . . . , r, denote a sample of r short reads aligned to a given set of K transcripts. The sample space X consists of all sequences of letters A, C, G, T. Assuming that reads are independent, the joint probability density function of the data is written as The number of components (K) is equal to the number of transcripts and it is considered as known since the transcriptome is given. The parameter vector θ = (θ 1 , . . . , θ K ) ∈ P K−1 denotes relative abundances, where P K−1 := {p k 0, k = 1, . . . , K − 1 : The component specific density f k (·) corresponds to the probability of a read aligning at some position of transcript k, k = 1, . . . , K. Since we assume a known transcriptome, {f k } K k=1 are known as well and they are computed according to the methodology described in Glaus et al. [2012], taking into account optional position and sequence-specific bias correction method. Papastamoulis and Rattray [2017] proposed a Bayesian model selection approach for identifying differentially expressed transcripts from RNA-seq data. The methods builds upon the BitSeq model [Glaus et al., 2012, Papastamoulis et al., 2014, Hensman et al., 2015. Compared to other approaches, the main difference of cjBitSeq is that transcript expression and differential expression is jointly modelled. In contrast to other methods where the starting point of the DE analysis is a count matrix, the input of cjBitSeq is the matrix L containing alignment probabilities of each read to the transcriptome. According to Equation (1), the probability of read i aligning at transcript k is given by L ik = f k (x i ) for i = 1, . . . , r and k = 1, . . . , K.
Assume that we have at hand two samples x := (x 1 , . . . , x r ) and y := (y 1 , . . . , y s ), with r and s denoting the number of (mapped) reads for sample x and y, respectively. Now, let θ k and w k denote the unknown relative abundance of transcript k = 1, . . . , K in sample x and y, respectively. Define the parameter vector of relative abundances as θ = (θ 1 , . . . , θ K−1 ; θ K ) ∈ P K−1 and w = (w 1 , . . . , w K−1 ; w K ) ∈ P K−1 . Under the standard BitSeq model the prior on the parameters θ and w would be a product of independent Dirichlet distributions. In this case the probability θ k = w k under the prior is zero and it is not straightforward to define non-DE transcripts. To model differential expression we would instead like to identify instances where transcript expression has not changed between samples. Therefore, we introduce a finite probability for the event θ k = w k . This leads us to define a new model with a non-independent prior for the parameters θ and w.
Definition 1 (State vector). Let c := (c 1 , . . . , c K ) ∈ C, where C is the set defined by: Then, for k = 1, . . . , K let: We will refer to vector c as the state vector of the model.
cjBitSeq was originally applied to the problem of DTE by introducing a cluster representation of aligned reads to transcripts. This clustering approach substantially reduces the dimensionality of the samp ling space and makes the MCMC sampler converge to reasonable time. It is important to mention that clusters are defined under a data-driven algorithm, that is, by searching the alignments of each read and identifying groups of transcripts sharing reads.
Under the same approach, we would be able to infer clusters of transcripts with DTU.
However, since in this work we focus on inference at the gene level, we impose the assumption that clusters are defined as the transcripts of each gene. Otherwise, in some instances it will not be straightforward to perform inference at the gene level, due to the possibility of clusters of transcripts merging multiple genes together. For example, we found that approximately 4.5% of mapped reads align to more than one gene in our simulation experiments of Section 5 using paired-end reads with length 101 base-pairs. In case that a read maps to more than one gene, we only keep the alignments corresponding to transcripts of the gene containing the best score for this specific read. Thus, the cjBitSeq algorithm is applied separately to each gene (consisting of at least two transcripts). outperforms other choices. However, this choice introduces a prior bias to the case of DTU since genes with larger number of transcripts are assigned larger prior probability of DTU than genes with small number of transcripts. Therefore, now it is a priori assumed that the probability of no differential expression within a gene is equally weighted with the event that at least two transcripts exhibit DTU, that is, P(c + = 0) = 0.5. An equal prior probability is assigned to the rest possible configurations. Thus, the prior distribution on the state vector is defined as: This modification is necessary in order to ensure that no prior bias is enforced at the gene-level which is the aim of the analysis in the DTU setup.
A graphical model of the cjBitSeq prior assumptions is shown in Figure 2. The binary state vector c = (c 1 , . . . , c K ) defines differentially or equally expressed transcripts within each gene.
The prior distribution of c is given by Equation (2), although in the general implementation of Papastamoulis and Rattray [2017] an extra level of hierarchy is imposed by the hyper-parameter π, shown in Figure 2. The parameters u and v are a-priori independent Dirichlet random variables. The dimension of u is equal to K, i.e. the number of transcripts for a given gene. On the other hand, v is a random variable with varying dimension, which is defined by the number of differentially expressed transcripts, that is, K k=1 c k . The parameters u and v along with an auxiliary parameter τ define via a suitable one-to-one transformation the actual transcript expression parameters θ and w. According to Theorem 1 of Papastamoulis and Rattray [2017], θ and w are marginally Dirichlet random variables, however they are not independent since the probability of the events {θ k = w k ; k = 1, . . . , K} is positive. At the next level of hierarchy, the latent allocation variables ξ and z define the transcript allocation of each read from sample x and y, respectively, through the equations P (ξ i = k) = θ k , independent for i = 1, . . . , r, and P (z j = k) = w k , independent for j = 1, . . . , s. Papastamoulis and Rattray [2017] showed that the model is conjugate given c. But in order to update (c, v), a reversible-jump mechanism [Green, 1995, Richardson and Green, 1997, Papastamoulis and Iliopoulos, 2009] is required. However, this step can be avoided by analytical integration of (u, v). Thus, a collapsed Gibbs sampler [Geman and Geman, 1984, Gelfand and Smith, 1990, Liu, 1994, Liu et al., 1995 updates the latent allocation variables (ξ and z) of each read to its transcript of origin as well as the binary variables c k of each transcript state (DE or EE). Let x −[i] denote the vector arising from x after excluding its i-th entry. A pseudo-code description of the collapsed Gibbs MCMC sampler is: 1. Update allocation variables for sample x: ξ i |ξ [−i] , z, c, x, y, i = 1, . . . , r.
Note that the update 4 is optional in the sense that it is not required by any of the previous steps, however one can include it in order to also obtain MCMC samples of the transcript expression parameters θ and w. For a detailed description of the conditional distributions involved in steps 1-4 (as well as the alternative RJMCMC sampler) see Papastamoulis and Rattray [2017]. According to our model, it is natural to call a gene as DE if at least two transcripts exhibit DTU. Hence, the posterior probability of DTU for a gene g is defined as and it is estimated by the corresponding ergodic average across the MCMC run (after burn-in).

BayesDRIMSeq
Let n = n g denotes the total number of reads aligning to a gene g with k transcripts, g = 1, . . . , G.
Assume that X = X g = (X 1 , . . . , X k ) is the vector of reads originating from each transcript, according to an underlying vector θ = θ g = (θ 1 , . . . , θ k ) of relative abundances which is unknown.
A priori, a Dirichlet prior is imposed on θ and, given θ, the observed reads are generated according to a multinomial distribution, that is, Integrating out θ, this model leads to the Dirichlet-Multinomial [Mosimann, 1962] distribution: where the first term in the product denotes the multinomial coefficient and δ + = K k=1 δ k . We will write: X|n, δ ∼ DM(n, δ). It can be shown that where π = {δ j /δ + ; j = 1, . . . , k − 1} and diag(π) denotes a diagonal matrix with diagonal entries equal to π 1 , . . . , π k−1 . Note that as δ + → ∞ the variance-covariance matrix of the Dirichletmultinomial distribution reduces to n{diag(π) − ππ }, that is, the variance-covariance matrix of the multinomial distribution. In any other case extra variation is introduced compared to standard multinomial sampling, a well known property of the Dirichlet-multinomial distribution (see e.g. Neerchal and Morel [1998]).
Consider now that a matrix of (estimated) read counts is available for two different conditions, consisting of n 1 and n 2 replicates. Given two hyper-parameter vectors δ 1 , δ 2 , let j denote two independent vectors of (estimated) number of reads for the transcripts of gene g = 1, . . . , G for replicate i = 1, . . . , n 1 and j = 1, . . . , n 2 for the first and second condition, respectively. Obviously, n 1i and n 2j denote the total number of reads generated from gene g for the first and second condition for replicates i and j.
In this context, DTU inference is based on comparing the hyper-parameters of the Dirichlet-Multinomial distribution. Note that δ 1 and δ 2 is proportional to the average expression level of the specific set of transcripts. Typically, there are large differences in the scale of these parameters, thus their direct comparison does not reveal any evidence for DTU. For this reason, it is essential to reparametrize the model as follows: where d 1 > 0, d 2 > 0 and g 1 = (g 11 , . . . , g 1k ), g 2 = (g 21 , . . . , g 2k ), with k i=1 g 1i = k i=1 g 2i = 1 and g 1i , g 2i > 0, i = 1, . . . , k.
In this case, DTU inference is based on comparing the null model: A likelihood ratio test is implemented in the DRIMSeq package for testing the hypothesis of the null versus the full model. In this work, we propose to compare the two models by applying approximate Bayesian model selection techniques. In particular, a priori it is assumed that g i ∼ D(1, . . . , 1) independent for i = 1, 2, and furthermore d i and g j are mutually independent.
In order to perform Bayesian model selection, the Bayes factor [Kass and Raftery, 1995] of the null against the full model is approximated using a two stage procedure. At first, the posterior distribution of each model is approximated using Laplace's approximation [Laplace, 1774[Laplace, , 1986], a well established practice for approximating posterior moments and posterior distributions [Tierney and Kadane, 1986, Tierney et al., 1989, Azevedo-Filho and Shachter, 1994, Raftery, 1996. Then, the logarithm of marginal likelihoods of M 0 and M 1 are estimated using independent samples from the posterior distribution via self-normalized sampling importance resampling [Gordon et al., 1993]. Finally, the posterior probabilities p(M 0 |x (g) , y (g) ), and p(M 1 |x (g) , y (g) ) are estimated assuming equally weighted prior probabilities.
Obviously, the underlying parameter spaces are defined as U 0 = P Kg−1 × (0, +∞) 2 and U 1 = According to the basic importance sampling identity, the marginal likelihood model can be evaluated using another density φ, which is absolutely continuous on U j , as follows The minimum requirement for φ is to satisfy φ(u j ) > 0 whenever f (x (g) , y (g) |u j )f (u j |λ) > 0.
The candidate distribution φ is the approximation of the posterior distribution according to the Laplace's method. It is well known that basic importance sampling performs reasonably well in cases that the number of parameters is not too large. However, it can be drastically improved using sequential Monte Carlo methods, such as sampling importance resampling [Gordon et al., 1993, Liu andChen, 1998]. The R package LaplacesDemon [Statisticat and LLC., 2016] is used for this purpose.
Finally, the posterior probability of the DTU model is defined as p g = P(M 1 |x (g) , y (g) ) ∝ f (x (g) , y (g) |M 1 )P (M 1 ), g = 1, . . . , G, by also assuming equally weighted prior probabilities, that is, P (M 1 ) = P (M 0 ) = 0.5. Note that the Bayes Factor of the null against the full model is then given by since the prior odds ratio is equal to one.
In case that low expressed transcripts are included in the computation, the Laplace approximation faces many convergence problems. We have found that this problem can be alleviated by pre-filtering low expressed transcripts, as also pointed out by Soneson et al. [2015].

Bayesian FDR control for the problem of DTU
In this section we consider various decision rules in order to control the False Discovery Rate (FDR) [Benjamini and Hochberg, 1995, Storey, 2003, Müller et al., 2004, 2006]. Decision rules (9) and (11) are taking into account the whole set of genes and make use of the raw and transformed posterior probabilities, respectively. Intuitively, the transformation of posterior probabilities prioritizes genes consisting of transcripts with large changes in their expression.
Decision rules (10) and (12) are based on filtering the output of (9) and (11) according to a trust region.
A decision rule based on the raw gene-level posterior probabilities of DTU, as defined in Equations (3) and (8), is the following.
Note that for the problem of inferring DTE the decision rule (9) is the one used by Leng et al. [2013]. However, the cjBitSeq model takes into account changes to any subset of transcripts within a gene, thus, (9) may identify a large number of genes consisting of relatively small changes in low expressed transcripts. A more conservative choice will focus our attention to the dominant transcripts, where more reads are available and potentially the results will be more robust.
Next we define a filtering of the output based on a "trust region". Let i and j denote the estimated dominant transcripts in condition A and B, respectively. The trust region corresponds to the subset of genes where the relative ordering of estimated expression levels of dominant transcript switches, that is, Switching events between dominant transcripts have been proposed as a major source of DTU in real RNA-seq data [Gonzàlez-Porta et al., 2013].
Note that in the previous expression we used the notation of transcript expression levels according to cjBitSeq. For BayesDRIMSeq θ and w should be replaced by g 1 and g 2 , respectively.
The decision rule which corresponds to filtering (9) according to G 0 is the following: Note that decision rules d 1 and d 2 are solely based on the posterior probabilities of gene DTU and the trust region, respectively. However, it makes sense to also take into account additional information, such as the magnitude of the change of the within gene relative transcript expression, which is a by-product of our algorithm.
In order to clarify this, consider the following example. Assume that genes g 1 and g 2 both consist of two transcripts. For g 1 , let θ = 0.4. Furthermore, assume that the posterior evidence of DE is the same for both genes, that is, p g 1 = p g 2 = p. In the case that the posterior probability p is sufficiently large, genes g 1 and g 2 will be given the same importance in our discovery list. Note however that for gene g 1 the absolute change in relative expression is 4 times larger than for gene g 2 . Ideally, we would like our discovery list to rank higher gene g 1 than gene g 2 . This is achieved using the following FDR control procedure.
Consider any (Bayesian) method that for each gene yields an estimate of the posterior probability of DTU per gene p g , g = 1, . . . , G.
Here we mention that in the original implementation of cjBitSeq for the DTE problem, the permutation τ was defined as the one that orders the posterior probabilities of transcript DE in decreasing order.
The permutation that takes into account the previously described concept of magnitude change is defined as follows. Let ρ g = max | θ denote the posterior mean estimates of within gene transcript expression for a given transcript k of gene g. Consider the permutation τ = (τ 1 , τ 2 . . . , τ G ) that orders the set {ρ g ; g = 1, . . . , G} in decreasing order, that is: Finally, we combine decision rule d 3 with the trust region G 0 to obtain our final decision rule, that is, 1, 1 g g * and g ∈ G 0 0, otherwise. (12)

Simulation study
In order to assess the performance of the proposed methods and decision rules as well as to compare against existing models, a set of simulation studies is used. Instead of setting up our own simulation scenarios, we followed the pipeline introduced in the recent study of Soneson et al. [2015], where a large number of count-based method is being benchmarked. Synthetic RNA-seq reads are generated from the Drosophila Melanogaster and Homo Sapiens transcriptomes using the RSEM-simulator [Li and Dewey, 2011]. The model parameters for RSEM-simulator were estimated from real datasets using a Negative Binomial model described in Soneson and Delorenzi [2013]. The transcriptomes of these two organisms exhibit strong differences as illustrated in Figure 3. The average number of transcripts per gene is considerably smaller for fruit fly, however the transcripts are longer than for human (see also Supplementary Table 1 of Soneson et al. [2015]).
Each replicate consists of 25 million paired-end reads with length 101 base-pairs. Differential transcript usage was introduced for 1000 genes, by reversing the relative abundance of the two most abundant transcripts in one of the two conditions. The total number of reads for each transcript may or may not be equal across conditions. If the total number of reads generated from a gene is constant, no gene-level differential expression is evident. For the drosophila reads no gene-level differential expression was introduced. For human reads both cases are considered.
Finally, the simulated reads are mapped to the genome or transcriptome with Tophat2  and Bowtie2 [Langmead et al., 2009], respectively. Cufflinks and HTSeq used the alignment files produced by Tophat2, while BitSeqVB and cjBitSeq use the alignment produced from Bowtie2, allowing a maximum of 100 hits per read. The count matrix used as input to DEXSeq is estimated using the default HTSeq method, while BitSeqVB is used for input to edgeR, limma, DRIMSeq and BayesDRIMSeq.

Comparison against existing methods
For cuffdiff, DRIMSeq, edgeR and limma we use the gene-level p-values at the ROC and precision/recall plots and the adjusted q-values at the power versus achieved FDR plot. However, dexSeq reports only the adjusted q-values, hence this method is not shown at ROC and precision/recall curves. Note that for all these methods, the adjusted q-values correspond to the Benjamini and Hochberg [1995] FDR control procedure. For cjBitSeq we used the raw FDR rate (11) at the ROC and precision/recall curves and the adjusted FDR (12) at the power versus achieved FDR plots. For BayesDRIMSeq we used the raw FDR rate (11) at all plots, after pre-filtering isoforms with an average number of reads less than 20.
The performance measures of the evaluated methods are shown in Figure 5. The run-time per method is illustrated in Figure 6, with respect to the maximum amount of virtual memory used by each process. For the counting-based methods, the main computational burden of the two-stage pipeline is due to the first stage (that is, either HTSeq or BitSeqVB). Therefore, it is suggested to run cjBitSeq using at least 8 cores, since the memory requirements stay within reasonable levels. Finally, we mention that isoform pre-filtering is also essential for the computing time of BayesDRIMSeq. In case where no filtering takes place, the wallclock time is increased almost 2.5 times for drosophila and 4.3 times for the human datasets.

Adenocarcinoma dataset
In this section we benchmark the new Bayesian methods against DRIMSeq using real RNA-seq data from human lung normal and adenocarcinoma samples from six Korean female nonsmoking patients [Kim et al., 2013]. The data corresponds to samples from GSM927308 to GSM927319 and was downloaded from NCBIs Gene Expression Omnibus (GEO) under the accession number GSE37764: SRR493937, SRR493939, SRR493941, SRR493943, SRR493945, SRR493947, SRR493949, SRR493951, SRR493953, SRR493955, SRR493957, SRR493959.
The data consist of paired-end reads with length equal to 78 base pairs which were mapped to the reference transcriptome using Bowtie2. The overall alignment rates and the total number of mapped reads range between (70%, 85%) and (22×10 6 , 30×10 6 ), respectively. Next, BitSeq was used in order to calculate the matrix of alignment probabilities (as input to cjBitSeq) as well as to obtain a matrix of estimated counts per transcript (as input to DRIMSeq and BayesDRIMSeq).
Following Nowicka and Robinson [2016], we benchmark our methods using two comparisons: (a) a two-group comparison of 6 normal versus 6 cancer samples and (b) "mock" comparisons where 3 versus 3 samples from the normal condition are compared. For the latter scenario the expectation is to detect no DTU since replicates of the same condition are compared, although the biological variation between the replicates of the normal condition is high [as noted by Nowicka and Robinson, 2016]. The results are displayed in Figure 7, using different cutoff values for controlling the FDR. For the 6 normal versus 6 cancer samples comparison ( Figure   7.a), we conclude that all decision rules contain a large amount of genes which overlap with DRIMSeq (green colored regions), especially for the trust-region adjusted rules d 2 and d 4 . For the "mock" comparison ( Figure 7.b), at first note that a smaller number of DTU genes is inferred.
Second, observe that the decision rule d 4 is capable of substantially reducing the number of false discoveries compared to DRIMSeq and that this number is almost zero when using α = 0.01.

Discussion
In this study we exemplified the use of Bayesian methods for inferring genes with differential transcript usage. For this purpose two previously introduced models were modified and extended: cjBitSeq and a Bayesian version of DRIMSeq. After defining proper decision rules we concluded that both methods exhibit superior or comparable performance with other methods.
This was achieved by using the decision rule defined in Equation (11), shown in the ROC and precision-recall curves. According to (11), the whole sequence of posterior probabilities is transformed with respect to the ordering of the magnitude change of relative expression between conditions. For the read-based method (cjBitSeq) FDR control is improved when the decision rule is combined with a trust region. For the count-based method (BayesDRIMSeq) FDR control is mainly affected by the filtering of low-expressed transcripts, as previously reported under a frequentist context by Soneson et al. [2015]. BayesDRIMSeq exhibits slightly better FDR control than cjBitSeq for the drosophila dataset, however this effect is not so evident for the human datasets. In all cases cjBitSeq is more powerful than BayesDRIMSeq, but at the cost of increased computing time.
Regarding the analysis of real RNA-seq data, we compared our findings to DRIMSeq. We reported results based on a comparison of two different conditions, as well as "mock" comparisons of replicates within the same condition where no evidence of differential expression is expected.
We concluded that our DTU lists contain a large number of genes also detected by DRIMSeq.
Moreover, using conservative decision rules like d 4 we are able to substantially reduce the number of false discoveries when performing comparisons within the same condition.

A Prior Sensitivity of BayesDRIMSeq
According to Equation (6), the prior assumptions of BayesDRIMSeq are depending on the fixed hyperparameter λ. Figure 8 displays the power versus achieved FDR curves based on the decision d 3 as a function of λ ∈ {0.01, 0.1, 0.2, . . . , 1} (after isoform pre-filtering). We conclude that the value λ = 0.5 offers, perhaps, the best trade-off between power and FDR control. In particular, we note that values smaller than 0.5 tend to have small power and, on the other hand, values larger than 0.5 have larger rates of False Discoveries. All results presented in the main paper correspond to λ = 0.5.

B Using Kallisto counts
In the main text we used BitSeqVB count estimates as input to DRIMSeq and BayesDRIMSeq.
According to the recent study of Hensman et al. [2015], BitSeqVB is ranked as one of the most accurate methods for estimating transcript expression levels. Since there is a variety of alternative methods for this purpose, we compare the performance when Kallisto [Bray et al., 2016] counts are being used as input. As shown in Figure 9, we conclude that in drosophila data both BayesDRIMSeq and DRIMSeq perform better when BitSeqVB counts are used. However there is no clear ordering in the human datasets: in both cases BitSeqVB counts correspond to increased power but at the cost of slightly worse FDR calibration. Finally, ROC and precision-recall curves suggest that BitSeqVB leads to slightly increased performance for both methods.