Next generation sequencing of RNAs (RNA-seq) is replacing microarray technologies in genome study in recent years. RNA-seq has many advantages over traditional microarray technologies in that RNA-seq can quantify larger dynamic range of expression levels, has lower background noise, more power viewing the entire transcriptome and so on (Wang et al., 2009, 2010; Oshlack et al., 2010). In a typical RNA-seq experiment, a sample of purified RNAs is extracted, and converted to cDNA library. This cDNA library is then sequenced on a high-throughput platform. The sequencing process generates millions of reads taken from one or both ends of each cDNA fragment. The amplified cDNAs (also called the short reads) are then mapped to the reference genome. The number of mapped short reads for a gene reflects the abundance of this gene in the given sample, i.e. the gene expression. Refer to Oshlack et al. (2010) for a comprehensive introduction to RNA-seq technology.
As in the microarray analysis, identifying differentially expressed (DE) genes between different conditions is one of the main goals of RNA-seq data analysis. Due to the high cost of the DNA sequencing, many RNA-seq data were produced for two-group comparison with small sample sizes at early stage. Hence many existing RNA-seq analysis packages are designed for the simple two-group comparison. However, as the cost of DNA sequencing decreases, it is expected that more and more RNA-seq data will be produced in the setting of complex experimental designs with multiple explanatory factors. This has been well recognized and several methods have been proposed to handle complex RNA-seq experiments through a generalized linear model framework such as edgeR package (McCarthy et al., 2012) and DESeq package (Anders and Huber, 2010) or Bayesian framework such as paired baySeq (Hardcastle and Kelly, 2013) and EBSeq (Leng et al., 2013). While these approaches are much more flexible, they do not allow for inclusion of random effects.
Data arising from complex experiments such as split-plot designs, randomized block design, repeated measures in time or space are traditionally analyzed by mixed effects models (Stroup, 2015). Therefore, RNA-seq data from such experiments can be analyzed in similar ways. A couple of packages are available to model RNA-seq data allowing random effects. For instance, Blekhman et al. (2010) used the standard generalized linear mixed model (GLMM) through R package lme4 to analyze an RNA-seq dataset from three species with two genders, where the random effect was used to account for the individual subject effect. Van De Wiel et al. (2013) proposed a Bayesian framework to handle data from very flexible experimental designs allowing random effects.
Although packages allowing random effects are available for RNA-seq analysis, it does not appear to be popular in literature. In practice, a common approach is to ignore the random effects and treat them as fixed effect and apply the generalized linear model (GLM) type of approaches mentioned earlier. For instance, McCarthy et al. (2012) considered a real RNA-seq dataset with a paired structure. RNA-seq profiles of tumor and matched normal tissue from three patients with oral squamous cell carcinomas were fitted to a negative binomial log-linear model including both patient effect and treatment effect through the edgeR package. For such an experiment, the conventional practice is to treat the patient effect as random for two reasons: (1) to account for the correlation between the paired observations within each patient; (2) so that the statistical inference can be applied to broader space rather than the observed three patients. However, due to the limitation of the current version of edgeR, it can only handle fixed effects and hence the patient effect was treated as fixed in their analysis. Although the shrinkage estimation of dispersion parameter borrows strength across genes, it does not account for the correlation due to the same patient. In addition, ignoring the variation due to random patient effects can lead to inflated false positive rates.
Another common practice is to use partial data pertaining to a comparison of interest and apply the two-group comparison approach ignoring the rest of data in the experiment. For instance, Johnston et al. (2013) conducted a 7-day time course RNA-seq experiment to characterize the long-lasting immune response of beetles to bacterial challenges. Specifically, pools of 10 individuals treated with bacterial challenge and two pools of control beetles consisting of five individuals were sampled at five consecutive time points, respectively and the experiment was repeated twice at consecutive weeks. This experiment involves repeated measures and random blocks and would traditionally be handled by mixed effects models. However, the temporal response to immune challenge was studied by analyzing the RNA-seq data through pairwise comparison at each time point using only the data of each time point. Using partial data pertaining to the comparison of interest disregards data information and may result in low power in detecting DE genes.
To summarize, we have reviewed the following three practices in literature for analyzing RNA-seq data arising from complex experimental designs involving random effects: (1) take into account of the experimental design by allowing random effects; (2) take into account of the experimental design but treating random effects as fixed and (3) use partial data and focus on two-group comparisons. Other may also argue using transformation and apply linear mixed model. However, Stroup (2015) pointed out that transformations for count data “in a mixed model setting not only do not help, they tend to make things worse.” Therefore, we do not consider this approach here.
With different practices in literature, it is fair to ask the following question: does it matter if one ignores the random effects by either treating them as fixed or by focusing on pairwise comparison using partial data when analyzing the RNA-seq data from a multi-factor experiment? Answering this question is our goal in this paper. We accomplish the goal by comparing the three practices through real data analysis and simulation studies in the setting of split-plot designs. For each of the three practices, there are multiple packages available. In this paper, we do not mean to do a comprehensive review of all existing methods. Instead, we only choose a couple of popular methods from each category of practice to make the points. Particularly, we consider the standard GLMM model for the first category. For the second category, we consider the edgeR GLM approach due to its popularity and flexibility to handle RNA-seq data from multi-factor experiment. For the third category, we consider three popular packages allowing paired data structures: baySeq, DESeq and edgeR.
The rest of the paper is organized as follows. Section 2 gives the motivating example that has a split-plot design and details the three common practices for analyzing this type of data. Section 3 uses bioinformatics tools illustrating the effectiveness of the model considering random effects in Section 2.1. Section 4 generates simulated counts data from parametric and semiparametric models to validate the necessity of incorporating random effects in complex designs where mixed effects models should be used.
2 A motivating example
Microbe-specific molecules, referred to as microbe-associated molecular patterns (MAMPs), can be recognized by the plant innate immune systems pattern recognition receptors and a defense response, termed MAMP-triggered immunity (MTI), will be mounted so that the plant can have disease resistance (Valdés-López et al., 2014). To understand the MTI response of the soybean to invading pathogens, Valdés-López et al. (2014) produced the RNA-seq gene expression data of soybean plants of four genotypes (LD, LDX, 11272 and 11268) where plants of each of the four genotypes were treated with either treatment (MAMP or pathogen) or control (no pathogen). The experimental design is illustrated in Figure 1. Specifically, five plants of each genotype were grown in each of the four pots, respectively. After 3 weeks, leaves were harvested and cut in small slices. Half of these slices were treated by control (no pathogen) and the other half were treated by treatment (MAMP or pathogen). All experiments were carried out in a growth chamber with controlled environmental conditions. RNA samples were extracted and sequenced for each combination of genotype and treatment condition. The above procedures were repeated three times to prepare three biological replicates, respectively. In total, there were 4 (genotypes)×2 (treatments)×3 (biological replicates)=24 RNA-seq libraries. Researchers were interested in comparing gene expressions under different conditions (treatment versus control) for each genotype, as well as gene expressions of different genotypes under each condition.
This is an example of a split-plot design, where the whole plot is the pot and the split-plot is the subsample of soybean seedlings within each pot. In addition, the whole plot is nested within the blocks (the biological replicates). Since the response variable is count data, data from such an experiment should be handled by generalized linear mixed effects models with pot effects and replicate effects being treated as random effects (Stroup, 2015). Therefore, we first consider the standard GLMM model and then compare it with the other two common practice in literature reviewed in the introduction section.
Without loss of generality, we focus on identifying DE genes of soybean plants under treatment and control for only one of the genotypes, LD. In other words, we are interested in identifying MAMP-responsive gene, i.e. genes responding to the invading pathogen, for the LD genotype.
2.1 Three approaches for data analysis
2.1.1 Approach I: generalized linear mixed effects model (GLMM)
If one wants to use all data from the 24 RNA-seq libraries and consider the random effects, one can consider the following gene-specific Poisson mixed-effects model (1).
where is the raw read count of gene g for genotype i, treatment j and replicate k. We use i to index the four genotypes (i=1, 2, 3, 4 representing LD, LDX, 11,272 and 11,268, respectively), j to index the two conditions (j=1 for control and j=2 for treatment) and k to index biological replicates (k=1, 2, 3). In addition, we use g to index genes (g=1, …, G), where G is the total number of genes and G=31,616. lijk is a scaling factor for each library such that the expected gene expression levels of non-DEgenes are comparable after scaling. The library scaling factor adjusts for the differential expression caused by both the differential sequencing depths and the differential RNA compositions of different RNA samples. It can be obtained by dividing the effective library size of each library to that of a reference library. Here the effective library size refers to the product of the original library size, the total number of read counts per library, and a normalization factor to adjust for the RNA composition effect. Throughout the paper, we use the trimmed mean method (TMM) of Robinson and Oshlack (2010) to calculate the normalization factor for RNA composition effect, which uses a weighted trimmed mean of the log expression ratios across genes to estimate the global fold change of two samples to adjust for the RNA composition effect, assuming that majority of genes are non-differentially expressed.
There are three sources of random effects in the linear model for the log of the Poisson rate in model (1). is the kth random replicate effect (block effects), is the random pot effect (random error of whole-plot units), which is nested within the replicate effect, and is the random individual sample effect (part of random error on the level of the split-plot units), which is nested within the pot effect and is included to account for the over-dispersion often associated with such data (Stroup, 2015). In addition, we also assume that and are mutually independent.
is the normalized average expression level of gene g under condition j and genotype i. Recall we are interested in identifying DE genes between treatment and control for one genotype at a time, say genotype LD. Therefore, our testing hypothesis is versus Both the full model and the reduced model were fitted to the data using the R lme4 package, respectively. Then a likelihood ratio test was conducted to compute p-values. False discovery rate (FDR) was controlled at level α by using the q-value approach of Storey (2002).
Note that an alternative model to the Poisson mixed effects model is a negative binomial mixed-effects model, where the random individual sample effect is dropped and the Poisson distribution is replaced by a negative binomial distribution. For one dimensional data, the negative binomial mixed effects model is considered to be a better model for biological count data according to Stroup (2015). However, when we fit this model to the soybean RNA-seq data, it fails to converge for many genes when using either SAS packages or the R glmmADMB package. A careful examination of these failed cases show that the negative binomial mixed effects models either fail to converge or converge very slowly when one or more variance components are zero, which is also observed by Booth et al. (2003). After dropping the random effects with zero variance components, the negative binomial mixed effects model converges well. Due to the gene heterogeneity, one needs to adjust the negative binomial mixed effects model individually for each gene to avoid failure of convergence, which can be computationally intensive and time consuming. Since we did not encounter such computational issues when fitting a general Poisson mixed effects model to all genes in either real data analysis or simulation studies, we stick to the Poisson GLMM model in our study.
2.1.2 Approach II: generalized linear model with shrinkage dispersion parameter estimates, treating random effects as fixed
Although the more appropriate way to deal with count data arising from complex designs such as split-plot designs is GLMM, in practice, one may ignore random effects structure by treating them as fixed and apply the generalized linear models (GLM) using the cutting-edge packages such as the edgeR and DESeq packages, which can handle data from multi-factor experiments with fixed effects only. For our motivating example, one can fit the following gene-specific negative binomial (NB) log-linear model (2) using the glm function of the edgeR package,
where ϕg is a dispersion parameter associated with the NB distribution. All the other notations have the same interpretations as in model (1), except that the replicate effect is now treated as a fixed effect. Note that the pot effect in model (1) can not be included in model (2) as fixed effects because otherwise the design matrix will become singular and some pot effects are not estimable. Also note that the individual sample effect in model (1) is excluded from the model here since the NB distribution already takes into account of the overdispersion caused by the individual sample effect.
Before the negative binomial GLM model is fitted to the data, the dispersion parameters are estimated using the Cox-Reid profile-adjusted likelihood method (McCarthy et al., 2012). Specifically, a common dispersion parameter is estimated for all genes and then an empirical Bayes strategy is applied to squeeze the gene-specific dispersions towards the common dispersion parameter estimate. The amount of shrinkage is determined by a user-specified prior degrees of freedom with a default value of 20 (McCarthy et al., 2012). To identify DE genes between the control and treatment for genotype LD, the data were fitted under the full model and the reduced model respectively, and then the likelihood ratio test was conducted to compute p-values. FDR was controlled at level α by using the q-value approach (Storey, 2002).
Obviously model (2) will not be adequate for genes where the pot effects cannot be ignored. However, if one wants to restrict oneself to fixed effects models, this is the best model one can fit to the whole data. An arguably more popular practice to analyze RNA-seq data arising from complex experiments is to ignore the design of the experiment, and focus on pairwise comparison using partial data.
2.1.3 Approach III: paired data analysis using partial data
As pointed out in the introduction section, another common practice is to use partial data pertaining to a comparison of interest and apply the two-group comparison approach ignoring the rest of data in the experiment. For this motivating example, it means one only uses the six RNA-seq libraries that relate to the genotype LD and then conducts a paired-group comparison because the plants under the treatment and control are paired due to the replicate effect and the pot effect (note the replicate effect is completely confounded with the pot effect when only the data of one genotype is considered).
There are many packages available for such an analysis. See Soneson and Delorenzi (2013) for a comprehensive review. Here we choose three popular packages: baySeq (Hardcastle and Kelly, 2013), edgeR (McCarthy et al., 2012) and DESeq (Anders and Huber, 2010) as representatives. These methods are selected since they are commonly used in RNA-seq analysis (Chung et al., 2013; Qiu et al., 2015). In this paper, we refer to these packages as the paired baySeq, the paired DESeq, and the paired edgeR.
Note that both the paired DESeq and the paired edgeR use the same GLM model in model (2) except that the model will only be fitted to the six RNA-seq libraries relating to the genotype LD (the genotype index i is fixed at value 1). The main difference between the paired DESeq and the paired edgR method lies mostly in how the dispersion parameter is estimated. Both packages estimate the per-gene dispersion parameters using the Cox-Reid-adjusted maximum likelihood estimators. But the edgeR package uses an empirical Bayesian approach to shrink (or squeeze) the per-gene dispersion parameter estimate toward a common dispersion parameter estimate while the DESeq package models the mean-dispersion relationship using a regression curve so that the information is shared across genes with similar expression abundance. Then the maximum of the fitted dispersion estimate and the per-gene estimate is taken as the final dispersion parameter estimate for the DESeq package.
After the dispersion parameters are estimated, both the paired DESeq and the paired edgeR method fit the negative binomial log linear model (2) to the data twice, the first time with both condition effect and the replicate effect, and the second time with replicate effect only to identify DE genes between treatment and control for the LD genotype. Then a likelihood ratio test is conducted to obtain p-value for each gene. We apply the q-value approach to control FDR (Storey, 2002).
Note that both the paired DEseq and the paired edgeR use a fixed replicate effect to model the paired structure of the partial data, which does not really account for the correlation between the paired observations or the extra variability introduced by the random pair (replicate/pot) effect. The paired baySeq takes on a very different approach to deal with the paired data. It treats each pair of observations as a unit and considers the proportion of gene expression levels contributed by the treatment condition in a pair as the parameter of interest. Specifically, it models the conditional distribution of given the sum of and by a beta-binomial distribution with adjustment for unequal library sizes. The probability density function of is given by
and B(·, ·)is the standard beta function. Here and lijk have the same interpretation as in model (1). ϕg is the dispersion parameter for the beta-binomial model to describe the overdispersion associated with the random pair (pot/replicate) effect, and is the average proportion of normalized expression levels that are contributed by the treatment group in a pair of genotype LD and hence is the parameter of interest for this model. When there is no differential expression between treatment and control groups, the value of should be 0.5. To identify the DE genes between treatment and control for genotype LD, the testing hypothesis is versus . An empirical Bayesian framework is used to evaluate the posterior probabilities for a gene to be differently expressed or not. The baySeq package produces the estimated posterior probabilities, which can be used for ranking. However, it does not produce a fold change estimate and hence will be excluded from our comparisons of different procedures whenever the fold change criterion is used together with the false discovery rate (Guo et al., 2013).
We will compare the above three different approaches through real data analysis and simulation studies to answer the aforementioned question: does it matter if one ignores the random effects when analyzing RNA-seq data in a multi-factor experiment?
3 Real data analysis
We analyzed the RNA-seq data of Valdés-López et al. (2014) described in Section 2 and focused on the identification of DEgenes between treatment and control of the genotype LD. We compared the three approaches reviewed in the last section: the Poisson mixed effects model (denoted by GLMM in the paper), the edgeR GLM model, and the paired data analyses using different packages. We filtered low count genes by keeping the ones having at least one count per million for at least three libraries, and used the TMM normalization method of Robinson and Oshlack (2010) for all above approaches.
To ensure the reported DE genes have meaningful gene expression changes, we required the fold change (FC) across conditions is no less than 2 with the false discovery rate (FDR) controlled at level 0.05. Since the paired baySeq package does not provide fold change estimate, it is excluded from the comparisons here. We summarized the analysis results of these methods in Figure 2. It is clear from the venn diagram that the GLMM method detects the largest number of DE genes and has good overlapping with other methods. The paired edgeR method comes the second in terms of the number of identified DE genes. Note that the list of DE genes identified by either the paired DESeq method or the edgR GLM method is completely covered by the list identified by both the paired edgeR method and the GLMM method.
The conservative performance of the edgeR GLM approach might be due to the model inadequacy. Although this approach utilizes the whole dataset, it only allows fixed effects in the model, which forces the pot effect to be excluded from the model and the nested structure of the experimental design is not taken care of. The paired DESeq package is known for being conservative due to its conservative way of estimating the dispersion parameter (Lund et al., 2012). Since it uses the maximum of a fitted dispersion parameter estimate and a per-gene estimate as the final dispersion parameter estimate, it is not a surprise that it detects fewer DE genes than the paired edgeR package. The paired edgeR method utilizes only partial data pertaining to the genotype of interest and hence might suffer from loss of information compared to the GLMM approach.
Although the GLMM method detects more DE genes than the paired edgeR package, there are some genes detected by the paired edgeR but not detected by the GLMM method. It is of interest to compare the extra genes identified by the paired edgeR alone with those by the GLMM alone to see which method produces more reliable results. We used two bioinformatics tools for this purpose: MULTICOM-PDCN (Wang et al., 2011, 2013) was used to predict functions and KAAS (KEGG Automatic Annotation Server) (Moriya et al., 2007) was used to predict KEGG pathways for the DE genes identified by the paired edgeR and the GLMM methods on the soybean data. We did not consider the result of the paired DESeq and the edgeR GLM for the bioinformatics analysis because they did not identify anything new compared to the paired edgeR or the GLMM method. We calculated the number of genes annotated in each of the predicted functions, and sorted these functions in descending order. Top 10 functions were selected for each list of DE genes. We calculated p-value for each of predicted KEGG pathways using Fisher’s exact test (Agresti, 2002). The predicted KEGG pathways were sorted by their p-values in ascending order, and the top 10 pathways were selected for each list of DE genes.
The function prediction results show that the predicted functions of DE genes identified by GLMM only are more consistent with those of the overlap of GLMM and the paired edgeR. Table 1 shows top 10 biological process functions of DE genes identified by GLMM only, the paired edgeR only and their overlap on the soybean data. Note that the top 10 predicted function for the overlap are all highly significant with p-values smaller than 10−5 and some of these predicted functions are clearly related to plant immunity response such as defense response, plant-type hypersensitive response and response to chitin (one type of the MAMP treatment). Therefore, one can have good confidence in the results of the overlap. Interestingly, nine of these ten predicted functions happen to be among the top 10 predicted functions by GLMM only with significant p-values (smaller than 0.003), including functions such as defense response and plant-type hypersensitive response. On the other hand, only five of the top 10 predicted functions by the overlap are among the top 10 predicted functions by the paired edgeR only with much less statistical significance. The defense response and plant-type hypersensitive response functions do not show up on the top 10 predicted functions of the paired edgeR only method.
The KEGG pathway prediction results also show that the results by GLMM only are more consistent with those of the overlapped genes of GLMM and paired edgeR approaches. Table 2 shows top 10 predicted KEGG pathways of DE genes identified by the GLMM only, the paired edgeR only, and their overlap on the soybean data. Among the top 10 KEGG pathways identified by both the paired edgeR and the GLMM method, five are statistically significant with p-value<0.05. The top two most significant pathways (p-value<10−8), ko046626 and ko04075, are also identified by the GLMM only approach with small p-values. Note that the most highly significant pathway identified by the overlap, the ko04626 pathway, is a plant-pathogen interaction pathway, which is clearly associated with the soybean innate immunity response to invading pathogen. This pathway is also identified as the most significant pathway by the GLMM only. However, it is not among the top 10 predicated pathway identified by the paired edgeR only.
To summarize, the GLMM approach not only detects more DE genes than the paired edgeR approaches, but also the extra DE genes identified by the GLMM approach shows more consistent pattern with the overlapping DE genes detected by both the GLMM and the paired edgeR approach based on the bioinformatics studies. In addition, a significant amount of genes uniquely identified by the GLMM are involved in the plant-pathogen interaction pathway and hence the results of the GLMM provide valuable extra information.
Generally speaking, when FDR level is controlled at a fixed level, the methods that detect the largest number of DE genes are preferred. However, for real data analysis, we can only control the nominal FDR level at a fixed level. It is not possible to assess the actual FDR levels for each method because the truth is unknown for the real data. Although the bioinformatics analysis provides extra information, it does not give us an accurate estimate of the actual FDR of different methods either. Therefore, some researchers take another way to compare different methods by focusing on a fixed number of the most significant DE genes (Lund et al., 2012).
Table 3 lists the number of genes that are shared by each pair of methods in their top 500 DE genes. Note here the rankings of genes are decided purely by their p-values (or estimated posterior probabilities) and no fold change criterion is used. Therefore, the paired baySeq method is included in this comparison. It is clear that the paired DESeq and the paired edgeR have the best overlapping in their top 500 DE genes. Then these two methods have better overlapping with the edgeR GLM. This makes sense because all these three methods apply the same GLM model to the data while the first two methods both use partial data and the edgeR GLM utilizes all data. The paired baySeq comes next in terms of similarity to the previous three. This is likely due to the fact that the paired baySeq takes on a very different approach to address the correlation and overdispersion features of the paired data. The GLMM has the least overlapping with the other four methods, which is expected because it uses all data information and model the nested design structure of the data by allowing for random effects. This comparison only tells us how similar or how different these methods are. It still does not tell us which method is better in terms of power or type I error rate. Although the comparative analysis on the real data provides valuable insights into how these methods work, one needs to conduct simulation studies to evaluate the relative performance of these methods in terms of power and type I error rate.
4 Simulation study
4.1 Simulation descriptions
In this section, we conduct simulation studies to compare the performance of the three different approaches. We simulate data with the same design as the motivating example of Valdés-López et al. (2014). In other words, there are four genotypes and two conditions with 12 pots nested within the three biological replicates. Let be the raw read count of gene g for genotype i, treatment j and replicate k, and be the average expression level of gene g for genotype i, treatment j and replicate k. We generated according to the following log-linear mixed effects model:
where parameters have the same interpretation as in model (1) and and are mutually independent. The relationship between and is modeled through the specification of the distribution of Three different distributions are considered: (1) log-normal Poisson distribution, (2) negative binomial distribution, and (3) nonparametric distribution using the empirical distribution based on real data. We will describe the three simulation settings below individually.
4.1.1 Simulation setting I: Poisson mixed-effects model
We assume that follows a Poisson distribution, and to account for the over-dispersion observed for individual biological samples, we introduced random sample effect in the log-linear model for the mean function. In other words, we simulated the count data according to model (1). Conditioning on the replicate and the pot effects, follows a log-normal Poisson distribution.
In this simulation model, we need to specify the following parameters: the library scaling factor lijk, the average expression level of gene g under condition j and genotype i, and the variance components, and for the three random effects. The library scaling factor lijk is usually estimated through normalization method in real data analysis. Since different packages use different normalization methods while the normalization methods can have profound impact on the downstream analysis, we set the library scaling factor lijk=1 for all i, j and k for the simulated data so that observations from all libraries are comparable without normalization.
The variance component parameters and of model (1) were generated from inverse gamma distributions with the parameters of inverse gamma distributions obtained from real data. Specifically, model (1) was fit to the RNA-seq data of Valdés-López et al. (2014) and each of three variance component estimates were obtained for all genes (see Supplementary Figure S1 for the fitted density curves of the estimated variance components of the three random effects) and then fit to an inverse gamma distribution with the parameters of the inverse gamma distribution estimated by method of moments (see Supplementary Table S1 for parameter estimates.)
The parameters were generated from real data so that 2000 genes would be differentially expressed and 8000 genes would be non-differentially expressed. Note that even though the data were simulated with four genotypes, we are interested in comparing the treatment versus control for only one genotype at a time as in the motivating example described in Section 2. Without loss of generality, we focus on genotype i=1. Therefore, the mean parameters of the non-DE genes between treatment and control for genotype i=1 satisfy We fit the model (1) to the RNA-seq data of Valdés-López et al. (2014) and there are 3986 significantly differentially expressed genes with q-value less than 0.05 (see Supplementary Figure S2 for the empirical distribution of parameter estimates for the 3986 declared DE genes.) For each simulation, we randomly sample 2000 genes out of the 3986 declared DE genes in real data analysis, and use their estimated as the true parameters for the simulated DE genes, and then randomly sample 8000 genes out of the rest genes and use their estimated for the true parameters of the simulated non-DE genes with
4.2 Simulation setting II: negative binomial mixed-effects model
A more widely used approach to model the over-dispersion observed for individual biological samples is the negative binomial model. Therefore, we also generated data according to the following negative binomial mixed-effects model:
where follows the log-linear mixed model (4). All the notations here have the same interpretation as in model (1) except that for the negative binomial distribution, there is a dispersion parameter ϕg (to replace the random sample effect in model (1)).
For this simulation model, we generated the library scaling factor lijk and the parameters and in the same way as in simulation setting I. In addition, we generate the dispersion parameters ϕg from a gamma distribution with shape parameter 0.85 and scale parameter 0.5, which was used by Robinson and Smyth (2007) in simulation studies to match the empirical distribution of dispersion estimates of a human dataset.
4.2.1 Simulation setting III: semi-parametric models with random effects
In previous two simulation settings, is assumed to follow a parametric distribution (either lognormal-Poisson or negative binomial), which may not always describe the real data well. To mimic the real data better, we use a real human RNA-seq dataset that contains multiple biological replicates and were used by Kvam et al. (2012) for a nonparametric simulation study. The data they used contained RNA-seq read counts obtained from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals and sequenced at Yale. Since the 69 samples were basically biological replicates of the same condition, no differential expression across the samples is expected after normalization and hence the normalized data can be used to generate biological samples in the same group without making any distribution assumptions.
We pre-processed the RNA counts to get normalized data by first excluding genes with zero counts across all 69 libraries and then normalizing all libraries by the TMM method in edgeR package to obtain counts per million (CPM) normalized counts. We further excluded low counts genes with sum of normalized counts across 69 libraries less than 69. There are 15,115 genes left after this preprocessing. For each simulation, 24 libraries were randomly chosen from the 69 samples and assigned to different combinations of the genotype, condition, replicate and pot factor levels according to the experimental design of Valdés-López et al. (2014). For instance, the first two libraries were labeled as observations under the treatment and control condition for genotype 1 in pot one and replicate one and the following two libraries were labeled as observation under treatment and control for genotype 2 in pot two and replicate one, and etc. Then for these 24 libraries, 10,000 genes were randomly selected from the 15,115 filtered genes for each simulation. We should expect there are no differential expression between any two levels of any experimental factor at this time point and all the 24 samples should be treated as biological replicates of the same condition.
In order to maintain the nested design structure of Valdés-López et al. (2014), we need to impose the replicate effects and pot effects. We generated random replicate effects and pot effects independently from normal distributions in exactly the same way as in simulation setting I. Then we multiplied the normalized counts for the kth replicate and ith genotype by a factor of to impose the replicate and pot effect. Finally we multiplied the above obtained value by a factor of to impose the combination of genotype and condition effect, where the parameters were generated the same way as in Simulation setting I to allow 20% (i.e. 2000) genes to be DE genes between treatment and control for the genotype one. The final values were rounded to the nearest integer values and these would be . Note that since we used the normalized data as the biological samples, s are comparable across different libraries, which implies the scaling factor lijk is fixed at one. The above simulation strategy allows the average expression level still follows the log-linear mixed-effects model (4); however, the distribution of is non-parametric. Therefore we consider this setting as a semi-parametric setting.
In summary, all three simulation settings keep the original split-plot design structure of our motivating example. The parameters used to generate the data are the same for all the three simulation settings except that the baseline data are generated differently. Supplementary Figure S3 plots the box plots of the median coefficient of variance of all genes across 100 simulations under each of the three simulation settings with all parameters matched. As expected, the data generated in simulation settings II and III are more noisy than the one in setting I because the parameters used to generate the baseline data for setting I is based on plant data while those for settings II and III are from human data.
The motivating example described in Section 2 has four genotypes, three replicates and two conditions, total 24 observations per gene. We also conduct simulation studies with only two genotypes to mimic smaller experiment with only half of the data but still keeping the split-plot design structure with both the random pot and replicate effects. We implement the simulation studies with 12 observations per gene for all three simulation settings.
4.3 Simulation results
For all simulation studies, we utilize five different packages: (1) The R/lme4 package (version 0.999999_0) for fitting Poisson mixed effects model (Blekhman et al., 2010) using all data; (2) McCarthy et al. (2012) edgeR GLM (version 3.0.8) for fitting a GLM model by treating random effects as fixed effects and using all the data; (3) McCarthy et al. (2012) edgeR GLM (version 3.0.8) for paired data using only the partial data analysis pertinent to the genotype of interest; (4) Hardcastle and Kelly (2013) baySeq (version 1.12.0) for the paired data analysis using only partial data; (5) Anders and Huber (2010) DESeq (version 1.10.1) for the paired data analysis using only partial data.
We evaluate each approach’s performance according to two criteria: (1) their abilities to rank the true DE genes on the top of the final list of significant genes; (2) their actual FDR and power performance after controlling FDR at nominal levels. For each simulation setting, the three approaches with five packages were applied to the simulated data to identify DE genes between treatment and control for the genotype of interest (genotype one), and the declared DE genes were compared with the truth so that various measures such as the number of true positives and the number of false positives would be calculated. For each simulated data, we filtered low count genes as we did for the real data analysis. For the DESeq and the baySeq, and the lme4 packages, we set the normalization factor as one since the simulated data were generated in a way so that all libraries are comparable. For the edgeR package, one need to set the scaling factor to be the inverse of the original library size so that the effective library sizes are equal. For each of the three simulation settings, this process ran for 100 times and the results were averaged over the 100 simulations to produce summary statistics.
Ignoring the random effects by treating them as fixed or by focusing on pairwise comparison with partial data in a multi-factor experiment can lead to increased number of false positives among the top selected genes.
To compare the performances of different approaches for ranking the true DE genes on the top, a common approach is to draw ROC curves. ROC curves reflect the ranking of true DE genes among all genes: the higher the ROC curve is, the better the method ranks the true DE genes on top. However, for RNA-seq analysis, biologist are mostly interested in the top selected genes, usually fewer than 1000 genes. Therefore, the false discovery (FD) plot, the number of the false positives among the top selected genes versus the number of selected genes, is a preferred tool for comparison of different methods since it highlights the performance of the top ranked genes (Robinson and Smyth, 2007).
Figure 3 plots the FD plots for all methods under various simulation settings. Specifically, genes are ranked according their p-values (or estimated posterior probabilities). Since there are 2000 true DE genes, the number of false discoveries (y-axis, presented on log-scale) is plotted for each number of top selectetrue DEd genes (x-axis) up to 2000 genes. It is clear that the GLMM approach outperforms the other approaches in almost all cases. In addition, the advantage of the GLMM method over other methods is more pronounced in right panels than in left panels. This can be explained by the fact that the right panels plot the results for simulation settings with four genotypes (there are total 24 observations per gene), while the left panels correspond to the simulation settings with two genotypes (total 12 observations per gene.) Since the paired data analyses use only six libraries pertaining to the genotype of interest, the effect of losing information is much more serious when there are 24 libraries in total than when there are 12 libraries in total.
We also notice that the advantage of the GLMM approach over other approaches is more obvious for simulation settings II and III (bottom four panels of Figure 3) than for the simulation setting I (the top panels). Recall the split-plot unit level variation for data simulated in settings II and III mimics that of the human data while setting I mimics that of plant data. Therefore, settings II and III produce much more noisy data than setting I as shown by the supplementary Figure S3 with setting II being the most noisy setting. The noisy level adversely affects the performance of all methods which can be observed by comparing the minimum numbers of selected genes to obtain the first false discoveries across different panels of Figure 3. But for simulation settings II and III, the advantage of the GLMM method over the other approaches (particularly the methods based on the GLM models) is bigger, even for the two-genotype settings. It suggests that when the data has several source of large variation, the negative binomial GLM model is inadequate to fit the data well. On the other hand, the GLMM approach takes into account of the design structure of the data and hence it performs the best in terms of the FD plots even though the model is misspecified for the GLMM method under simulation settings II and III.
Interestingly, the paired baySeq outperforms the paired edgeR and DESeq and has the greatest similarity with the GLMM for settings II and III. A possible reason is that it takes care of the correlation structure between the paired data by modeling the pair as a whole unit. In addition, it uses a beta-binomial model to account for overdispersion caused by the pair (pot/replicate) effect. On the other hand, the paired edgeR and DESeq just treat the random pair (pot/replicate) effect as fixed, which does not account for the correlation structure of the pair and neither does it consider the variation due to the random pair (pot /replicate) effect. But the paired baySeq is still not as good as the GLMM method because it discards the information of (more than) half of the data.
In summary, when the data arises from a complex design with multiple sources of variation, ignoring the random effects can lead to increased false positives among the top selected genes. This problem can be aggregated when the data noise level is high or when a big proportion of the data is discarded in the analysis.
When the nominal FDR level is fixed, ignoring the random effects in a multifactor experiment can lead to lower power, especially when the data is noisy. However, a word of caution is that the actual FDR levels can be out of control when the model is clearly misspecified with high noise level.
For the evaluation of FDR and power performance after controlling FDR at nominal level, we declare a gene to be a DE gene if the q-value (Storey, 2002) is smaller than the nominal level and in addition, the fold change estimate is greater than 2. We consider two nominal levels for FDR control: 0.05 and 0.1. Since the paired baySeq does not provide the fold change estimate, it is excluded from this analysis. Figures 4 and 5 plot the box plots of the empirical false discovery rates and the empirical true positive rates of different procedures when the nominal FDR level is controlled at 0.05 and 0.1, respectively. The empirical false discovery rate is defined to be the fractions of false positives among the declared significant genes and the empirical true positive rate is defined as the proportion of true differentially expressed genes that are detected.
Figure 4 show that all methods have actual FDR under control after using both q-value and fold change criteria when the data is generated from Poisson mixed effects model with small noise level (see top panels). However, for simulation setting II, the GLMM, the paired edgeR and the edgeR GLM all fail to control FDR at nominal levels (see middle panels of Figure 4). For this simulation setting, the data is generated from negative binomial mixed effects model while the GLMM approach fits a Poisson mixed effects model to the data and the paired edgeR and edgeR GLM both fit a negative binomial fixed effects model to the data. Therefore, this is the case when the model is clearly misspecified for all methods and we expect their true FDR would not necessarily be controlled at the nominal level because the proper FDR control depends on the right distribution of the null p-values. When the model is misspecified, we would expect the null distribution of p-values is no longer uniform and hence the FDR control using the q-value would not be right (The simulation results are similar when we applied the Benjamini and Hochberg (1995) method to control FDR, results are not reported here). For simulation setting III (bottom panels of Figure 4), the actual FDR of the above three methods are closer to the nominal levels compared to setting II, although they are still out of control. A possible reason is that the data is less noisy in simulation setting III than in simulation setting II and hence it is easier to control FDR. Note that the paired DESeq is consistently very conservative even though its model assumption is also violated.
The boxplots of empirical true positive rate in Figure 5 show that the ranking of the power of all methods are consistent across all simulation settings. The GLMM method is the most powerful one, the paired edgeR comes second and the paired DESeq is the least powerful one. When the data noise level is small (top panels), the paired edgeR and the edgeR GLM do not lose too much power compared to the GLMM. But the power loss is much more pronounced when the noise level of the data increases (see the middle and bottom panels of Figure 5). However, we need to use caution in interpreting the power comparison for settings II and III since the actual FDR levels are not controlled for these cases due to model misspecification.
5 Conclusion and discussion
We examined the consequences of ignoring random effects in a multi-factor experiment for RNA-seq analysis. By comparing the method (GLMM) allowing for random effects with other methods (edgeR, baySeq, DESeq) that ignore them, we show that either using partial data or treating the random effects as fixed will either lead to increase of false positives or low power. In a real RNA-seq data analysis, we use bioinformatics tools to verify the consistency of identified DE genes by the GLMM approach only with the overlaps between GLMM and other approaches. In simulation studies, we illustrate the advantages of GLMM in ranking true DE genes on the top of candidate genes in various simulation settings. Even when the model is misspecified for the GLMM method, it does a better job in ranking the true DE genes on the top than other methods that ignore the random effects. This is particularly true when the data noise level is high or when a large proportion of the data is discarded when one focuses on paired analysis.
Among the three paired analysis packages, the paired baySeq performs the best in terms of its ability to rank the true DE genes on the top when the data has large variation. One possible reason is that the paired baySeq takes into account of the correlation structure by treating each pair as a whole and also accounts for the overdispersion due to the random pair effect through the beta-binomial model while the paired edgeR and Deseq treat the random pair effect as a fixed effect in their negative binomial log linear model, which does not account for the correlation structure of the pair or the extra variation introduced by the random pair effect.
When the nominal FDR level is fixed, ignoring the random effects in a multi-factor experiment can lead to lower power compared to the GLMM method, especially when the data is noisy. However, a word of caution is that many approaches including the GLMM method and the edgeR GLM method can be sensitive to model misspecification in terms of the actual FDR control, especially when the data is noisy. Therefore, one need to use caution in interpreting the results in such cases. For our real data analysis, we use the bioinformatics tools to further validate the results of the GLMM method when the nominal FDR level is controlled at 0.05.
As pointed out by Lund et al. (2012), in most cases only a limited number of genes can be followed up with further study due to resources constraints after DE genes are identified. Hence in practice, instead of controlling for FDR at a nominal level, a fixed number of the most significant genes is another way to summarize the analysis results. In this regards, the GLMM model does a better job in ranking the true DE genes on the top than the approaches that ignore the random effects. However, when there are not many true DE genes between treatment and control, say less than a 100, and if one still arbitrarily chooses to follow up top 200 genes, it can be a waste of money and resources. A possible solution is to combine the two approaches for real data analysis. If thousands of significant DE genes are detected at a nominal FDR level, then one can focus on just the top hundreds of genes. But if only few significant DE genes are detected at the nominal FDR level, then the list detected with nominal FDR level controlled can be more informative than just arbitrarily choosing a fixed number of genes on the top. For such cases, the GLMM approach is still preferred because it is a more powerful procedure than the ones that ignore the random effects.
In the paper, we choose the standard Poisson mixed effects model without shrinkage estimation as the representative of the method allowing for random effects to make the following point: when the design structure of real data is taken care of, the method without shrinkage variance estimates can do a better job than the methods with shrinkage dispersion estimates which do not adequately model all the sources of variation observed in the data. In other words, using shrinkage variance estimation does not completely make up for the damage caused by ignoring the random effects in a multi-factor experiment. For that being said, we are not claiming that the standard Poisson mixed effects is necessarily the best method to analyze such data. In fact, we expect that the standard Poisson mixed effects model can be improved in similar way that the NB GLM model is improved by the edgeR package using shrinkage dispersion parameter estimation. The ShrinkSeq package (Van De Wiel et al., 2013) is one of such choices that not only allow for random effects in the model but also use shrinkage estimation for both mean and variance. However, the control of false discovery rate remains to be problematic for the ShrinkSeq package as pointed out by Soneson and Delorenzi (2013).
Note that an alternative model to the Poisson mixed effects model is the negative binomial mixed-effects model. However, the negative binomial mixed-effects model is sensitive to zero variance components in the model. When one or more variance components of the random effects are zero in the model, fitting the model can be computationally unstable: it either converges very slowly or fail to converge at all (Booth et al., 2003). For the RNA-seq data, genes demonstrate great heterogeneity in term of their sources of variation. Some genes can have stable expression levels across biological samples and hence may not demonstrate overdispersion on the individual sample levels as pointed out by Auer and Doerge (2011) while others demonstrate large overdispersion. On the other hand, some genes may have one or more zero variance components for some random effects in the model while other genes demonstrate large variability due to these random effects. Therefore, if one wants to fit the negative binomial mixed effects model to the RAN-seq data without failure of convergence, one needs to adjust the model individually for each gene, which can be computationally intensive and time consuming.
Agresti, A. (2002): Categorical data analysis, 359, Hoboken, NJ, USA: John Wiley & Sons.Google Scholar
Benjamini, Y. and Y. Hochberg (1995): “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” J. Roy. Stat. Soc. B Met., 57, 289–300.Google Scholar
Blekhman, R., J. C. Marioni, P. Zumbo, M. Stephens and Y. Gilad (2010): “Sex-specific and lineage-specific alternative splicing in primates,” Genome Res., 20, 180–189.Web of ScienceCrossrefGoogle Scholar
Chung, L. M., J. P. Ferguson, W. Zheng, F. Qian, V. Bruno, R. R. Montgomery and H. Zhao (2013): “Differential expression analysis for paired rna-seq data,” BMC Bioinformatics, 14, 110.CrossrefGoogle Scholar
Hardcastle, T. J. and K. A. Kelly (2013): “Empirical bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution,” BMC Bioinformatics, 14, 135.CrossrefWeb of ScienceGoogle Scholar
Johnston, P. R., O. Makarova and J. Rolff (2013): “Inducible defenses stay up late: temporal patterns of immune gene expression in tenebrio molitor,” G3 (Bethesda)., 4, 947–955.Google Scholar
Leng, N., J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. Smits, J. D. Haag, M. N. Gould, R. M. Stewart and C. Kendziorski (2013): “Ebseq: an empirical bayes hierarchical model for inference in rna-seq experiments,” Bioinformatics, 29, 1035–1043.Web of ScienceCrossrefGoogle Scholar
Lund, S., D. Nettleton, D. McCarthy and G. Smyth (2012): “Detecting differential expression in rna-sequence data using quasi-likelihood with shrunken dispersion estimates,” Stat. Appl. Genet. Mol. Biol., 11, 8.Google Scholar
McCarthy, D. J., Y. Chen and G. K. Smyth (2012): “Differential expression analysis of multifactor rna-seq experiments with respect to biological variation,” Nuc. Acids Res., 40, 4288–4297.Web of ScienceGoogle Scholar
Moriya, Y., M. Itoh, S. Okuda, A. C. Yoshizawa and M. Kanehisa (2007): “Kaas: an automatic genome annotation and pathway reconstruction server,” Nuc. Acids Res., 35, W182–W185.Web of ScienceGoogle Scholar
Qiu, F., F. Yu and J. Meza (2015): “Evaluation of statistical methods for differential expression analysis of rna-seq data with paired data design,” in 143rd APHA Annual Meeting and Exposition (October 31-November 4, 2015), APHA.Google Scholar
Stroup, W. W. (2015): “Rethinking the analysis of non-normal data in plant and soil science,” Agron. J., 107, 811–827.Google Scholar
Valdés-López, O., S. M. Khan, R. J. Schmitz, S. Cui, J. Qiu, T. Joshi, D. Xu, B. Diers, J. R. Ecker and G. Stacey (2014): “Genotypic variation of gene expression during the soybean innate immunity response,” Plant Genet. Resour., 12, S27–S30.Web of ScienceGoogle Scholar
Van De Wiel, M. A., G. G. Leday, L. Pardo, H. Rue, A. W. Van Der Vaart and W. N. Van Wieringen (2013): “Bayesian analysis of rna sequencing data by estimating multiple shrinkage priors,” Biostatistics, 14, 113–128.Web of ScienceCrossrefGoogle Scholar
Wang, Z., X.-C. Zhang, M. H. Le, D. Xu, G. Stacey, and J. Cheng (2011): “A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny,” PloS One, 6, e17906.Google Scholar
Wang, Z., R. Cao and J. Cheng (2013): “Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks,” BMC Bioinformatics, 14, S3.Web of ScienceGoogle Scholar
The online version of this article (DOI: 10.1515/sagmb-2015-0011) offers supplementary material, available to authorized users.
About the article
Published Online: 2016-02-29
Published in Print: 2016-04-01