In RNA-Seq differential expression analysis, RNA-Seq-specific statistical methods such as edgeR and DESeq2 are widely used. These methods are statistical models specifically designed for the characteristics of Gene Counts data.
At the same time, simplified statements such as “You should use edgeR,” “You should use DESeq2,” or “You should not use a t-test for RNA-Seq” are becoming increasingly common. In this article, we examine this issue using an actual dataset.
Characteristics of the dataset: an in vitro experiment with low variability
The dataset used here, GSE49110, is not a biopsy dataset.
It is a cell culture-based RNA-Seq dataset using MDA-MB-231 cells.
The experiment compares cells treated with siRNA targeting ERRα and cells treated with control siRNA.
Each condition was measured in duplicate, so the number of replicates is only two.
On the other hand, the replicate measurements within each condition are very stable,
and the within-group variability is relatively small.
In practice, there are many research datasets of this type. Therefore, we first compare the methods using this kind of dataset and examine the results.
The purpose of this article is not to determine which method is superior. Rather, by clarifying what kinds of expression patterns are judged significant, or not judged significant, when each method is used, we hope this article will serve as a practical reference when choosing an analysis strategy.
Three methods compared
Here, we compared the following three methods for the same comparison between siC and siE1.
- Differential expression analysis using edgeR
- Differential expression analysis using DESeq2
- t-test using data after preprocessing, normalization, and filtering
edgeR and DESeq2 use statistical models designed for RNA-Seq Count data. In contrast, the t-test was applied to values after preprocessing, normalization, and filtering. The t-test was not applied directly to raw Gene Counts or to all measured genes. We reduced unstable genes in the low-Counts region and prepared the data so that a general statistical method could be applied more appropriately.
The specific preprocessing, normalization, and filtering steps were as follows.
- Log2 transformation. Measurements with a value of 0 become missing values at this step.
- Global normalization using the 70th percentile.
- After normalization, values below 20 on the linear scale were replaced with 20.
- Missing values were filled with 16 on the linear scale.
- Values were converted to ratios relative to the average of the siC-treated control samples.
- Genes whose Gene Counts were below 20 in at least one sample in all groups were removed.
When looking only at p-values, differences among methods appear large
First, we compared genes with p < 0.05 in edgeR, DESeq2, and the t-test.
When individual samples are compared directly, small differences in measured values in the low-Counts region can easily become large Fold Changes. As a result, the points spread away from the diagonal line, and the variability appears large.
In contrast, the scatter plots in Fig. 1 compare the average values of the replicate samples in the siC and siE1 groups. When group averages are compared, random variability among individual samples is averaged to some extent. Therefore, the points in the low-Counts region do not spread in the same way as when individual samples are compared. In this dataset, each group average is based on only two samples, so the variability does not strongly converge. However, as the number of samples increases, random variability is averaged further, and the points in the low-Counts region are expected to gather more closely around the diagonal line.
edgeR and DESeq2 determine statistical significance using models that consider the relationship between the mean and variance of Count data. In contrast, the t-test evaluates within-group variability and between-group differences. Therefore, even when the same data are used, the genes judged significant are not the same. As shown in Fig. 2, many genes overlapped among the three methods, but some genes were judged significant or non-significant depending on the method.
Fig1: Comparison of genes with p < 0.05 in edgeR, DESeq2, and the t-test. Black points indicate genes judged significant by each method.
The difference between edgeR and DESeq2 is mainly seen near the boundary
The difference between edgeR and DESeq2 is mainly observed near the boundary between significant and non-significant calls. For genes with clearly large differences, the results of the two methods agree well. However, for genes with p-values near the threshold, the method that judges them significant can differ. (Fig. 2, left)
It has been reported that, in RNA-Seq datasets with a small number of samples, edgeR can show a tendency to detect more candidate genes than DESeq2. In this comparison as well, edgeR judged more genes significant than DESeq2, which is consistent with that tendency. However, the difference was small. This does not mean that edgeR always detects more genes.
Fig2: Comparison of differences among genes with p < 0.05 in edgeR, DESeq2, and the t-test. The difference between edgeR and DESeq2 was mainly observed near the decision boundary, whereas genes judged significant only by the t-test included genes located far from the diagonal line.
Genes judged significant by both edgeR and DESeq2 but non-significant by the t-test
Next, we examined genes that were judged significant by both edgeR and DESeq2, but not significant by the t-test. (Fig. 2, center)
edgeR and DESeq2 do not determine significance using only the variability observed in each individual gene. Instead, they also use the variability of genes with similar Counts values to estimate dispersion. Therefore, genes in high-Counts regions can be judged significant even when their own observed values show some degree of variability.
In contrast, the t-test directly evaluates the within-group variability and between-group difference observed for each gene. Therefore, if the replicate samples show large variability, the gene is less likely to be judged significant.
With this difference in mind, please look at Fig. 2B. Compared with the genes judged significant by all three methods (left), the genes judged significant only by edgeR and DESeq2 showed a characteristic pattern of larger variability (right). Because this dataset is an in vitro experiment with very small overall variability, these genes may have passed the significance threshold more easily in the RNA-Seq-specific models.
Fig2B: Comparison between genes judged significant by all three methods and genes biased toward edgeR or DESeq2. Compared with genes commonly detected by all three methods, genes biased toward edgeR or DESeq2 included genes with larger within-group variability.
From a statistical perspective, when the number of replicates is small, there is a high probability that variability will appear small by chance. In other words, if large variability is observed even with a small number of replicates, it may be more likely to reflect a genuinely high-variability characteristic of that gene than to be a random observation of an outlier.
The model design of edgeR and DESeq2 depends more on the variability of genes with similar Counts values than on the observed variability of each individual gene. This design may not always work in a desirable direction for biological interpretation.
Genes judged significant by the t-test but non-significant by edgeR and DESeq2
Next, we examined genes that were judged significant by the t-test but not significant by either edgeR or DESeq2. (Fig. 3)
What is particularly interesting is that there appear to be two groups: genes near the p-value threshold, which appear near the diagonal line in the scatter plot, and genes with large expression changes, which appear far from the diagonal line. The group near the diagonal line is probably related to the actual variance of the measured values, as discussed above. Here, we focus on the group farther from the diagonal line, shown in black on the left side of Fig. 3.
When these genes were examined individually using Line Graphs, they showed high reproducibility within each group and were clearly separated between groups. Some of the genes judged significant only by the t-test were not low-quality genes with large variability, but instead included genes worth retaining as biologically relevant candidates. For this kind of expression pattern, it is natural that the t-test gives an extremely low p-value, and that is exactly what was observed.
We then considered why these genes were excluded from the significant gene lists of edgeR and DESeq2. DESeq2 has a mechanism that detects extreme count outliers based on Cook's distance. For such genes, the p-value can be set to NA, or the outlier count can be replaced and the model refitted under certain conditions. edgeR also uses robust estimation and quasi-likelihood-based testing to reduce the influence of outliers and excessive variability.
Therefore, even when an individual gene shows stable replicate measurements within each group and a large difference between groups, it can still be treated as an outlier-like expression pattern or dispersion structure in RNA-Seq-specific models and excluded from the significant gene list. In this dataset, this kind of model-based handling may have contributed to the result.
When using edgeR or DESeq2 for DEG analysis, this is a particularly important feature to understand.
Fig3: Example of genes judged significant by the t-test but excluded from the significant gene lists of edgeR and DESeq2. These genes are located far from the diagonal line in the scatter plot, and the Line Graph shows high reproducibility within groups and a large difference between groups.
Preprocessing and filtering help manage the risk in the low-Counts region
If a t-test is applied without preprocessing and filtering, genes in the low-Counts region are judged significant and become mixed into the DEG list. On the other hand, after normalization, values below the Low Signal Cutoff can be raised to the threshold, and genes that consist only of low-Counts measurements and are unlikely to contribute to the analysis can be filtered out. This should greatly reduce candidates from the low-Counts region. To confirm this, we compared genes judged significant after only normalization and log transformation with genes judged significant after appropriate preprocessing and filtering. (Fig. 4)
The difference between the DEG lists before and after processing showed that the removed genes were concentrated in the low-Counts region. This indicates that preprocessing and filtering function as risk management against the low reliability of the low-Counts region.
edgeR and DESeq2 automatically apply statistical models designed for RNA-Seq Count data. This is very convenient, but within the statistical model itself, there is limited room for the analyst to adjust how much of the Counts range should be included in the interpretation according to the biological purpose.
In contrast, when preprocessing and filtering are performed manually while looking at the data, the Low Signal Cutoff threshold can be flexibly changed according to the purpose of the analysis. For example, it can be set somewhat lower for exploratory analysis, or higher for more stringent biomarker candidate selection. Being able to adjust this decision according to the purpose of the analysis has biological interpretive value.
Fig4: Comparison between the t-test without preprocessing/filtering and the t-test after preprocessing/filtering. Genes removed from the DEG list by preprocessing and filtering were concentrated in the low-Counts region, showing that this process helps reduce unstable candidates from the low-Counts region.
Adding Fold Change makes the DEG list closer to a practical candidate list
In actual differential expression analysis, DEG lists are not always created using p-values alone. It is common to add a Fold Change threshold to focus on changes that are easier to interpret biologically. Here, we compared genes with p < 0.05 and at least a 1.4-fold expression difference among edgeR, DESeq2, and the t-test applied to preprocessed, normalized, and filtered data. (Fig. 5)
Fig5: Comparison of genes with p < 0.05 and at least a 1.4-fold expression difference in edgeR, DESeq2, and the t-test. By adding a Fold Change threshold, the differences among methods are examined in a form closer to a practical DEG candidate list.
As a result, compared with the p-value-only comparison, the proportion of genes commonly detected by all three methods increased, and the differences among methods became extremely small (Fig. 6, Venn diagram). This is highly consistent with the observation in Fig. 2 that differences among methods were concentrated near the diagonal line, because many of those genes were removed by the Fold Change threshold.
As described above, genes judged significant only by the t-test included genes located far from the diagonal line in the scatter plot. Because these genes are not affected by adding a Fold Change threshold, they remained in the list. These genes have sufficiently large Fold Changes and clearly visible differences between groups. (Fig. 6, right)
On the other hand, genes judged significant by edgeR or DESeq2 but not by the t-test also showed a characteristic pattern. When examined using Line Graphs, these genes included genes with relatively large variability among replicate samples within each group. (Fig. 6, left)
This result shows that the simple idea that “a t-test should not be used for RNA-Seq” is insufficient. A common impression is that “edgeR and DESeq2 give more conservative results, so these methods should be used instead of a t-test.” However, the result of this examination was the opposite: edgeR and DESeq2 gave more liberal results.
Fig6: After filtering genes to those with p < 0.05 and at least a 1.4-fold expression difference, we compared genes more likely to be detected on the edgeR/DESeq2 side and genes more likely to be detected on the t-test side. The left panel shows genes judged significant by edgeR or DESeq2 but not by the t-test. The right panel shows genes judged significant by the t-test but not by edgeR or DESeq2.
However, the dataset used here is an in vitro experimental dataset with extremely small variability among replicate samples. In addition, each condition has only two replicates, making this an extremely small-sample comparison from a statistical perspective. These conditions likely had a large influence on the results. Therefore, datasets such as biopsy samples or clinical specimens, which often have larger variability and larger sample sizes, may show different results.
Therefore, in a future analysis, we would like to compare edgeR, DESeq2, and the t-test applied to preprocessed data using a medium-sized RNA-Seq biopsy dataset, and examine how much of the tendency observed here is reproduced.
On the other hand, in vitro experiments with very few replicates per group are not rare in real RNA-Seq studies. Based only on the results in this article, for this type of low-variability, small-sample in vitro dataset, the t-test applied to preprocessed data appears to identify candidates that are more biologically convincing than those identified by edgeR or DESeq2.
In such cases, it may be worthwhile to perform a t-test as well, rather than relying only on edgeR or DESeq2. Then, the differences among the methods can be compared when selecting candidate genes.
What matters is not which method to trust, but looking at the data
In this comparison, the results of edgeR, DESeq2, and the t-test are not fundamentally contradictory. When looking only at p-values, the differences among methods appear large. However, after adding a Fold Change condition, the differences become much smaller as a practical DEG candidate list.
On the other hand, when the genes included in the differences among methods are examined individually, different characteristics emerge. Genes judged significant by edgeR or DESeq2 but not by the t-test included genes with relatively large within-group variability. Conversely, genes judged significant by the t-test but not by edgeR or DESeq2 included genes with high within-group reproducibility and large between-group differences.
This result is not a simple matter of “edgeR is correct,” “DESeq2 is correct,” or “the t-test is correct.” The statistical model changes how candidate genes are selected. In RNA-Seq differential expression analysis, results should not be judged only by the name of the method. It is important to examine p-values, Fold Change, Counts level, within-group variability, filtering conditions, and the expression patterns of individual genes together. Statistical methods are important tools for narrowing down candidate genes. However, to decide which genes should remain as targets for interpretation, it is necessary to look at the actual data, not only at p-values.
Visualizing differences among DEG lists in Subio Platform
The RNA-Seq Data Analysis Tutorial introduces the workflow from data import to normalization, filtering, and DEG analysis, and also explains in detail what to look at and how to make decisions at each step.
Many people may think that DEG analysis only requires importing Gene Counts into edgeR or DESeq2 and letting the software process them. However, that approach can miss many important points. We encourage you to try the tutorial.
The tutorial page also provides a download link for an SSA file. An SSA file can be imported into Subio Platform so that you can interact with the data yourself. Please try checking the results of this comparison with your own hands.