How to Use edgeR, DESeq2, and t-test in RNA-Seq Differential Expression Analysis - Practical Rules by Data Type

  • Gene Expression
  • High-Throughput Sequencing

In RNA-Seq differential expression analysis, statistical methods designed for Gene Counts, such as edgeR and DESeq2, are commonly used. These methods are designed to account for the characteristics of RNA-Seq Counts, and they are widely used as standard methods for DEG analysis.

On the other hand, in real RNA-Seq data, the nature of the data differs depending on the data type, such as whether the samples come from an in vitro experiment or biopsy specimens, whether the number of replicates is large or small, whether the input RNA amount is high or low, and whether the study has a paired design. As a result, the expression patterns of genes judged to be significant also differ.

In this page, we first summarize the basic differences among edgeR, DESeq2, and t-test. Then, using four representative data types, we examine what kinds of genes tend to be judged significant by each method, and consider how these methods can be used appropriately in practice.

Basic differences between t-test and edgeR / DESeq2

The major difference between t-test and edgeR / DESeq2 lies in how they handle variance.

A t-test evaluates whether the difference between groups is large based on the observed variation among samples for each gene. In other words, it strongly depends on the within-group variance observed for each gene.

In contrast, edgeR and DESeq2 use not only the Counts of the gene itself, but also the variation of other genes with similar Count levels, to estimate the relationship between the mean and variance across the RNA-Seq dataset.

RNA-Seq Gene Counts have a characteristic structure: in high-expression regions, relative variation tends to be small, whereas variation increases as expression levels become lower. In addition, in low-expression regions, the chance of detection becomes stronger; a gene may be detected in one sample but not detected in another.

edgeR and DESeq2 evaluate differential expression while taking this mean-variance relationship of Count data into account. As a result, when genes judged significant by edgeR or DESeq2 are viewed on a scatter plot, they often appear as a distribution that moves away from the diagonal toward the low-Count side, reflecting the variance structure characteristic of RNA-Seq Counts. However, this does not mean that all differences in the low-Count region are reliable. For this reason, in typical protocols using edgeR or DESeq2, low-expression genes are usually removed from the analysis.

With these characteristics in mind, we next look at four representative data types, how the results of edgeR, DESeq2, and t-test differ, and how these methods can be used in practice.

Case 1: In vitro data with low variance and a small number of samples

A detailed examination of this case is provided in a case study comparing edgeR, DESeq2, and t-test in low-variance in vitro data with a small number of samples.

In in vitro experiments, because there is no variation derived from individual differences, within-group variance may appear very small. Furthermore, when the number of replicates is extremely small, such as n = 2, the variance of a gene can easily be underestimated simply because the two values happen to be close to each other.

Of course, this kind of experimental design is far from what statistical models assume. However, it is also a condition commonly seen in real RNA-Seq data. Therefore, it is important to know what kinds of genes tend to be judged significant by each method under such conditions.

When we actually calculated the results, the observed within-group variance was so small that, regardless of the method used, even genes with very small expression changes were judged significant.

Therefore, in this type of case, it is better not to extract DEGs based on p-values alone, but to combine p-value filtering with a fold change condition. When a fold change condition is added, genes that were significant only because of very small expression changes are removed, and the differences among DEG lists obtained by the three methods become much smaller.

However, even after adding a fold change condition, differences among methods still remain. Genes that are significant by edgeR or DESeq2 but not by t-test tend to have relatively large observed within-group variance. This is because the variance is estimated not only from the observed variance of that gene itself, but also from the variation of other genes with similar Count levels. In low-variance in vitro data, a smaller variance than the observed variation may be assigned, and as a result, edgeR or DESeq2 may produce more liberal significance calls than t-test.

Conversely, genes significant only by t-test tend to have extremely small observed within-group variance. For such genes, the p-value in the t-test can become small simply because the replicate values happened to be close to each other. For this type of gene, it is effective to set an appropriate fold change threshold and remove changes that are too small. This idea applies not only to t-test, but also to edgeR and DESeq2.

Case 2: Medium-sized biopsy-derived data

A detailed examination of this case is provided in a case study comparing edgeR, DESeq2, and t-test in medium-sized biopsy-derived data.

Next, let us consider medium-sized biopsy-derived RNA-Seq data. Compared with Case 1, this type of data has more samples, and in signal regions where expression levels are sufficiently high, variance estimation becomes relatively stable. Therefore, in regions where expression levels are sufficiently high, the differences among edgeR, DESeq2, and t-test become small, and it is reasonable to consider that major differentially expressed genes can be detected commonly by any of these methods.

On the other hand, in this case, many genes in the low-expression region were judged significant only by edgeR or DESeq2. When these genes were examined in detail, they did not appear to reflect clear biological expression differences. Rather, they corresponded to shifts in the lower limit of normalized Gene Counts, which were associated with differences in dynamic range among samples.

In RNA-Seq data, it is not unusual for total read counts to differ by several fold among samples. However, differences in total read counts do not correspond simply to differences in dynamic range. This is because multiple factors are involved, such as RNA quality, sample composition, bias toward specific highly expressed genes, and mapping rate. In any case, differences in dynamic range lead to shifts in the lower limit of normalized Gene Counts. When this shift in the lower limit is biased toward one group, it becomes a cause of apparent group differences.

The statistical models of edgeR and DESeq2 are designed to account for the instability of Counts in low-expression regions. However, they do not explicitly model shifts in the lower limit of normalized Gene Counts caused by differences in dynamic range among samples.

Therefore, for genes that become significant only by edgeR or DESeq2 in the low-Count region, it is necessary not to rely on p-values alone, but to check whether they may reflect apparent differences caused by differences in dynamic range.

On the other hand, unlike Case 1, in medium-sized data with a reasonable number of samples, it is not always necessary to uniformly add a fold change condition. In regions where expression levels are sufficiently high, genes with small but consistent significant changes may reflect small biological changes. However, low-change DEGs should be interpreted more cautiously than genes with large expression changes.

Case 3: Paired design data

A detailed examination of this case is provided in a case study comparing edgeR, DESeq2, and paired t-test in paired design data.

The third case is paired design RNA-Seq data. For example, when control and case samples are obtained from the same patients, a paired t-test cancels out individual differences and evaluates in which direction and by how much each pair changes from control to case. Therefore, compared with an unpaired two-group comparison, paired t-test generally has higher detection power.

On the other hand, edgeR and DESeq2 can also account for individual differences by including paired information such as patient in the design. However, edgeR and DESeq2 do not simply evaluate the direction of change within each pair. They also handle the relationship between the mean and variance of Gene Counts, the uncertainty of low Counts, and variance estimation across the entire model.

Therefore, even when paired design is taken into account in edgeR or DESeq2, the number of significant genes does not necessarily increase. In some cases, the number of significant genes may even decrease compared with an unpaired two-group comparison.

Rather than arguing whether paired t-test or edgeR / DESeq2 is correct, it is practically important to know that this type of difference in results can occur.

Case 4: Low-input RNA-Seq data

A detailed examination of this case is provided in a case study comparing edgeR, DESeq2, and t-test in low-input RNA-Seq.

The fourth case is low-input RNA-Seq. In this case, before deciding which statistical method to use, it is necessary to check whether the data itself has enough reproducibility to support DEG analysis.

With extremely low input amounts, such as 100 cells, there was almost no reproducibility among replicate samples. Many of the factors that appeared as large expression differences were derived from whether a transcript happened to be amplified, or happened not to be detected.

In this type of data, the structure is very different from the variance structure of ordinary bulk RNA-Seq assumed by the statistical models of edgeR and DESeq2. Therefore, even if genes are judged significant by edgeR or DESeq2, it is not surprising if many of them reflect apparent differences derived from amplification or detection, rather than biological expression differences.

This does not mean that edgeR or DESeq2 is wrong. Rather, it is a natural consequence of the input data structure being far outside the variance model assumed for ordinary bulk RNA-Seq.

At 1000 cells, a variance structure somewhat resembling RNA-Seq begins to appear on the scatter plot. In other words, the pattern starts to resemble the mean-variance relationship assumed by edgeR and DESeq2: variation is smaller in high-expression regions, and becomes wider toward low-expression regions.

However, the fact that the variance structure begins to appear is different from saying that DEG analysis results from edgeR or DESeq2 can be trusted. Even at 1000 cells, the data are still unstable compared with ordinary bulk RNA-Seq, and randomness in amplification and detection remains.

In this dataset, the 100k-cell results could be used as reference data. When the 1000-cell results were compared with the 100k-cell results, consistent changes were confirmed only for very highly expressed genes with Gene Counts of 5000 or higher.

In other words, even if 1000-cell data appear to approach the variance structure assumed by edgeR and DESeq2, the DEG results cannot be trusted as they are. DEG lists obtained from low-input RNA-Seq data should be handled extremely cautiously, for example by limiting interpretation to very highly expressed genes.

Practical rules by data type

Data type Main issue Points to note for edgeR / DESeq2 Points to note for t-test Practical rule
Low-variance in vitro data with a small number of samples Within-group variance appears extremely small. Because n is small, gene-level variance estimation is unstable. Because variance is estimated using genes with similar Count levels, genes with relatively large observed variance can still become significant. In this case, edgeR or DESeq2 may appear more liberal than t-test. Genes with extremely small observed within-group variance are likely to become significant. Use a fold change condition in addition to p-values. Adding a fold change condition reduces the differences among the three methods.
Medium-sized biopsy-derived data The number of samples is moderate, but individual differences, tissue composition, total read counts, and differences in detection limits are mixed together. In low-Count regions, apparent differences caused by differences in dynamic range may be judged significant. Results are easier to interpret in signal regions, but caution is needed in low-Count regions and regions with missing values. Apply a low signal cutoff to normalized Gene Counts based on the sample with the narrowest dynamic range.
DEG extraction in low-Count regions should be considered separately.
Paired design If individual differences are not taken into account, baseline differences among patients are mixed into the evaluation of group differences. Individual differences can be accounted for by including patient in the design, but because uncertainty in the whole Count model is also evaluated, the number of significant genes does not necessarily increase. Because paired t-test directly evaluates within-pair differences, it tends to detect small changes that are consistent within patients. However, these cannot simply be dismissed as biologically meaningless noise. Always take the paired structure into account. Genes found only by paired t-test should also be checked as small but consistent within-patient changes using clustering or functional analysis.
Low-input RNA-Seq Randomness in amplification and detection is large, and the data structure can deviate from ordinary bulk RNA-Seq. Genes judged significant may reflect apparent differences derived from amplification or detection. Because t-test directly reflects instability in the observed values, large p-value significance or large fold change does not necessarily mean the result is reliable. Do not trust DEG lists based on p-values alone. Interpret the results even more conservatively than they may appear.

Detailed case studies

Conclusion: Use methods according to the nature of the data and the research purpose

In RNA-Seq DEG analysis, it is recommended to use methods such as edgeR or DESeq2 that can take Gene Counts as input. However, using edgeR or DESeq2 does not mean that the results can always be trusted as they are. As shown above, differences in the nature of the dataset can greatly affect both the DEG list obtained and its reliability.

A t-test is not a method to be applied directly to raw RNA-Seq Gene Counts. However, when appropriate preprocessing, normalization, and filtering are performed, and the analysis is limited to the signal region, it can be a useful comparative method for interpreting results in relation to the observed expression patterns.

Based on what kinds of genes each method tends to judge as significant, the method should be selected according to the research purpose. When necessary, it can also be practically useful to combine DEG lists obtained by multiple methods.

Chef Choosing the Right Knife