Comparing edgeR, DESeq2, and t-test in RNA-Seq Differential Expression Analysis (1): Low-Variance, Small-Sample in vitro Data

  • Gene Expression
  • High-Throughput Sequencing

In RNA-Seq differential expression analysis, RNA-Seq-specific statistical methods such as edgeR and DESeq2 are commonly used. These methods are based on statistical models specifically designed for the characteristics of Gene Counts data.

At the same time, simplified statements such as “you should use edgeR,” “you should use DESeq2,” or “you should not use a t-test for RNA-Seq” are becoming increasingly common. In this article, we compare edgeR, DESeq2, and a t-test applied to preprocessed data using an actual dataset.

Dataset characteristics: low-variance in vitro data

The dataset used in this article, GSE49110, is not a biopsy dataset. It is RNA-Seq data from an in vitro experiment using MDA-MB-231 cells.

The comparison is between cells treated with siRNA targeting ERRα and cells treated with control siRNA. Each condition was measured in duplicate, so the sample size is only n = 2 per group. On the other hand, the replicate samples within each condition are highly stable, and the dataset is characterized by relatively small within-group variance.

In practice, this type of research dataset is very common. Therefore, we first use this type of dataset to compare the methods and examine the results.

The purpose of this article is not to determine which method is superior. Instead, the goal is to examine what types of expression patterns are judged as significant, or not judged as significant, when each method is applied. This can provide a practical reference for method selection and result interpretation.

Three methods compared

Here, we compared the following three methods for the same siC versus siE1 comparison.

  • Differential expression analysis using edgeR
  • Differential expression analysis using DESeq2
  • t-test applied after preprocessing, normalization, and filtering

edgeR and DESeq2 use statistical models designed for RNA-Seq Count data. In contrast, the t-test was applied after preprocessing, normalization, and filtering. It was not applied directly to raw Gene Counts or to all measured genes. Instead, low-count unstable genes were reduced in advance, and the data were prepared into a state where a general statistical method could be applied more appropriately.

The preprocessing, normalization, and filtering steps were as follows.

  1. Log2 transformation. Values of 0 became missing values at this step.
  2. Global normalization using the 70th percentile.
  3. After normalization, values below 20 on the linear scale were replaced with 20.
  4. Missing values were imputed as 16 on the linear scale.
  5. Values were converted to ratios against the mean of the siC control samples.
  6. Genes with Gene Counts below 20 in at least one sample in every group were removed.

When judged only by p-value, the differences among methods appear large

First, we compared genes with p < 0.05 for edgeR, DESeq2, and the t-test.

In scatter plots comparing individual samples, even small differences in measured values can appear as large fold changes in the low-count range. As a result, points spread away from the diagonal, and the low-count range can appear highly variable.

In contrast, the scatter plots in Fig1 compare the mean values of the two replicate samples in the siC group and the two replicate samples in the siE1 group. When group means are compared, random variation among individual samples is averaged to some extent. Therefore, points in the low-count range do not spread as widely as they would in a comparison between individual samples.

However, the fact that group-mean points in the low-count range cluster near the diagonal does not mean that the measurements in this range are stable. Rather, in this range, random measurement fluctuation is likely to have a larger influence than true expression differences. Many values in this range are therefore difficult to interpret as biologically meaningful expression differences.

edgeR and DESeq2 evaluate significance by estimating the relationship between mean expression and variance from the entire Gene Counts dataset. However, if the low-count range is treated as a continuous extension of the variance trend from the middle- to high-count ranges, genes whose measurements are dominated more by noise than signal can also become targets of significance testing. (Fig1 left and middle)

In other words, even when using edgeR or DESeq2, the low-count range is not necessarily included directly as an interpretation target. In actual analyses, filtering to remove lowly expressed genes is commonly used together with these methods. This also indicates that the low-count range is not a range that can be interpreted safely simply by leaving it to the statistical model. Instead, it is a range where analysts need to decide in advance how much of the data should be included in the analysis target.

Case Study421 Fig1 Significant Genes By Method

Fig1: Distribution of genes with p < 0.05 detected by edgeR, DESeq2, and t-test. Black dots represent all genes, and colored dots indicate genes judged as significant by each method. The overall distribution of significant genes is similar among the three methods, but differences are observed in the low-expression range and near the decision boundary.

The difference between edgeR and DESeq2 is mainly seen near the decision boundary

Fig2 shows that many genes overlap among the three methods, while some genes are judged as significant or not significant depending on the method.

The difference between edgeR and DESeq2 is mainly seen near the boundary between significant and non-significant genes. (Fig2 left) For genes with clearly large differences, the results of the two methods agree well. However, for genes with p-values near the threshold, the method used can affect whether a gene is judged as significant. In this dataset, edgeR also tended to judge more genes as significant in the low-count range.

In small-sample RNA-Seq data, edgeR has been reported to detect more candidate genes than DESeq2 in some cases. In this comparison as well, edgeR judged more genes as significant than DESeq2, which is consistent with that tendency. However, this does not mean that edgeR always detects more genes.

Case Study421 Fig2 Differences Between Methods

Fig2: Comparison of genes with p < 0.05 detected by edgeR, DESeq2, and t-test. The lower panels show the overlap among genes detected by the three methods. Genes in the gray-overlaid regions of the Venn diagrams are shown as black dots in the upper scatter plots.

Genes detected only by edgeR and DESeq2 often have relatively larger mean differences

Compared with genes detected only by the t-test, genes detected only by edgeR and DESeq2 are located slightly farther away from the diagonal. (Fig2 middle) These gene groups also include genes that appear to have relatively large within-group variation. (Fig2B middle) This is probably because edgeR and DESeq2 do not rely only on the variation of the four samples for each individual gene. Instead, they also use the variance trend estimated from many genes with similar Gene Counts. Because this dataset originally has small within-group variance, the variance trend estimated from genes with similar Gene Counts may also tend to be small. As a result, genes whose individual measured values appear relatively variable may still have been judged as significant in the edgeR or DESeq2 model.

Genes detected only by the t-test often have small mean differences

In contrast, many genes detected only by the t-test are located very close to the diagonal. (Fig2 right) These genes also show extremely small variance. (Fig2B right)

When the number of replicates is as small as two, the measured values are likely to fall in high-probability regions of the population distribution. As a result, the observed variance can be much smaller than the true population variance. This means that even when the difference between group means is small, a gene can be judged as significant if the within-group variation happens to be observed as extremely small. As long as we judge only by p-value, this result supports the argument that edgeR or DESeq2 should be used for RNA-Seq data rather than a t-test.

Case Study421 Fig2 B Intersection Vs Edge R De Seq2 Specific Genes

Fig2B: Comparison of expression patterns among genes commonly detected by all three methods, genes detected only by edgeR and DESeq2, and genes detected only by the t-test. Genes in the gray-overlaid regions of the Venn diagrams in the lower panels are shown as black lines in the line plots above.

Preprocessing and filtering are risk control for the low-count range

If a t-test is applied without preprocessing or filtering, genes in the low-count range can be judged as significant and become mixed into the DEG list. This range is more easily affected by count discreteness, missing values, and instability caused by low signal. Therefore, interpreting expression differences based only on p-values requires caution.

To examine this, we compared genes judged as significant after only normalization and log2 transformation with genes judged as significant after applying Low Signal Cutoff, missing value imputation, and filtering.

After Low Signal Cutoff, missing value imputation, and filtering were applied, fewer genes in the low Gene Counts range were judged as significant by the t-test. This result indicates that the major risk is not the t-test itself, but rather applying a statistical test directly to unstable data in the low-count range.

edgeR and DESeq2 automatically apply statistical models designed for RNA-Seq Count data. This is very convenient. However, within the statistical model itself, there is limited room for analysts to adjust, according to biological purpose, which count range should be included as an interpretation target. Looking at the left and middle scatter plots in Fig1, the significance calls appear rather aggressive for a situation with only two replicates per group. Even when using edgeR or DESeq2, it is more appropriate to combine the analysis with some form of filtering for lowly expressed genes.

In contrast, when preprocessing and filtering are performed manually while inspecting the data, the Low Signal Cutoff threshold can be adjusted flexibly according to the analysis purpose. For exploratory analysis, a lower threshold may be acceptable. For stricter biomarker candidate selection, a higher threshold may be more appropriate. The ability to change this decision according to the purpose of the analysis has biological interpretive value.

Case Study421 Fig3 T Test Risk Reduction By Preprocessing And Filtering

Fig3: Effect of preprocessing and filtering on the t-test. The left panel shows the result of applying the t-test after only normalization and log2 transformation. The middle panel shows the result after further applying Low Signal Cutoff, missing value imputation, and filtering. The right panel shows the distribution of genes removed from the t-test significant gene list by preprocessing and filtering. Many of these removed genes are located in the low Gene Counts range, showing that preprocessing and filtering reduce the risk of judging unstable low-count genes as significant.

Adding fold change to p-value brings the DEG list closer to practical interpretation

In actual differential expression analysis, DEG lists are not always created based on p-values alone. It is common to add a fold change condition and narrow the list to changes that are easier to interpret biologically. Here, we compared edgeR, DESeq2, and the t-test applied after preprocessing, normalization, and filtering using the condition p < 0.05 and fold change of 1.4 or greater.

When genes are judged only by p-value, genes with small mean differences can also be judged as significant. In particular, with the t-test, genes with extremely small within-group variation can become significant even when the fold change is small.

When the fold change condition is added, genes with small changes close to the diagonal are removed, and the significant genes are narrowed down to regions with larger expression differences.

Case Study421 Fig4 Significant Genes By Method And Fold Change

Fig4: Distribution of genes with p < 0.05 and fold change of 1.4 or greater detected by edgeR, DESeq2, and t-test.

Adding a fold change condition reduces the differences among methods

Fig5 compares how the DEG lists obtained by the three methods change when the condition of p < 0.05 and fold change of 1.4 or greater is applied.

When the methods were compared only by p-value, 355 genes were judged as significant only by the t-test. Many of these genes had extremely small within-group variation but small mean differences. In addition, among genes detected only by edgeR or DESeq2, many genes near the significance boundary also had small expression changes.

When the fold change condition of 1.4 or greater is added, many of these genes with small expression changes are removed. As a result, the overlap among the DEG lists obtained by edgeR, DESeq2, and the t-test becomes larger, and the differences among methods become smaller than when the methods are compared by p-value alone. (Fig5)

In RNA-Seq differential expression analysis, DEG lists are often created using fold change together with p-values. Therefore, this result indicates that in practical RNA-Seq differential expression analysis, the differences among methods can become smaller.

Case Study421 Fig5 Method Specific Genes With Fold Change

Fig5: Comparison of DEG lists extracted using p < 0.05 and fold change of 1.4 or greater. Adding the fold change condition reduces the differences among methods seen when comparing only by p-value, and increases the overlap among the DEG lists obtained by the three methods.

However, adding a fold change condition does not completely eliminate method-dependent differences

Adding the fold change condition of 1.4 or greater greatly reduces the number of genes with small changes that were judged as significant only by the t-test. On the other hand, many genes detected only by edgeR or DESeq2 still remain. (Fig5)

These genes were probably not judged as significant by the t-test because their within-group variation was relatively large. (Fig5B left and middle) In contrast, edgeR and DESeq2 also use the variance trend estimated from the entire Gene Counts dataset. Therefore, genes whose individual profiles appear variable can still be judged as significant by the model.

This result shows that adding a fold change condition does not completely eliminate the differences among methods. Even for DEG lists obtained using edgeR or DESeq2, it is necessary to check the expression patterns of individual genes.

Case Study421 Fig5 B Intersection Vs Edge R De Seq2 Specific Genes

Fig5B: Comparison of expression patterns among gene groups extracted using p < 0.05 and fold change of 1.4 or greater. Even after adding the fold change condition, genes detected only by edgeR or DESeq2 include genes that show relatively large within-group variation when individual samples are inspected.

This result strongly depends on low-variance, small-sample in vitro data

However, the dataset used here is in vitro experimental data, and the variation among replicate samples is extremely small. In addition, each condition has only two replicates, which is an extremely small sample size from a statistical perspective.

This type of in vitro experiment with very few replicates per group is not rare in actual RNA-Seq studies. In low-variance, small-sample data, the variance trend estimated from the entire Gene Counts dataset can become small. As a result, edgeR and DESeq2 can judge some genes as significant even when they are not significant by the t-test. This is one of the characteristics to keep in mind when using edgeR or DESeq2 for this type of dataset.

In the next article, we will compare edgeR, DESeq2, and a t-test applied to preprocessed data using a medium-sized RNA-Seq dataset from biopsy samples. We will then examine to what extent the tendencies observed here are reproduced.

The important point is not which method to trust, but how to judge the data

In this comparison, the results of edgeR, DESeq2, and the t-test are not largely contradictory. When compared only by p-value, the differences among methods appear large. However, after adding a fold change condition, the differences among practical DEG candidate lists became much smaller.

At the same time, when genes included in the method-specific differences were inspected individually, different characteristics became visible. Genes detected only by the t-test often had extremely small within-group variation but small mean differences. These genes were greatly reduced by adding the fold change condition.

Conversely, genes that were significant by edgeR or DESeq2 but not by the t-test included genes whose within-group variation appeared relatively large when individual samples were inspected. When using edgeR or DESeq2 for low-variance, small-sample in vitro data, it is necessary to understand that such genes may be included in the DEG list and interpret them carefully.

For RNA-Seq differential expression analysis using Gene Counts, it is safer to use methods such as edgeR or DESeq2, which can consider the variance structure of the entire count dataset, as the basic approach. At the same time, if normalization, preprocessing, filtering, fold change conditions, and visual inspection are combined appropriately, the t-test can also be used as a practical method for evaluating expression differences that does not greatly contradict edgeR or DESeq2.

Statistical methods are important tools for narrowing down candidate genes. However, to decide which genes should ultimately be interpreted, it is necessary to check the actual data, not just the p-values.

Visualizing differences among DEG lists with Subio Platform

In the RNA-Seq Data Analysis Tutorial, we explain not only how to import data, normalize it, filter it, and perform DEG analysis, but also what to check and how to make decisions at each step.

Many researchers may think that DEG analysis simply means loading Gene Counts into edgeR or DESeq2 and letting the tool process the data. However, many important points can be missed with that approach.

The tutorial page also provides a download link for an SSA file. SSA files can be imported into Subio Platform, allowing you to explore the actual dataset yourself. We encourage you to examine this case study by working with the data directly.

Related Topics

Comparing edgeR, DESeq2, and t-test for RNA-Seq Differential Expression Analysis (2)|A Medium-Sized Biopsy Dataset