Comparing edgeR, DESeq2, and t-test for RNA-Seq Differential Expression Analysis (2)｜A Medium-Sized Biopsy Dataset

Unlike the small-sample, low-variance in vitro dataset used in the previous article, this article uses a biopsy-derived RNA-Seq dataset from GSE121212, consisting of 20 non-lesional skin samples and 20 lesional skin samples, for a total of 40 samples. With a larger number of samples, within-group variance for each gene can be estimated more stably than in the previous example, reducing the risk that genes are judged significant by the t-test simply because their variance was estimated to be extremely low by chance.

At the same time, in biopsy-derived data, individual differences, tissue-state differences, differences in total read count, and instability in the low-count range are more likely to affect the results. Therefore, in this article, we do not simply compare which method is better among edgeR, DESeq2, and the t-test. Instead, we examine the characteristics of the genes judged significant by each method.

In biopsy data, the dynamic range can differ greatly among samples

CaseStudy426 Fig1: Histogram Dynamic Range

Fig1: Comparison of Gene Counts distributions across the 40 samples in GSE121212. After normalization, the signal region on the right side of the histogram is aligned, whereas the low-count region on the left side extends differently from sample to sample. In the table below, genes with no Gene Counts in at least 36 of the 40 samples were removed. The normalized Gene Counts were then log2-transformed, and the samples were sorted by the 25th percentile value in descending order.

In RNA-Seq data, normalization does not make all regions of the distribution uniformly aligned. Before normalization, the left end near zero appears relatively similar across samples, but the position of the signal region on the right side differs among samples, reflecting differences in total read count.

After normalization, the peak of the signal region on the right side becomes aligned. This is a natural result in the sense that normalization aligns the main expression range used for comparisons among samples. However, as a result, differences in how far the left side of the low-count region extends become visible among samples.

In samples with a narrow dynamic range, measurements on the low-expression side are not sufficiently obtained. Therefore, even if normalization algorithms or batch correction methods are used to align the distributions, they cannot restore low-expression information that was not measured in the first place. We have previously examined this point in Case Study No. 403 and No. 413.

In this article, we examine what kinds of results are obtained when edgeR, DESeq2, and the t-test are applied to data that include this kind of instability in the low-count range.

Comparison conditions

In this analysis, we compared the p-values obtained from the three methods. For edgeR and DESeq2, Gene Counts for all genes were used as input, and p-values were calculated based on each statistical model. However, the purpose here was not to evaluate the entire recommended workflow of edgeR or DESeq2. Rather, the goal was to compare what kinds of genes are selected by the three significance-testing methods, including the t-test.

For the t-test in Subio Platform, no special preprocessing such as low signal cutoff or missing-value imputation was applied. The t-test was performed on data that had only been log2-transformed and globally normalized.

Genes with p < 0.05 were then extracted for each method. The log ratio of lesional samples relative to non-lesional samples was used to evaluate the magnitude of change.

The results of edgeR, DESeq2, and the t-test largely overlap

CaseStudy 426 Fig2: Venn and Scatter

Fig2: Comparison of genes with p < 0.05 by edgeR, DESeq2, and the t-test. Many genes were judged significant by all three methods. At the same time, some genes were significant by edgeR or DESeq2 but not by the t-test, and some genes were significant only by the t-test.

As shown in Fig2, the majority of significant genes were shared by all three methods. In this 40-sample biopsy dataset, the major expression differences were detected fairly consistently by edgeR, DESeq2, and the t-test.

This is very different from the small-sample, low-variance in vitro dataset examined in the previous article. In the previous article, each group had only two replicate samples, so the genes judged significant were strongly affected by the degree of variation between the two samples within each group. In contrast, the present dataset includes 20 samples per group. This makes the estimation of within-group variance and between-group differences more stable for each gene, and the differences among the methods appear to be smaller for the major expression changes.

On the other hand, some method-dependent genes still remained. In particular, there was a noticeable group of genes that were judged significant by edgeR or DESeq2 but not by the t-test. However, many of these genes were located in the low-count range. There were approximately 3,000 genes that were significant only by edgeR or DESeq2, but about 2,000 of them were in the low-count range, while about 1,000 were in the signal range. (Fig3B, left and center)

The number of genes significant only by the t-test was 956. Therefore, when the comparison is limited to the signal range, the difference between the number of genes significant only by edgeR or DESeq2 and the number of genes significant only by the t-test is not large. (Fig3B, center and right) This result indicates that, in a dataset with 20 samples per group, the differences among the methods for detecting major expression changes were not as large as those observed in the previous article.

Genes significant by all three methods appear reasonable as group differences

CaseStudy426 Fig3A: Intersection Genes

Fig3A: Heatmaps of genes with p < 0.05 by all three methods: edgeR, DESeq2, and the t-test. The left panel shows all intersection genes, the center panel shows genes with fold change greater than 1.4, and the right panel shows genes with fold change of 1.4 or less. Even among genes with fold change of 1.4 or less, consistent differences between non-lesional and lesional samples can be observed.

Genes judged significant by all three methods clearly separate non-lesional and lesional samples in the heatmap. This result suggests that the intersection of the three methods extracts relatively stable DEG candidates.

What is interesting is that even genes with fold change of 1.4 or less still show patterns that can be interpreted as group differences. (Fig3A, right) In the previous small-sample, low-variance dataset, genes with small changes were often included when extraction was based on p-value alone, and filtering by fold change had practical value.

In the present biopsy dataset, however, where the sample size is larger, even small changes can be observed consistently across many samples. Especially in studies that target small expression changes, it is worth checking genes extracted by p-value alone rather than mechanically excluding them by a fold change cutoff.

Method-specific genes need to be checked separately by expression range

CaseStudy426 Fig3B: Method Specific Genes

Fig3B: Heatmaps of genes judged significant in a method-specific manner. The left panel shows low-count genes significant only by edgeR or DESeq2, the center panel shows signal-range genes significant only by edgeR or DESeq2, and the right panel shows genes significant only by the t-test. The gradient at the bottom shows the 25th percentile of normalized Gene Counts for each sample. Darker colors indicate higher 25th percentile values, representing samples with a narrower dynamic range.

Fig3B examines genes that were not shared by all three methods but were judged significant only by specific methods. Here, genes that were significant by edgeR or DESeq2 but not by the t-test are divided into two groups for closer inspection: genes in the signal range, with average Gene Counts greater than 20 in both non-lesional and lesional samples, and the remaining genes in the low-count range.

In the low-count range shown on the left side of Fig3B, there are genes that appear, at first glance, to be down-regulated in lesional samples. However, when the 25th percentile values shown at the bottom of the heatmap are considered, this pattern appears to correspond to the bias of samples with a narrow dynamic range. In this case, the apparent group difference in the heatmap may reflect sample-specific measurement range differences or experimental factors rather than biological expression changes.

On the other hand, in the lower part of the same Fig3B left panel, some genes are not measured, or are almost OFF, in non-lesional samples, but appear to turn ON in lesional samples. This kind of ON/OFF-type change is a region where count-based methods such as edgeR and DESeq2 can be useful.

However, the important point is that whether samples with a narrow dynamic range are biased toward one group or the other is not necessarily determined by the biological condition. In the low-count range, measurement-range bias can appear as if it were a group difference. Therefore, even for low-count genes judged significant by edgeR or DESeq2, it is risky to interpret them based only on p-values.

Caution is also needed for the signal-range genes shown in the center panel of Fig3B. Because the average Gene Counts exceed 20 in both groups, this group is easier to treat as signal than the low-count range shown in the left panel. However, the heatmap pattern also appears to correspond to the 25th percentile gradient shown at the bottom. In other words, even within the signal range, patterns influenced by sample-to-sample differences in dynamic range may be mixed in.

When the scatter plot in Fig2 is also considered, genes significant only by edgeR or DESeq2 lie above the diagonal, whereas genes significant only by the t-test lie below the diagonal. In both cases, the genes are aligned near the decision boundary. These genes appear less like genes with strong expression differences and more like genes that crossed, or did not cross, the p-value threshold because of small differences in normalization method or fold change estimation.

For the genes significant only by the t-test shown on the right side of Fig3B, the pattern does not appear to represent a large and consistent expression difference between groups. Instead, genes whose magnitude of change was small and whose within-group variance happened to be estimated as small may have passed the significance threshold of the t-test by chance. This group also includes genes that appear to reflect dynamic-range differences in the low-count range.

Therefore, the genes shown in the center and right panels of Fig3B should not be treated directly as high-priority DEG candidates. They should be separated from the genes shared by all three methods and interpreted carefully. In this analysis, it seems more appropriate to treat them as targets for additional confirmation rather than as high-priority DEG candidates.

In the low-count range, filtering based on the presence or absence of measured values is effective

CaseStudy426 Fig3C: ON OFF Genes

Fig3C: ON in lesional and OFF in lesional gene groups overlaid on the low-count genes shown in the left panel of Fig3B. ON/OFF-type genes were extracted as genes for which the proportion of samples without measured values was high in one group, while the proportion of samples with measured values was high in the other group.

In the low-count range, edgeR and DESeq2 may detect ON/OFF-type changes. However, the same region also contains apparent differences caused by dynamic-range bias. Therefore, it is risky to judge low-count genes based only on p-values.

In Fig3C, ON/OFF-type genes were therefore extracted not by the p-values from edgeR or DESeq2, but by the presence or absence of measured values themselves. Specifically, genes were extracted when the proportion of samples without measured values was high in one group and the proportion of samples with measured values was high in the other group. This approach allows ON/OFF-type changes in the low-count range to be checked in a more direct and interpretable way.

In this way, continuous expression differences can be evaluated in the signal range, while ON/OFF-type changes in the low-count range can be extracted based on the presence or absence of measured values. This makes it possible to create a DEG candidate list that includes both the signal range and the low-count range.

What this comparison shows

These results show that, in RNA-Seq differential expression analysis, it is important to distinguish between the low-count range and the signal range.

First, significance testing is difficult in the low-count range. This region is strongly affected by dynamic-range differences in normalized Gene Counts, and even when a pattern looks like a group difference, it may actually reflect sample-specific measurement range differences or experimental factors. Therefore, p-values in the low-count range need to be interpreted carefully.

Second, genes in the signal range that were judged significant by edgeR, DESeq2, and the t-test appear to be fairly reasonable DEG candidates. In a biopsy dataset with a sample size such as this one, even genes with small magnitudes of change may show consistent differences across many samples. When such genes are of interest, mechanically applying a fold change cutoff may not be the best choice.

Third, even in the signal range, method-specific genes need to be interpreted carefully. Genes significant only by edgeR, only by DESeq2, or only by the t-test may be affected by normalization, variance estimation, p-value calculation, or dynamic-range differences. They should be considered separately from genes shared by all three methods.

Fourth, in the low-count range, filtering based on the presence or absence of measured values can be useful even without relying on statistical testing. When ON/OFF-type changes are of interest, directly using the condition that a gene is measured in one group and not measured in the other can be easier to interpret than relying on p-values.

Considering a standard analysis protocol

In this analysis, we deliberately did not apply a low signal cutoff or missing-value imputation in order to compare the three methods. In the standard analysis protocols for edgeR and DESeq2, low-expression genes are filtered, library sizes are normalized, dispersion is estimated, and statistical tests are performed based on a model. However, these steps alone cannot prevent apparent expression changes caused by sample-to-sample differences in dynamic range.

Therefore, in practical analysis, in addition to the standard procedures of edgeR and DESeq2, it is necessary to check the data distribution of each sample and the differences in how far the left side of the normalized histogram extends. The differences in measurement limits on the low-expression side that remain after normalization should not be carried directly into differential expression analysis.

Specifically, a low signal cutoff should be applied to the normalized Gene Counts. However, this cutoff should not be set as low as possible. Instead, it should be set around the lower limit of the signal region, even for the group of samples with the narrowest dynamic range. The values above this cutoff should be treated as the signal range, where continuous expression values are compared. The values below this cutoff should be treated as a separate region where ON/OFF-type changes are evaluated based on the presence or absence of measured values.

If the signal range and the low-count range are not separated in this way, and data that include differences in measurement limits on the low-expression side are analyzed as they are, sample-to-sample differences in dynamic range may be extracted as if they were group differences. These differences in measurement limits on the low-expression side are not explicitly incorporated into the statistical models of edgeR or DESeq2, nor into their standard analysis protocols. However, they do exist in real RNA-Seq data and can become a source of apparent expression changes. This is not merely a theoretical concern. As shown in Case Study No. 403, this problem can also be observed in real RNA-Seq data.

In the signal range, taking the intersection of significant genes from edgeR, DESeq2, and the t-test can be used as a conservative approach for extracting DEG candidates. At least in this dataset, genes significant by all three methods showed a reasonable group-difference structure in the heatmap.

If running all three methods and taking the intersection is difficult, simpler approaches can also be considered for evaluating the signal range: using edgeR or DESeq2, or using the t-test.

When edgeR or DESeq2 is used, it is recommended to check by heatmap whether the expression patterns of the genes extracted by p-value correspond to dynamic-range bias. When the t-test is used, genes with both a small magnitude of change and small within-group variance should be removed from the genes judged significant by p-value. This additional step helps compensate for a weakness often attributed to the t-test when compared with edgeR and DESeq2.

Regardless of which method is used, ON/OFF-type genes in the low-count range need to be extracted separately.

Region	When using intersection	When using edgeR/DESeq2	When using the t-test
Signal range above the low signal cutoff	Use genes significant by all three methods: edgeR, DESeq2, and the t-test, as DEG candidates.	Extract DEG candidates using p-values from edgeR or DESeq2. However, check whether the expression patterns correspond to dynamic-range bias.	Extract DEG candidates using p-values from the t-test. However, remove genes with both a small magnitude of change and small within-group variance.
Low-count range below the low signal cutoff	Do not use p-values as they are. Instead, re-extract ON/OFF-type changes based on the presence or absence of measured values.

Support Help - Theory & Case Study Comparing edgeR, DESeq2, and t-test for RNA-Seq Differential Expression Analysis (2)｜A Medium-Sized Biopsy Dataset