Comparing edgeR, DESeq2, and the t-test for RNA-Seq Differential Expression Analysis (2) - A Medium-Sized Biopsy Dataset

This article is a continuation of Case Study No. 421, “Case Study (1)｜A Low-Variance, Small-Sample in vitro Dataset”.

Unlike the small-sample, low-variance in vitro dataset used in the previous case study, this article uses biopsy-derived RNA-Seq data from GSE121212, consisting of 20 non-lesional skin samples and 20 lesional skin samples, for a total of 40 samples. Because each group contains 20 samples, the within-group variance of each gene can be estimated more stably than in the previous example. This reduces the risk that genes with accidentally very small variance will become significant in the t-test.

On the other hand, biopsy-derived data are more likely to reflect individual differences, differences in tissue condition, differences in total read depth, and differences in the lower measurement limit on the low-expression side. Therefore, in this article, we compare edgeR, DESeq2, and the t-test using data in which variance estimation is more stable because of the larger sample size, but biological variation and differences in dynamic range still remain.

Dynamic Range Differs Greatly Across Samples in Biopsy Data

CaseStudy426 Fig1: Histogram Dynamic Range

Fig1: Comparison of Gene Counts distributions across 40 samples in GSE121212. In the normalized histograms, the signal region on the right side is aligned, whereas the low-count region on the left side extends differently depending on the sample. In the table below, after removing genes with no Gene Counts in 36 or more of the 40 samples, normalized Gene Counts were log2-transformed and the samples were sorted by the 25th percentile value in descending order.

In RNA-Seq data, normalization does not align all regions uniformly. Before normalization, the left edge near zero in the Gene Counts distribution appears relatively aligned, whereas the position of the signal region on the right side shifts among samples, reflecting differences in total read depth.

After normalization, the peak position of the signal region on the right side becomes aligned. This is a natural result in the sense that normalization aligns the main expression range that we want to compare among samples. However, as a result, differences in how the histogram extends in the low-count region on the left side become visible among samples.

In samples with a narrow dynamic range, sufficient measurements are not obtained on the low-expression side. Therefore, even if we try to align distributions using normalization algorithms or batch correction, information from the unmeasured low-expression side cannot be restored. This has been confirmed in Case Study No. 403 and No. 413.

In this article, we examine what happens when edgeR, DESeq2, and the t-test are applied to data that include such differences in the lower measurement limit on the low-expression side.

Comparison Conditions

In this analysis, to compare the p-values obtained by the three methods, all genes in the Gene Counts table were used as input for edgeR and DESeq2, and p-values were calculated based on their respective statistical models. However, the purpose here is not to evaluate the full recommended workflows of edgeR or DESeq2, but to compare what kinds of genes are selected by the three significance-testing methods, including the t-test.

In Subio Platform, where the t-test was performed, no special preprocessing such as low signal cutoff or missing-value imputation was applied. The t-test was applied only after log2 transformation and global normalization.

For each method, genes with p < 0.05 were extracted. Expression change was evaluated using the log ratio of lesional to non-lesional samples.

Therefore, this analysis is not intended to present a standard recommended protocol. It is a comparison designed to examine the differences among the three significance-testing methods under conditions where low signal cutoff and missing-value handling are not applied.

The Results from edgeR, DESeq2, and the t-test Largely Overlap

CaseStudy 426 Fig2: Venn and Scatter

Fig2: Comparison of genes with p < 0.05 in edgeR, DESeq2, and the t-test. Many genes were commonly judged significant by all three methods. At the same time, some genes were significant in edgeR or DESeq2 but not in the t-test, and some genes were significant only in the t-test.

As shown in Fig2, most significant genes were shared among the three methods. In this 40-sample biopsy dataset, the major expression differences were detected quite consistently by edgeR, DESeq2, and the t-test.

This differs greatly from the small-sample, low-variance in vitro dataset examined in the previous article. In the previous article, each group had only two replicate samples, so the genes judged significant were strongly affected by the degree of variation between the two samples within each group. In contrast, the present dataset contains 20 samples per group, making the estimation of within-gene variance and between-group differences more stable. As a result, the differences among methods were smaller for the major expression changes.

However, method-dependent genes still remain. In particular, genes that were significant in edgeR or DESeq2 but not in the t-test were prominent. Many of these genes were located in the low-count region. There were approximately 3,000 genes significant only in edgeR or DESeq2, of which approximately 2,000 were in the low-count region and approximately 1,000 were in the signal region. (Fig3B left and middle)

There were 956 genes significant only in the t-test. Therefore, when limited to the signal region, the difference between the number of genes significant only in edgeR or DESeq2 and the number of genes significant only in the t-test was not as large as it appeared in the Venn diagram. (Fig3B middle and right)

Genes Significant in All Three Methods Appear Reasonable as Group Differences

CaseStudy426 Fig3A: Intersection Genes

Fig3A: Heatmaps of genes with p < 0.05 in all three methods: edgeR, DESeq2, and the t-test. The left panel shows the full intersection, the middle panel shows genes with fold change greater than 1.4, and the right panel shows genes with fold change of 1.4 or less. Even among genes with fold change of 1.4 or less, consistent differences between non-lesional and lesional samples can be observed.

Genes judged significant by all three methods clearly separate non-lesional and lesional samples in the heatmap. This result indicates that the intersection of the three methods extracts relatively stable DEG candidates.

What is interesting is that even genes with fold change of 1.4 or less still show patterns that are interpretable as group differences. (Fig3A right) In the previous small-sample, low-variance dataset, extracting genes by p-value alone included many genes with small changes, so narrowing the list using a fold change condition had practical value.

However, in biopsy data with a certain number of samples, as in this case, small expression changes can still be observed consistently across many samples. Especially when the research question targets small expression changes, it can be valuable to examine genes extracted by p-value alone rather than mechanically excluding them using a fold change condition.

Method-Specific Genes Should Be Examined Separately by Expression Region

CaseStudy426 Fig3B: Method Specific Genes

Fig3B: Heatmaps of method-specific significant genes. The left panel shows low-count genes significant only in edgeR or DESeq2, the middle panel shows signal-region genes significant only in edgeR or DESeq2, and the right panel shows genes significant only in the t-test. The gradient below the heatmap indicates the 25th percentile of normalized Gene Counts for each sample. Darker colors indicate a higher 25th percentile value, representing samples with a narrower dynamic range.

In Fig3B, we examine genes that were not shared by all three methods, but were judged significant only by specific methods. Here, genes significant in edgeR or DESeq2 but not in the t-test are divided into two groups: signal-region genes for which the mean Gene Counts exceed 20 in both non-lesional and lesional samples (approximately 1,000 genes), and other low-count-region genes (approximately 2,000 genes).

In the low-count region shown in the left panel of Fig3B, some genes appear at first glance to be down-regulated in lesional samples. However, when we look at the 25th percentile shown below the heatmap, the pattern appears to correspond to a bias toward samples with a narrower dynamic range. In this case, the apparent group difference in the heatmap may reflect differences in dynamic range across samples rather than biological expression changes.

It is important to note that whether samples with a narrow dynamic range are biased toward one group is not necessarily determined by the biological condition. In the low-count region, differences in measurement range can appear as group differences. Therefore, for low-count genes judged significant by edgeR or DESeq2, it is risky to rely only on p-values.

In the signal region, genes significant only in edgeR or DESeq2 were located above the diagonal, whereas genes significant only in the t-test were located below the diagonal. Both groups were positioned near the decision boundary. These genes appear less like a distinct group with strong expression differences and more like genes that crossed, or did not cross, the p-value threshold because of small differences in normalization, change estimation, and p-value calculation.

In summary, although there were many genes significant only in edgeR or DESeq2, many of them were located in the low-count region. These genes should be evaluated carefully by visualizing normalized Gene Counts using histograms and line plots, rather than relying only on p-values. In the signal region, many method-specific genes were located near the decision boundary. They should be regarded not as a separate group of genes with strong expression differences, but as genes for which significance and non-significance were separated near the boundary.

In the Low-Count Region, Filtering Based on Whether Values Are Measured Is Useful

CaseStudy426 Fig3C: ON OFF Genes

Fig3C: ON in lesional and OFF in lesional gene groups overlaid on the low-count region shown in the left panel of Fig3B. ON/OFF-type genes were extracted as genes with no measured values in 40% or more of samples on the OFF side and measured values in 60% or more of samples on the ON side.

In the low-count region, edgeR and DESeq2 can sometimes detect ON/OFF-type changes. However, apparent differences caused by dynamic range bias are also mixed in the same region. Therefore, judging low-count-region genes only by p-values is risky.

In Fig3C, instead of using p-values from edgeR or DESeq2, we extracted ON/OFF-type genes using whether measured values were present or absent. Specifically, genes were extracted when 40% or more of samples on the OFF side had no measured value, and 60% or more of samples on the ON side had measured values. This condition broadly covers the gene groups that we want to identify as ON/OFF-type changes in the heatmap.

In this way, continuous differences in expression level should be evaluated in the signal region, whereas ON/OFF-type changes in the low-count region should be extracted based on whether values are measured. This makes it possible to create a DEG candidate list that includes both the signal region and the low-count region.

What We Can Learn from This Comparison

These results show that in RNA-Seq differential expression analysis, it is important to distinguish the low-count region from the signal region.

First, significance testing becomes difficult in the low-count region. This region easily reflects differences in the dynamic range of normalized Gene Counts. Even if the data appear to show a group difference, the pattern may actually reflect sample-specific measurement range or experimental factors. Therefore, p-values in the low-count region must be interpreted with caution.

Second, in the signal region, genes judged significant by all three methods—edgeR, DESeq2, and the t-test— appear to be fairly reasonable DEG candidates. In biopsy data with a certain number of samples, as in this case, even genes with small expression changes can show consistent differences across many samples. For such genes, it may not be appropriate to mechanically add a fold change condition.

Third, even in the signal region, method-specific gene groups must be interpreted carefully. Genes significant only in edgeR, only in DESeq2, or only in the t-test were often located near the p-value threshold in this dataset. They may reflect differences in normalization, dispersion estimation, p-value calculation, or change estimation. Therefore, it is better to prioritize genes shared by all three methods, and to use method-specific genes as additional candidates only when it is necessary to broaden the list.

Fourth, in the low-count region, filtering based on whether values are measured is more useful than calculating p-values as if the values represented continuous expression levels. When ON/OFF-type changes are of interest, directly using the condition that a gene is measured in one group and not measured in the other group can be easier to interpret than p-values that may reflect dynamic range differences.

Considering a Standard Analysis Protocol

In the analyses so far, we intentionally compared the three methods themselves. For edgeR and DESeq2, all genes in the Gene Counts table were used as input. For the t-test, normalized Gene Counts after log2 transformation and global normalization were used. Therefore, the comparison up to this point was not intended to evaluate the full recommended workflows of edgeR or DESeq2. Rather, it was designed to examine what kinds of genes each method tends to call significant when the data include low-count regions and differences in dynamic range.

In actual analyses, edgeR and DESeq2 are usually applied after filtering out low-expression genes. In edgeR, low-expression genes were removed using filterByExpr. This function does not simply filter genes by total count. Instead, it uses CPM values that take library size into account, and keeps genes that have sufficient expression in a sufficient number of samples. In this dataset, each group contains 20 samples. With the default settings, genes are retained if they have CPM values above the required threshold in approximately 17 or more samples, and also have a sufficient total count across all samples. After this filtering step, TMM normalization, dispersion estimation, and quasi-likelihood GLM testing were performed.

In DESeq2, pre-filtering was performed before running DESeq and results. In this analysis, following the pre-filtering example shown in the DESeq2 vignette, genes were retained if the number of samples with at least 10 counts was greater than or equal to the smallest group size. Because this dataset contains 20 samples in each group, genes were included in the test if they had 10 counts or more in at least 20 of the 40 samples. In addition, independent filtering is performed by default in results.

The t-test, on the other hand, was applied after preprocessing the data while visually checking the data distribution in Subio Platform. Specifically, normalized Gene Counts below 20 were replaced with 20, missing values were filled with 16, and genes were removed if more than half of the samples in both the non-lesional and lesional groups had Counts values below 20. The t-test was then performed on the preprocessed data.

All of these procedures help reduce the inclusion of DEG candidates derived from instability in the low-count region or apparent expression changes caused by differences in dynamic range. As a result, the DEG candidates extracted by the three protocols showed a large overlap.

However, some method-specific differences were still observed, and these features were consistent with the previous case study. The t-test tends to call genes significant when the observed variation is extremely small. Conversely, near the boundary between the noise region and the signal region, where variation can increase abruptly, edgeR and DESeq2 take into account not only the variation observed for each individual gene, but also the variation of other genes with similar Counts values. As a result, genes in this boundary region tended to be called significant more easily by edgeR or DESeq2 than by the t-test.

CaseStudy426 Fig4: Official Workflow vs Preprocessed ttest

Fig4: Comparison of results from workflows close to the standard recommended protocols for edgeR and DESeq2, and the t-test after preprocessing in Subio Platform. The upper panels show genes extracted with p < 0.05 and fold change ≥ 1.4 on scatter plots of mean Gene Counts in non-lesional and lesional samples. The middle panels show the overlap of genes extracted by edgeR, DESeq2, and the t-test. The lower panels show line plots of expression patterns for genes whose significance calls differed among methods.

As shown in Fig4, when low-expression filtering and preprocessing of the low-count region are applied appropriately, the major DEG candidates from edgeR, DESeq2, and the t-test largely overlap. Therefore, to obtain reliable results, it is important to look not only at the testing method itself, but also at the data distribution before testing, and to set preprocessing and filtering appropriately.

However, if all genes in the low-count region are simply removed from the analysis, ON/OFF-type genes may be missed. These are genes that are measured in one group but not measured in the other group. As described above, such genes can be extracted without relying on p-values, by using whether measured values are present or absent. Therefore, Subio recommends the following workflow.

Recommended Workflow in Subio

First, set a low signal cutoff for normalized Gene Counts. This cutoff should not be determined in advance as a fixed value such as 10, 20, or 30. Instead, it should be set after examining the distribution of each dataset. In particular, the cutoff should be placed near the lower limit that can still be treated as the signal region in the sample group with the narrowest dynamic range. Values above this cutoff are treated as the signal region, where expression values can be compared as continuous measurements. Values below this cutoff are treated as the low-count region, where ON/OFF-type changes are evaluated based on whether the gene is measured or not measured.

If the signal region and the low-count region are not separated, and data containing differences in the lower measurement limit are analyzed as they are, differences in sample dynamic range can be extracted as group differences. Such differences in the lower measurement limit are not explicitly incorporated into the statistical models of edgeR or DESeq2, nor into standard analysis protocols. However, they do exist in real RNA-Seq data and can generate apparent expression changes. This is not only a theoretical concern. As shown in Case Study No. 403, this problem can also be observed in real RNA-Seq datasets.

In the signal region, taking the intersection of significant genes detected by edgeR, DESeq2, and the t-test can be used as a conservative way to extract robust DEG candidates. At least in this dataset, genes significant in all three methods also showed reasonable group differences in the heatmap.

If it is difficult to run all three methods and take their intersection, simpler approaches can also be considered. In the signal region, one option is to use edgeR or DESeq2. Another option is to use the t-test after appropriate preprocessing.

When using edgeR or DESeq2, genes with large variation near the boundary between the noise region and the signal region may be called significant more easily. Therefore, expression patterns should be checked, and if necessary, candidates with large variation should be removed using CV or related filtering criteria. When using the t-test, combining the p-value condition with a fold change condition helps remove genes that become significant only because within-group variation is small despite a small expression change.

Regardless of which method is used, ON/OFF-type genes in the low-count region should be extracted separately. This allows candidates that are difficult to handle by p-values alone to be captured more safely.

Region	When using intersection	When using edgeR/DESeq2	When using the t-test
Signal region Above the low signal cutoff	Use genes that are significant in all three methods, edgeR, DESeq2, and the t-test, as DEG candidates.	Use p-values from edgeR or DESeq2 following their standard protocols. However, near the boundary with the noise region, genes with large variation may be called significant more easily, so their expression patterns should be checked and, if necessary, they should be removed from DEG candidates.	Use p-values from the t-test. However, combine the p-value condition with a fold change condition to remove genes with small expression changes.
Low-count region Below the low signal cutoff	Do not use p-values directly. Instead, re-extract ON/OFF-type changes based on whether the gene is measured or not measured.

The SSA file for this dataset can be downloaded from here . By importing the SSA file into Subio Platform, you can explore the analysis results shown in this article interactively.

Subio Platform the 90-Seconds Demo

Download

Support

Help - Theory & Case Study

Comparing edgeR, DESeq2, and the t-test for RNA-Seq Differential Expression Analysis (2) - A Medium-Sized Biopsy Dataset