Unlike the small-sample, low-variance in vitro dataset examined in the previous article, this article uses a biopsy-derived RNA-Seq dataset consisting of 20 non-lesional skin samples and 20 lesional skin samples from GSE121212, for a total of 40 samples. With 20 samples in each group, within-group variance for each gene can be estimated more stably than in the previous example, and the risk that genes with accidentally very small variance become significant by the t-test is reduced.
On the other hand, biopsy-derived data are more likely to reflect individual differences, differences in tissue state, differences in total read counts, and differences in measurement limits on the low-expression side. Therefore, in this article, we compare the results of edgeR, DESeq2, and the t-test using a dataset in which variance estimation becomes more stable because of the larger sample size, while biological variability and differences in dynamic range still remain.
Dynamic range can differ greatly among samples in biopsy data
Fig1: Comparison of Gene Count distributions across the 40 samples in GSE121212. In the histograms after normalization, the signal range on the right side is aligned, whereas the left-side low-count range extends differently among samples. In the table below the histograms, genes with no Gene Counts in 36 or more of the 40 samples were removed, the normalized Gene Counts of each sample were log2-transformed, and the samples were ordered by the 25th percentile value in descending order.
In RNA-Seq data, normalization does not make all regions of the distribution uniform. Before normalization, the far-left region near zero appears relatively aligned, whereas the position of the signal range on the right side differs among samples, reflecting differences in total read counts.
After normalization, the peak position of the signal range on the right side becomes aligned. This is a natural result in the sense that the main expression range to be compared among samples is aligned. However, as a consequence, differences in how far the left-side low-count range extends become visible among samples.
In samples with a narrower dynamic range, measured values on the low-expression side are not sufficiently obtained. Therefore, even if a normalization algorithm or batch correction method tries to align the distributions, it cannot restore low-expression information that was not measured. This has been examined in Case Study No. 403 and No. 413.
In this article, we examine what kinds of results are obtained when edgeR, DESeq2, and the t-test are applied to data that include such differences in measurement limits on the low-expression side.
Comparison conditions
In this analysis, we compared the p-values obtained from the three methods. For edgeR and DESeq2, Gene Counts for all genes were used as input, and p-values were calculated based on each statistical model. However, the purpose here was not to evaluate the entire recommended workflow of edgeR or DESeq2. Rather, the goal was to compare what kinds of genes are selected by the three significance-testing methods, including the t-test.
For the t-test in Subio Platform, no special preprocessing such as low signal cutoff or missing-value imputation was applied. The t-test was performed on data that had only been log2-transformed and globally normalized.
Genes with p < 0.05 were then extracted for each method. The log ratio of lesional samples relative to non-lesional samples was used to evaluate the magnitude of change.
Therefore, this analysis is not intended to present a standard recommended protocol. Rather, it is a test designed to examine the differences among the three significance-testing methods under conditions where low signal cutoff and missing-value handling were not applied.
The results of edgeR, DESeq2, and the t-test largely overlap
Fig2: Comparison of genes with p < 0.05 by edgeR, DESeq2, and the t-test. Many genes were judged significant by all three methods. On the other hand, genes significant by edgeR or DESeq2 but not by the t-test, as well as genes significant only by the t-test, were also observed.
As shown in Fig2, the majority of significant genes were shared by all three methods. In this 40-sample biopsy dataset, the major expression differences were detected fairly consistently by edgeR, DESeq2, and the t-test.
This is very different from the small-sample, low-variance in vitro dataset examined in the previous article. In the previous article, each group had only two replicate samples, so the genes judged significant were strongly affected by the degree of variation between the two samples within each group. In contrast, the present dataset includes 20 samples per group. This makes the estimation of within-group variance and between-group differences more stable for each gene, and the differences among the methods appear to be smaller for the major expression changes.
On the other hand, some method-dependent genes still remained. In particular, there was a noticeable group of genes that were judged significant by edgeR or DESeq2 but not by the t-test. However, many of these genes were located in the low-count range. There were approximately 3,000 genes that were significant only by edgeR or DESeq2, but about 2,000 of them were in the low-count range, while about 1,000 were in the signal range. (Fig3B, left and center)
The number of genes significant only by the t-test was 956. Therefore, when the comparison is limited to the signal range, the difference between the number of genes significant only by edgeR or DESeq2 and the number of genes significant only by the t-test is not large. (Fig3B, center and right) This result indicates that, in a dataset with 20 samples per group, the differences among the methods for detecting major expression changes were not as large as those observed in the previous article.
Genes significant by all three methods appear reasonable as group differences
Fig3A: Heatmaps of genes with p < 0.05 by all three methods: edgeR, DESeq2, and the t-test. The left panel shows the entire intersection, the center panel shows genes with a fold change greater than 1.4, and the right panel shows genes with a fold change of 1.4 or less. Even among genes with a fold change of 1.4 or less, consistent differences between non-lesional and lesional samples can be observed.
Genes judged significant by all three methods clearly separate non-lesional and lesional samples in the heatmap. This result indicates that the intersection of the three methods extracts relatively stable DEG candidates.
An interesting point is that even genes with a fold change of 1.4 or less still show patterns that are interpretable as group differences. (Fig3A, right) In the previous small-sample, low-variance dataset, genes extracted by p-value alone included many genes with small magnitudes of change, and applying a fold-change condition had practical value.
However, in a biopsy dataset with a moderate number of samples such as this one, some differences are consistently observed across many samples even when their magnitude is small. Especially in studies targeting small expression changes, it can be valuable to inspect genes extracted by p-value alone, rather than mechanically excluding them by applying a fold-change cutoff.
Method-specific genes need to be examined separately by expression range
Fig3B: Heatmaps of genes judged significant only by specific methods. The left panel shows low-count genes significant only by edgeR or DESeq2, the center panel shows signal-range genes significant only by edgeR or DESeq2, and the right panel shows genes significant only by the t-test. The gradient at the bottom indicates the 25th percentile of normalized Gene Counts for each sample. Darker colors indicate a higher 25th percentile and therefore a narrower dynamic range.
Fig3B examines genes that were not shared by all three methods but were judged significant only by specific methods. Here, genes that were significant by edgeR or DESeq2 but not by the t-test are divided into two groups for closer inspection: genes in the signal range, with average Gene Counts greater than 20 in both non-lesional and lesional samples, and the remaining genes in the low-count range.
In the low-count range shown in the left panel of Fig3B, some genes appear at first glance to be downregulated in lesional samples. However, when the 25th percentile shown at the bottom of the heatmap is examined, this pattern appears to correspond to the bias of samples with a narrower dynamic range. In this case, the group difference seen in the heatmap may reflect sample-to-sample differences in measurement range or experimental factors, rather than biological expression changes.
On the other hand, in the lower part of the same left panel of Fig3B, there are also genes that are not measured, or are nearly OFF, in non-lesional samples but appear to turn ON in lesional samples. Such ON/OFF-type changes are a region where count-based methods such as edgeR and DESeq2 can be useful.
However, what should be noted here is that the group in which samples with a narrower dynamic range are enriched is not necessarily determined by the biological condition. In the low-count range, bias in measurement range can appear as if it were a group difference. Therefore, even for low-count genes judged significant by edgeR or DESeq2, it is risky to interpret them based only on p-values.
The signal range shown in the center panel of Fig3B also requires caution. Because this group consists of genes with average Gene Counts greater than 20 in both groups, it is easier to treat as signal than the low-count range shown in the left panel. However, the heatmap pattern also appears to correspond to the 25th percentile gradient shown at the bottom. In other words, even within the signal range, patterns influenced by sample-to-sample differences in dynamic range may still be mixed in.
Furthermore, when viewed together with the scatter plots in Fig2, genes significant only by edgeR or DESeq2 were located above the diagonal line, whereas genes significant only by the t-test were located below the diagonal line. Both groups were aligned near the decision boundary. Rather than showing strong expression differences, these genes appear to be groups that crossed, or did not cross, the p-value threshold because of small differences in normalization, effect-size estimation, or related factors.
For genes significant only by the t-test in the right panel of Fig3B, the pattern does not appear to represent a large and consistent group difference. Instead, genes with a small magnitude of change and apparently small within-group variance may have passed the significance threshold of the t-test by chance. This group also includes genes in the low-count range that appear to reflect differences in dynamic range.
Therefore, the genes included in the center and right panels of Fig3B should not be treated directly as major DEG candidates. They should be interpreted separately from the genes shared by all three methods. In this analysis, it is more reasonable to treat them as targets for additional confirmation, rather than as high-priority DEG candidates.
In the low-count range, filtering based on the presence or absence of measured values is useful
Fig3C: Overlay of ON in lesional and OFF in lesional genes on the low-count range shown in the left panel of Fig3B. ON/OFF-type genes were extracted as genes for which measured values were absent in 40% or more of the samples on the OFF side, and present in 60% or more of the samples on the ON side.
In the low-count range, edgeR and DESeq2 may detect ON/OFF-type changes. However, apparent differences caused by dynamic-range bias are also mixed into the same region. Therefore, judging low-count genes by p-values alone is risky.
Therefore, in Fig3C, ON/OFF-type genes were extracted using the presence or absence of measured values themselves, rather than the p-values from edgeR or DESeq2. Specifically, genes were extracted when measured values were absent in 40% or more of the samples on the OFF side, and present in 60% or more of the samples on the ON side. With this condition, the genes that we would want to identify as ON/OFF-type changes in the heatmap are largely covered.
In this way, continuous differences in expression levels can be evaluated in the signal range, while ON/OFF-type changes can be extracted in the low-count range based on the presence or absence of measured values. This makes it possible to create a DEG candidate list that includes both the signal range and the low-count range.
What this comparison shows
The results of this comparison show that, in RNA-Seq differential expression analysis, it is important to distinguish between the low-count range and the signal range.
First, significance testing becomes difficult in the low-count range. This region is strongly affected by differences in the dynamic range of normalized Gene Counts. Even if a pattern appears to be a group difference, it may actually reflect sample-to-sample differences in measurement range or experimental factors. Therefore, p-values in the low-count range need to be handled carefully.
Second, genes judged significant by all three methods—edgeR, DESeq2, and the t-test—in the signal range appear to be fairly reasonable DEG candidates. In a biopsy dataset with a moderate number of samples such as this one, even genes with small magnitudes of change can show consistent differences across many samples. When such genes are of interest, it may be better not to mechanically add a fold-change cutoff.
Third, even within the signal range, method-specific genes need to be interpreted carefully. Genes significant only by edgeR, only by DESeq2, or only by the t-test may be affected by normalization, dispersion estimation, p-value calculation, or differences in dynamic range. They should be considered separately from genes shared by all three methods.
Fourth, in the low-count range, filtering based on the presence or absence of measured values can be powerful even without relying on statistical testing. When the goal is to identify ON/OFF-type changes, directly using the condition that a gene is measured in one group and not measured in the other can sometimes be easier to interpret than using p-values.
Considering a standard analysis protocol
In this analysis, we deliberately did not apply low signal cutoff or missing-value imputation in order to compare the three methods. In the standard analysis protocols for edgeR and DESeq2, low-expression genes are filtered, library sizes are normalized, dispersion is estimated, and statistical tests are performed based on a model. However, these steps alone cannot prevent apparent expression changes caused by sample-to-sample differences in dynamic range.
Therefore, in practical analysis, in addition to the standard procedures of edgeR and DESeq2, it is necessary to check the data distribution of each sample and the differences in how far the left side of the normalized histogram extends. The differences in measurement limits on the low-expression side that remain after normalization should not be carried directly into differential expression analysis.
Specifically, a low signal cutoff should be set for the normalized Gene Counts. However, this cutoff should not be set as low as possible. Instead, it should be set around the lower limit of the signal region, even for the group of samples with the narrowest dynamic range. The values above this cutoff should be treated as the signal range, where continuous expression values are compared. The values below this cutoff should be treated as a separate region where ON/OFF-type changes are evaluated based on the presence or absence of measured values.
If the signal range and the low-count range are not separated in this way, and data that include differences in measurement limits on the low-expression side are analyzed as they are, sample-to-sample differences in dynamic range may be extracted as if they were group differences. These differences in measurement limits on the low-expression side are not explicitly incorporated into the statistical models of edgeR or DESeq2, nor into their standard analysis protocols. However, they do exist in real RNA-Seq data and can become a source of apparent expression changes. This is not merely a theoretical concern. As shown in Case Study No. 403, this problem can also be observed in real RNA-Seq data.
In the signal range, taking the intersection of significant genes from edgeR, DESeq2, and the t-test can be used as a robust approach for extracting DEG candidates. At least in this dataset, genes significant by all three methods showed a reasonable group-difference structure in the heatmap.
If running all three methods and taking the intersection is difficult, simpler approaches can also be considered: in the signal range, using edgeR or DESeq2, or using the t-test.
When edgeR or DESeq2 is used, it is recommended to check by heatmap whether the expression patterns of the genes extracted by p-value correspond to dynamic-range bias. When the t-test is used, genes with both a small magnitude of change and small within-group variance should be removed from the genes judged significant by p-value. This additional step helps compensate for a weakness often attributed to the t-test when compared with edgeR and DESeq2.
Regardless of which method is used, ON/OFF-type genes in the low-count range need to be extracted separately.
| Region | When using intersection | When using edgeR/DESeq2 | When using the t-test |
|---|---|---|---|
| Signal range above the low signal cutoff |
Use genes significant by all three methods: edgeR, DESeq2, and the t-test, as DEG candidates. | Extract DEG candidates using p-values from edgeR or DESeq2. However, check whether the expression patterns correspond to dynamic-range bias. | Extract DEG candidates using p-values from the t-test. However, remove genes with both a small magnitude of change and small within-group variance. |
| Low-count range below the low signal cutoff |
Do not use p-values as they are. Instead, re-extract ON/OFF-type changes based on the presence or absence of measured values. | ||