Basics of RNA-Seq DEG Analysis - How to Identify Differentially Expressed Genes Starting from Gene Counts

Do you think RNA-Seq DEG analysis is simply a matter of following the recommended edgeR or DESeq2 protocol?

In practice, RNA-Seq analysis is not that simple. DEG analysis is only one part of the entire RNA-Seq workflow. It cannot be separated from normalization, the handling of low-expression genes, data checking by PCA and clustering, and how the resulting gene list will be used according to the research purpose.

This page explains the basic concepts you should understand before performing RNA-Seq DEG analysis. In particular, it explains why Gene Counts should be used as the starting point instead of TPM or FPKM,
why edgeR or DESeq2 is not always sufficient by itself,
and how to think about the strictness or looseness of DEG selection depending on the purpose of the analysis.

RNA-Seq DEG analysis should start from Gene Counts

In RNA-Seq analysis, DEG analysis, PCA, clustering, and related analyses should be performed consistently using normalized and transformed data that start from Gene Counts, rather than TPM, FPKM, or RPKM.

Using Gene Counts does not mean that raw Gene Counts should be used directly for PCA or clustering. It means that Gene Counts should be used as the starting point, followed by appropriate normalization, log transformation, Low Signal Cutoff, missing value imputation, and related preprocessing steps, so that the entire analysis is performed using data based on the same concept.

TPM and FPKM are expression values corrected for gene length. When gene-length correction is applied, unstable measurements derived from the low-count region become mixed with more reliable measurements from genes with sufficient expression. As a result, it becomes more difficult to judge whether the extracted expression differences are reliable.

It is sometimes said that TPM and FPKM are suitable for visualization, such as PCA and clustering. However, using Gene Counts for DEG analysis while using TPM or FPKM for PCA and clustering can easily make the analysis confusing. This is because DEG analysis, PCA, and clustering are not independent steps. Their results are compared with each other to judge whether the findings are reasonable. For this reason, it is more practical to keep the analysis consistent by starting from Gene Counts.

For more details, see Why TPM, FPKM, and RPKM Should Not Be Used for RNA-Seq Differential Expression Analysis - DEG Analysis Should Start from Gene Counts .

edgeR or DESeq2 is not always enough

edgeR and DESeq2 are widely used standard methods for RNA-Seq DEG analysis. They are powerful statistical methods designed to handle the properties of Gene Counts.

However, using edgeR or DESeq2 does not automatically guarantee a reliable DEG list for every RNA-Seq dataset.

RNA-Seq data can differ greatly in quality and characteristics. Read depth, variation between samples, input RNA amount, sampling conditions, library preparation methods, sequencer differences, and many other factors can substantially change the properties of the resulting Gene Counts.

The genes you want to extract also depend on the research purpose. A broad exploratory search for candidate genes and a narrow selection of reproducible diagnostic marker candidates do not require the same type of DEG list.

In other words, there is no magic method that can be applied to every RNA-Seq dataset as-is. In practical analysis, the analyst needs to understand what kinds of expression patterns each method tends to call significant, and what kinds of patterns each method may fail to detect. Based on that understanding, the method must be selected and used appropriately.

The convenience of simply putting Gene Counts into a pipeline and obtaining results can also lead analysts to overlook batch effects, differences in dynamic range between samples, and other data-quality issues. Real RNA-Seq data are not always as clean and well-behaved as textbook examples or statistical models may assume.

Therefore, practical analysis should begin by assessing data quality and deciding on an appropriate analysis strategy. The first plan may not work as expected, so RNA-Seq analysis often becomes an iterative process: run the analysis, check whether the problems have been resolved, and revise the approach when necessary.

For more details, see Are t-tests Really Inappropriate for RNA-Seq? DESeq2, edgeR, and Why We Should Not Overtrust Statistical Models .

Preprocessing and filtering often have a larger impact than the choice of method

When real RNA-Seq experimental datasets are compared, the DEG list is affected not only by whether edgeR, DESeq2, or a t-test is used, but also strongly by the preprocessing and filtering performed before statistical testing.

The handling of the low-count region is especially important. In the low-count region, apparent differences can arise from whether genes are detected or not detected, or from shifts in the lower limit of normalized Gene Counts caused by differences in dynamic range. If this region is passed directly into statistical testing, unreliable genes can be mixed into the DEG list.

On the other hand, when Gene Counts data are appropriately preprocessed, the differences between the results obtained by edgeR, DESeq2, and t-tests become smaller.

However, the differences between methods do not disappear completely. edgeR, DESeq2, and t-tests each have expression patterns that they tend to call significant, as well as patterns that they are less likely to detect. Therefore, in practical DEG analysis, you need to choose a method that is suitable for extracting the expression pattern you are looking for.

This idea is explained in more detail, based on comparisons of several types of real RNA-Seq datasets, in How to Use edgeR, DESeq2, and t-test in RNA-Seq Differential Expression Analysis - Practical Rules by Data Type .

Statistical significance should be used according to the research purpose

In DEG analysis, statistical significance is evaluated using P-values and FDR. However, statistical significance itself is not enough as the goal of practical analysis.

No statistical model perfectly represents real experimental data. What matters is how to use statistical methods to find candidate genes that match the purpose of the research.

A gene expression pattern that is statistically significant is basically a step-like pattern: small variation within each group and a large difference between groups. If the purpose is to develop diagnostic markers or classification models, it is reasonable to extract this kind of expression pattern.

On the other hand, if the research purpose is to investigate upstream biological phenomena close to the cause of the observed changes, such step-like expression patterns are not necessarily the only important ones. For example, consider cases where a gene changes only when multiple conditions are combined, or cases where an effect is canceled by an Inhibitor. Biologically important changes can also appear as small changes. Genes related to a specific cell type may be observed in Bulk RNA-Seq as expression patterns with very small fold changes and low variance. If you are looking for such genes, a t-test may be effective.

For this reason, DEG analysis should not be based only on whether a gene is statistically significant. You also need to consider what the resulting gene list will be used for. The way DEG candidates should be handled differs depending on whether the analysis is exploratory, whether the goal is to select biomarker candidates, whether the results will be used for a paper or conference presentation, or whether the list will be used as input for GO analysis or Pathway analysis.

Purpose of analysis	Example approach to DEG candidates
Exploratory analysis	Consider candidates broadly, including the union of DEGs detected by multiple methods.
Selection of biomarker candidates	Place more emphasis on genes commonly detected by multiple methods as more reproducible candidates.
Analysis used to support claims in a paper	Define criteria such as FDR and Fold Change in advance, and interpret the results carefully.
Pathway or enrichment analysis	Keep a somewhat broader candidate list when the purpose is to see biological trends.
Analysis assuming follow-up validation experiments	Keep a somewhat broader candidate list to avoid missing promising candidates.

In this way, DEG analysis should not apply statistically strict criteria uniformly in every situation. The balance between strictness and breadth should be considered according to the research purpose and the stage of the analysis.

This idea is also explained in the section on extracting differentially expressed genes in Bulk RNA-Seq Data Analysis Tutorial: Learn the Workflow and How to Interpret Results .
The relationship between statistical significance and models involving multiple factors or an Inhibitor is also discussed in Case Study No.199 .

After understanding the background, perform DEG analysis in practice

After understanding these background concepts, the next step is to perform DEG analysis with real data. In Subio Platform, you can use plugins to perform differential expression analysis using t-tests, Paired T-tests, ANOVA, and related methods.

If you want to use edgeR or DESeq2, you can export a Gene Counts table from Subio Platform, run edgeR or DESeq2 in R, and then import the resulting P-values and FDR values back into Subio Platform for visualization.

This workflow is explained in edgeR and DESeq2 Analysis with ChatGPT: RNA-Seq Tutorial Using R , including how to export Gene Counts, create an R script, run edgeR and DESeq2, and re-import the results.

Summary

In RNA-Seq DEG analysis, what matters is not only which statistical method is used, but also what data are used as the starting point, how the data are preprocessed, and what the resulting gene list will be used for.

By starting from Gene Counts instead of TPM or FPKM, and by applying appropriate normalization, log transformation, Low Signal Cutoff, filtering, and related steps, PCA, clustering, and DEG analysis can be performed consistently, making it possible to check whether the results support each other.

edgeR, DESeq2, and t-tests are all tools, and each method has expression patterns that it tends to call significant. What matters is not to overtrust any one method, but to assess the quality of the data and choose DEG candidates that fit the research purpose.

DEG analysis is not simply the task of collecting genes with small P-values. It is part of a complex analysis process for understanding the biological phenomena taking place inside cells.

Photographer Choosing the Right Lens

Support Help - Theory & Case Study Basics of RNA-Seq DEG Analysis - How to Identify Differentially Expressed Genes Starting from Gene Counts