In RNA-Seq differential expression analysis, statistical models based on the negative binomial distribution, such as DESeq2 and edgeR, are widely used. On the other hand, t-tests are often described as “not suitable for RNA-Seq” or “inappropriate for count data.”
This explanation is useful as an introductory warning. In particular, it plays an important role in preventing the misuse of applying a t-test directly to raw count values.
However, the statement that “a negative binomial model captures the data structure of RNA-Seq better than a t-test” has an important prerequisite. It mainly applies when the data being analyzed are raw count values, or data in which low-expression genes and distributional distortions have not been sufficiently addressed.
Raw count values are non-negative integers. Low-expression regions often contain many zeros and ones. The relationship between the mean and variance, as well as overdispersion, also cannot be ignored. Applying a t-test, which assumes a normal distribution, directly to such data is risky. In this sense, the negative binomial model can be considered a model that is better suited to handling the properties of raw RNA-Seq count data.
However, once appropriate normalization, preprocessing, and low-expression filtering have been performed, and the data distribution and relationships among samples have been checked, this explanation can no longer be applied in the same way. At that stage, the data being analyzed are no longer raw count values themselves. They have already been prepared in a form that is easier to apply statistical analysis to.
For such data, it becomes difficult to make a simple argument that “the negative binomial model is inherently more correct” or that “t-tests are inappropriate for RNA-Seq.” Both are mathematical models for interpreting data, and each should be understood as a method with its own assumptions and limitations.
What matters is not to decide which statistical method to use first, but to check the state of the data, understand the effects of preprocessing and filtering, and interpret the results while understanding what the method assumes and what it evaluates.
Why Did t-Tests Come to Be Avoided in RNA-Seq Analysis?
The reason t-tests came to be avoided in RNA-Seq analysis is not simply a matter of mathematical superiority. There were also practical reasons.
When RNA-Seq first became widely used, it was not uncommon for researchers to apply general statistical tests to raw counts or improperly processed data without sufficiently considering low-expression genes, count data with many zeros and ones, differences in library size, or overdispersion.
Applying a simple t-test to such data can greatly distort the results because of unstable values in low-count regions, differences in library size between samples, and variance that depends on the mean expression level.
The spread of RPKM and FPKM also made this problem harder to recognize. Because RPKM and FPKM are values adjusted for library size and gene length, they can appear to be “normalized expression values” that can be used directly for statistical analysis.
However, converting values to RPKM or FPKM does not make low-count measurements more reliable. Even a difference between raw counts of 1 and 2 may appear as a two-fold difference after conversion. Such differences may strongly reflect measurement uncertainty or sampling noise.
In other words, RPKM and FPKM do not solve the variability problem in low-count regions. Instead, they can make that instability look like “normalized continuous values.” As a result, applying a t-test or another general statistical test to RPKM or FPKM values can lead to unstable differences derived from low-expression genes being treated as meaningful changes.
TPM later appeared and came to replace FPKM and RPKM in many contexts. TPM has the advantage of making expression levels more comparable across samples than FPKM. However, TPM does not fundamentally solve the problem discussed here.
Even after conversion to TPM, issues such as low-count instability, expression-dependent variance, and variability under small-sample conditions do not disappear. Therefore, TPM should not simply be used as input for a general statistical test. Also, using TPM as input for a negative binomial model is outside the assumptions of the model and is inappropriate.
To avoid such misuse, RNA-Seq-specific statistical models such as DESeq2 and edgeR became widely recommended. These tools can handle low-expression genes, library size correction, expression-dependent variance, and variance estimation under small-sample conditions within a single framework.
What further supported this shift was the practical efficiency of being able to use Gene Counts directly as input. In contrast to a simple t-test, which can produce large errors if applied directly to unprocessed Gene Counts, RNA-Seq-specific models could receive Gene Counts as input and process them through a standardized analysis workflow. This made them well suited for pipelines that process large amounts of data.
This convenience contributed greatly to the standardization of RNA-Seq analysis. At the same time, however, the reassurance that “the standard model will handle the data once Gene Counts are provided” also encouraged analyses in which researchers received only the final results without sufficiently checking the state of the data.
In this sense, the statement “do not apply a t-test directly to RNA-Seq data” is a reasonable safety warning for beginners. However, if it gradually turns into the simplistic belief that “t-tests are inherently inferior methods for RNA-Seq analysis,” the explanation becomes inaccurate.
After Preprocessing, the Output Tendency of a Method Matters More Than Its Theoretical Superiority
As discussed above, using raw count values, RPKM/FPKM, or TPM directly for statistical testing requires caution.
However, once low-expression genes have been removed, distributions and sample relationships have been checked,
and normalization or log transformation has been performed, the premise of the discussion changes.
At this stage, the data are no longer raw count values themselves.
They have already been prepared in a form that is easier to apply statistical analysis to.
For such data, we cannot decide which method is inherently more correct simply by comparing the names “t-test” and “negative binomial model.” No statistical model exactly represents the underlying biological state. Every model is an approximation for evaluating differences under a specific set of assumptions.
In this context, the practical focus shifts away from theoretical superiority. What becomes important is understanding which kinds of genes are more likely to be detected by a given method, and which kinds of genes may be missed. Beyond that, we need to consider, based on biological background, clinical models, the nature of the experimental system, and the experience and intuition of experimental biologists, what kind of expression pattern the target genes are expected to show, and then choose methods and thresholds accordingly.
For example, the appropriate method and thresholding strategy will differ depending on whether you want to focus on genes that show small but consistent changes in a disease model, genes that are strongly switched on or off by a treatment, or subtype-specific genes that respond strongly only in a subset of samples.
Genes with high expression and small but consistent differences between groups, genes with large Fold Change but unstable low-expression measurements, and genes that appear as strong outliers in only some samples may be treated differently depending on the statistical method and preprocessing strategy.
Therefore, a statistical method is not a device that automatically produces the correct answer. It is a tool for selecting candidate genes in light of the research objective. In actual analysis, it is essential to check what kind of gene list a method tends to produce, and whether that list fits the research purpose and biological hypothesis.
The Error Rate of a Statistical Model Is Not the Biological Difference Itself
P-values and FDR are indicators calculated under the assumptions of a statistical model. They represent error rates and significance as defined within the model, but they do not directly represent biological differences themselves.
For example, even if a result is statistically significant, a very small Fold Change may not represent a meaningful change for the research question. Conversely, even when the p-value is not extremely small, there may be genes that deserve careful attention when sample size, variability, measurement limits, outliers, and batch effects are considered.
In this way, the theoretical assumptions of a statistical method and its practical use in real data analysis do not always align in a simple way. The reason why t-tests may contain theoretical problems but still work practically is discussed in more detail in another article.
Bulk RNA-Seq data also include not only protein-coding genes but also non-coding genes. In typical bulk RNA-Seq, most read counts are assigned to protein-coding genes, whereas low-count regions contain values near the measurement limit, low-expression genes, non-coding genes, and measurements that may be affected by annotation or library preparation.
In other words, low-count regions are not simply a collection of “low-expression genes.” They contain overlapping factors with different properties, including measurement uncertainty, gene category differences, library preparation characteristics, and annotation uncertainty.
Putting such complex data into a single statistical model does not automatically guarantee biological interpretation. A statistical model is a tool for interpreting data, not biological judgment itself.
Therefore, what matters in such regions is not to leave the judgment to the statistical model, but for the analyst to consider, in light of the research objective, what kind of expression pattern the genes of interest should show. For example, the appropriate method and thresholding strategy will differ depending on whether you prioritize genes with small Fold Change but also very small variance and consistent differences between groups, or genes with large variance that nevertheless change in a binary manner in a specific subtype or responder group.
The Convenience of Modern Statistical Models Can Make Batch Effects Harder to Notice
Modern statistical models such as DESeq2 and edgeR are highly useful. However, the convenience of entering Gene Counts into a standard pipeline and automatically obtaining differential expression results can also create another problem.
The problem is not the modern statistical model itself. The problem is that, because Gene Counts can be entered and the pipeline returns results automatically, these methods can be used as a standard procedure without sufficiently checking the state of the data. As a result, analyses that still contain batch effects or biased sample composition may be treated as statistically well-formed DEG lists. In this sense, the risk may have increased. The importance of checking PCA and expression distributions before deciding that a pattern is a batch effect is discussed in detail in another case study.
A negative binomial model is a model for handling library size and count-data variability. However, it does not automatically detect and remove batch effects caused by experimental date, reagent lot, measurement facility, processing order, sample preparation conditions, or similar factors.
Batch effects are often not random noise. They are systematic distortions linked to sample groups or experimental conditions. If the control group and treatment group were prepared on different days, or measured using different reagent lots, a statistical model may treat those differences as biological differences.
To handle batch effects within a model, they must be explicitly incorporated into the experimental design or model formula. Furthermore, when condition and batch are completely confounded, statistical modeling alone cannot separate biological differences from batch effects.
In other words, using a modern statistical model does not solve problems in experimental design or sample composition. If the input data contain structural bias, even an advanced statistical model will return results in a well-formed format while still carrying that bias.
Looking at the Data Manually Is Not an Outdated Approach
Checking data manually during analysis is not simply an old or inefficient approach. Rather, by examining PCA, clustering, histograms, scatter plots, expression distributions, and relationships among samples, you may notice batch effects, outliers, or biased sample composition.
In an analysis where Gene Counts are entered into a pipeline and only the DEG list is received, there are fewer opportunities to check the state of the data along the way. As a result, even if the output is statistically well-formed, it becomes easier to overlook what kind of data structure supports that result.
In contrast, when data are checked step by step, you can examine distributions before and after normalization, the effects of low-expression genes, sample separation in PCA, grouping in clustering, and expression patterns of individual genes. This process cannot be automatically replaced by a statistical model.
For example, if samples separate by experimental date rather than biological condition in PCA, that is important information that should be checked before differential expression analysis. If histograms show large distributional differences between samples, the cause should be considered before moving directly to statistical testing.
If these checks are skipped and only the DEG list is examined, batch effects, outliers, normalization effects, and low-expression gene contamination may go unnoticed. Statistically significant genes may then be treated as biologically meaningful genes without sufficient verification.
Looking at the data manually is not a rejection of statistical models. Rather, it is necessary for interpreting the results of statistical models correctly.

What Matters Is Not Which Method to Trust, but How to Judge the Data
This article is not intended to recommend t-tests. Nor is it intended to reject modern RNA-Seq statistical models such as DESeq2 or edgeR.
The issue is that one-sided explanations such as “t-tests are bad” or “you should use a modern statistical model” are not sufficient for making decisions in RNA-Seq data analysis.
For raw count values or data that have not been sufficiently preprocessed, it is reasonable to say that a negative binomial model is better suited to handling the data structure of RNA-Seq. However, once the data have been properly normalized, preprocessed, filtered, and their distributions and sample relationships have been checked, we can no longer simply say that “the negative binomial model is correct” or “t-tests are inappropriate.”
Whether we use a modern statistical model or a general t-test, the results depend on the state of the data, preprocessing, filtering, model assumptions, thresholds, sample composition, batch effects, outliers, and other factors. Therefore, the danger is not a particular statistical method itself, but applying a method blindly without checking the state of the data or the assumptions of the method.
A statistical model is not a device that automatically gives the answer. It is a tool for interpreting data. What matters is not which method was used, but whether the results are consistent with the data distribution, relationships among samples, expression patterns, and biological context.
In RNA-Seq analysis, we need to look at the data before choosing a statistical method. And after obtaining statistical results, we need to return to the data and check again.
Look at the data, check the assumptions, and question the results. That is more important than the name of the statistical method in RNA-Seq differential expression analysis.