How to Interpret Persistent Distortion in RNA-Seq Data|CPM, TMM, VST, Quantile Normalization, and ComBat

This article is a follow-up to the previous case study.
In this article, we further examine the sample structure observed in the previous case study by testing how it appears under different normalization and correction methods. The decisions discussed here, including sample exclusion and correction, are not based on this article alone. They are based on the visualization results shown in the previous case study . Please read the previous article first, then read this article as a follow-up.

In the previous article, we showed that when RNA-Seq data were visualized after CPM normalization, some samples displayed expression profiles that differed from the other samples and formed characteristic clusters in the clustering analysis.

When we see such a result, it is natural to wonder: “Is this simply an effect of CPM normalization?” or “Would another normalization method solve the problem?”

In this article, we tested several normalization, transformation, and correction methods using the same dataset, and examined how much the previously observed sample structure changed. We also examined not only clustering results, but also where the gene groups characteristic of the doubt samples were located in the histograms of each sample.

The important point here is not to decide which method is the only correct one. Rather, we need to observe how the data change, or do not change, after normalization or correction, and consider what the remaining structure may reflect.

The structure of the doubt samples remained even after changing normalization and transformation methods

The color blocks in the figures indicate the sample structure observed in the previous analysis. Red indicates the doubt samples, which appeared clearly separated from the other samples in PCA and clustering. Yellow indicates samples labeled as close to doubt. These samples were located between the doubt samples and most of the other samples in PCA, and their histograms also showed expression profiles closer to those of the doubt samples.

Green indicates a cluster detected by clustering the remaining samples after removing the doubt and close to doubt samples. This cluster contained approximately half of the Healthy control samples. Blue indicates the remaining samples, including the other half of the Healthy control samples and the Diseased samples.

The doubt sample structure remained even after TMM-normalized CPM

First, we applied TMM normalization to the same Gene Count data and performed clustering using the TMM-normalized CPM data. TMM normalization is a widely used and useful normalization method in RNA-Seq analysis.

However, in this dataset, the sample structure observed in the previous CPM-normalized data did not change substantially even after using TMM-normalized CPM. The color block at the bottom of Figure 1-a shows the clustering result based on the previous CPM-normalized data. The doubt samples shown in red and the close to doubt samples shown in yellow again formed a group with similar expression profiles after TMM-CPM correction.

In this analysis, we extracted the gene cluster that appeared commonly high in the doubt samples as high@doubt, and the gene cluster that appeared commonly low in the doubt samples as low@doubt. In Figure 1-a, these gene clusters are shown with gray overlays.

TMM-CPM_clustering

Figure 1-a. Clustering result using TMM-normalized CPM data. The gray overlays indicate the high@doubt and low@doubt gene clusters.

The doubt sample structure also remained after DESeq2 VST

Next, we transformed the same Gene Count data using DESeq2 variance stabilizing transformation (VST), imported the transformed values into Subio Platform, and performed clustering. VST is a transformation designed to reduce the relationship between the mean and variance of count values, making it easier to evaluate sample distances and clustering.

Because VST values are already transformed values, we used them directly in Subio Platform without applying an additional log2 transformation.

Even after VST transformation, the doubt samples shown in red and the close to doubt samples shown in yellow again formed a group with similar expression profiles. In Figure 2-a, the high@doubt and low@doubt gene clusters extracted in Figure 1-a are overlaid in gray. Even after VST transformation, these gene groups were still observed as characteristic expression patterns of the doubt samples.

VST_clustering

Figure 2-a. Clustering result using DESeq2 VST-transformed data. The gray overlays indicate the high@doubt and low@doubt gene clusters extracted in Figure 1-a.

Quantile normalization did not eliminate the cluster structure either

As an additional reference comparison, we applied quantile normalization to the raw Gene Count data. Quantile normalization is a method that forces the distributions among samples to become similar, and has been commonly used in microarray analysis.

In RNA-Seq data, however, quantile normalization may also force biologically meaningful differences in expression distributions to become similar. Therefore, it should be treated not as a standard solution, but as a reference comparison showing what happens when the distributions are strongly adjusted.

In this analysis, we applied quantile normalization to the Gene Count data, then performed log2 transformation and centering (ratio to average) before clustering. Even with this strong correction, the cluster structure observed in the previous CPM-normalized data did not change substantially. In Figure 3-a, the high@doubt and low@doubt gene clusters also remained as characteristic patterns associated with the doubt samples.

Quantile_clustering

Figure 3-a. Clustering result using Gene Count data after quantile normalization, followed by log2 transformation and centering (ratio to average). The gray overlays indicate the high@doubt and low@doubt gene clusters extracted in Figure 1-a.

Histograms show that the characteristics of the doubt samples remain after correction

In Gene Count data, the histograms of the doubt samples are shifted to the left

By comparing the histograms of each sample, the characteristics of the doubt samples become clearer. Figure 4 compares the distributions of Gene Count, TMM-CPM, VST, and quantile-normalized data.

In the Gene Count data, the histograms of the doubt samples are clearly shifted to the left. This indicates that the overall dynamic range of count values is narrower in these samples.

In TMM-CPM, the overall histogram appears to shift horizontally while largely preserving its shape. This suggests that TMM-CPM performs a relatively linear-like correction.

In contrast, VST substantially changes the shape of the distribution. This is because VST is a transformation that reduces the relationship between the mean and variance of count values, rather than a correction that simply shifts the histogram horizontally.

Quantile normalization basically corrects the data toward making the distribution shapes similar across samples. Therefore, when a sample group such as the doubt samples has a distribution that is substantially different from the others, and the number of such samples is small, the corrected distribution shape can change considerably.

Even after VST and quantile normalization, the histograms of the doubt samples still appeared to lack the left side of the distribution. This suggests that the original narrow dynamic range still affects the distribution after correction.

Comparison with histograms_overall

Figure 4. Comparison of histograms for Gene Count, TMM-CPM, DESeq2 VST, and quantile-normalized data. In the Gene Count data, the histograms of the doubt samples are shifted to the left, indicating that their original dynamic range is narrow.

high@doubt genes appear to be lifted after correction

Next, we examined where the genes that appear commonly upregulated in the doubt samples (high@doubt) are located in each distribution after correction. In Figure 5, the high@doubt genes are shown in black.

In the Gene Count data, the high@doubt genes are biased toward the left side of the histogram. In other words, these genes are originally located in a low-count region.

However, in TMM-CPM, as the entire histogram of the doubt samples shifts to the right, these genes appear relatively higher after correction. As a result, they appear as genes commonly upregulated in the doubt samples.

In VST and quantile normalization, the effect is not seen as a simple horizontal shift as in TMM-CPM. However, the effect is similar: genes that were originally located within a narrow dynamic range are lifted after correction, and remain relatively high in the doubt samples.

Comparison with histograms_UP@doubt

Figure 5. Histogram view showing high@doubt genes in black. In the Gene Count data, high@doubt genes are biased toward the left side of the histogram. In TMM-CPM, the entire histogram shifts to the right, making these genes appear upregulated after correction.

For low@doubt genes, the correction has an effect but is not complete

Next, we examined the distribution of genes that appear commonly downregulated in the doubt samples (low@doubt). In Figure 6, the low@doubt genes are shown in black.

In the Gene Count data, the low@doubt genes are located on the right side of each sample’s histogram, that is, among genes with relatively high count values. However, because the overall dynamic range is narrow in the doubt samples, the Gene Count values of these genes do not become as high as in the other samples. As a result, they appear markedly lower in the doubt samples.

In TMM-CPM, the conversion to the CPM scale shifts the entire histogram of the doubt samples to the right, and the difference in the low@doubt genes becomes much smaller. However, the difference does not disappear completely, and the low@doubt genes still remain slightly lower after correction.

In VST and quantile normalization, the relationship is harder to see as a simple horizontal shift of the entire histogram, as in TMM-CPM. However, even with these corrections and transformations, the difference in the low@doubt genes becomes much smaller, while it is not completely corrected, and these genes still tend to appear slightly lower in the doubt samples.

Comparison with histograms_DOWN@doubt

Figure 6. Histogram view showing low@doubt genes in black. In the Gene Count data, low@doubt genes are located on the right side of the histogram, but because the dynamic range is narrow in the doubt samples, their count values do not become as high as in the other samples.

A weak distributional difference also remains between the blue and green samples

The doubt and close to doubt samples are clearly different from the other samples even in the overall PCA and clustering. However, when focusing on the high@doubt and low@doubt gene groups, a small distributional difference can also be seen between the blue (4-mixed node) and green (3-healthy node) samples, although it is much weaker than that seen in the doubt samples.

This suggests that the sample structure cannot be explained solely by the doubt samples. A weaker, stepwise difference may also exist among the remaining samples. Therefore, even after removing the doubt samples, it is necessary to examine the structure of the remaining blue and green samples.

Normalization and transformation alone may not solve the problem

Taken together, the expression patterns associated with the doubt samples do not appear to be simple expression differences that were already present as such in the raw Gene Count data. Rather, they are likely to reflect the distortion and narrow dynamic range of the original Gene Count distribution, which appear as relative upregulation or downregulation after normalization or transformation.

In particular, when the histogram shapes differ substantially between samples, that is, when the dynamic ranges differ greatly, the effect can be very problematic. Even after applying different methods such as TMM-CPM, DESeq2 VST, and quantile normalization, the patterns corresponding to high@doubt and low@doubt remained. This shows that distortion in the original data distribution cannot be easily solved simply by changing the correction algorithm.

When the distribution or expression profile of the raw data is strongly distorted in this way, changing normalization or transformation methods alone often cannot restore the data to a state suitable for analysis. If such samples are forced into the analysis after correction, they may emphasize DEG extraction or clustering results in the wrong direction.

Therefore, for samples that clearly show expression profiles different from the others in visualization, it may be safer to exclude them from the analysis, rather than trying to force them into the dataset through correction, as long as the reason for exclusion is clearly recorded based on histograms, PCA, clustering, and related evidence.

Removing strongly distorted samples and examining the remaining structure

Next, we removed the red and yellow samples, which showed strongly distorted expression profiles, and reanalyzed only the remaining blue and green samples.

The red and yellow samples showed expression profiles different from the others in PCA, clustering, and histograms. Therefore, removing these samples and reanalyzing the data is considered a reasonable decision based on visualization results.

However, because we do not have access to laboratory notes or detailed batch information, we cannot determine whether these samples were actually poor-quality samples or whether they corresponded to specific experimental conditions or technical batches.

Here, we used the data after this exclusion to examine the structure of the remaining samples.

ComBat correction using hypothetical batch information weakened the cluster difference

Next, for the data after removing the strongly distorted samples, we treated the remaining green and blue label information as a hypothetical batch variable and applied ComBat correction.

The label information used here is not explicit batch information based on laboratory notes, sequencing runs, or similar records. Therefore, this analysis does not claim that the observed structure is an actual batch effect. Rather, it is a hypothetical examination of how the data change when the visually observed structure is treated as a batch variable.

In this analysis, ComBat correction was applied to DESeq2 VST-transformed values. The VST values and ComBat-corrected values were used directly in Subio Platform without applying an additional log2 transformation.

After ComBat correction, the large difference between clusters observed before correction was greatly weakened. This is consistent with the purpose of applying ComBat correction in this test.

On the other hand, this result alone does not mean that the cluster was an actual technical batch. If a structure observed in clustering is specified as a batch variable, it is natural that the structure becomes weaker after correction. When applying a similar correction to real research data, it is essential to confirm whether there is a valid reason to remove that structure.

Combat-VST_clustering

Figure 7. Clustering result after ComBat correction using the green and blue label information as a hypothetical batch variable. After ComBat correction, the difference between the green and blue samples observed before correction was greatly weakened.

DEG lists change depending on the correction method

As shown by the results above, DEG lists change depending on data structure and on the normalization or correction methods used. In other words, it is risky to perform DEG analysis without correctly understanding the characteristics of the data and the methods used for normalization or correction. Before interpreting a DEG list, we need to confirm what kind of data structure and preprocessing the result is based on.

Correction is not magic that makes data “correct”

In this analysis, structures associated with the doubt samples remained even after comparison using TMM-CPM, DESeq2 VST, and quantile-normalized data. In the histograms as well, the characteristic positions of the high@doubt and low@doubt gene groups did not completely disappear after correction.

On the other hand, after removing the clearly distorted samples, when the remaining green and blue label information was treated as a hypothetical batch and ComBat correction was applied, the difference between the green and blue samples observed before correction was greatly weakened. This result shows that correction methods can change how the data appear and can also change DEG lists.

However, this is not simply a matter of choosing which correction method is correct. Before applying correction, we need to consider what should be treated as the correction target and whether there is a valid reason to remove it.

When experimental conditions or batch information are not available, the corrected result should not be treated as the correct answer. Instead, we need to distinguish between trends that remain stable across multiple analysis conditions and features that change depending on the conditions.

Normalization and correction are not magic that make the data “correct.” After normalization, transformation, or correction, it is important to visualize the data and confirm what has changed and what has remained.

Distorted data cannot be made as if they never existed simply by applying an algorithm afterward. Corrections such as ComBat can weaken a specified structure. However, that does not necessarily mean that the data have been restored to the state they should have been in.

The most important point is not to rely entirely on post-analysis correction, but to minimize unnecessary variation as much as possible at the stages of experimental design, sample preparation, library preparation, sequencing conditions, and related procedures. Analysis is the process of checking the state of the obtained data and determining which differences may be corrected and which differences should not be corrected.