Apparent Differences Created by CPM Normalization and Potential Biases at Multiple Steps

  • Gene Expression
  • High-Throughput Sequencing

Understanding the Challenges of RNA-Seq Analysis Through Visualization

CaseStudy GSE173789

“Why do samples from the same condition show larger variation than the difference between groups?”

One of the unavoidable challenges in omics data analysis is batch effect, or systematic bias. Many researchers assume that once RNA-seq data have been normalized by CPM, the samples are ready for comparison. However, depending on the state of the data, CPM normalization itself can sometimes emphasize apparent differences between samples.

In this article, we use a real dataset to show how to visually inspect what is happening in the data before relying on advanced correction algorithms.

Focus on Distribution Changes, Not Just Counts

In Subio Platform, experimental parameters and histograms showing value distributions can be viewed in the same table. This makes it easier to visually check whether the data distribution appears to be related to experimental parameters or sample-level information.

In RNA-seq data, samples with extremely low read counts may show relatively elevated expression values after CPM normalization. In the histogram shown in this video, this would appear as a shift of the distribution to the right.

In this dataset, many of the samples that appear biased in PCA also corresponded to samples with smaller read counts or download sizes in SRA. This tendency was especially clear for MS-13, MS-16 to MS-19, and HC-19.

However, not all of the observed bias can be explained by file size alone. For example, some samples that appear biased in PCA do not have extremely small data sizes. Therefore, in addition to the effect of CPM normalization on low-read-count samples, it is also necessary to consider sample heterogeneity and other possible sources of data bias.

In this way, it is important to carefully examine whether the extracted differentially expressed genes truly reflect biological differences, or whether they include apparent differences caused by data size, normalization, or bias within the sample groups. If data analysis proceeds without visual inspection, there is a risk of misinterpreting the resulting DEG list.

Integrated view of experimental parameters and RNA-Seq data distributions in Subio Platform. A batch effect is visualized as a rightward shift in the histograms of specific samples, independent of biological conditions.

Transcriptome analysis is based on the assumption that, among all genes, most expression levels remain unchanged and only a subset of genes show differential expression. This assumption is what makes normalization possible.

If the shapes of the histograms differ greatly between samples, this assumption may not hold sufficiently. Therefore, rather than proceeding with analysis based only on normalized values, it is important to examine the distribution shapes and sample-level data sizes while interpreting the results.

Data Biases Can Arise at Multiple Steps

Even after identifying samples that clearly require caution, such as those with extremely low read counts, the analytical process does not necessarily end there. Even if samples that may strongly violate the assumptions are removed and the data are visualized again, another sample structure may still appear.

In this dataset, the Control samples appear to separate into two clusters. One Control cluster shows a profile that is distant from the disease group, whereas the other Control cluster shows a profile closer to the disease group.

In such a situation, at least two possibilities need to be considered.

Possibility 1: The Pattern Reflects Biological Heterogeneity

The Control group may include samples with different biological backgrounds, and these differences may be reflected in their expression profiles. For example, age, sex, inflammatory status, medical history, or differences in cell subset composition could potentially contribute to the observed pattern.

In this case, the separation of the Control group is not simply noise. It may contain important information for interpretation. Therefore, such samples should not be removed without sufficient justification.

Possibility 2: The Pattern Reflects Experimental or Technical Bias

On the other hand, the separation of the Control group may have been caused by technical factors, such as library preparation, sequencing run, sample processing date, storage conditions, RNA quality, or other experimental factors.

In this case, the observed difference may not reflect a biological disease effect, but rather a bias introduced during data generation or processing. Whether such samples should be included in the main analysis needs to be considered carefully.

Some Questions Cannot Be Answered by Algorithms Alone

In this type of situation, simply asking which statistical method should be used is not enough. Changing the algorithm does not eliminate the underlying data structure: the Control samples still appear to separate into two groups.

What is needed here is information outside the computational results. For example, clinical background of the Control samples, sample collection conditions, library preparation date, sequencing run, RNA quality, read counts, and laboratory records may help determine whether the observed structure reflects biological differences or technical bias.

Even so, it is not always possible to obtain a clear answer to every question. In such cases, the analyst needs to clarify the assumptions being made based on the available information, and choose an analysis strategy according to those assumptions.

The important point is not to avoid making a judgment, but to make the assumptions behind the analysis explicit. Whichever interpretation is adopted, that assumption will strongly influence how the analysis results should be understood.

In omics data analysis, it is not always possible to reach conclusions through a fully mechanical workflow. It is important to visualize the state of the data, examine relevant background information when necessary, and make conscious analytical decisions.

What Small Changes May Suggest: Complementary Use of Bulk and Single-Cell Analysis

Suppose we decide that some of the Control samples that appear far from the disease group are more likely to reflect technical bias than biological differences. In that case, the expression differences observed between the remaining Control samples and the disease group may become very small.

In bulk RNA-seq, many different cells are measured together. If important changes occur only in a small subset of cells, the overall expression difference may appear small. Therefore, it is natural to consider whether the data should be examined by single-cell RNA-seq, or whether the target cells should be further enriched by cell sorting and the experiment repeated.

At the same time, there is another important point to keep in mind. Single-cell analysis is extremely useful for examining differences among cell populations, but its data characteristics differ from those of bulk RNA-seq. It can be strongly affected by dropout, measurement noise, cell number, and preprocessing conditions.

Therefore, before chasing subtle signals with another method, it may be useful to first examine whether small but consistent expression differences can be observed in bulk RNA-seq, where measurement stability is often higher.

The important question is not “bulk or single-cell?” These approaches can be used in a complementary way. Bulk RNA-seq can be used to examine the overall picture and reproducible changes, while single-cell analysis or cell fractionation experiments can be used when needed to deepen the interpretation at the cell population level.

What Is the Role of the Analyst?

The discussion above describes the process of how I interpreted this dataset while examining it. It is not intended to claim that this is the only correct answer.

What I want to emphasize is the importance of considering multiple possibilities while looking at the data, and clarifying the points at which interpretation may diverge. When it is not possible to narrow the explanation down to a single possibility, the analyst should clearly present the reason and organize the analysis results according to each possible scenario.

This helps support constructive discussion with team leaders, experimental researchers, clinicians, pathologists, and others who understand the background of the data. Mastering advanced statistical methods alone is not the essence of omics data analysis. Visualizing the data, checking assumptions, carefully explaining the characteristics of the data and the analysis results, and organizing possible interpretations are also important parts of the analyst’s work.

What It Means to Learn Data Analysis

Learning data analysis means gaining experience by working with many different datasets, visualizing them, forming hypotheses, checking those hypotheses, and making analytical decisions. Naturally, this cannot be mastered overnight. It requires time, persistence, and repeated practice.

How Subio Supports Your Growth

Subio provides practical learning through online training, where we work together with customers using their own data.

Although the pace varies from person to person, many users can steadily build their analytical thinking and decision-making skills through approximately three to six one-on-one sessions over the course of about one year, using their own data as the learning material.

For those who need results quickly, or who first want their data to be examined from an expert perspective, we recommend our data analysis service.

Subio’s data analysis service is not simply a contract analysis service that delivers a results report. It is designed to help customers understand the characteristics of their data and the analysis steps, and to provide the information needed for customers themselves to make decisions and draw conclusions.

If you are interested, please feel free to contact us.

What you need to learn is not only commands or how to operate tools.
What you truly need to learn is data analysis itself.

Related Topics