What Is Reproducibility in Single-Cell RNA-Seq Analysis? Explaining Data Quality and Its Limitations | Revisiting 2019 Insights from a 2026 Perspective

This page provides a clear explanation of the limitations and data quality issues in single-cell RNA-Seq (scRNA-Seq) analysis.

Introduction: Have You Ever Looked Behind the “Beautiful Figures”?

Single-cell RNA-Seq (scRNA-Seq) has become a standard approach in omics analysis. The colorful visualizations generated by tools like Loupe Browser often resemble immunostaining (FISH) images, giving the impression that they faithfully represent the true state of cells.

However, one must be careful not to be misled by this visual appeal. When data are examined in detail using Subio Platform, it becomes evident that beneath these images lie underlying data fragility that cannot be fully corrected computationally, as well as stochastic detection events that occur purely by chance.

1. The Problem of Sequencing Depth in scRNA-Seq and the Illusion of “Cellular Identity”

This is an issue we raised as early as 2019, when single-cell RNA-Seq was just beginning to gain traction.

In protocols such as those from 10x Genomics, the number of reads per cell is only on the order of tens of thousands. Compared to conventional RNA-Seq, this is dramatically low.

A low read count implies an extremely limited dynamic range. When visualizing histograms of read counts across cells in Subio Platform, it becomes immediately clear that there are substantial differences in sequencing depth among cells.

scRNA-Seq Scatter Plot

The figure above visualizes scatter plots comparing cells with the lowest sequencing depth and those with the highest depth (generated from GSE164898). The black dots represent ribosomal protein genes, which are expected to be highly expressed in all cells. From this figure, the following observations can be made:

Low-depth cells: Even ribosomal protein genes are only barely detectable.
High-depth cells: At best, only a few genes show expression levels comparable to ribosomal proteins.

This leads to an important conclusion: the technology itself carries a structural risk of misinterpreting simple “depth variation” as “cellular identity.”

Another fundamental principle must not be forgotten:
“Not detected” does not mean “not expressed.”
It may simply mean that the gene was not detected by chance.

The “beautiful figures” produced by analysis tools are, in reality, constructed from such highly unstable data. We must recognize this inherent uncertainty.

2. Do Not Ask AI Only for “Solutions” (A Perspective from 2026)

Now, let us shift to the perspective of 2026. The key difference from 2019 is that anyone can now consult AI for analysis.

If you ask AI, “What are the limitations of scRNA-Seq?”, you will likely receive a reasonable answer. However, try the following: ask AI, “How can batch effects in scRNA-Seq be corrected?” You will quickly be presented with sophisticated methods such as MNN or Harmony.

If you are unaware of the fundamental limitations of data quality, you may accept these answers without question. However, once you understand the underlying fragility of the data, a critical question naturally arises:

“Can data with missing signals truly be ‘corrected’ through computation alone?”

Once this question emerges, you can begin to interrogate AI more deeply—asking about the characteristics and limitations of each method. Instead of focusing only on solutions (How), you examine the underlying problem structure (What). Recognizing contradictions and maintaining a healthy skepticism—this sense of “something feels off”—is essential for analysts to survive in the age of AI. It is the essence of critical thinking.

The Mirage Created by Advanced Normalization

Between 2019 and 2026, there have been advances in normalization methods for scRNA-Seq. CPM (Counts Per Million) is commonly used to correct for depth variation, but in cells with extremely low read counts, the detection of even a single read can disproportionately influence expression values, amplifying variability.

This phenomenon—noise amplification in low-input data—is not unique to single-cell analysis. As shown in Case Study 403, it is a universal challenge in data analysis, also encountered when analyzing lowly expressed genes in bulk RNA-Seq.

More advanced methods such as TMM and Median-of-ratios (originally designed for bulk RNA-Seq) are now recommended in some contexts. However, applying them to scRNA-Seq data can be a risky gamble. This is because it is inherently difficult to determine whether the “corrected data” produced by sophisticated algorithms are actually closer to the truth than simple CPM.

These methods rely on a mathematical assumption: that most genes are not differentially expressed. But recall that in scRNA-Seq, only a small subset of highly expressed genes can be reliably detected. If we are effectively observing only this limited subset, the validity of that assumption becomes questionable.

In practice, rather than blindly trusting a single method, analysts should compare results across multiple normalization approaches. By examining both normalized and raw count data side by side, one must determine which representation introduces the least distortion—or at least provides the most reasonable interpretation.

If you wish to explore raw scRNA-Seq data in Subio Platform, you can download a matrix file from GEO and ask ChatGPT to generate a Python script to convert it into a dense matrix TSV format (genes as rows, cells as columns). Running the script in Jupyter will produce a file that can be directly imported into Subio. Thanks to AI, such “tasks” have become remarkably easy. For more details, see the separate article: “Generating Code with ChatGPT and Executing It in Jupyter.”

Even as of 2026, scRNA-Seq remains a developing technology, and its reliability is far from absolute. Rather than assuming that “advanced methods guarantee correctness,” we should remain aware of current limitations while anticipating future improvements—particularly in measurement sensitivity and stability (rather than merely in correction algorithms). Above all, we must maintain a mindset of critical thinking.

Conclusion: Distinguishing Truth with Your Own Eyes

Therefore, it is essential to examine data with your own eyes using Subio Platform.

If analysis remains a black box and you simply accept AI-generated “plausible results,” you may never even notice these inconsistencies. By directly observing the distortions and limitations inherent in raw data, and internalizing them through visual understanding, you can move beyond being a passive user of tools.

Only through this process can you evolve into a true analyst—one who uses AI effectively while drawing conclusions based on independent judgment.

Subio’s Philosophy and Approach

At present, the nature of scRNA-Seq data demands a high level of judgment in interpreting analytical results.

For this reason, Subio considers that applying standard analysis services or training—both of which assume a certain level of data quality and reproducibility—requires careful consideration in the context of scRNA-Seq.

Therefore, analysts must adopt the mindset of examining data with their own eyes and making informed judgments.

Subio Platform is designed as an analysis environment that enables this approach.

Support

Help - Theory & Case Study

What Is Reproducibility in Single-Cell RNA-Seq Analysis? Explaining Data Quality and Its Limitations | Revisiting 2019 Insights from a 2026 Perspective