What Is Good Gene Expression Data? Understanding Noise, Reproducibility, and Dynamic Range

When looking at gene expression data, many people first focus on reproducibility. Do replicate samples under the same condition look similar? Do the points align along the diagonal in a scatter plot? Is the correlation coefficient high?

Of course, reproducibility is important when evaluating the quality of gene expression data. But if the correlation coefficient is high, can we really say that the data are “good”?

In this page, we use microarray data as an example to discuss the relationship among noise, reproducibility, and dynamic range. The important point is not to treat noise simply as something to be discarded, but to determine where noise exists and from which range the signal can be interpreted with confidence.

Is noise just garbage?

Imagine measuring the same sample twice using a microarray. When the two measurements are compared in a scatter plot, large variation may appear in the low-signal region. At first glance, such data may look noisy and low in quality.

However, seeing noise in gene expression data is not unnatural. It is unlikely that all genes are expressed in a given cell or tissue. In addition, every measurement system has a limited range in which values can be measured reliably.

Values for genes that are not expressed, or for lowly expressed genes near the detection limit, are less stable even when the same sample is measured repeatedly. For this reason, variation in the low-signal region is a natural feature of expression data.

The problem is not that noise exists. The problem is that noise becomes hidden, making it difficult to determine the boundary between noise and signal.

Fig1: Affymetrix HGU133 Plus2 MAS5

Fig1: An example of Affymetrix HG-U133 Plus2.0 data summarized using MAS5. Variation is visible in the low-signal region at the lower left of the scatter plot.

In this figure, variation is visible in the low-signal region. At first glance, the data may appear to have low reproducibility. However, this variation does not necessarily indicate a measurement failure. Rather, it can be interpreted as a region containing many genes that are not expressed or are close to the detection limit.

When viewed as a histogram, the distribution of MAS5 signal values may appear to have a single peak. In practice, however, it is more natural to regard it as a two-component structure in which the noise region and the signal region overlap. Not all low signal values are pure noise; some weak but real signals are also included in this region.

Therefore, the boundary between the noise region and the signal region should not be treated as a single clear line. It should be understood as a gradual transition zone.

Does a high correlation coefficient mean good data?

Next, let us look at an example in which the same raw data are summarized using RMA. In RMA-processed data, the large variation in the lower-left region of the scatter plot becomes less visible. As a result, the data may look more reproducible and higher in quality than the MAS5 data.

Fig2: Affymetrix HGU133 Plus2 RMA

Fig2: The same raw data as in Fig1, summarized using RMA. In the scatter plot, variation in the low-signal region becomes less noticeable.

Indeed, if we look only at the scatter plot, Fig2 appears cleaner than Fig1. The correlation coefficient is also higher in Fig2 than in Fig1. However, a high correlation coefficient does not necessarily mean that the entire dataset is reliable.

Correlation-based evaluation of reproducibility is most meaningful for genes that lie within a reliable signal region. If genes that are not expressed, or genes near the detection limit, are included in the calculation, the correlation coefficient can be strongly affected by how the data are displayed and by the summarization algorithm used.

In Fig2, variation in the low-signal region is less visible. However, this does not mean that the noise has disappeared. In the histogram, a high peak appears in the low-value region. This can be interpreted as the result of values from the noise region being compressed into a narrow range.

Therefore, even if variation is not visible in a scatter plot, it does not mean that all measured values can be treated as reliable signals. Rather, when noise becomes less visible, it becomes more difficult to decide which values should be included in the analysis.

There is value in seeing noise

Noise is not simply garbage to be removed. From the perspective of data interpretation, there is great value in noise being visible as noise.

This is because visible noise makes it easier to determine the boundary between signal and noise. Which regions of the measured values can be interpreted with confidence? Which regions should be treated with caution? Which genes should be excluded from the analysis? These decisions become easier when noise can be seen.

In contrast, when noise is difficult to see, the data may appear highly reproducible, but it becomes difficult to identify the range that can be interpreted reliably. As a result, low-confidence measurements may be included in the analysis.

In gene expression data analysis, the goal is not to make noise disappear completely. What matters is to understand the relationship between noise and signal in a visible form.

Dynamic range is not simply the width of numerical values

The term “dynamic range” is often used when discussing the performance of gene expression data. However, if dynamic range is understood simply as the difference between the maximum and minimum values, the practical performance of the data can be misunderstood.

Microarray signal values depend on the measurement system, probe design, and summarization algorithm. Therefore, even if one platform produces values spanning four orders of magnitude and another produces values spanning five orders of magnitude, that alone does not mean that the latter is superior.

What matters is not the apparent width of the numerical values. What matters is how many genes are included in the signal region that can be interpreted with confidence.

In this sense, dynamic range should be understood not as “the overall width of the measured value distribution,” but as “the width of the signal region that can be used reliably for analysis.”

What is good gene expression data?

So, what kind of data should be considered good gene expression data?

Good gene expression data are not simply data with clean-looking scatter plots. They are not simply data with high correlation coefficients. Nor are they data in which noise is invisible.

Good gene expression data are data in which the relationship between the signal region and the noise region is easy to judge, and reliable measurements can be selected as the basis for analysis.

For this reason, it is important to examine not only scatter plots but also histograms. Instead of relying only on an overall correlation coefficient, low-signal and high-signal regions should be considered separately. Rather than treating all measured values with the same weight, we need to determine which range can be interpreted with confidence.

This way of thinking also applies to RNA-Seq, not only to microarrays. In RNA-Seq, Gene Counts reveal large variation in the low-count region. This is because genes with fewer reads have greater measurement uncertainty, and the range of expression levels that can be interpreted reliably depends on sequencing depth.

In contrast, when data are converted to FPKM or TPM, the values are displayed after correction for gene length and library size. As a result, the instability of the low-count region that is visible in Gene Counts becomes harder to recognize. Even if the values look well organized, genes with very low original counts may show large expression differences from only a small difference in the number of reads.

In other words, FPKM and TPM are values converted into units that make expression levels easier to compare, but at the same time, they make noise in the low-count region less visible. Therefore, if we look only at FPKM or TPM, it becomes difficult to decide which range can be treated as reliable signal.

Gene Counts vs FPKM vs TPM

Fig3: Comparison of the same RNA-Seq data displayed as Gene Counts, FPKM, and TPM. Variation in the low-count region is easier to see in Gene Counts, whereas this instability becomes less visible in FPKM and TPM.

Therefore, in RNA-Seq analysis as well, we should not judge data quality only from the appearance of normalized values. It is important to examine the distribution of Gene Counts, the variation in the low-count region, and how the data look after normalization. Only after confirming the relationship between signal and noise in a visible form should we decide which genes to include in the analysis.

Summary

When evaluating the quality of gene expression data, it is risky to rely only on correlation coefficients or a clean visual appearance.

A high correlation coefficient indicates good reproducibility within the signal region. However, when the noise region is included, the correlation coefficient can be affected by how the data are displayed and by the algorithm used to summarize or transform the data.

Data in which noise is visible may look low in quality at first. However, visible noise makes it easier to judge the boundary between signal and noise. In this sense, there is value in noise being visible as noise.

Good gene expression data are not data in which noise appears to be absent. They are data in which the boundary between noise and signal is easy to judge, and a reliable signal region can be selected for analysis.

From this perspective, differences among microarray platforms can also be understood not simply as differences in the width of measured values, but as differences in “which range can be interpreted reliably for analysis.” For details, see Dynamic Range of Microarrays.

Support Help - Theory & Case Study What Is Good Gene Expression Data? Understanding Noise, Reproducibility, and Dynamic Range