Is The Noise Junk?

Let's say you are comparing technical replicate samples on a scatter plot. If you see charts like Fig4 and Fig5, which do you think is better quality? Fig4 looks varying wider partially. Correlation coefficient of Fig4 is 0.919, while that of Fig5 is 0.988. I guess many think Fig5 is better. But the two are generated from a same raw data and quantified by different algorithms, which are MAS5 (Fig4) and RMA (Fig5). For more about the characteristics of the data, please see "MAS5 vs. RMA" for more about the characteristics of the data. We focus on the reproducibility of microarray data here.

A high correlation coefficient indicates the high reproducibility. But it is not always true when you talk about microarray data. Why?

You cannot assume biologically that all genes are expressing in a sample. And every measurement system has its own dynamic range. You can rely on signals in the dynamic range (signal range), but cannot them out of the range (noise range). A microarray data must have both signals and noise. The noise is a mixture of not-expressing genes and too-low-to-detect by the system. You cannot distinguish not-expressing and too-low-to-detect from noise. So it is natural and reasonable there is an area where spots scatter widely at low signal range.

There is such a noise range in Fig4, but not in Fig5. Actually, there is the noise, which is just hidden, from a view point of experimental biology.

Hiding noise is not welcome for analysts because it makes their works difficult. If the noise range looks obvious, it is easy to set a boundary between noise and signal ranges, and extract genes with reliable signals (Fig4). But it is not easy to do it in Fig5. If you look closely at the histograms, there are peaks at their left ends. And if you select genes in the signal range in Fig4, the shapes of them in histograms in Fig5 look similar. So you can assume the left peaks are the noise range in Fig5, though it is hidden in the scatter plot.

A high correlation coefficient indicates the high reproducibility. This is true for signal range, but not for noise range. So a correlation coefficient calculated with all genes is not an issue. This confusion resulted in algorithms hiding noise, or treating data as if all data are in the signal range. By the way, the dynamic range is the signal range excluding noise range. This confusion resulted in algorithms extending the noise range. The noise is valuable if it looks like noise.

Now let's see dynamic ranges of microarray platforms.

Paper fig04 thumb

GSE11670 is another example of HG-U133 Plus2.0 Array data with technical replicates of good quality. Dots widely scatter at lower left, indicating signals less than 100 are not reliable. Though the histograms of MAS5 look monomodal, but they're intrinsically bimodal. The peak is formed in between signals and noise, because the two components are close and gradual. Signals between 10 and 100 are less reliable, but not absolutely nonsense. You can download the SSA file to import to your Subio Platform.

Paper fig05 thumb

GSE11670 data by RMA. There is no widely scattered area at left-bottom, but it doesn't mean there is no noise and all data are signals. Intrinsic noise must exist in a biological context. If you see histograms (row3 and 4), there are high peak at left side. It only converges low signals into small range. Signals between 10 and 100 are consequently much less reliable, comparing to MAS5 data.

Dynamic Ranges of Microarray Platforms

You might have heard like "4-5 orders" about the dynamic range of microarray data. But this is like saying 100km is longer than 80mi. Microarray signals depend on the hardware and software. So you cannot compare numbers from different platforms directly. You need a common unit to compare. So why don't you ask like "How many genes are reliably measured by the platform?" or "How many genes are in the signal range?"

Affymetrix 3' IVT GeneChip

This is the typical of the microarray because was massively used from 2002 to 2004. Fig4 and Fig5 are a dataset measured by Human Genome U133 Plus 2.0 Array. About 25,000 probe sets are in the signal range. Multiple probe sets are designed for some genes, so about 12,000 genes are measured reliably with this technology.

The signal and noise range are not clearly separated in actual data. The boundary is not like a line but a zone or spectrum. If the overlapping zone, or boundary zone, is large enough, it forms a peak in the histogram. You see peaks at just left of signal area (black) in Fig4. And the left skirts of peaks are noise area. Maybe you can use data in the boundary zone, if you accept the less trustfulness. You cannot separate boundary zone and noise range in RMA data (Fig5).

You can download the SSA file of GSE11670 to examine on your Subio Platform.

Illumina BeadChip

This product was released after GeneChip, and they insisted very high reproducibility. If you see a scatter plot of HumanHT-12 V4.0 expression beadchip data(Fig6), it looks ultra-low variance. But you know this is only the effect of RMA algorithm. About 15,000 probe sets are in the signal range, and about 11,000 genes are reliably measured. Notice that the dynamic range is narrower than GeneChip. The interpretation becomes difficult due to RMA, and there is no choice of quantification method.

You can download the SSA file of GSE25315 to examine on your Subio Platform.

Affymetrix Gene ST Array

Affymetrix changed the fundamental design of 3' IVT GeneChip, and released later. MAS5 is no more available, and you can select only RMA or similar methods which hide noise. It is very difficult to distinguish signal and noise range (Fig7). In the area of >6 (log2) signal, plots are scattering wider as signals go low. This is a natural form. Unlikely, plots is converging as signals go low in the <6 signal area. This is artificial form made by the software. It is hard to divide though, about 12,000 – 15,000 probe sets, or about 9,000 – 11,000 genes are in the signal range.

If you want to know the noise range more clearly, it is a good idea to search plots of not-expressing genes in the cell or organ. You will know how broad such plots locate. Because the boundary zone is so wide that it is very difficult to interpret the data.

The newer products after GeneChip Plus2.0 have narrower dynamic range. And the latest Clariom S Array has the narrowest. The fame as a technology leader has completely faded.

You can download the SSA file of GSE22288 to examine on your Subio Platform.

Agilent Whole Genome 4x44k Microarray

Paper fig07 thumb

GSE22288 is obtained with Gene ST Array, which came after 3' IVT Array (HG-U133 Plus 2.0). 3' IVT probes are designed in 3' UTRs because those sequences are more unique than exons. Gene ST probes are targeted exon. It seems the strategy makes data more ambiguous and less sensitive. New technology is not always better than old. It's not rare that matured technology is superior in history. There are tons of 3' IVT GeneChip data available on the web,sThis is another advantage you'd better to consider. You can download the SSA file to import to your Subio Platform.

The former microarrays were poor though, Agilent made a great improvement in 4x44k series in 2004. And they have kept the unparalleled dynamic range since then. In Fig8, the number of probes in the signal range reached to about 34,000, it means you can measure more than 16,000 genes with this technology. It enables your measuring expression levels of less transcribed genes, and it has more impact from a biological point of view.

The noise is a mixture of signals of not-expressing and too-low-to-detect genes. If you want to know what genes are not expressing, you must increase sensitivity to reduce the latter. So this technology gives you not only signals of 4,000 to 5,000 more genes, but also purer list of not-expressing genes. So this is only the technology for studies of ON/OFF behaviors of genes.

You can download the SSA file of GSE36082 to examine on your Subio Platform.

Paper fig08 thumb

GSE36082 is obtained with Agilent Whole Genome 4x44k microarray, which was a great achievement of microarray technology. And their SurePrint microarrays keep the paramount quality. There is a steep peak at the left end of histogram, which is formed by signals as low as negative control probes. What's good is the edge is brilliantly sharp. It seems you might have a chance to measure slightly expressing genes, or to distinguish not- and lowly-expressing genes. You can download the SSA file to import to your Subio Platform.

The Dynamic Range of RNA-Seq

The noise of microarray data is a mixture of signals of not-expressing and too-low-to-detect genes. RNA-Seq could avoid the former theoretically, but cannot be free from the latter. So, what is the reality?

Count Data

Fig9 shows a count data of RNA-Seq by HiSeq2000. Plots scatter widely in the area of <20 count. It indicates there is the noise range of too-low-to-detect genes. The signal range has about 12,000 genes. Notice that the dynamic range is almost same as Affymetrix 3' IVT GeneChip.

But you will notice a characteristic of RNA-Seq data, which makes analysis difficult, if you look at the data carefully. There are genes having count values in some of replicate samples, and 0 in the others. It is said that genes can be amplified by PCR or not, can be captured by sequencer or not, can be mapped to reference or not just by chance. Thus, you cannot conclude a gene is not expressing even if the count is 0.

So the noise of RNA-Seq data is also a mixture of not-expressing and too-low-to-detect, and you cannot distinguish them. It might be possible to overcome this problem theoretically, but is impossible practically due to the cost. We foresee that RNA-Seq will never be comparable to Agilent microarrays in the dynamic range at the same cost.

Additionally, you must carefully consider how to handle 0s in RNA-Seq data. Any way do you choose, it brings some confusion in interpretation.

You can download the SSA file of GSE49110 to examine on your Subio Platform.


Paper fig09 thumb

GSE49110 is obtained with Illumina HiSeq 2000. RNA-Seq data don't contain noise of not-expressing genes theoretically. But you see noise area at left-bottom, and only about 12,000 genes fall into signal area. This means it's not able to measure genes at middle to low expression level. This is HiSeq 2000 data. You can imagine how poor the sensitivity of Genomic Analyzer (GA) or GA II was. You can download the SSA file to import to your Subio Platform.

Fig10 also shows a RNA-Seq data by HiSeq2000, but the values are FPKM. Notice that the noise area is not as clear as count data (Fig9). FPKM stands for fragments per 1kb exon model per million mapped reads. "Per million mapped reads" is for normalizing per samples. Larger genes have more chances of having mapped reads, so "Per 1kb exon model" is for normalizing size of genes in each sample.

You may think those two ways of normalizations are reasonable. But the effect of "per 1kb exon model" normalization is moving plots of large genes to left-bottom-ward, and small genes to right-top-ward. This is the reason why the noise area in FPKM scatter plot is not clear. You must know that the FPKM also contain the noise which looks unobvious. The average exon size is larger than 1kb. So many plots move left-bottom-ward, and the distribution range expands about 2 orders. This is only a side-effect of gene-size-normalization, and it is nonsense to conclude "the dynamic range of RNA-Seq is wider than microarray," or "RNA-Seq is more sensitive in the low signal range."

If you want to know which gene is more expressing in one sample, you need to apply gene-size-normalization. But if you want to know differentially expressed genes between samples, gene-size-normalization is not necessary. It is count data in which you can easily separate signal and noise range. We recommend you use count data for quality control steps, even if you use FPKM for clustering or other after the filtering. And you need to be careful to say like ">10 FPKM is reliable," because FPKMs for very small genes can be very high even if the counts are in the noise range.

Though the signal range is not clear like count data, reliably measurable genes are about 12,000 anyway.

You can download the SSA file of GSE53567 to examine on your Subio Platform.

Paper fig10 thumb

GSE53567 is also obtained with Illumina HiSeq2000. But they are FPKM values, not count values which are shown in Fig9. FPKM (or RPKM) are normalized by gene size. It means larger genes move left-bottom-ward, and smaller genes move right-top-ward. Consequently, the signal and noise ranges are less obvious than count (Fig9). But it does not mean there is less noise, of course. You can download the SSA file to import to your Subio Platform.

What Is The High-Quality Data?

We showed several microarrays and RNA-Seq datasets to examine their dynamic ranges. If you are interested in the highly-expressing 10,000 – 12,000 genes, you can choose any of platforms even if there is a difference in the difficulty of interpretation. But if you want to see 4,000 – 5,000 genes under the highly-expressing genes, there is no choice other than Agilent microarrays.

The innovative technology by Agilent has been available since late 2004, but the paradigm shifted to NGS and it caused the transcriptomics' standstill for more than a decade. It is a huge loss which is due to the high cost of open discussion on omics data. To avoid such a waste, we suggest you examine the data quality from the following four points.

  • The noise is obviously visible.
  • Many genes are in the signal range, and reliably measurable.
  • The boundary zone of signal and noise range is small.
  • The sensitivity of differential expression is high.

We did not mention about the last point. If you use an algorithm which makes variability low, it sacrifices the sensitivity of differential expression. It is a tradeoff. If you expect the difference is small, you need to consider a plan to get very-high-quality raw data, and to choose algorithms which have minimum artificial effect.

Lastly, from the experimental point of view, we recommend you take in to account the tradeoff between accuracy and precision (Fig12). You would better to be careful about too-good data. The biased data takes your eyes away from the biological discovery, and easily leads wrong conclusions. Please read more about the theory of experimental errors.

Paper fig12 thumb

You cannot pursue the accuracy and precision generally, because it is a tradeoff. If the accuracy is high, you can fairly estimate the true value by averaging replicated measurements, even if the precision is low. But the data showing an ultra-high precision is likely to be biased. If it is so, the source of bias is mainly quantification and normalization algorithm. You would better bear that in mind when you look at the omics data.

The Solution We Offer, Subio Platform

Nobody knows how to handle omics data. Nobody knows how to extract meaning from omics data. We think it is not statisticians or bioinformaticians, but is an experimental biologist who will bring a critical idea someday. So it is essential to invite them to the table of discussion, at where everybody is equally amateur.

Please read why we create Subio Platform software, and what we offer to you who challenge the indeterminate problem.

Banner epilogue

Please read the following articles, too.

Back to Top