How to interpret microarray data

Products
Subio Platform
How to interpret microarray data

The Biases in Omics Data

Let's say there are 5 conditions you want to compare, and 10 replicates in each condition. So there are 50 samples in total.

	Number of Samles
before_ R	10
after_ R	10
before_ FC	10
after_ FC	10
after_ RFC	10

Zoom In

GSE15490 is obtained with Affymetrix HG-U133 Plus2.0 Array, composed of 50 samples (5 conditions x 10 replicates). This data is quantified with RMA. You can download the SSA file to import to your Subio Platform.

Fig1. An example of Affy HG-U133 Plus2.0 data.

Fig1 shows the result of hierarchical clustering on the dataset. Samples are divided into 2 clusters which are of replicate 1 – 5 and 6 – 10, rather than the 5 conditions. (The yellow vertical line in Fig1.) Do you think there really are 2 types of expression profiles over the 5 conditions? From a view point of experimental biologist, you may instinctively sense there could have been experimental factors generating the bias.

If the instinct is right, non-linear bias is involved in the data. It is no appropriate you compare averages of the groups or apply ANOVA to detect differentially expressed genes. Maybe you divide into two datasets and analyze separately and then get a conclusion from the two results. Or maybe other idea is needed.

MAS5 and RMA

Zoom In

I re-analyzed the raw data of GSE15490 (Fig1) with another quantification algorithm called MAS5. It generates very different distribution patterns from Fig1. You can download the SSA file to import to your Subio Platform.

Fig2. Reanalysis of Fig1 Data with MAS5.

Though the histograms in Fig1 show signal distribution patterns, the signal is generated with RMA which is an algorithm of quantification of the raw data (CEL) files, and which includes a strong non-linear normalization method. So you can re-quantify the raw data files with MAS5 algorithm to see more directly. (Fig2)

Both data of Fig1 and Fig2 are generated from the same raw data files, the quantification algorithm has a significant effect on the result. If you look closely, you may notice that rep 1–5 and 6–10 have different distribution range of 25th and 75th percentile. MAS5 signals (Fig2), which reflect the raw data more directly, have wider dynamic ranges in rep 6–10 (>100), and narrower in rep 1–5 (>200). The dynamic range is not the range of all signals, but the range of trustfully measured signals. In other words, genes with signals at around 100 are reliably measured in rep 6–10, but not in rep 1–5. We do not know what brought such a difference between rep 1–5 and 6–10. Maybe it could come from dyeing or washing processes. Anyway, there are differences in dynamic ranges of signals, and it could result in the two non-sense clusters in Fig1.

RMA and other strong normalization algorithms had been invented to cancel non-linear biases like this case. Actually it looks successful if you look only in histograms or other charts visualizing signal distribution patterns. But it is very important to know that any normalization techniques never be able to cancel the bias really.

Zoom In

(A) The distribution patterns of RMA and MAS5 signals. Black indicates signals of abundant genes, which are reliably measured in the dynamic range.
(B) The result of hierarchical clustering of the abundant genes (Black in A).

You can download the SSA file to import to your Subio Platform.

Fig3. Comparing RMA and MAS5 data of abundant genes.

By the way, although RMA (Fig1) and MAS5 (Fig2) signal distributions look very different, they look very similar if you only look at signals in the dynamic range. (Fig3 A, black) If you take only those data, the analysis results also look very similar. (Fig3 B) The difference of the two quantification algorithms treat differently only in noise range, or unreliable signals. (Fig3 A, blue)

You can download these SSA files from the links in Fig1 - 3 at the right to examine by yourself with Subio Platform.

Disadvantages of RMA

It is low-signal range where you see the large difference between MAS5 and RMA. We will not discuss which algorithm calculates better in the low range. But we cannot pretermit two disadvantages of RMA from the view point of interpreting results.

Firstly, signal distributions are uniformed by RMA and it hides the involved non-linear biases. It means you need more experience and carefulness, because only experienced analysts can sense the non-linear bias from the result of PCA or clustering, and would check experimenters, dates, places, or other experimental factors to bear out. But for most of analysts, hiding the bias easily misleads the conclusion.

Secondly, RMA makes it difficult to separate the signal and noise range. The microarray usually has probes for all transcripts. So there must be the noise range, because not all genes are expressing in one sample, and because every measurement system has lower limit on detection. RMA hides the noise, and it also misleads the conclusion for most of analysts. You can read more about the dynamic range of microarrays.

Moreover, RMA has two restrictions on the experimental design. You should not use RMA, if you cannot assume all samples have almost same expression profiles, or if you will continuously add samples in prospective studies. The former condition is very common in the biological study, like comparing different cell types, different compositions of cells types, different stages of the development or pathology, etc.

Consequently, MAS5 is superior to RMA from the view point of interpreting data or planning studies, MAS5 is no more available in new Affymetrix, which was taken over by Thermo Fisher Scientific, GeneChip systems. We think it is a stepping back of bioinformatics due to lack of biology.

Raw Data, Raw Data, Raw Data!

We do not intend to criticize a technology particularly, but would like to show an example of the current situation that unreasonable myths are making draw backs of the life science. "Sophisticated" normalization techniques can make data look good, but cannot remove the non-linear bias really. What is ultimately important is to brush up an experimental plan to get the high-quality raw data which do not need such cosmetic techniques. And we recommend you use a basic normalization technique to see the true characteristics of data.

The Solution We Offer, Subio Platform

Nobody knows how to handle omics data. Nobody knows how to extract meaning from omics data. We think it is not statisticians or bioinformaticians, but is an experimental biologist who will bring a critical idea someday. So it is essential to invite them to the table of discussion, at where everybody is equally amateur.

Please read why we create Subio Platform software, and what we offer to you who challenge the indeterminate problem.

The solution we offer, Subio Platform.

Please read the following articles, too.

Fig1. An example of Affy HG-U133 Plus2.0 data.

Fig2. Reanalysis of Fig1 Data with MAS5.

Fig3. Comparing RMA and MAS5 data of abundant genes.

Fig4. MAS5 data of technical replicates on scatter plot and histogram.

Fig5. RMA data of technical replicates on scatter plot and histogram.

Fig6. Ilumina BeadChip data on scatter plot and histogram.

Fig7. Affymetrix Gene ST Array data on scatter plot and histogram.

Fig8. Agilent Whole Genome 4x44k data on scatter plot and histogram.

Fig9. Scatter plot of Count

Fig10. Scatter plot of TPM

Fig11. The number of reads and the dynamic range of RNA-Seq

Fig12. The number of reads and the dynamic range of RNA-Seq

Products

Subio Platform

How to interpret microarray data

The Biases in Omics Data

MAS5 and RMA

Disadvantages of RMA

Raw Data, Raw Data, Raw Data!

The Solution We Offer, Subio Platform

Fig1. An example of Affy HG-U133 Plus2.0 data.

Fig2. Reanalysis of Fig1 Data with MAS5.

Fig3. Comparing RMA and MAS5 data of abundant genes.

Fig4. MAS5 data of technical replicates on scatter plot and histogram.

Fig5. RMA data of technical replicates on scatter plot and histogram.

Fig6. Ilumina BeadChip data on scatter plot and histogram.

Fig7. Affymetrix Gene ST Array data on scatter plot and histogram.

Fig8. Agilent Whole Genome 4x44k data on scatter plot and histogram.

Fig9. Scatter plot of Count

Fig10. Scatter plot of TPM

Fig11. The number of reads and the dynamic range of RNA-Seq

Fig12. The number of reads and the dynamic range of RNA-Seq

Products Subio Platform How to interpret microarray data

The Biases in Omics Data

MAS5 and RMA

Disadvantages of RMA

Raw Data, Raw Data, Raw Data!

The Solution We Offer, Subio Platform

Products

Subio Platform

How to interpret microarray data