GEO contains a large number of legacy microarray datasets generated using Affymetrix GeneChip platforms. Even though RNA-Seq has become the mainstream approach, these legacy datasets are still valuable for re-analysis in studies of disease, drug response, cell differentiation, tissue comparison, and many other topics. For more background on different types of microarrays and their dynamic ranges, please see this related article.
However, for researchers who have only worked with RNA-Seq data, the difference between Affymetrix processing methods such as RMA and MAS5 may not be obvious. Some datasets registered in GEO are processed with RMA, while others are processed with MAS5. Both may look like “normalized expression values,” but they are not values that can be compared in the same way.
The purpose of this article is not to provide a comprehensive explanation of all Affymetrix data processing methods. Instead, we focus on the minimum background needed when re-analyzing legacy datasets from GEO: the difference between RMA-like processed data and data obtained using MAS5.
RMA values are intended for comparisons within the same processing set
RMA is a method that processes multiple samples together and produces expression values by aligning the distributions among samples. Because RMA includes quantile normalization, the distributions of the processed samples often appear neatly aligned.
However, datasets processed with RMA at different times cannot be directly compared as expression values on the same scale. For example, RMA values from GEO Series A should not simply be combined with RMA values from another Series B to compare expression levels or fold changes directly.
This is because RMA does not quantify each sample completely independently. Instead, the values are determined using the distribution of all samples included in the processing set. As a result, the expression values depend on which samples were processed together by RMA.
Other methods such as GC-RMA, dChip, and PLIER differ in the details of background correction and probe summarization. However, in practical terms, they require similar caution to RMA because they also process data in a way that depends on multiple samples. They can be used for comparisons within the same processing set, but separately processed datasets should not be directly compared as if they were expression values on the same scale.
RMA makes data look cleaner, but the results depend on the samples included in the processing set
In RMA-processed data, the histogram shapes across samples often appear well aligned. This can make the data look cleanly normalized and ready for comparison.
However, aligned distributions do not mean that the original experimental differences or quality differences have disappeared. Because RMA determines values using the distributions of multiple samples included in the processing set, the resulting expression values are affected by which samples were submitted to RMA together.
For example, if a severely low-quality sample is included in the RMA processing set, it can affect not only that sample but also the values of otherwise normal samples in the same set. Similarly, if different tissue types or cell types are included in the same RMA processing set, distributional differences from tissues outside the intended comparison can be introduced into the normalization process.
Therefore, when re-analyzing RMA-normalized data, you should not assume that the data are directly comparable simply because they have been processed with RMA. It is important to check which samples were included when the RMA values were generated. Even within the same GEO Series, samples from tissue types, conditions, or quality levels outside the intended analysis can affect the comparison of interest.
RMA aligns the data, but it does not correct batch effects
The figure below shows clustering of GSE15490, a dataset processed with RMA. The samples form clusters not by condition, but by lower and higher replicate numbers. This suggests a possible batch effect. However, the histograms shown at the bottom appear neatly aligned because of quantile normalization.
When the same dataset is clustered after MAS5 processing, it looks completely different from the RMA result. The important point is that the histograms at the bottom reveal two groups of samples: one group with a wider dynamic range and another with a narrower dynamic range. This difference has a strong influence on the clustering result.
In this way, MAS5 can make it easier to recognize differences in dynamic range and possible batch effects from the histograms. In RMA-processed data, those differences may become much less visible.
The same raw data can look very different with RMA and MAS5
Here, we compare the same raw Affymetrix HG-U133 Plus 2.0 data processed using RMA and MAS5.
With RMA processing, the distributions among samples are aligned, so the data appear clean and well organized overall.
In contrast, MAS5 processing makes it easier to see differences in sample distributions, the spread of low-signal regions, and the boundary between signal and noise.
This property is useful when re-analyzing legacy data. If you only look at RMA-normalized data, all samples may appear to have similar distributions. However, when the same data are re-quantified using MAS5, you may find that some samples have a narrower signal range, larger noise in the low-signal region, or a different overall dynamic range.
A high correlation coefficient between replicate samples is often interpreted as evidence of high reproducibility. However, this idea became overemphasized, and there was a tendency to discuss microarray data quality mainly in terms of correlation coefficients. Because RMA tends to produce very high correlation coefficients, it became almost a de facto standard from the mid-2000s onward. However, making noise less visible does not necessarily lead to more reliable gene expression data analysis.
What to keep in mind when only RMA-processed data are available
Some Affymetrix datasets in GEO are provided mainly as RMA-normalized Series Matrix files. When re-analyzing such data, the first step is to check which samples were included when the RMA values were generated.
If the samples were included in the same RMA processing set, then RMA values can be used for comparisons within that set. For example, if control and treatment samples were processed together in the same study, RMA-processed data can be used for clustering, differential expression analysis, and pathway analysis. Even in this case, however, you still need to check for possible batch effects and low-quality samples.
On the other hand, RMA values derived from different Series, different papers, or different processing sets should not be combined and compared directly as expression values. Even if separately processed RMA datasets appear to be on the same numerical scale, the sample sets used during processing were different. Therefore, they should not be treated as expression values on a common basis.
If you want to integrate and re-analyze multiple Affymetrix datasets, it is preferable, whenever possible, to go back to raw data such as CEL files and re-quantify the data using MAS5.
When only RMA data are available, or when you want to re-quantify the data with MAS5
Even if you currently only have RMA-normalized data, it may be possible to re-quantify the data from raw files using MAS5 if CEL files are available in GEO.
Subio’s data analysis service can support the re-analysis of Affymetrix datasets published in GEO, including obtaining CEL files, re-quantifying the data with MAS5, creating SSA files that can be opened in Subio Platform, and checking histograms, clustering, and differential expression results.
If you can only find RMA-normalized data, or if you want to re-quantify the data with MAS5 to check the data distribution, please contact Subio.
Subio Data Analysis Service
For microarray datasets registered in public databases such as GEO,
we can support CEL file retrieval, MAS5 re-quantification,
creation of Subio Platform data files,
and visual checks for re-analysis.
View Subio Data Analysis Service
Summary
RMA and MAS5 have both been used to quantify Affymetrix GeneChip data, but the properties of the resulting values are very different. RMA processes multiple samples together and aligns the distributions among samples, so it can be used for comparisons within the same processing set. However, different RMA processing sets cannot be directly compared as expression values on the same scale.
MAS5 does not align distributions as strongly as RMA. As a result, it is easier to check differences in sample distributions, low-signal regions, and dynamic range. For this reason, when re-analyzing legacy Affymetrix datasets from GEO, it is preferable not to rely only on RMA-normalized data. Whenever possible, it is better to return to CEL files and re-quantify the data with MAS5 to check the data distribution.
When re-analyzing legacy data, you should not trust a normalized expression table without checking its origin. It is important to understand how the values were generated, which samples were included in the processing set, and how the signal and noise regions appear in the data.
Subio Platform can import microarray data processed with RMA or MAS5 and help you re-analyze the data while checking histograms, clustering, and expression patterns.