Comparing the Dynamic Range of Microarray Platforms|Essential Background for Re-Analyzing Legacy Data

  • Gene Expression
  • Microarray

Public databases such as GEO contain a large amount of microarray data measured in the past. When re-analyzing these datasets, it is important not to treat them simply as “microarray data,” but to check which platform was used and how the data were summarized.

Even among microarrays, Affymetrix 3' IVT GeneChip, Illumina BeadChip, Affymetrix Gene ST Array, and Agilent Whole Genome / SurePrint Microarray differ in how noise appears, how easily low-signal regions can be handled, and how many genes can be interpreted stably. This is important background knowledge when looking at past datasets in GEO and other public databases.

When discussing the performance of microarrays, the phrase “wide dynamic range” is often used. However, if dynamic range is compared simply as the difference between the maximum and minimum measured values, the practical performance of gene expression data can be misunderstood.

Microarray signal values depend on the measurement system, probe design, scanner, and summarization algorithm. Therefore, the superiority of a platform cannot be judged only by how many orders of magnitude the measured values span.

What matters is not the apparent width of the numerical values, but how many genes are included in the signal region that can be interpreted with confidence. In other words, what matters is how many genes can be safely used as targets for analysis.

The basic ideas of noise, reproducibility, and dynamic range are discussed in What Is Good Gene Expression Data? Understanding Noise, Reproducibility, and Dynamic Range. Based on that perspective, this page compares the dynamic range of major microarray platforms.

Affymetrix 3' IVT GeneChip

Affymetrix 3' IVT GeneChip is a representative microarray platform that was widely used from the early 2000s. Human Genome U133 Plus 2.0 Array is one example of this type, and a large amount of data from this platform is available in GEO.

When re-analyzing data from this platform, an important point is that even from the same raw data, MAS5-summarized data and RMA-summarized data show low-signal noise very differently.

Fig1: Affymetrix HGU133 Plus2 MAS5

Fig1: An example of Affymetrix HG-U133 Plus2.0 data summarized using MAS5. Variation in the low-signal region is visible in the lower-left part of the scatter plot.

In MAS5-summarized data, variation in the low-signal region is relatively easy to see. At first glance, such data may appear to have low reproducibility. However, this variation can be interpreted as a region containing many genes that are not expressed or are close to the detection limit.

When noise is visible in this way, it becomes easier to decide which regions should be trusted and included in the analysis.

Fig2: Affymetrix HGU133 Plus2 RMA

Fig2: The same raw data as in Fig1, summarized using RMA. In the scatter plot, variation in the low-signal region becomes less noticeable.

In contrast, in RMA-summarized data, values in the low-signal region are compressed, making noise less visible. Looking only at the scatter plot, the data may appear cleaner than MAS5 data, but this does not mean that the noise has disappeared. As a result, in RMA data, the boundary between the signal region and the noise region becomes harder to judge.

MAS5 and RMA also differ in how expression changes appear. In RMA-summarized data, not only is variation in the low-signal region compressed, but the overall amplitude of expression changes also tends to appear smaller.

Therefore, if the “two-fold or greater” Fold Change threshold often used for MAS5 data is applied directly to RMA data, the number of detected differentially expressed genes may be greatly reduced. When analyzing RMA data, the Fold Change threshold should be reconsidered according to the data distribution and the amplitude of expression changes.

When reusing past Affymetrix data from GEO and other sources, it is necessary to check not only which platform was used, but also how the data were summarized. MAS5 and RMA differ not only in how the noise and signal regions appear, but also in the size of Fold Change and in the thresholds used to extract differentially expressed genes.

In typical Affymetrix 3' IVT GeneChip data, the signal region may contain about 25,000 probe sets. However, because multiple probe sets are often designed for a single gene, it is realistic to regard about 12,000 genes as a practical reference for the number of genes that can be interpreted stably.

In theory, noise components and signal components can be considered separately. In real data, however, they are not cleanly separated. The boundary is not a single line, but rather a band-like region.

Illumina BeadChip

Illumina BeadChip appeared after GeneChip and emphasized high reproducibility as one of its features. Looking only at scatter plots, variation among replicate samples appears very small, and the data look clean.

However, this clean appearance requires caution. In Illumina BeadChip data, even if there are some differences in summarization or normalization methods, variation in the low-signal region is compressed, and the data often show an appearance similar to the RMA data described above. Therefore, even when the scatter plot appears to contain little noise, instability in the low-signal region becomes difficult to see.

Fig3: Illumina BeadChip

Fig3: An example of data measured using Illumina BeadChip. Although the scatter plot appears to show little variation, many genes are concentrated in the low-signal region in the histogram.

In Fig3, the peak at the far left of the histogram is steep and narrow, so the boundary between the noise region and the signal region is relatively easy to judge. However, because many genes are concentrated on the low-signal side, the number of genes included in the signal region is not large.

Although Illumina BeadChip is designed to target many genes, if we limit the analysis to the range that can be treated as a clear signal region in the histogram, it is realistic to consider about 8,000 to 10,000 genes as stably interpretable. In this sense, the practical dynamic range of Illumina BeadChip is narrower than that of Affymetrix 3' IVT GeneChip or Agilent Whole Genome / SurePrint Microarray.

Affymetrix Gene ST Array

Affymetrix Gene ST Array was introduced as a successor to 3' IVT GeneChip platforms such as HG-U133 Plus 2.0. Successor products in this lineage, such as Clariom S Array and Clariom D Array, are still sold by Thermo Fisher Scientific.

In 3' IVT GeneChip, probes were designed mainly toward the 3' UTR side, where relatively unique sequences are often found. In contrast, Gene ST Array and its successor products use probes designed across the whole gene, especially in exon regions. This design is advantageous for obtaining information about the whole transcript and splicing, but caution is needed when evaluating dynamic range as gene-level expression data.

Gene ST Array data are usually summarized on the assumption of RMA. Therefore, in scatter plots, variation among replicate samples may appear small, and the data may look clean and highly reproducible overall. However, this appearance also includes the effect of low-signal compression by RMA.

Newer technology is not always superior to older technology. For mature measurement technologies such as microarrays, older platforms may have data characteristics that are easier to understand and more practical to handle. In addition, a large amount of 3' IVT GeneChip data has accumulated in public databases, which makes it useful for re-analysis.

Fig4: Affymetrix Gene ST Array

Fig4: An example of data measured using Affymetrix Gene ST Array. In the low-signal region, variation is compressed by RMA, making the boundary between the noise region and the signal region difficult to distinguish.

In Fig4, in regions where the signal values are relatively high, variation increases as expression levels decrease. This is a natural appearance: measurement uncertainty becomes larger as expression becomes lower.

However, in even lower-value regions, variation becomes smaller again, and the values appear to converge. Rather than meaning that the noise has actually disappeared, this is better interpreted as low-signal values being compressed by RMA, making noise and bias less visible.

What is particularly troublesome about this microarray is that the boundary between the signal region and the noise region is broad and ambiguous. Even if the scatter plot looks clean, it is difficult to decide how far the data can be trusted and included in the analysis.

Probe design may also contribute to this ambiguity. In Gene ST Array, probes are mainly designed in exon regions. Exon regions may contain sequences shared among isoforms or sequences common to highly homologous gene families. As a result, probe specificity can vary, and some probes are more likely to be affected by cross-hybridization or nonspecific signals.

For this type of data, the range that can be trusted for analysis needs to be chosen carefully. Although it depends on the dataset and tissue, for gene-level expression, it is realistic to regard about 9,000 to 11,000 genes as a practical reference for the number of genes that can be interpreted stably.

When judging the noise region, it can be useful to check where genes that are thought not to be expressed in that cell or tissue appear in the value distribution. However, in data where noise and signal components are broadly mixed, and where the low-signal side is further compressed by RMA, the boundary is not easy to interpret in practice.

Agilent Whole Genome 4x44k Microarray

Agilent Whole Genome 4x44k Microarray represents a major advance in the history of microarrays. Its successor, the SurePrint Microarray series, also maintains high data quality.

Fig5: Agilent Whole Genome 4x44k

Fig5: An example of data measured using Agilent Whole Genome 4x44k. The overlap between the noise region and the signal region is small, and the boundary is relatively clear.

In the histogram of Fig5, a sharp peak is visible at the far left. This region is thought to be formed by negative control probes and probes with signal values close to them.

A key feature of Agilent data is that the overlap between the noise region and the signal region is small, and the boundary between them is relatively clear. This makes it easier to distinguish genes that are expressed, even at low levels, from genes that are not expressed.

In this type of data, about 34,000 probes are included in the signal region, and about 15,000 to 16,000 genes can be considered a stably interpretable range. Compared with Affymetrix 3' IVT GeneChip and Illumina BeadChip, low-expression regions are easier to handle, and the practical dynamic range is wide.

A wide dynamic range does not simply mean that many genes in the middle-to-low expression range can be measured. It also has important biological meaning because it becomes easier to distinguish genes that are not expressed from genes that are expressed at low levels.

When examining on/off expression or weak expression changes, it is important to know how far the low-expression region can be trusted. In this respect, Agilent Whole Genome 4x44k and later platforms have practical advantages.

Comparison with RNA-Seq

RNA-Seq and microarrays are often compared with the statement that “RNA-Seq has a wider dynamic range.” However, in such comparisons, it is necessary to ask what is being used as the “microarray” comparison target.

Affymetrix 3' IVT GeneChip, Illumina BeadChip, Affymetrix Gene ST Array, and Agilent Whole Genome / SurePrint Microarray differ greatly in how low-signal regions appear and in how many genes can be interpreted stably. Therefore, if “microarray” is treated as a single category and compared with RNA-Seq, it becomes unclear what is actually being compared.

For example, if a microarray platform with a relatively limited number of genes in the signal region is used as the comparison target, RNA-Seq will tend to appear to have a wider dynamic range. On the other hand, when comparing RNA-Seq with a platform such as Agilent Whole Genome / SurePrint Microarray, which can handle low-expression regions relatively stably, the number of RNA-Seq reads and the variation in the low-count region must be checked.

The dynamic range of RNA-Seq depends, in principle, on read depth. With a small number of reads, it is difficult to measure low-expression genes stably. With sufficient read depth, lower-expression genes can more easily be included in the analysis.

As a rough guide, RNA-Seq with about 30 to 60 million reads in single-read equivalent is often sufficient for quantifying medium-to-high expression genes, and may provide practical utility comparable to mature microarrays. However, if low-expression genes are also to be included stably in the analysis, the required number of reads increases substantially.

In other words, RNA-Seq does not always have a wider dynamic range than microarrays. If sequencing depth is insufficient, the practical resolution in low-expression regions may be inferior to that of a good microarray.

When comparing RNA-Seq and microarrays, it is necessary to judge not only by the technology name, but also by the type of microarray being used as the comparison target, the number of RNA-Seq reads, variation in the low-count region, the summarization method, and how the signal region appears.

Summary

The dynamic range of a microarray cannot be compared by how many orders of magnitude the measured values span. What matters is how many genes are included in the signal region that can be interpreted with confidence.

Affymetrix 3' IVT GeneChip is a representative microarray platform with a large amount of accumulated data, and is a practical platform for stably interpreting about 12,000 genes. However, because MAS5 and RMA differ in how noise appears and in the amplitude of Fold Change, the summarization method should be checked when re-analyzing data from GEO and other sources.

In Illumina BeadChip, the boundary between the noise region and the signal region can be judged relatively easily from the histogram, but the number of genes included in the signal region is not large. In Affymetrix Gene ST Array, low-signal compression by RMA and the effect of exon-targeted probe design make the boundary between the noise region and the signal region broad and ambiguous.

In contrast, in Agilent Whole Genome 4x44k and later platforms, the overlap between the noise region and the signal region is small, and many genes can be interpreted stably, including low-expression genes. This makes it easier to distinguish genes that are not expressed from genes that are expressed at low levels, and gives these platforms a wide practical dynamic range.

When comparing RNA-Seq and microarrays, “microarray” should not be treated as a single category. It is necessary to check the comparison platform, the number of RNA-Seq reads, and the variation in the low-count region. Rather than judging superiority only by the technology name, it is important to look at the actual data distribution.

Related topic: fixed measurement systems are important for long-term comparison

The differences between RNA-Seq and microarrays are also discussed in RNA-Seq Is Not a Simple Upgrade from Microarrays in Gene Expression Analysis . RNA-Seq and microarrays are not in a simple superior-inferior relationship. Because their measurement methods differ, some genes are more stably observed by microarrays, whereas others are more easily detected by RNA-Seq.

In addition, for long-term diagnosis or monitoring, where it is important to compare data using the same criteria over time, microarrays have practical value because the measurement system can be fixed more easily. RNA-Seq is a flexible and expandable technology, but differences in measurement conditions, library preparation, sequencers, and analysis pipelines can change how the data appear. Therefore, it is important to evaluate not only the research purpose and the characteristics of target genes, but also over what period and by what criteria the data need to be compared.

Subio Platform allows you to visually examine microarray and RNA-Seq data and consider which range can be trusted as the target of analysis. Instead of judging data only by their appearance, checking the relationship between noise and signal while proceeding with the analysis is the first step toward handling gene expression data appropriately.