In gene expression analysis, RNA-Seq is often described as a newer and more powerful method than microarrays. Indeed, RNA-Seq has many advantages, such as not depending on predefined probe design, being able to analyze a broader range of transcripts, and offering greater flexibility for reanalysis.
However, RNA-Seq is not a simple upgrade from microarrays. Because RNA-Seq and microarrays are based on different measurement principles, the genes that are easier to detect, the genes that are harder to interpret, and the range of measurements that can be reliably compared are not the same.
There are multiple microarray platforms. In this page, we use Agilent microarrays as an example of a platform that is still sometimes used as a relatively stable source of gene expression data. When we refer to “microarrays” below, we are mainly referring to Agilent microarrays.
The important question is not which technology is always better. What matters is understanding the characteristics of each method, looking at the actual data, and deciding which range of the data can be trusted for interpretation.
RNA-Seq and microarrays are based on different measurement principles
In RNA-Seq, libraries are prepared from RNA and sequenced. The resulting reads are then assigned to a genome or transcriptome to estimate gene-level expression. As a result, read depth, library preparation, gene length, sequence mappability, the handling of multi-mapping reads, annotation, and expression quantification methods can all affect the results.
In contrast, microarrays measure signals generated by the hybridization of sample-derived nucleic acids to predefined probes. The measurement targets are limited to genes or regions for which probes have been designed, but within the same platform, the fixed probe set can make stable comparisons easier.
In other words, although RNA-Seq and microarrays may both appear to measure “gene expression,” they estimate expression values based on different measurement principles. Therefore, when interpreting differences between results, it is necessary to consider the target nucleotide sequence, read mappability, the possibility of multi-mapping, and the region measured by the probe.
Early RNA-Seq comparison papers should be read carefully from today’s perspective
From the early to mid-2010s, many papers compared RNA-Seq with microarrays. These papers often claimed that RNA-Seq could detect more genes than microarrays, was better for low-expression genes, and had a wider dynamic range.
However, many comparisons from that period focused on demonstrating the advantages of RNA-Seq as a new technology. They often emphasized the number of detected genes or the apparent spread of expression values based on FPKM/RPKM. From today’s perspective, some of these comparisons did not sufficiently distinguish “detected genes” from “genes with stable measurements that can be reliably used for analysis.”
In RNA-Seq, genes assigned only a small number of reads may still be counted as detected. However, genes with only a few reads cannot always be interpreted reliably in terms of expression differences or expression patterns across samples.
In gene expression analysis, the important question is not simply how many genes were detected, but which genes can be used for sample-to-sample comparison, and with what level of reliability. It is risky to conclude that RNA-Seq always provides more useful information based only on the number of detected genes.
Be careful with the wide dynamic range seen in FPKM/RPKM values
RNA-Seq is often said to have a wider dynamic range than microarrays. However, in early RNA-Seq comparison papers, the wide dynamic range of RNA-Seq was sometimes emphasized even though the number of reads obtained with sequencers at that time was much lower than what is common today.
In particular, during the period when platforms such as the Genome Analyzer II (GAII) were used, RNA-Seq read depth was often limited compared with current bulk RNA-Seq. For low-expression genes, sufficient read counts may not have been obtained. Therefore, caution is needed when such data are used to claim that RNA-Seq can broadly quantify expression even in low-expression regions.
Today, deeper sequencing and methods using UMI (Unique Molecular Identifier) can reduce the effects of PCR bias and duplicated reads in some cases. However, even when UMI is used, the number of observed molecules in low-expression regions is still small. Therefore, it is still necessary to consider carefully whether such differences can be interpreted as stable expression differences across samples.
In earlier comparison papers, normalized values such as FPKM and RPKM were often used to discuss the wide dynamic range of RNA-Seq. FPKM and RPKM normalize read counts by library size and gene length. As a result, even genes with low counts may appear as continuous expression values after normalization.
However, a wide distribution of normalized values is not the same as having stable and reliable measurements. Especially in low-expression regions, the original read counts may be small. If dynamic range is evaluated only from FPKM/RPKM values, the practical quantitative range may be overestimated.
Therefore, conclusions in early RNA-Seq comparison papers stating that “RNA-Seq has a wider dynamic range than microarrays” should not be accepted at face value. They should be read while checking the actual read counts, the original count distribution, the handling of low-expression genes, and the apparent spread caused by FPKM/RPKM values.
Genes and cases where RNA-Seq may have advantages
RNA-Seq has advantages that microarrays do not have. For example, RNA-Seq does not depend on an existing probe set, can more easily incorporate updated annotations through reanalysis, and can be extended to analyses based on sequence information.
Also, even at the same expression level, longer genes may generate more reads in RNA-Seq, making them less likely to suffer from insufficient counts in low-expression regions. Therefore, for genes that are lowly expressed but long and highly mappable, RNA-Seq may make it easier to examine expression patterns.
| Genes or cases where RNA-Seq may have advantages | Reason |
|---|---|
| Genes without designed probes | They are not measured by microarrays if no probe is available |
| Genes to be re-examined with updated annotation | They may be included through re-mapping or re-quantification |
| Low-expression genes that are long and highly mappable | They may still obtain enough reads to examine expression patterns |
| Cases involving transcript structure or isoforms | Sequence information can be used, but short-read RNA-Seq depends on inference and requires sufficient read depth |
However, being detectable by RNA-Seq does not mean that all measurements can be interpreted with the same level of confidence. For low-expression genes, insufficiently annotated genes, and groups of highly similar sequences, read assignment and expression estimation may be unstable.
RNA-Seq is also often said to be useful for examining transcript structure and isoforms. However, transcript-level expression values obtained from ordinary short-read RNA-Seq are estimates based on fragmented reads, not direct observations of full-length transcripts. They should not be treated in the same way as results from long-read RNA-Seq, which can read full-length transcripts more directly.
Furthermore, when expression is estimated at the transcript level rather than the gene level, reads are divided among multiple isoforms. As a result, the number of reads assigned to each transcript decreases, and expression estimates tend to become less stable than gene-level summaries.
If an experiment is designed specifically for isoform analysis, sufficient read depth should be secured for that purpose. However, RNA-Seq data generated for ordinary gene expression analysis often do not have enough reads for reliable transcript-level quantification. If such data are later repurposed for isoform analysis, values may still be output by the software, but they may not be reliable enough for interpretation.
Detecting genes without probes is not the same as interpreting them reliably
One advantage often attributed to RNA-Seq is that, unlike microarrays, it does not require probes to be designed in advance. Therefore, genes without probes, genes included in updated annotations, and newly annotated transcripts can potentially be included in the analysis.
This is an important feature of RNA-Seq. However, “detectable” does not mean “interpretable with the same confidence as well-characterized known genes.”
For insufficiently annotated genes, highly similar sequence families, and low-expression genes, read assignment and expression estimation may be unstable. In particular, genes whose intron-exon structures are experimentally supported by cDNA libraries or ESTs are not equivalent to genes whose structures are not sufficiently supported. The reliability of expression values obtained by RNA-Seq may differ between these cases.
Read assignment and expression estimation are performed automatically by software. However, the output values should not all be treated as measurements with the same level of confidence. The reliability of the measurement can depend on the support level of the gene model, read mappability, the possibility of multi-mapping, and the presence of similar sequences nearby.
Moreover, even if a measurement value is obtained, it may be difficult to interpret biologically if the function of the gene or transcript, the cell types in which it is expressed, or its known biological role is not understood. In practice, even if many uncertain transcripts without annotation are detected, they are often not used as central targets for analysis or interpretation because their relationship to the research question cannot be explained.
In other words, RNA-Seq expands the range of things that can be measured. However, the range of genes that can be interpreted does not expand by the same amount. In ordinary gene expression analysis, the ability to detect unknown genes or transcripts becomes a major practical advantage only in limited cases.
What matters in analysis is not using everything that is detected, but identifying the range that can be measured reliably and explained in a biologically meaningful way.
Genes and cases where microarrays may have advantages
Microarrays depend on predefined probes, so their measurement targets are limited. This is a limitation, but in certain cases it can also become an advantage.
For genes that are both lowly expressed and short, RNA-Seq may not obtain sufficient read counts. Counts may become zero to only a few reads, or sample-to-sample variation may appear large. On the other hand, if a sufficiently specific probe exists and the signal is clearly above background noise, microarrays may allow more stable comparison in some cases.
However, short genes are not always difficult for RNA-Seq. With current deep bulk RNA-Seq, short genes can still be analyzed stably if they are sufficiently expressed, have unique exons, and show high mappability. The essential issue is not gene length itself, but whether sufficient effective counts are obtained for sample-to-sample comparison.
In gene families that share highly homologous regions, short-read RNA-Seq reads may correspond to multiple genes. Depending on the analysis method, multi-mapping reads may be excluded, or they may be distributed among multiple candidate genes to estimate expression levels.
In such cases, depending on the analysis method, expression estimates may appear closer to each other even when the true expression levels differ between genes. In other words, for groups of highly similar sequences, RNA-Seq measurements may not fully reflect gene-specific expression differences.
On the other hand, if probes are designed in regions with high discriminatory power, microarrays may provide signals that are easier to interpret as differences between genes.
Measuring short and highly homologous microRNAs is a difficult challenge for both RNA-Seq and microarrays. In microarrays, one possible advantage is that the measurement system can be designed not only based on probe sequence, but also around the physical properties of hybridization.
Reference: Agilent miRNA Microarrays Technical Overview
| Genes or cases where microarrays may have advantages | Reason |
|---|---|
| Low-expression and short genes | RNA-Seq may not obtain sufficient effective counts |
| Groups of highly similar sequences | Read assignment in RNA-Seq may be ambiguous, and depending on the analysis method, expression estimates may appear closer to each other |
| Genes with well-designed probes | They may provide stable continuous signals for comparison |
| Cases focused on relative comparison of known genes | Within the probe-covered range, stable comparison may be easier |
| Cases where comparison with accumulated historical data is important | Data from the same or similar platforms can be reused |
For long-term diagnostic use, the fixed nature of the measurement system also matters
One advantage of microarrays is that they are based on a physical measurement platform with fixed probe sequences and positions. This can be a limitation, but it also makes it easier to maintain the same measurement system over a long period of time.
For diagnostic tools and long-term clinical testing, it is important to keep measurement targets, measurement methods, and decision criteria as stable as possible. In microarrays, the position of each probe and the gene or region measured by that probe are predefined. Therefore, as long as the same platform is used, the measurement system itself can be kept relatively fixed.
In contrast, RNA-Seq goes through many steps before expression values are obtained: RNA extraction, library preparation, rRNA depletion or poly(A) selection, fragmentation, PCR, sequencing conditions, read mapping, and expression estimation. Changes in reagents, protocols, instruments, or analysis pipelines at any of these steps can affect the resulting expression values.
In research, flexibility can be valuable. However, in diagnostics, fixed measurement systems can also be valuable. RNA-Seq is a highly flexible and powerful method for research, but when used as a diagnostic tool operated under the same criteria for many years, it becomes important to fix, validate, and maintain the reproducibility of the entire measurement system.
The question is not which is correct, but which range can be trusted
The difference between RNA-Seq and microarrays is not simply a difference between “new technology” and “old technology.” Because the measurement principles differ, the genes each method handles well, the genes each method handles poorly, and the ranges that can be reliably compared are different.
Therefore, when RNA-Seq and microarray results differ, we should not immediately conclude that one is correct and the other is wrong. First, it is necessary to examine the original data distribution, read counts, signal intensity, gene length, mappability, probe design, annotation, and sample-to-sample variation.
In gene expression analysis, results should not be judged by technology name alone. It is important to look at the actual data and determine which range can be trusted for interpretation.
Accumulating old and new gene expression data as research assets
Even now that RNA-Seq is widely used, the value of previously generated microarray data has not disappeared. Microarrays have advantages such as long-term data accumulation, validated measurement systems, and ease of comparison with previous studies.
RNA-Seq, on the other hand, has advantages such as flexibility for reanalysis and independence from predefined probes. The important point is not to choose one technology and discard the other. What matters is having an environment where data can be accumulated, compared, and reanalyzed when needed, while understanding the principles and limitations of each measurement method.
By revisiting not only new data but also previously accumulated data, research continuity and reproducibility can be improved. Gene expression data are not simply results to be analyzed once and forgotten; they are research assets that can be re-examined from new perspectives.
Examine RNA-Seq and microarray data from the same perspective with Subio Platform
Subio Platform supports both RNA-Seq data and microarray data. Gene Counts, normalized expression values, microarray signal data, and other expression datasets can be imported into the same environment for visualization, filtering, PCA, clustering, differential expression analysis, and enrichment analysis.
This makes it easier to reuse not only new RNA-Seq data, but also previously accumulated microarray data as research assets. Even when technologies change, it is important to have an environment for revisiting, comparing, and interpreting data.
For practical workflows, please see the following tutorials.
Subio Platform is not based on the assumption that only one of RNA-Seq or microarrays should be used. It provides an environment for accumulating, visualizing, comparing, and understanding both old and new gene expression data. What matters is not leaving conclusions to tools or technology names, but seeing the data, verifying the results, and interpreting them.