Read Depth and Dynamic Range in RNA-Seq

  • Gene Expression
  • High-Throughput Sequencing

Read Depth and Dynamic Range in RNA-Seq

The range of expression levels that can be reliably handled in RNA-Seq depends greatly on read depth. Here, dynamic range does not simply mean the overall spread of measured values. Rather, it refers to the signal range that can be interpreted with confidence, after excluding low-count regions where noise has a large influence.

When read depth is low, many of the reads are occupied by a relatively small number of highly expressed genes. In contrast, more reads are required to measure the expression levels of lowly expressed genes stably.

In other words, in RNA-Seq, the more you want to include lowly expressed genes in the analysis, the more reads you need. Increasing the read depth brings more genes into the reliable signal range, but extending the dynamic range toward the low-expression side requires a rapidly increasing number of reads.

Note: In this article, read depth is described as single-end read equivalents. For paired-end RNA-Seq data, one read pair should be counted as one fragment. Therefore, if the total read count is reported by counting read 1 and read 2 separately, use approximately half of that number when comparing it with the read depths discussed in this article.

Fig1: RNA-Seq Read Depth and Dynamic Range

For example, in an RNA-Seq dataset with around 10 million reads, depending on the filtering criteria and the nature of the samples, approximately the top 10,000 genes by expression level can be expected to be handled as a relatively stable analysis target. However, if you try to extend the analysis further into the low-expression region, the required read depth increases substantially.

This point can easily be overlooked if you only look at values after conversion to TPM, FPKM, or RPKM. With TPM or FPKM, the range of values may appear wider because of normalization by gene length. However, this does not mean that the measurement reliability of the low-count region has improved. For details, please also see Why TPM, FPKM, and RPKM Should Not Be Used for RNA-Seq Differential Expression Analysis.

When evaluating the dynamic range of RNA-Seq data, it is important to check the original Count values and read depth, rather than relying only on TPM, FPKM, or RPKM.

Pay attention to the amount of input RNA

The amount of input RNA, as well as read depth, has a major effect on the range of expression levels that can be reliably handled in RNA-Seq. In ultra-low-input RNA-Seq and single-cell RNA-Seq, the number of molecules obtained from each sample or cell is limited, so these methods are subject to constraints that differ from those of bulk RNA-Seq. Although technical improvements continue to be made, extra care is needed when trying to quantify lowly expressed genes stably. In particular, in the low-count region, it is necessary to carefully judge whether a gene is truly not expressed or was simply not detected by chance.

Pay attention to differences in read depth among samples

Even within the same RNA-Seq dataset, the total number of reads can differ among samples. When looking at multiple datasets, it is not unusual to see about a two-fold difference between samples with the highest and lowest total read counts.

In data analysis, filtering and interpretation should be considered according to the sample with the lowest read depth. Therefore, when designing an experiment, it is safer not to rely only on the theoretical values listed in the sequencer specifications. It is better to allow for the possibility that the effective number of reads available for analysis may be about half of the theoretical value.

Fig2: RNA-Seq Read Depth and Dynamic Range Summary

3' RNA-Seq is still constrained by read depth

3' RNA-Seq is a method that sequences mainly the region near the 3' end of transcripts, rather than the entire transcript. It is less affected by gene length and has the advantage of making it easier to process many samples efficiently.

However, using 3' RNA-Seq does not mean that lowly expressed genes can be measured stably with only a small number of reads. To interpret lowly expressed genes with confidence, sufficient molecular input and read depth are still required.

Summary

In RNA-Seq, if you only look at normalized values or analysis results, it is easy to overlook whether a sufficient number of reads was obtained in the first place. If the read depth is insufficient, interpretation of the low-expression region becomes unstable.

When considering the dynamic range of RNA-Seq data, first check the read depth, and then judge the data by actually examining the Count distribution and reproducibility among samples.