There are two types of data on expression levels by RNA-Seq. The first is read count (or Count), and the other is TPM, FPKM, and RPKM.
Many commentaries on RNA-Seq data analysis recommend the use of TPM, FPKM, and RPKM instead of Counts, so we often see the latter in papers and conference presentations.
However, you need to understand the difference and choose the right one to analyze the RNA-Seq data properly.
Fig9 visualizes Counts from a pair of replicates measured by HiSeq2000. The plots converge diagonally in the region where the count value is higher than 20, suggesting the high reproducibility. However, the dots widely scatter in areas where the count value is lower than 20, indicating the measurements are unreliable. So, this seems to be the noise area in this data (black dots). The dynamic range is not the entire range that all values distributed, but the signal range (colored dots) excluding the noise range.
Here, we use a threshold of 20 for convenience, but in reality, the boundary between the noise and signal ranges are not definite, but a gradient. In this data, we can say that the lower the value becomes less reliable in the region between 10 and 100.
Fig10 shows the same data as Fig9, but the values are TPM. The black dots represent the same genes of noise as Fig9. You see that the noise spreads diagonally and is indistinguishable from the genes in the signal range. However, it would be a mistake to interpret indistinguishability as the absence of noise, because there is no measurement system having no noise.
The read count is expected to reflect the size of the gene, with larger genes having more Counts and, conversely, smaller genes having fewer Counts. Therefore, it is said that it is better to normalize per 1 kbp of exon length. This claim may seem plausible at first glance, but a closer look reveals that it is not.
Notice that the normalization by length diagonally moves the dots of Fig9. Genes larger than 1kb move in a downward-left direction, and smaller genes shift in an upward-right direction. Consequently, this treatment of data introduces the invisibility of noise. What's more, since the average exon length is much longer than 1 kb, many genes move in a downward-left direction, and the range of values becomes much broader. Many papers have claimed that TPM, FPKM, and RPKM have a wide dynamic range or are sensitive for lowly expressed genes, but this is a false conclusion.
By the way, is the claim that "if you don't normalize per 1kb, it is not suitable for expression analysis" true? If your question is which gene is more (or less) expressed in the same sample, you must normalize to the length. However, if you are investigating the change in expression levels between samples, length normalization is fundamentally unnecessary.
The calculation of TPM, FPKM, and RPKM includes the global normalization against 1 million reads. The background to this is the assumption that the total amount of transcripts is constant across samples. However, such an assumption is not always valid in biological experiments, especially cases below.
Forced normalization per million reads for such experiments is only a detriment. You would better seek a proper normalization way as viewing the actual data and going through the trials and errors.
From the considerations above, we got two conclusions.
The dynamic range of RNA-Seq is, in principle, dependent on the number of reads. If the number of reads is small, most of the reads will be occupied by a small number of superabundant genes. You expect that a massive number of reads would be needed to measure the expression level of a lowly-expressed gene. So let's take a look at how many reads you need for your desired dynamic range using a public database.
Fig11 shows a scatterplot visualization of the count values from a pair of repeated samples with different numbers of reads from the GEO database. As expected, you can see how the dynamic range expands as the number of leads increases. And here's a little more detailed summary Fig12.
The wider the dynamic range, the more genes will come into the signal range. The number of genes shown in the table is a rough guide to the number of genes in the signal range since the number varies depending on the filtering criterion.
If the number of reads is about 10 million, you will be able to analyze approximately 10,000 genes from the highest expression. However, if you try to extend the dynamic range, you find that the number of reads required increases dramatically. Compared to the dynamic range of microarrays, 20-50 million rads are comparable to the Affymetrix 3' IVT GeneChip. And to rival Agilent's microarrays, we would expect you need about 100 million reads (200 million if paired-end).
In addition to the number of reads, the amount of imputed RNA has a significant impact on the dynamic range. The dynamic range of ultra-low-input RNA-Seq and single-cell RNA-Seq data is exceptionally narrow.
Also, if you look at many datasets, you will find that there is about a two-fold difference in the number of maximum and minimum reads in most data sets. Data analysis must be adjusted to the narrowest dynamic range by nature. So it is dangerous to rely on the theoretical maximum reads on the sequencer's catalog when estimating the required number of reads. It would be realistic to plan an experiment on the assumption that you are getting samples with a half number of the max reads.