RNA-Seq Normalization Presets|How to Handle Gene Counts, TPM, FPKM, and RPKM

When importing RNA-Seq data into Subio Platform, it is important to choose an appropriate normalization and preprocessing scenario according to the type of input data. In particular, Gene Counts, TPM, FPKM, and RPKM differ in what their values represent and how noise appears in the data, so the same preprocessing steps cannot simply be applied to all of them.

For differential expression analysis, Gene Counts should generally be used as the starting point. For checking data structure by PCA, clustering, heat maps, and similar methods, Gene Counts can be easier to interpret after appropriate normalization, log transformation, and filtering. For GO and pathway analysis, biological interpretation is then performed using the gene list obtained from differential expression analysis.

However, in public datasets or previously processed analysis results, Gene Counts may not be available, or obtaining them may require substantial effort. For this reason, this page also explains important points to keep in mind when using TPM, FPKM, or RPKM data.

Finding a proper sequence of normalization and pre-processing.

Subio Platform offers the following normalization scenarios. You can pick one as a template and adjust the options to make it fit the characteristics of the data.

  • Expression Microarray
  • RNA-Seq (Count)
  • RNA-Seq (FPKM, TPM, RPKM)
  • Methylation Beta Values
  • Pre-normalized Log2 Data - Select this if you have pre-normalized log2 ratio data.
  • Nothing - Select this to clear all blocks.

Novice users might feel unconfident about the settings, but we will support them through online training so that they can learn the whole analysis process for the data they want to analyze.

Characteristics of RNA-Seq Gene Counts Data

When you look at a table of RNA-Seq Gene Counts, you will notice that it contains many zeros. A value of 0 means that no reads were mapped to that gene in that sample, but it cannot simply be interpreted as meaning that the gene is not expressed. This is one of the difficult aspects of RNA-Seq data analysis.

The reason becomes clear when you look at experimental data with replicates. For example, a gene may have a nonzero value in one replicate sample, but a value of 0 in another replicate sample. This indicates that even if a gene is expressed, it may not be captured by the sequencer. Whether the count for an expressed gene happens to become 0 is probabilistic, and the lower the expression level, the higher the probability that the observed count will be 0. In actual datasets, genes for which some replicate samples have a count of 0 tend to be concentrated in the low-count region.

Let us consider a simplified example. Suppose you compare the average values between Group A and Group B, and a certain gene appears to be downregulated in Group B. However, when you look more closely, Group B contains several samples with a count of 0, while the measured counts in the remaining samples are not very different from those in Group A. In this case, if the zeros occurred simply because the gene was not captured, it may be more appropriate to consider that the actual expression level has not decreased as much as it appears. Handling zeros in RNA-Seq data is therefore very difficult. With this in mind, let us consider how the data should be processed.

Analyzing Gene Counts Data

When replicate samples are available and their counts are compared in a scatter plot, the points tend to converge along the diagonal in the high-count region, whereas they do not converge well in the low-count region. These correspond to the signal region and the noise region. The boundary between the signal and noise regions is not clear-cut. It also differs from dataset to dataset, but in many bulk RNA-Seq datasets with sufficient read depth, Gene Counts above roughly 100 often fall within the signal region, Gene Counts below roughly 10 often fall within the noise region, and the boundary between signal and noise is often found somewhere between 10 and 100. This boundary does not switch sharply, but instead forms a gradual transition. As described above, genes whose counts happen to become 0 are often found in the noise region. (Black points)

the dynamic range of RNA-Seq - fig1

For this reason, Subio Platform provides a preset normalization scenario called RNA-Seq (Count), which applies the following preprocessing and normalization steps to Count values.

  1. Log Transformation: base 2
  2. Global Normalization: at 90th percentile
  3. Low Signal Cutoff: Replace <30 to 30
  4. Fill Missing Values: 4
  5. Centering

In Step 1, Count 0 is replaced with a missing value.

In Step 2, Global Normalization is applied. This is a process for correcting overall scale differences among samples. Although its calculation method differs from CPM, which normalizes by total read count, and from TMM normalization, which accounts for compositional bias, they share the same general purpose of correcting overall scale differences among samples, so their effects are broadly similar.

However, in practice, normalization may remain incomplete even after dividing by one million. Therefore, even when such normalization has already been applied, it is often safer to apply Global Normalization again. The default value is 90, but this should be adjusted by looking at the data. In a histogram, set the value near the peak of the right-side signal-region distribution. Use the vertical line indicating the 75th percentile as a reference when choosing the value.

Step 3 is a process for aligning the lower limit of the values, because after normalization, the lower limit shifts according to differences in dynamic range. In actual RNA-Seq datasets, it is common for total read counts to differ several-fold among samples. When you look at the distribution of normalized values, shifts are often observed on the lower-limit side. This is one of the major differences from normalization of microarray data. If samples with a narrower dynamic range happen to be enriched in a particular experimental group, the lower limit in that group may be pushed upward, making it appear as if expression has increased. This needs to be prevented.

The key point when setting the lower limit is to set it near the lower end of the signal region in the sample group with the narrowest dynamic range. You may feel that this wastes information from samples with a wider dynamic range, but from the perspective of the reliability of the overall analysis, the reliable range needs to be aligned to the narrowest dynamic range. Otherwise, unreliable measurements from samples with a narrow dynamic range would be included in the analysis.

edgeR and DESeq2 are methods for statistically testing differential expression by modeling the mean-variance relationship in Gene Counts. However, aligning the shift in the lower limit of the reliable measurement range in normalized values caused by sample-to-sample differences in dynamic range is not the purpose of these methods.

Therefore, if many genes in the low-count region remain in the analysis, variation that should originally be treated as a measurement-limit issue or as a difference in dynamic range among samples may be detected as statistical differential expression. In actual RNA-Seq data, when samples with narrow and wide dynamic ranges are unevenly distributed across experimental groups, genes that appear significant because of dynamic-range differences may be mixed into the significant gene list. (See related case studies: No.426 and No.403.)

For this reason, even when using edgeR or DESeq2, it is important to check whether the significant gene list contains genes that are derived from shifts in the lower limit after normalization.

Step 4 is the process of replacing values that became missing because their original Count was 0 with another value. The preset uses 4 by default, but if you set Low Signal Cutoff to 30 in Step 3, it is better to change this value to around 4.5.

Note that the value for Fill Missing Values is specified as an exponent. A value of 4 means 2 to the 4th power, or 16. On the other hand, a value of 4.5 means 2 to the 4.5th power, or approximately 23. Compared with the lower limit of 30 set in Step 3, 4.5 is closer to the lower limit while still being slightly lower.

The value for Fill Missing Values should not be set too much lower than the lower limit specified in Step 3. The reason is that a Gene Count of 0 does not guarantee that the true expression level is sufficiently lower than the lower limit set in Step 3. If the value is uncertain, it is safer to assign a value that does not appear very different from the lower limit, rather than assigning an arbitrarily low value.

Step 5 converts each gene value into a ratio relative to the mean across all samples. If control samples are available, this step can be replaced with the Ratio to Control Samples block.

We have prepared an Excel file that reproduces the calculations described above. Please download it if you would like to examine the details.

Analyzing TPM, FPKM, and RPKM Data

Analyzing TPM, FPKM, or RPKM data is more difficult than analyzing Gene Counts. This is because the noise region and the signal region are mixed together, and replacing zeros with another value can make the false differential-expression problem described earlier more apparent. The unreliable variation derived from low Counts cannot be removed afterwards by normalization or correction. Therefore, when using TPM, FPKM, or RPKM, you need to choose one of the following analysis policies.

  • Exclude as many genes as possible that have variation derived from low Counts, while accepting that fewer genes will remain available for analysis.
  • Prioritize retaining as many genes as possible for analysis, while accepting that the results will contain noise derived from unreliable measurements.

the dynamic range of RNA-Seq - fig2

Now let us look at the Subio Platform preset normalization scenario RNA-Seq (TPM, FPKM, RPKM).

  1. Log Transformation: base 2
  2. Global Normalization: at 90th percentile
  3. Low Signal Cutoff: Replace <0.01 to 0.01
  4. Centering

Steps up to Step 2 are the same as in the RNA-Seq (Count) scenario, so please refer to the explanation above.

Step 3 sets a lower limit, but unlike Gene Counts, it cannot cancel the effect of the noise region. According to the two analysis policies described above, you need to choose either a high threshold or a low threshold.
For Gene Counts, unreliable measurements tend to be concentrated in the low-count region, so they can be handled to some extent. In contrast, for TPM, FPKM, and RPKM, unreliable measurements derived from low Counts are widely dispersed, so Fill Missing Values cannot be used to replace them appropriately. Therefore, values that become missing after log transformation in TPM, FPKM, or RPKM data are safer to leave as missing values rather than forcibly replacing them with another value. When performing differential expression analysis, you need to consider not only ordinary methods based on fold change and P-values, but also genes that have values in Group B but not in Group A, or vice versa. This adds extra work to the analysis, so be careful not to overlook these genes.

The explanation of Step 4 is also the same as Centering above. However, because measurements may be missing in control samples, it is better not to replace this step with the Ratio to Control Samples block.

However, when clustering or other multivariate analyses are performed in this state, missing values can become a problem. In such cases, it may be acceptable to add Fill Missing Values after Centering and replace missing values with a value such as -1, representing decreased expression. This means that you need to switch Fill Missing Values on or off depending on whether you are performing differential expression analysis or multivariate analysis, which also adds extra work.

Related Topics