Don’t trust TPM/FPKM/RPKM too much. They don’t promise to cancel the systematic error of RNA-Seq data.

  • Microarray
  • High-Throughput Sequencing
  • Gene Expression

Casestudy of GSE159751

You might believe TPM/FPKM/RPKM is better than read counts for RNA-Seq data analysis. However, normalizing per million reads is too naive for the super-complicated omics datasets.

Let’s see an example: GSE159751. They provide FPKMs via the GEO database. Even after per-million-read normalization, the FPKM distributions often vary among samples. Additionally, this dataset has non-linear bias as seen in the distribution shapes: mono- and bimodal, which might have been brought by the RNA degradation. Please learn the importance of visual aids for omics data analysis with this movie.

The other point is examining the ability of “sophisticated” algorithms to remove non-linear biases. Here, we applied quantile normalization, which forces the samples to have a uniform distribution. Though it delivers distribution shape resemblance, it didn’t help remove the non-linear bias.

Thanks to bioinformaticians, we have many algorithms we can use to tackle highly complex tasks. But remember that wise experimental biologists must test whether it works or not on their own. Currently, having a good experimental design and delivery plan to generate high-quality raw data and tools to monitor what happens is superior to relying on “sophisticated” algorithms.

If you're concerned about whether there might be issues with your data, feel free to contact Subio.

Related Topics