You have to be more careful in RNA-Seq data analysis, and we recommend you use Data Analysis Service more strongly than microarrays’. Please read below for details.
If you look at a table of count or FPKM of RNA-Seq with Excel, you will find a lot of "0s" (or blanks), meaning there was "no read" for those genes. However, you must know that you can’t regard them as "not-expressed."
You will see the reason clearly if you have replicated samples. You will easily find genes with some counts in some samples and "0s" in others. So "0s" can mean "not-captured" by a sequencer just by chance. This phenomenon is unique to the digital method, and analog methods like microarray don’t have. It is just totally random, though, is more likely to happen for lowly expressed genes.
Let's say; you are comparing expression levels between two groups. A focused gene shows lower expression in group B than A, but you find that some samples in group B have "0" counts. In such cases, those "0s" can cause fake detections of the differential expression. You must calculate averages and variances after removing the effect of "0s," otherwise you will get many false positives in the result. In another case, you have to be careful to conclude like "RNA-Seq detects more differentially expressed genes than microarrays."
So, we propose the following steps of pre-processing and normalization to handle RNA-Seq data properly for differential expression analysis with Subio Platform. You can recall the blocks by selecting RNA-Seq (optimized for GDC counts) scenario.
- Log Transformation
- Global Normalization
- Low Signal Cutoff
- Centering, or Ratio to Control Samples
Having Log Transformation block at first removes all "0s" to exclude the effect from the subsequent comparisons.
Some data don't have the "0s" because they added one or something to avoid an error on log-transforming. If you have such data files, you should apply Transform Signals block to subtract one before Log Transformation.
About a comprehensive list of differentially expressed genes.
Processed signals generated from the scenario above can be null, if the raw signal of denominator or numerator is 0. So they are excluded from subsequent differential expression analysis based on fold-change or p-value. But if all signals in group A are 0 and signals in group B are positive and reliable, such genes are interesting in the biological context. So you have to extract these genes with Filter tool. Firstly, extract genes which Ch1RawSignals are 0 at 100% samples in group A. Secondly, extract genes which Ch1RawSignals are larger than, for example, 10 at 50% of samples in group B. And then use Venn Diagram tool to extract the intersection of the two lists.
An RNA-Seq data analysis procedure
Please take an Online Training for a full instruction of the data analysis.
For the clustering or PCA, it is better to fill data lacks with some numbers to avoid many nulls. So, after extracting differentially expressed genes, add "Fill Missing Values" block and recalculate Processed Signals. And then execute clustering or PCA. I recommend the following usages of the block.
If the last block is "Centering"
Put "Fill Missing Values" block at last . Replace blanks with 0 (You assume you can't tell about expression levels) or -1 (You think they are likely to under-expressed than the average).
If the last block is "Ratio to Control Samples"
Put "Fill Missing Values" block before "Ratio to Control Samples" block. Replace blanks with the threshold of "Low Signal Cutoff" or a bit smaller than that.
See also "How to Use Fill Missing Values Block."
Additionally, RNA-Seq data is less stable than microarrays'. You may see non-linear biases even in samples in the same batch. We recommend you stop blindly taking data as sound, but consult experienced analysts for advices on their quality.
Now you understand that RNA-Seq data analysis is more complicated than microarrays', and you know why we recommend you order Data Analysis Service instead of doing it by yourself.