A PRACTICAL Tutorial of RNA-Seq Data Analysis

This tutorial illustrates from importing gene expression data (text files) to interpreting in the biological context. If you have FASTQ files to analyze, please start from another tutorial for the pipeline of computing gene expression levels from FASTQ files.

We use Subio Platform as the data analysis software, which is much less prevalent than R/Bioconductor, but you can use it for academic outputs. It's a shame that beginners often focus too much on learning commands rather than concepts of data analysis. We offer this software because you can visually confirm how an operation affects data step by step. And this is vital if you need to understand the concepts of data analysis neatly. Even if you use R, understanding the concepts of commands is beneficial. So, why won't you start with Subio Platform?

1. Import The Gene Level Expression Data

Let’s import a data set of GSE49110 and analyze, which is composed of eight RNA-Seq samples. You can download a text file of counts data of the eight samples. Or you can download the SSA file of this data set. Here, though I use a table of Counts, you can import FPKM or TPM data in the same way.

You open and edit the file with Excel, but notice that there is a tip in the opening. Delete unnecessary columns and fill blanks in the header row, to make a right table with a column of gene names and eight columns of counts representing eight samples.

Now you have a file to import into Subio Platform. Start the Import Samples Wizard. Use the “Multiple Samples in One File” option in the first step, and the “Create A New Platform” option in the second step.

After importing samples, you would better to add sample information described in the SOFT formatted family file. Now you can see more about them. It is also useful to filter samples by keywords.

Let’s create a Series of the eight samples to visualize and analyze them.

RNA-Seq Data Analysis Tutorial (01) - Import RNA-Seq Counts Data

2. Create and Setup A Series

When the series is loaded, the Series Panel at the left side organizes objects like Measurement Lists and DataSets. The upper panel of “Analysis Browser” draws a scatter plot chart by default. And the “Setup Series” tab is open in the lower panel.

In the Setup Series tab, you click buttons from left to right. The first button is the Edit Parameters. Open the Edit Parameter Window. You import sample information to do it quickly in general. Here, you have four conditions (control and three siRNA treatments), and each has duplicates.

Switch to the Setup DataSet tab to edit and create DataSets. You organize sample groups by prioritizing and marking parameters. In this tutorial, you establish two DataSets. Additionally, it would be good to make notes and attach related files in Sample Info. Tab.

RNA-Seq Data Analysis Tutorial (02) - Create and Setup A Series

3. Editing Normalization & Pre-processing

The normalization and pre-processing is a critical part because it determines the following analysis result. You’d better keep in mind that the real data can be something very different from a textbook assumes. So you have to look at the data in the right way, understand the characteristics, and choose the proper methods according to the traits. Otherwise, you will lead to inadequate conclusions. It would help if you always had supreme attention in this step.

You can start from a Normalization Scenario named “RNA-Seq (Counts),” and adjust options to make it fit to the data. Please see what you have to know, how to operate for it, and how to make decisions according to the data. We repeatedly say that you cannot apply the same way to any data sets. You always have to modify according to the data in front of you. If you are not sure, please order a Data Analysis Service.

Please read also;

RNA-Seq Data Analysis Tutorial (03) - Normalization and Pre-processing

4. Filtering (Quality Control)

As you saw in the previous part, you can’t trust all the measurement values. You have to distinguish trustful measurements from noise which you have to take care so that they don’t disturb the analysis. The Filter tool in the Basic Plug-ins is useful for the task of quality control. If you don’t have a valid license, please request a 5-days free trial.

Remember that one of the basics of quality control for omics is to exclude genes that are not eligible for the analysis, although many beginners try to extract genes with only trustful measurements. The latter way must miss genes like which are not expressed in the control but sharply elevated by a treatment. Please confirm the difference by yourself by comparing the results of various filtering.

And another basic is you do it at two layers. Firstly, you exclude genes with too-low signals at all groups. Secondly, you filter out genes if the experimental factors don’t alter their expression levels.

By the way, If you learn analyzing with DESeq2 in R by yourself, maybe you see the following code for filtering lowly-expresse genes.

dds <- dds[rowSums(counts(dds)) >= 10,]

However, how do you know if the criterion is adequate? From my experience, more is needed, and this code depends on the number of samples. For example, the cutoff should be ten times different when you handle six and 60. So, how do you know the right criterion? The answer is you have to find from watching data. This is why we illustrate how to examine the data to determine the proper way in the movies.

RNA-Seq Data Analysis Tutorial (04) - Filtering (Quality Control)

5. PCA & Clustering

Former sections are preparations, and here we start using analytical tools to extract biological meanings. Please be noticed that you shouldn’t use all genes, but quality controlled genes made with the filter tool.

We recommend you firstly overlook the data to grab the whole landscape, and to point out what you should do. Principal Component Analysis (PCA) is an ideal tool for this purpose. Please remember the three points to interpret the PCA results. The first is the distance between dots. Closely located dots indicate the expression profiles of them are similar. The second is the direction from the origin. The sharp angle shows their expression profiles resemble, while the farther dots indicate the higher amplitudes. If dots locate beyond the origin, their expression profiles are opposite. And the last is principal components (pc) and their contribution rates. You can suppose the horizontal and vertical axes represent different groups of genes. And the contribution rate reflects the size of the groups. So a pc with the highest contribution rate represents the majority. It is useful to understand the trend, but it is nothing to do with biological significance. Only a few genes often determine the fate of the cells in many biological phenomena. In other words, principal components with a tiny contribution rate may indicate the behavior of the significant genes.

With the basics above, let’s take a look at the PCA result of this case. The origin locates at near the left-upper corner, and you find the control samples around the origin. The duplicates in each group closely locates together and it indicates the experiment is highly reproducible. There seems to be common effect by all siRNAs (right-down-ward) and independent effects (right-ward and down-ward.)

Next, we apply hierarchical clustering. If you overlook the heatmap, you'll notice that many genes are commonly colored red. Contrarily, blue-colors are siRNA-oriented. In other words, down-regulated genes seem to be different by siRNAs. What’s important for analysts is to find such differences or characteristics, not to make lists or figures because they can be a clue for speculation for the underlying mechanism.

The results from PCA and clustering provide different viewpoints of the same intrinsic states. Switch the view to understand the data in depth.

RNA-Seq Data Analysis Tutorial (05) - PCA and Clustering

6. Extracting Differentially Expressed Genes (DEG)

In this case study, we saw that the up-regulate genes are largely common, while down-regulated genes are unique to the siRNAs. So, let's extract DEGs and confirm the result with the Venn Diatfam tool.

RNA-Seq Data Analysis Tutorial (06) - Extracting Differentially Expressed Genes (DEG)

7. Gene Annotation and Enrichment Analysis

The RNA-Seq data often contains only one ID column like Gene Symbol, ENST/ENSG ID of EMBL, Entrez Gene ID of NCBI. So you usually have to fill gene annotations by yourself as you get tables from database sites. Here we show how to use the NCBI FTP site. If your data has ENSG or ENST IDs, you have to get the table from Ensembl BioMart. If not, please get it from the database site managing the ID and annotation. Once you have the gene annotation in the platform, you can utilize features like searching or the Enrichment Analysis tool in the Advanced Plug-in.

They often confusingly use terms like “gene ontology (GO) Analysis,” “pathway analysis,” “network analysis,” when they want to ask how to execute the enrichment analysis. Gene Set Enrichment Analysis (GSEA),  Ingenuity Pathway Analysis (IPA), David Functional Annotation, or Meatscape are names of tools to do it.

Please take a look at the tutorial of the enrichment analysis as well.

RNA-Seq Data Analysis Tutorial (07) - Gene Annotation & Enrichment Analysis

8. Genomic Location Specificly Regulated Genes and Motif Sequences.

In the table of gene annotation, please put the chromosomal location information in the dedicated columns in a proper format. Once you have done, you can utilize the Genome View and Chromosome tab of the Subio Platform.

You can theoretically generate the estimated expression levels on the gene or transcript from the RNA-Seq data. However, if you calculate the expression levels of transcripts, sequence reads are split to splicing variants. As a result, it narrows the dynamic range and degrades the quality of the data. Thus, only gene-level estimation is useful in general. It means you can’t know the exact TSS or transcripts precisely.
After you extract DEGs or a cluster of genes sharing a particular expression pattern, you can check if those genes are located evenly on the chromosomes or biased. If you find regions in which genes are mostly up- or down-regulated, you suppose the change of epigenetic status or chromatin structure might cause the effect. Or, if you don’t find the positional bias, you might presume that transcription factors are likely to control their expression levels.

On the other hand, if you download the genomic sequence data and store the gz files under a directory, you can utilize them from the Subio Platform. Open the Find Regions from Seq tool. You can search where the motif sequence specified in the IUPAC nucleotide code is in the genome. Once you got a region list of the motif sequence, you can search genes that have the motif around their TSS. Of course, you can see the expression patterns of those genes.

In this case study, the up- or down-regulated genes are likely to be evenly scattered. The author modulated a nucleic receptor ERR alpha by siRNA, and we can find the consensus sequence of it in Wikipedia. So we can extract genes with the motif around their TSS. There is only one gene in the intersection of the commonly down-regulated genes and genes with the motif. However, the essential genes might not be down-regulated by all three siRNA, or the threshold might have been too strict. You will have more candidates by loosening those criteria.

RNA-Seq Data Analysis Tutorial (08) - Genomic Location Specificly Regulated Genes & Motif Sequences

9 – The last tip for you

Although this tutorial shows a workflow, it is not always applicable to any case. In the real data analysis, you have to make a series of proper decisions according to the data characteristics and the research purpose. If you think it is difficult for you due to the lack of experience, please don't hesitate to contact us for the Data Analysis Service.