We use the Subio Platform as the data analysis software, which is much less prevalent than R/Bioconductor, but you can use it for academic outputs . It's a shame that beginners often focus too much on learning commands rather than concepts of data analysis. We offer this software because it lets you visually confirm how an operation affects data step by step. And this is vital if you need to understand data analysis concepts clearly. Even if you use R , understanding the concepts is beneficial. So, why won't you start with the Subio Platform?
1-a. Importing FASTQ Files
Importing RNA-Seq FASTQ files into the Subio Platform means running a pipeline composed of fastp , HISAT2, and StringTie . It works on an ordinary Windows or Mac computer. Even for those who can use the command-line tools, doing it with the Subio Platform is more comfortable.
Before you run the pipe, you have to install and set up these tools. If you have any difficulties with the installation, please order the Troubleshooting Service for FASTQ processing.
Processing FASTQ files generates intermediate results in the same folder, so be sure that there is enough free disk space, which is at least five times larger than the total size of the FASTQ files, before you start executing. The process will stop in the middle if there is no disk space available. The FASTQ file directory can be on an external disk.
If you have paired-end FASTQ files, please follow the naming conventions. And if you get any errors, please try troubleshooting.
1-b. Importing The Gene-Level Expression Data
Let’s say you have a table of gene counts/FPKM/TPM, you don't need to hustle with FASTQ processing. Instead, import it into the Subio Platform. For example, we import the GSE49110 data set and analyze it, which consists of eight RNA-Seq samples. You can download a text file of the counts data of the eight samples. Or you can download the SSA file to skip to step 2.
You open and edit the text file in Excel, but notice a tip in the opening. Delete unnecessary columns and fill in the header row to create a correctly formatted table with a column of gene names and eight columns of counts representing eight samples.
Now you have a file to import into the Subio Platform. Start the Import Samples Wizard. Use the “Multiple Samples in One File” option in the first step, and the “Create A New Platform” option in the second step.
After importing samples, you would be better off addoing sample information described in the SOFT formatted family file. Now you can see more about them. It is also helpful to filter samples by keywords.
Let’s create a Series of 8 samples to visualize and analyze.
2. Create and Set Up A Series
When the series is loaded, the Series Panel on the left side organizes objects like Measurement Lists and DataSets. The upper panel of “Analysis Browser” draws a scatter plot chart by default. And the “Setup Series” tab is open in the lower panel.
In the Setup Series tab, you click buttons from left to right. The first button is labeled "Edit Parameters." Open the Edit Parameter Window. You import sample information to do it quickly in general. Here, you have four conditions (control and three siRNA treatments), and each has duplicates.
Switch to the Setup DataSet tab to edit and create DataSets. You organize sample groups by prioritizing and marking parameters. In this tutorial, you establish two DataSets. Additionally, it would be good to make notes and attach related files in Sample Info. Tab.
3. Editing Normalization & Pre-processing
Normalization and preprocessing are critical because they determine the results of subsequent analyses. You’d better keep in mind that the real data can be something very different from what a textbook assumes. So you have to look at the data correctly, understand the characteristics, and choose the appropriate methods based on those traits. Otherwise, you will lead to inadequate conclusions. It would help if you always paid close attention to this step.
You can start from a Normalization Scenario named “RNA-Seq (Counts),” and adjust options to make it fit the data. Please see what you need to know, how to operate it, and how to make decisions based on the data. We repeatedly say that you cannot apply the same approach to any dataset. You always have to adjust based on the data in front of you. If you are not sure, please order a Data Analysis Service.
Please read also;
- Details of the preset normalization scenario for the RNA-Seq data.
- The basics of the normalization and pre-processing
4. Filtering (Quality Control)
As you saw in the previous part, you can’t trust all the measurement values. You have to distinguish trustworthy measurements from noise, which you have to take care of so that they don’t disturb the analysis. The Filter tool in the Basic Plug-ins is helpful for quality control. If you don’t have a valid license, please request a 5-days free trial.
Remember that one of the basics of quality control in omics is to exclude genes that are not eligible for analysis, even though many beginners try to extract genes based only on trustworthy measurements. The latter way must miss genes that are not expressed in the control but are sharply elevated by a treatment. Please confirm the difference by yourself by comparing the results of various filter settings.
And another basic is that you do it in two layers. Firstly, you exclude genes with too-low signals in all groups. Secondly, you filter out genes if the experimental factors don’t alter their expression levels.
If you learn to analyze with DESeq2 in R on your own, you may see the following code for filtering lowly expressed genes.
dds <- dds[rowSums(counts(dds)) >= 10,]
However, how do you know if the criterion is adequate? From my experience, more is needed, and this code depends on the number of samples. For example, the cutoff should be ten times different when you handle six and 60. So, how do you know which criterion is proper? The answer is that you have to find it by watching the data.
5. PCA & Clustering
Former sections are preparations, and here we start using analytical tools to extract biological meanings. Please note that you shouldn’t use all genes; only quality-controlled genes generated with the filter tool.
We recommend that you first review the data to get the complete picture and identify what you should do. Principal Component Analysis (PCA) is an ideal tool for this purpose. Please remember the three points for interpreting PCA results. The first is the distance between dots. Closely located dots indicate that their expression profiles are similar. The second is the direction from the origin. The sharp angle shows that their expression profiles resemble each other, while the farther dots indicate higher amplitudes. If dots are located beyond the origin, their expression profiles are opposite. And the last is principal components (pc) and their contribution rates. The horizontal and vertical axes represent different gene groups. And the contribution rate reflects the size of the groups. So a pc with the highest contribution rate represents the majority. Understanding the trend is helpful, but it has nothing to do with biological significance. Only a few genes often determine the fate of the cells in many biological phenomena. In other words, principal components with a tiny contribution rate may indicate the behavior of the significant genes.
With the basics above, let’s take a look at the PCA result of this case. The origin is near the upper-left corner, and the control samples are around it. The duplicates in each group are closely located, indicating that the experiment is highly reproducible. There seems to be a common effect by all siRNAs (right-down-ward) and independent effects (right-ward and down-ward)
Next, we apply hierarchical clustering. If you overlook the heatmap, you'll notice that many genes are commonly colored red. Contrarily, blue colors are siRNA-oriented. In other words, down-regulated genes are differentially silenced by siRNAs. What’s important for analysts is to find such differences or characteristics, not to make lists or figures, because they can be a clue for speculation about the underlying mechanism.
The results from PCA and clustering provide different viewpoints of the same intrinsic states. Switch the view to understand the data in depth.
6. Extracting Differentially Expressed Genes (DEG)
In this case study, we saw that the up-regulated genes are largely common, while the down-regulated genes are unique to the siRNAs. So, let's extract DEGs and confirm the result with the Venn Diagram tool.
7. Gene Annotation and Enrichment Analysis
NCBI FTP siteGene Symbol, EMBL ENST/ENSG ID, or NCBI Entrez Gene ID. So you usually have to fill gene annotations by yourself as you get tables from database sites. Here we show how to use the NCBI FTP site. If your data includes ENSG or ENST IDs, you need to retrieve the table from Ensembl BioMart. If not, please get it from the database site managing the ID and annotation. Once you have the gene annotation in the platform, you can utilize features like searching or the Enrichment Analysis tool in the Advanced Plug-in.
They often use terms like “gene ontology (GO) Analysis,” “pathway analysis,” and “network analysis” when they want to ask how to execute the enrichment analysis. Gene Set Enrichment Analysis (GSEA), Ingenuity Pathway Analysis (IPA), David Functional Annotation, or Meatscape are tools for doing it.
Please also take a look at the tutorial of the enrichment analysis as well.
8. Genomic Location-Specifically Regulated Genes and Motif Sequences.
In the gene annotation table, please place the chromosomal location information in the dedicated columns in the proper format. Once you have done so, you can utilize the Genome View and Chromosome tab of the Subio Platform.
You can estimate gene or transcript expression levels from RNA-Seq data. However, when calculating transcript expression levels, sequence reads are split across splicing variants. As a result, it narrows the dynamic range and degrades data quality. Thus, only gene-level estimation is generally helpful. It means you can’t know the exact TSS or transcripts precisely.
After you extract DEGs or a cluster of genes sharing a particular expression pattern, you can check if those genes are located evenly on the chromosomes or are biased. If you find regions in which genes are mostly up- or down-regulated, the change of epigenetic status or chromatin structure might cause the effect. Or, if you don’t find the positional bias, you might presume that transcription factors are likely to control their expression levels.
On the other hand, if you download the genomic sequence data and store the gz files in a directory, you can use them in the Subio Platform. Open the Find Regions from Seq tool. You can search for the motif sequence specified in the IUPAC nucleotide code is in the genome. Once you have a list of the motif sequence in a region, you can search for genes that have the motif around their TSS. Of course, you can see the expression patterns of those genes.
In this case study, the up- or down-regulated genes are likely to be evenly scattered. The author modulated the nuclear receptor ERRα using siRNA, and we can find its consensus sequence on Wikipedia. So we can extract genes with the motif around their TSS. There is only one gene at the intersection of the commonly down-regulated genes and the motif. However, the essential genes might not be down-regulated by all three siRNAs, or the threshold might have been too strict. You will have more candidates by loosening those criteria.
RNA-Seq Data Analysis Tutorial (08) - Genomic Location Specificly Regulated Genes & Motif Sequences
9 – The last tip for you
Although this tutorial shows a workflow, it is not always applicable to any case. In real data analysis, you have to make a series of appropriate decisions based on the data characteristics and the research purpose. If you think it is difficult for you due to a lack of experience, please get in touch with us about the Data Analysis Service.