Mastering RNA-Seq Data Analysis: A Visual Guide from Raw FASTQ to Biological Insights

In this tutorial, we use Subio Platform, an analysis software that balances intuitive operation with advanced visualization.

When starting to learn omics analysis, many beginners hit the steep wall of R and Bioconductor. Getting bogged down in code debugging and learning commands often leads to neglecting the most critical tasks: "Data Interpretation" and "Biological Insight." This is a huge missed opportunity.

The greatest advantage of using Subio Platform is that you can proceed step-by-step while visually confirming how your operations change the data and what features are being highlighted.

  • "Dialogue," Not Just "Task": Instead of throwing data into a black-box algorithm, you can develop a sense of "talking" to your data through histograms and scatter plots.
  • For Future R Users: Even if you aim to use R in the future, visually understanding the "statistical meaning" of each process is an essential foundation for any professional analyst.
  • Proven Results: Subio Platform is a battle-tested tool trusted in numerous conference presentations and journal publications.

As a first step to "dominating the analysis" rather than being "used by the tools," experience the essence of data analysis through this tutorial.

For those who want to dive deeper:
Regarding "batch effects" encountered in actual analysis and the "analyst's decision" beyond statistical values, please also see Case Study 403: [Does CPM Normalization "Fabricate" Batch Effects? The Hidden Pitfalls of Data Analysis .] (This helps cultivate "judgment criteria" while progressing through the tutorial.)

1-a. Importing FASTQ Files

Key Takeaway: The first step in analysis is importing the measured experimental data. We explain the process for both raw data (FASTQ) and data already converted to gene-level expression (gene counts, FPKM, TPM, etc.).

Importing RNA-Seq FASTQ files into Subio Platform involves running a pipeline consisting of three tools: fastp, HISAT2, and StringTie to calculate expression levels. However, you don't need a workstation or UNIX command-line skills. It runs on standard Windows or Mac computers. Even for those who can use the command line, Subio Platform is much easier to operate.

Before running the pipeline, these tools must be installed. If this seems difficult, please consider our FASTQ Processing Troubleshooting Service.

Keep FASTQ files in .gz compressed format. A large number of intermediate files will be generated in the same folder, so ensure you have sufficient disk space (at least 5 times the total size of the FASTQ files). If the disk runs out during execution, it will stop. FASTQ files can be stored on an external drive.

For Paired-end samples, please follow the naming rules for FASTQ files. If you encounter issues, try our troubleshooting guide.

Importing RNA-Seq FASTQ Files

Back to TOC

1-b. Importing Gene-Level Expression Data

If you already have data converted to gene-level expression (gene counts, FPKM, TPM, etc.), you don't need to struggle with FASTQ processing. Just import that text file into Subio Platform.

As an example, let's download and import the expression text file for GSE49110. Alternatively, you can download the SSA file and skip to the next step.

First, edit the downloaded file in Excel. There are a few tips: remove unnecessary columns and rows, and format it into a table with IDs (gene names) and expression levels.

Import this edited text file into Subio Platform. When starting the "Import Samples" wizard, use the "Multiple Samples in One File" option on the first page, and "Create A New Platform" on the next.

After importing, bring in sample attributes from the SOFT formatted family file. This allows you to view more information and filter by keywords.

RNA-Seq Tutorial (01) - Importing RNA-Seq Counts Data

Back to TOC

2. Series Creation and Setup

Key Takeaway: In statistical analysis, you cannot begin without accurately defining the "Experimental Design" (which samples belong to which group). By organizing attribute information and clarifying the "axis" of comparison, you establish a clear analysis environment.

Once the Series is loaded, objects like the Measurement List and DataSet will appear in the Series panel on the left. In the Analysis Browser, a Scatter Plot is drawn by default in the upper pane, while the Setup Series tab opens in the lower pane. Click the "Edit Parameters" button first. Usually, importing info from "Sample Information" is fastest. This dataset has four conditions (control and three types of siRNA treatment), each with two replicates.

Move to the Setup DataSet tab to edit and create DataSets. This tab's role is to define and organize Sample Groups by setting parameter priorities. In this tutorial, we set up two DataSets. Additionally, record information in the Sample Info tab and save related attachments.

RNA-Seq Tutorial (02) - Creating and Setting Up a Series

Back to TOC

3. Normalization and Pre-processing

Key Takeaway: Normalization is an essential step to correct systematic errors caused by differences in total read counts, but it should not be blindly trusted (See Case Study 379). Instead of relying solely on presets, develop an eye for judging whether data distortions are correctly resolved by visually checking pre- and post-correction distributions with histograms.

Normalization and pre-processing are critical steps because they can change subsequent results. Real-world data doesn't always match textbook assumptions. If you don't correctly understand the characteristics of the data in front of you and process it accordingly, you risk drawing incorrect conclusions. Proceed with caution.

In this tutorial, we start with the "RNA-Seq (Counts)" preset scenario and adjust it to fit the data. Learn what to look for and how to make decisions through this process. Again, this method won't apply to every dataset; real analysis requires constant adjustment. If you lack confidence, please feel free to use our Data Analysis Service.

RNA-Seq Tutorial (03) - Normalization and Pre-Processing

Back to TOC

4. Filtering (Quality Control)

Key Takeaway: "Garbage In, Garbage Out." Learn techniques to appropriately exclude low-expression, unreliable genes or stagnant noise to extract only the signals with true biological meaning.

As seen in the normalization section, not all measured values are reliable. Before analyzing, you must extract values worthy of analysis and handle unanalyzable values so they don't interfere. We use the Filter tool included in the Basic Plug-in. If you don't have the plugin, try our 5-day free trial.

The basic mindset of filtering is not "extracting reliable genes," but "identifying and removing genes unworthy of analysis." This distinction is hard to grasp without experience. The former might miss genes that were not expressed in the control but appeared after treatment. Verify this difference yourself using the Filter tool.

Another basic principle is a two-step process: First, remove genes with values that are too low. Second, remove genes that do not show expression variation.

For example, if you are self-studying R/DESeq2, you might see sample code like this:

dds <- dds[rowSums(counts(dds)) >= 10,]

Do you often copy-paste this thinking, "Is it okay to use the same threshold for any data?" but do it anyway because you're unsure? How should the appropriate threshold be decided? The answer is, "You won't know until you look at the data." For small datasets under 100 samples, the "look at the data as you go" approach in this tutorial is the right answer. For large datasets over 100 samples, the correct approach is to pick about 100 samples randomly for a preliminary analysis to determine the threshold before running the full pipeline. Running sample code without looking at the data inevitably leads to disastrous results.

RNA-Seq Tutorial (04) - Filtering (Quality Control)

Back to TOC

5. PCA and Clustering

Key Takeaway: The movement of tens of thousands of genes exceeds human comprehension; we must make it understandable. Gain a bird's-eye view to intuitively grasp sample variation (experimental quality) and similarity between groups to determine "what to investigate next."

Now that the preparation is complete, we move into extracting meaning using various analytical methods. Note that we target only the gene groups that passed Quality Control through filtering.

First, it's best to grasp the overview of expression profiles to clarify points of analysis. Principal Component Analysis (PCA) is useful for this. There are three key points when looking at PCA: 1) Proximity indicates similar profiles. 2) Direction from the origin indicates groups of genes moving similarly (distance from the origin indicates magnitude). 3) Principal components and contribution ratios represent different gene movements. A high contribution ratio captures the general overview, but low contribution ratios might capture biologically critical movements.

In this data's PCA, the distance between replicates is much smaller than between siRNA treatments, showing high-quality data. Compared to the control, siRNA groups show a shared direction (bottom-right) as well as individual movements.

Next, apply hierarchical clustering. The heatmap shows that while many genes are commonly upregulated (red) across all siRNAs, downregulated genes (blue) vary by siRNA. An analyst's job is not just to make lists, but to notice these features and gain insights into what is happening inside the cell.

RNA-Seq Tutorial (05) - PCA and Clustering

Back to TOC

6. Extracting Differentially Expressed Genes (DEGs)

Key Takeaway: Don't just rely on P-value cutoffs; narrow down genes multi-dimensionally using tools like Venn diagrams. Clarify what "moves commonly" and what "moves uniquely" between different conditions to decide the next target for consideration.

As seen in clustering, upregulated genes are common across the three siRNAs, while downregulated genes have few commonalities. Extract DEGs using Basic Plug-in tools and verify with the Venn Diagram tool.

RNA-Seq Tutorial (06) - Extracting DEGs

Back to TOC

7. Gene Annotation and Enrichment Analysis

Key Takeaway: Translate a "list of numbers" into "biological language." Determine which cellular functions (GO) or metabolic pathways are concentrated with statistically significant variations to find clues for discovery.

RNA-Seq count/FPKM tables often only have Gene Symbols or IDs (ENSG, ENST, Entrez). You need to fetch and fill gene annotation tables from database sites. In this tutorial, we use the NCBI FTP Site , but if your data has ENSG or ENST IDs, please obtain them from Ensembl BioMart . For other IDs, use the specific database site that manages those annotations.

Once the gene annotation is imported, you can use search functions and the Enrichment Analysis tool in the Advanced Plug-in.

Terms like GO analysis, Pathway analysis, and Network analysis often refer to "Enrichment Analysis." Tools like GSEA, IPA, DAVID, and Metascape perform this. See our detailed Enrichment Analysis tutorial for more.

RNA-Seq Tutorial (07) - Gene Annotation and Enrichment Analysis

Back to TOC

8. Genomic Location-Specific Regulation and Motif Search

Key Takeaway: Approach new regulatory mechanisms from physical positional relationships. Infer the "cause" of expression variation by examining changes dependent on specific genomic regions or associations with transcription factor "motif sequences."

Once you have extracted DEGs or genes with specific expression patterns through clustering, you might want to check if these genes are distributed unevenly across the genome. If a cluster of genes in a specific region shows a synchronized increase or decrease in expression, it could potentially hint at localized epigenetic shifts or structural variations in the chromosome. 

If no such bias is observed, it may be more likely that the expression is regulated by transcription factors. You can identify genes having motif sequences near their TSS. Exploring these spatial relationships may provide additional layers of insight into your data. For your reference, the motif sequence search supports IUPAC nucleotide codes.

In this specific case, there appears to be no expression control biased toward genomic locations. However, we were able to find the binding motif sequence for ERR alpha (which was knocked down by siRNA) on Wikipedia and extract genes possessing this motif near their TSS. By checking the overlap between these genes and the list of genes downregulated by all three types of siRNA, we arrived at a single gene. Nevertheless, as mentioned earlier, it may not be necessary for a gene to be suppressed by all three siRNAs, or the threshold for differential expression analysis might have been too stringent. By relaxing these conditions, the number of candidate genes directly regulated by ERR alpha will likely expand.

RNA-Seq Tutorial (08) - Genomic Location Specific Regulation & Motif Seq

Back to TOC

9. Conclusion: Mastering Tools and Becoming a "Decision-Maker"

Key Takeaway: Tutorials are merely introductions to "methods." In the actual field of analysis, it is the analyst—not the tool—who must judge which threshold is appropriate or how to handle batch effects. Reconfirm the importance of making decisions accompanied by subjectivity that goes beyond objective numbers.

This tutorial has introduced the standard flow of RNA-Seq data analysis. However, real data analysis never proceeds in a straight line. The "correct" answer changes depending on data characteristics and research goals.

What matters is not blindly trusting computational results, but "dialoguing" with data through the Viewer, noticing discrepancies, and proceeding with your own rationale. Once you finish learning the methods, the next step is the more essential one: "The Analyst's Decision."

For those aiming for even more advanced analysis:

Subio Platform is a powerful partner in your evolution from a "task worker used by tools" to a "researcher who dominates the analysis." Now, go and derive a world-unique discovery from your data.

Back to TOC