A Practical Guide to RNA-Seq: Moving Beyond Black-Box Analysis — Become an Analyst Who Thinks in the Age of AI

When beginning RNA-Seq analysis, many researchers quickly encounter a major hurdle: working with R and various analysis packages.
Time is often consumed by writing code and troubleshooting errors, leaving little room for what truly matters—interpreting the data and deriving biological insights.

At the same time, GUI-based automated analysis tools have become increasingly popular. While they allow users to obtain results quickly, they can also encourage analyses to proceed without carefully examining preprocessing steps, normalization, or batch effects.
In RNA-Seq analysis, what truly matters is not simply completing computations, but observing how the data changes at each step and making informed decisions based on those observations.

Subio Platform was designed as an environment for thinking through data by visualizing it.
By emphasizing visualization and providing essential statistical functions, it enables users to carry out analysis with clear intent rather than relying on black-box processes.

In this tutorial, you will follow the standard RNA-Seq analysis workflow while learning how to look at your data and how to make decisions using Subio Platform.
This is a tutorial designed to help you move from “being able to run an analysis and get results somehow” to “understanding your data and being able to explain your analysis.”

For those who need results quickly

This page is intended for those who want to understand RNA-Seq analysis by working through it step by step.
However, if you are in one of the following situations, you may prefer a shorter path:

I want to analyze my own data as quickly as possible
Data Analysis Service: We deliver not only processed results, but also an interactive SSA file that allows you to explore the data freely.

I want to learn the workflow quickly using my own data
Online Training: Learn the essential analysis workflow efficiently using your own data. Sessions can be recorded for future reference.

I want to start with the basics using demo data
→ Continue with this tutorial.

1-a. Importing FASTQ Files

Key Takeaway: The first step in analysis is importing the measured experimental data. We explain the process for both raw data (FASTQ) and data already converted to gene-level expression (gene counts, FPKM, TPM, etc.).

As discussed in “What is the most efficient way to learn omics data analysis?”, processing FASTQ files is often the first major hurdle for beginners and a step that tends to consume a significant amount of time.
However, FASTQ processing is not the core of research; it is closer to an execution process that follows defined procedures.

Learning command-line tools, overcoming errors, and successfully completing a pipeline can certainly bring a sense of accomplishment. However, this stage is merely a preparatory step. The core of research—understanding the data and deriving biological insights—lies beyond this point.

What truly deserves the most time and attention is the analysis and interpretation that follow. Spending excessive time and effort on these early steps can make the overall process inefficient.

If you find yourself stuck at this stage, it is perfectly reasonable to delegate this part—to colleagues, services, or even AI.

At the same time, tools that allow you to “just input FASTQ files and click a button,” as well as guides that focus only on such operations, may be convenient but often provide limited opportunities to deepen your understanding of the analysis.

If you already have data converted to gene-level expression values (Gene Counts, or not always optimal measures such as RPKM/FPKM, TPM, or CPM), there is no need to start from this step. We recommend proceeding directly to Section 1-b.

In this section (1-a), we will explain how to complete this preprocessing step within Subio Platform. However, keep in mind that this is still a preparatory stage—the essential analytical work lies ahead.

Importing RNA-Seq FASTQ files into Subio Platform involves running a pipeline consisting of three tools: fastp, HISAT2, and StringTie to calculate expression levels. However, you don't need a workstation or UNIX command-line skills. It runs on standard Windows or Mac computers. Even for those who can use the command line, Subio Platform is much easier to operate.

Before running the pipeline, these tools must be installed. If this seems difficult, please consider our FASTQ Processing Troubleshooting Service.

Keep FASTQ files in .gz compressed format. A large number of intermediate files will be generated in the same folder, so ensure you have sufficient disk space (at least 5 times the total size of the FASTQ files). If the disk runs out during execution, it will stop. FASTQ files can be stored on an external drive.

For Paired-end samples, please follow the naming rules for FASTQ files. If you encounter issues, try our troubleshooting guide.

Importing RNA-Seq FASTQ Files

Back to TOC

1-b. Importing Gene-Level Expression Data

If you already have data converted to gene-level expression (gene counts, FPKM, TPM, etc.), you don't need to struggle with FASTQ processing. Just import that text file into Subio Platform.

As an example, let's download and import the expression text file for GSE49110. Alternatively, you can download the SSA file and skip to the next step.

First, edit the downloaded file in Excel. There are a few tips: remove unnecessary columns and rows, and format it into a table with IDs (gene names) and expression levels.

Import this edited text file into Subio Platform. When starting the "Import Samples" wizard, use the "Multiple Samples in One File" option on the first page, and "Create A New Platform" on the next.

After importing, bring in sample attributes from the SOFT formatted family file. This allows you to view more information and filter by keywords.

RNA-Seq Tutorial (01) - Importing RNA-Seq Counts Data

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)

Back to TOC

2. Series Creation and Setup

Key Takeaway: In statistical analysis, you cannot begin without accurately defining the "Experimental Design" (which samples belong to which group). By organizing attribute information and clarifying the "axis" of comparison, you establish a clear analysis environment.

Once the Series is loaded, objects like the Measurement List and DataSet will appear in the Series panel on the left. In the Analysis Browser, a Scatter Plot is drawn by default in the upper pane, while the Setup Series tab opens in the lower pane. Click the "Edit Parameters" button first. Usually, importing info from "Sample Information" is fastest. This dataset has four conditions (control and three types of siRNA treatment), each with two replicates.

Move to the Setup DataSet tab to edit and create DataSets. This tab's role is to define and organize Sample Groups by setting parameter priorities. In this tutorial, we set up two DataSets. Additionally, record information in the Sample Info tab and save related attachments.

RNA-Seq Tutorial (02) - Creating and Setting Up a Series

Back to TOC

3. Normalization and Pre-processing

Key Takeaway: Normalization is an essential step to correct systematic errors caused by differences in total read counts, but it should not be blindly trusted (See Case Study 379). Instead of relying solely on presets, develop an eye for judging whether data distortions are correctly resolved by visually checking pre- and post-correction distributions with histograms.

Normalization and pre-processing are critical steps because they can change subsequent results. Real-world data doesn't always match textbook assumptions. If you don't correctly understand the characteristics of the data in front of you and process it accordingly, you risk drawing incorrect conclusions. Proceed with caution.

In this tutorial, we start with the "RNA-Seq (Counts)" preset scenario and adjust it to fit the data. Learn what to look for and how to make decisions through this process. Again, this method won't apply to every dataset; real analysis requires constant adjustment. If you lack confidence, please feel free to use our Data Analysis Service.

RNA-Seq Tutorial (03) - Normalization and Pre-Processing

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)

Back to TOC

4. Filtering (Quality Control)

Key Takeaway: "Garbage In, Garbage Out." Learn techniques to appropriately exclude low-expression, unreliable genes or stagnant noise to extract only the signals with true biological meaning.

As seen in the normalization section, not all measured values are reliable. Before analyzing, you must extract values worthy of analysis and handle unanalyzable values so they don't interfere. We use the Filter tool included in the Basic Plug-in. If you don't have the plugin, try our 5-day free trial.

The basic mindset of filtering is not "extracting reliable genes," but "identifying and removing genes unworthy of analysis." This distinction is hard to grasp without experience. The former might miss genes that were not expressed in the control but appeared after treatment. Verify this difference yourself using the Filter tool.

Another basic principle is a two-step process: First, remove genes with values that are too low. Second, remove genes that do not show expression variation.

For example, if you are self-studying R/DESeq2, you might see sample code like this:

dds <- dds[rowSums(counts(dds)) >= 10,]

Do you often copy-paste this thinking, "Is it okay to use the same threshold for any data?" but do it anyway because you're unsure? How should the appropriate threshold be decided? The answer is, "You won't know until you look at the data." For small datasets under 100 samples, the "look at the data as you go" approach in this tutorial is the right answer. For large datasets over 100 samples, the correct approach is to pick about 100 samples randomly for a preliminary analysis to determine the threshold before running the full pipeline. Running sample code without looking at the data inevitably leads to disastrous results.

RNA-Seq Tutorial (04) - Filtering (Quality Control)

【Start your 5-day free trial】
▶ Remove noisy genes with filtering

Back to TOC

5. PCA and Clustering

Key Takeaway: The movement of tens of thousands of genes exceeds human comprehension; we must make it understandable. Gain a bird's-eye view to intuitively grasp sample variation (experimental quality) and similarity between groups to determine "what to investigate next."

Now that the preparation is complete, we move into extracting meaning using various analytical methods. Note that we target only the gene groups that passed Quality Control through filtering.

First, it's best to grasp the overview of expression profiles to clarify points of analysis. Principal Component Analysis (PCA) is useful for this. There are three key points when looking at PCA: 1) Proximity indicates similar profiles. 2) Direction from the origin indicates groups of genes moving similarly (distance from the origin indicates magnitude). 3) Principal components and contribution ratios represent different gene movements. A high contribution ratio captures the general overview, but low contribution ratios might capture biologically critical movements.

In this data's PCA, the distance between replicates is much smaller than between siRNA treatments, showing high-quality data. Compared to the control, siRNA groups show a shared direction (bottom-right) as well as individual movements.

Next, apply hierarchical clustering. The heatmap shows that while many genes are commonly upregulated (red) across all siRNAs, downregulated genes (blue) vary by siRNA. An analyst's job is not just to make lists, but to notice these features and gain insights into what is happening inside the cell.

RNA-Seq Tutorial (05) - PCA and Clustering

【Start your 5-day free trial】
Explore global patterns with PCA and clustering

Back to TOC

6. Extracting Differentially Expressed Genes (DEGs)

Key Takeaway: Don't just rely on P-value cutoffs; narrow down genes multi-dimensionally using tools like Venn diagrams. Clarify what "moves commonly" and what "moves uniquely" between different conditions to decide the next target for consideration.

As seen in clustering, upregulated genes are common across the three siRNAs, while downregulated genes have few commonalities. Extract DEGs using Basic Plug-in tools and verify with the Venn Diagram tool.

RNA-Seq Tutorial (06) - Extracting DEGs

【Start your 5-day free trial】
Identify differentially expressed genes (DEGs)

Back to TOC

7. Gene Annotation and Enrichment Analysis

Key Takeaway: Translate a "list of numbers" into "biological language." Determine which cellular functions (GO) or metabolic pathways are concentrated with statistically significant variations to find clues for discovery.

RNA-Seq count/FPKM tables often only have Gene Symbols or IDs (ENSG, ENST, Entrez). You need to fetch and fill gene annotation tables from database sites. In this tutorial, we use the NCBI FTP Site , but if your data has ENSG or ENST IDs, please obtain them from Ensembl BioMart . For other IDs, use the specific database site that manages those annotations.

Once the gene annotation is imported, you can use search functions and the Enrichment Analysis tool in the Advanced Plug-in.

Terms like GO analysis, Pathway analysis, and Network analysis often refer to "Enrichment Analysis." Tools like GSEA, IPA, DAVID, and Metascape perform this. See our detailed Enrichment Analysis tutorial for more.

RNA-Seq Tutorial (07) - Gene Annotation and Enrichment Analysis

【Start your 5-day free trial】
▶ Discover GO terms and pathways from your gene list

Back to TOC

8. Genomic Location-Specific Regulation and Motif Search

Key Takeaway: Approach new regulatory mechanisms from physical positional relationships. Infer the "cause" of expression variation by examining changes dependent on specific genomic regions or associations with transcription factor "motif sequences."

Once you have extracted DEGs or genes with specific expression patterns through clustering, you might want to check if these genes are distributed unevenly across the genome. If a cluster of genes in a specific region shows a synchronized increase or decrease in expression, it could potentially hint at localized epigenetic shifts or structural variations in the chromosome. 

If no such bias is observed, it may be more likely that the expression is regulated by transcription factors. You can identify genes having motif sequences near their TSS. Exploring these spatial relationships may provide additional layers of insight into your data. For your reference, the motif sequence search supports IUPAC nucleotide codes.

In this specific case, there appears to be no expression control biased toward genomic locations. However, we were able to find the binding motif sequence for ERR alpha (which was knocked down by siRNA) on Wikipedia and extract genes possessing this motif near their TSS. By checking the overlap between these genes and the list of genes downregulated by all three types of siRNA, we arrived at a single gene. Nevertheless, as mentioned earlier, it may not be necessary for a gene to be suppressed by all three siRNAs, or the threshold for differential expression analysis might have been too stringent. By relaxing these conditions, the number of candidate genes directly regulated by ERR alpha will likely expand.

RNA-Seq Tutorial (08) - Genomic Location Specific Regulation & Motif Seq

【Start your 5-day free trial】
▶ Explore genes with consensus motifs near TSS

Back to TOC

9. Conclusion: From Using Tools to Becoming a Decision-Maker

This tutorial has focused on introducing methods. However, in real-world data analysis, it is not the tools but the analyst who must decide which thresholds are appropriate and how to handle issues such as batch effects. It is important to recognize the role of subjective, informed decision-making beyond objective numbers.

In this tutorial, we have walked through the standard workflow of RNA-Seq data analysis. In practice, however, data analysis rarely follows a single linear path. The “correct” approach depends on the characteristics of the data and the goals of the study.

Once you have learned the methods, the next and most essential step is to evolve into an analyst who can make appropriate decisions based on informed judgment.

Real-world data is never perfect. This is precisely why analysts must complement incomplete information with their own judgment—grounded in both the data and the research objective.

This is what defines a resilient analyst—one whose value remains intact even in the age of AI.

Subio Platform provides the analytical foundation for this transition—from a user who simply operates tools to a researcher who interprets data and makes decisions. In addition to the software itself, we support this process through data analysis services and online training.

Now, try applying this approach to your own data—and experience the difference.

▶ Analyze your own data now (Data Analysis Service)
▶ Learn RNA-Seq analysis with your own data (Online Training)

Back to TOC