RNA-Seq Analysis Guide for Beginners: How to Analyze and Understand Your Data

How RNA-Seq Data Analysis Works: A Practical Approach

RNA-Seq data analysis can seem complex, especially for beginners.

This page provides a practical guide to RNA-Seq data analysis, explaining the basic workflow (from preprocessing and normalization to downstream analysis) and key concepts in a clear and structured way. Instead of simply following predefined steps, you will learn how to analyze and understand your data. You will also learn how to interpret your data at each step of the RNA-Seq workflow.

In real-world data analysis, however, following a standard pipeline does not always lead to reliable results.

When beginning RNA-Seq analysis, many researchers quickly encounter a major hurdle: working with R and various analysis packages.
Time is often consumed by writing code and troubleshooting errors, leaving little room for what truly matters—interpreting the data and deriving biological insights.

At the same time, GUI-based automated analysis tools have become increasingly popular. While they allow users to obtain results quickly, they can also encourage analyses to proceed without carefully examining preprocessing steps, normalization, or batch effects.

As a result, the analysis becomes a black box, and opportunities to detect and correct incorrect conclusions may be lost.

So how can you analyze RNA-Seq data while truly understanding what you're doing?

________________________________________

Become an Analyst Who Thinks — Not Just Runs Tools — in the Age of AI

Subio Platform was designed as an environment for thinking through data by visualizing it.

By emphasizing visualization and combining it with essential statistical functions, it enables you to carry out analysis with clear intent—rather than relying on black-box processes.

In this tutorial, you will follow the standard RNA-Seq analysis workflow while learning how to examine your data and make informed decisions using Subio Platform.

This guide is designed to help you move from “running an analysis and getting results somehow” to “understanding your data and being able to explain your analysis with confidence.”

________________________________________

For those who need results quickly

This page is intended for those who want to understand RNA-Seq analysis by working through the tutorial themselves.
However, depending on your situation, a shorter and more efficient path may be available:

▶ I want to analyze my own data as quickly as possible
→ Data Analysis Service: We deliver not only processed results, but also an interactive SSA file that allows you to explore the data freely.

▶ I want to learn the workflow quickly using my own data
→ Online Training: Learn the essential analysis workflow efficiently using your own data. Sessions can be recorded for future reference.

▶ I want to start with the basics using demo data
→ Continue with this tutorial.

________________________________________

Tutorial Workflow

Instead of just performing "tasks," let's move forward by understanding how the data changes at each step.

Data Preparation and Import
- 1-a. Importing FASTQ Files (Processing from Raw Data)
  Note: If you get stuck here, feel free to outsource this step and move to 1-b.

1-b. Importing Gene-Level Expression Data (Utilizing Existing Data)

Series Creation and Setup: Organizing Experimental Info
Normalization and Pre-processing: Aligning Data "Distortions"
Filtering (Quality Control): Extracting Genes Worthy of Analysis
PCA and Clustering: Gaining a Bird's-Eye View
Extracting Differentially Expressed Genes (DEGs)
Enrichment Analysis
Genomic Location-Specific Analysis and Motif Search
Conclusion: Mastering the Tools and Becoming a "Decision-Maker"

________________________________________

1-a. Importing FASTQ Files

Key Takeaway: The first step in analysis is importing the measured experimental data. We explain the process for both raw data (FASTQ) and data already converted to gene-level expression (gene counts, FPKM, TPM, etc.).

Processing FASTQ files is often the first major hurdle for beginners and a step that tends to consume a significant amount of time.
However, FASTQ processing is not the core of research; it is closer to an execution process that follows defined procedures.

Learning command-line tools, overcoming errors, and successfully completing a pipeline can certainly bring a sense of accomplishment. However, this stage is merely a preparatory step. The core of research—understanding the data and deriving biological insights—lies beyond this point.

What truly deserves the most time and attention is the analysis and interpretation that follow. Spending excessive time and effort on these early steps can make the overall process inefficient.

If you find yourself stuck at this stage, it is perfectly reasonable to delegate this part—to colleagues, services, or even AI.
What matters more is how you approach the analysis itself. (See "How to Learn RNA-Seq Data Analysis")

At the same time, tools that allow you to “just input FASTQ files and click a button” can be useful for processing FASTQ files and obtaining gene counts (e.g., platforms such as Galaxy). While these tools can efficiently generate gene-level expression values, the underlying processing steps are not always fully transparent. Using them for downstream analysis can therefore easily turn the process into a black box.

In addition, some web-based tools, such as RaNA-Seq, automate the analysis process. However, they may not provide direct access to gene-level count data.

In RNA-Seq analysis, statistical methods are typically built on gene counts. In particular, widely used approaches such as edgeR and DESeq2 are designed to work with count data.
For this reason, if you plan to use these methods, it is best to choose a workflow that provides gene counts.

When no alternative is available, it is still possible to proceed with normalized values such as TPM. However, it is important to keep in mind that these data are based on different assumptions.

Whether you already have the data, have asked someone else to process it, or are using a tool, the method does not matter. If you already have data converted to gene-level expression values (e.g., gene counts, or not always optimal measures such as RPKM/FPKM, TPM, or CPM), you can skip this step and proceed directly to Section 1-b.
________________________________________

In this section (1-a), we will explain how to complete this preprocessing step within Subio Platform. However, keep in mind that this is still a preparatory stage—the essential analytical work lies ahead.

When importing RNA-Seq FASTQ files into Subio Platform, a pipeline consisting of fastp, HISAT2, and StringTie is executed to calculate gene-level expression values, including gene counts and FPKM.
This allows you to directly use data in formats required for downstream analysis, including count data suitable for statistical analysis.

No workstation or UNIX command-line skills are required. The process can be run on a standard Windows or Mac computer.
Even for users familiar with command-line tools, Subio Platform provides a more streamlined way to perform preprocessing and obtain expression data. This allows you to focus on interpreting the data rather than managing the processing steps.

Before running the pipeline, these tools must be installed. If this seems difficult, please consider our FASTQ Processing Troubleshooting Service.

Keep FASTQ files in .gz compressed format. A large number of intermediate files will be generated in the same folder, so ensure you have sufficient disk space (at least 5 times the total size of the FASTQ files). If the disk runs out during execution, it will stop. FASTQ files can be stored on an external drive.

For Paired-end samples, please follow the naming rules for FASTQ files. If you encounter issues, try our troubleshooting guide.

Importing RNA-Seq FASTQ Files

Download

Back to TOC

________________________________________

1-b. Importing Gene-Level Expression Data and Preparation

Preparing RNA-Seq data may seem complex at first, but once you understand the basic workflow, it becomes straightforward.

In this section, we use the GSE49110 dataset to walk through the essential steps of RNA-Seq data preparation, including obtaining gene counts data and organizing annotation. This workflow is not limited to this dataset—it can be applied to many public datasets, such as those from GEO.

In this example, we use GEO data and retrieve annotation from NCBI. However, many RNA-Seq datasets use Ensembl gene IDs. In such cases, you can obtain the corresponding annotation using Ensembl BioMart (MartView).

For those who want to get started more quickly, an SSA file is also provided. By using the SSA file, you can skip the data preparation steps and begin exploring the dataset immediately. It also allows you to review the entire analysis workflow, modify parameters, and reanalyze the data interactively.
Note: The SSA file contains the full analysis workflow and data, but external resources such as pathway databases and publication PDFs are not included due to copyright restrictions.

Although a significant portion of the video focuses on creating and editing a Platform, this step only needs to be done once. Once created, the Platform can be reused, allowing you to start new analyses immediately in future projects.

For detailed step-by-step instructions, please follow along with the video.

RNA-Seq Tutorial (01) - Importing RNA-Seq Counts Data

Download

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)
▶ Start exploring immediately with an SSA file (90-second demo)

Back to TOC

________________________________________

2. Series Creation and Setup

Key Takeaway: In statistical analysis, everything depends on how you define your experimental design—specifically, which samples belong to which groups. Without this information, the analysis cannot even begin. By organizing sample attributes and defining the axis of comparison, you establish a clear and reliable analysis framework.

Once the Series is loaded, objects such as the Measurement List and DataSet will appear in the Series panel on the left.
In the Analysis Browser, a Scatter Plot is displayed in the upper pane by default, while the Setup Series tab opens in the lower pane.

Start by clicking the "Edit Parameters" button.

In most cases, importing information from the "Sample Information" is the fastest approach.
In this dataset, there are four conditions (control and three types of siRNA treatment), each with two replicates.

Next, move to the Setup DataSet tab to define and organize your DataSets.
This step determines how samples are grouped and compared, which directly impacts the interpretation of the results.
In this tutorial, we create two DataSets.

You can also record additional information in the Sample Info tab and attach related files for reference.

RNA-Seq Tutorial (02) - Creating and Setting Up a Series

Download

Back to TOC

________________________________________

3. Normalization and Pre-processing

Key Takeaway: Normalization is an essential step to correct systematic errors caused by differences in total read counts, but it should not be blindly trusted (See Case Study 379). Instead of relying solely on presets, develop an eye for judging whether data distortions are correctly resolved by visually checking pre- and post-correction distributions with histograms.

Normalization and preprocessing are critical steps, as their settings can significantly influence downstream analysis results.
If the characteristics of the data are not properly understood and handled accordingly, this may lead to incorrect conclusions.

In this tutorial, we start from the preset scenario "RNA-Seq (Counts)" and adjust it based on the data.
Through this process, you will learn how to interpret the data—what to look at (and where), and how to make informed decisions.

The settings introduced here are only examples and are not universally applicable to all datasets.
In real analyses, it is essential to adjust parameters according to the specific characteristics of your data.
If you are unsure, consider using our data analysis service.

Low Signal Cutoff Setting

Setting this value higher reduces unstable genes but increases the risk of missing potentially interesting ones (false negatives).
Conversely, setting it lower increases unstable genes but reduces the risk of missing relevant signals.

In other words, this represents a trade-off between false positives and false negatives.

It is often said that minimizing false positives is important, but that applies mainly to the final stage of research.
In many cases, omics analysis is used for discovery, where avoiding false negatives is often more critical.

If you want to avoid missing potentially interesting genes—especially those in low-expression regions—it is better to set the Low Signal Cutoff to a lower value.
On the other hand, if your goal is to identify highly reliable biomarker candidates, a higher cutoff is more appropriate.

Fill Missing Value Setting

For this parameter, enter a value slightly lower than the threshold set for the Low Signal Cutoff.
Note that the input is interpreted on a logarithmic scale. For example, entering “4” corresponds to 2⁴ = 16.

In this way, the ability to flexibly adjust parameters according to the purpose of the analysis is one of the key strengths of using Subio Platform.

While intentionally setting a lower Low Signal Cutoff can be effective, it requires careful interpretation of the results.
To properly understand the outcome, it is important not only to focus on final p-values, but also to understand the overall analysis workflow.

For related topics, please also refer to the following:

RNA-Seq Tutorial (03) - Normalization and Pre-Processing

Download

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)

Back to TOC

________________________________________

4. Filtering (Quality Control)

Key Takeaway: “Garbage in, garbage out.” By removing low-expression, unreliable genes and uninformative noise, you can focus on signals that are biologically meaningful.

As discussed in the normalization step, not all measured values are equally reliable.
Before proceeding with the analysis, it is essential to identify which values are suitable for analysis and ensure that unreliable values do not interfere with the results.

For this purpose, we use the Filter tool included in the Basic Plug-in.
If you do not have the plug-in, you can try it with a 5-day free trial.

Fundamental Concept of Filtering

Filtering is not about “selecting reliable genes.”
It is about removing genes that are not suitable for analysis.

This distinction is critical.

If you focus only on selecting “reliable” genes, you may miss genes that are not expressed in the control condition but become expressed after treatment—potentially important biological signals.
Try using the Filter tool to observe this difference in practice.

Filtering Should Be Performed in Two Steps

In most cases, filtering is performed in two stages:

Remove genes with extremely low expression
Remove genes with little or no variation across samples

Are Common Thresholds Always Valid?

For example, when learning RNA-Seq analysis with DESeq2, you may often encounter code like this:

dds <- dds[rowSums(counts(dds)) >= 10,]

Have you ever thought:

“Is this threshold really appropriate for any dataset?”

Even if you have doubts, it is easy to copy and paste such code without fully understanding it.

Thresholds Are Not Universal Rules

These thresholds are widely used as rules of thumb, but they are not appropriate for all datasets.

The key point is that such thresholds are not universal rules.
They are simple pre-filters designed to remove extremely low-expression genes, and the appropriate cutoff depends on the characteristics of the data and the purpose of the analysis.

How Should You Determine the Threshold?

The answer is simple:
Look at the data.

You need to examine distributions, plots, and overall patterns to make informed decisions.

Practical Tip:
For TPM/FPKM data or low-input expression datasets, the boundary between noise and signal can be ambiguous, making it difficult to define a simple threshold based on raw signal values (e.g., Ch1 Raw Signal).
In such cases, filtering based on the coefficient of variation (CV) within Sample Groups defined in the Setup DataSet step can be effective.

Practical Tip:
For small datasets, it is often most reliable to iteratively inspect the data and adjust filtering as you proceed. For larger datasets, a practical approach is to randomly select around 100 samples, perform exploratory analysis to determine appropriate thresholds, and then apply them to the full dataset.
Using sample code without examining the data may lead to misleading or suboptimal results.

RNA-Seq Tutorial (04) - Filtering (Quality Control)

Download

【Start your 5-day free trial】
▶ Remove noisy genes with filtering

Back to TOC

________________________________________

5. PCA and Clustering

Key Takeaway: The movement of tens of thousands of genes exceeds human comprehension; we must make it understandable. Gain a bird's-eye view to intuitively grasp sample variation (experimental quality) and similarity between groups to determine "what to investigate next."

At this point, the data is ready for analysis.
We now move on to extracting meaningful insights from the data.

Note that the analysis should be performed not on all genes, but only on those that have passed Quality Control through filtering.

First, it is important to obtain an overview of the expression profiles and understand the overall structure of the data.
The most suitable method for this purpose is Principal Component Analysis (PCA).

A key advantage of PCA is that not only distances between samples, but also directions, have clear meaning.

How to Read PCA

When interpreting PCA results, the following three points are essential:

1. Distance
If two points are close, their expression profiles are similar; if they are far apart, they are substantially different.

2. Direction
Samples located in the same direction (from the origin) indicate variation driven by similar groups of genes.
If samples are positioned in opposite directions across the origin, this indicates that their expression changes are in opposite directions.
In addition, the farther a point is from the origin, the larger the magnitude of the variation.

3. Principal Components
Each axis (principal component) represents variation driven by different groups of genes.
Components with higher explained variance capture the overall structure of the data more strongly, but they are not necessarily biologically more important.
In some cases, variation in a small number of genes—captured by components with lower explained variance—may be more biologically meaningful.

Based on these principles, let us examine the PCA results for this dataset.

The distances between replicates (within-group) are clearly smaller than the distances between siRNA treatments (between-group), indicating that the data is of high quality with low variability.

Compared to the control, all siRNA-treated samples show a shared shift in the same direction (bottom-right), suggesting a common pattern of variation.
At the same time, each treatment exhibits distinct variation: siE2 shifts primarily to the right (PC1), while siE3 shifts downward (PC2).
In addition, although not visible in this 2D PCA plot, another principal component (PC3) captures the characteristic variation of siE1.

How to Read Hierarchical Clustering

Next, we apply hierarchical clustering.

From the heatmap, we can observe that many genes are commonly upregulated (red) across all siRNA treatments.
In contrast, downregulated genes (blue) differ between individual siRNA conditions.

Integrating PCA and Clustering Interpretations

The key point is not to create figures and feel satisfied, nor simply to look at them.
The essence of analysis lies in recognizing these patterns and differences, and interpreting what is happening within the cells.

PCA and hierarchical clustering are essentially different ways of viewing the same underlying data structure.
By comparing both results, you can deepen your understanding of the data.

Note
In the normalization and preprocessing step, log ratio transformation was applied. The meaning of the PCA origin depends on what is used as the reference (denominator) in this transformation.
When centering is used, the origin represents the average expression pattern of each gene. When using Ratio to Control Samples, the origin represents the average expression of the designated control group.
In this case, the siC-treated samples appear near the origin because they serve as the reference.

Methods such as UMAP are also widely used today, but they primarily represent structure based on distances. In contrast, PCA provides meaningful information about both distance and direction, making it particularly useful for interpreting the underlying sources of variation.

RNA-Seq Tutorial (05) - PCA and Clustering

Download]

【Start your 5-day free trial】
▶ Explore global patterns with PCA and clustering

Back to TOC

________________________________________

6. Identifying Differentially Expressed Genes

Key Takeaway: Rather than relying solely on P-value cutoffs, use approaches such as Venn diagrams to narrow down genes from multiple perspectives. By distinguishing what is commonly regulated across conditions from what is condition-specific, you can gain clues about what to investigate next and how to validate it.

In this case, as observed in the clustering results from the previous section, genes that are upregulated tend to be shared across the three siRNA conditions, whereas genes that are downregulated show very little overlap. After extracting differentially expressed genes using the Basic Plug-in tools, examine these patterns using the Venn Diagram tool.

In practice, it is not uncommon to see comparisons between tools such as DESeq2 and edgeR, where results differ—genes deemed significant in one analysis may not be in another. However, such differences often arise from variations in threshold settings or statistical models, and do not, by themselves, directly reflect biological meaning (see Case Study No.355).

Rather, it is crucial to pick up on the subtle inconsistencies in the data—why there is little overlap among downregulated genes, and why upregulated genes tend to show greater commonality.
What mechanisms define the boundary between what is “specific” and what is “common”? From the perspective of a system in which different inputs can lead to similar outputs, how should we interpret this relationship? Would time-course data help reveal these dynamics?

In this way, rather than focusing solely on P-values, it is essential to integrate insights from clustering and PCA to develop hypotheses—and to ask what experiments are needed to test them. Formulating such questions is the essence of analysis.

A small p-value does not necessarily mean biological importance.
Rather, what matters when forming hypotheses is not to focus only on genes with small p-values. Even genes that do not appear statistically significant may exhibit consistent patterns within expression profiles or clustering, and such patterns can carry important biological insights (Case Study No.199).

Practical Tip:
This may not be directly related to this specific case, but when you want to capture expression changes from a small subset of cells within a sampled tissue, it can be useful to combine multiple criteria such as:

low expression levels
low variability
small changes in expression

By combining these conditions (e.g., using multiple filters), you may be able to detect signals that would otherwise be overlooked in standard analyses.

What matters is to imagine how the genes of interest would appear in the data, and to devise ways to capture them.
These are precisely the skills required of analysts in the age of AI.

RNA-Seq Tutorial (06) - Extracting DEGs

Download

【Start your 5-day free trial】
▶ Identify differentially expressed genes (DEGs)

Back to TOC

________________________________________

7. Enrichment Analysis

Key Takeaway: Translating a “list of numbers” into “biological language” provides clues for interpreting the results of your analysis.

The next step after statistical analysis is to translate these results into biological meaning.
You can use the Enrichment Analysis tool included in the Advanced Plug-in of Subio Platform, or various freely available web-based tools. These tools allow you to interpret gene lists in terms of biological meaning.

Terminology can be confusing, but many approaches referred to as Gene Ontology (GO) analysis or pathway analysis are, in practice, forms of enrichment analysis.
Gene Set Enrichment Analysis (GSEA), DAVID Functional Annotation, and Metascape are among the most widely used tools for performing enrichment analysis.

Enrichment analysis does not provide definitive conclusions; rather, it offers clues for interpretation.

Because enrichment analysis relies on existing literature and databases, several limitations should be considered:

Results tend to be biased toward well-studied areas.
When only a small number of genes are associated with a given phenomenon, it can be difficult to determine whether this reflects a true biological property or simply a lack of prior research.
Phenomena observed in specific tissues or cell types may be presented as if they were general.

It is not uncommon to feel that the analysis is complete once enrichment results are obtained.
However, this is only the starting point for interpretation.

RNA-Seq Tutorial (07) - Gene Annotation and Enrichment Analysis

Download

【Start your 5-day free trial】
▶ Discover GO terms and pathways from your gene list

Back to TOC

________________________________________

8. Genomic Location-Specific Regulation and Motif Search

Key Takeaway: Approach new regulatory mechanisms from physical positional relationships. Infer the "cause" of expression variation by examining changes dependent on specific genomic regions or associations with transcription factor "motif sequences."

Once you have extracted DEGs or genes with specific expression patterns through clustering, you might want to check if these genes are distributed unevenly across the genome. If a cluster of genes in a specific region shows a synchronized increase or decrease in expression, it could potentially hint at localized epigenetic shifts or structural variations in the chromosome.

If no such bias is observed, it may be more likely that the expression is regulated by transcription factors. You can identify genes having motif sequences near their TSS. Exploring these spatial relationships may provide additional layers of insight into your data. For your reference, the motif sequence search supports IUPAC nucleotide codes.

In this specific case, there appears to be no expression control biased toward genomic locations. However, we were able to find the binding motif sequence for ERR alpha (which was knocked down by siRNA) on Wikipedia and extract genes possessing this motif near their TSS. By checking the overlap between these genes and the list of genes downregulated by all three types of siRNA, we arrived at a single gene. Nevertheless, as mentioned earlier, it may not be necessary for a gene to be suppressed by all three siRNAs, or the threshold for differential expression analysis might have been too stringent. By relaxing these conditions, the number of candidate genes directly regulated by ERR alpha will likely expand.

RNA-Seq Tutorial (08) - Genomic Location Specific Regulation & Motif Seq

Download

【Start your 5-day free trial】
▶ Explore genes with consensus motifs near TSS

Back to TOC

________________________________________

9. Conclusion: From Using Tools to Becoming a Decision-Maker

This tutorial has focused on introducing methods. However, in real-world data analysis, it is not the tools but the analyst who must decide which thresholds are appropriate and how to handle issues such as batch effects. It is important to recognize the role of subjective, informed decision-making beyond objective numbers.

In this tutorial, we have walked through the standard workflow of RNA-Seq data analysis. In practice, however, data analysis rarely follows a single linear path. The “correct” approach depends on the characteristics of the data and the goals of the study.

Once you have learned the methods, the next and most essential step is to evolve into an analyst who can make appropriate decisions based on informed judgment.

Real-world data is never perfect. This is precisely why analysts must complement incomplete information with their own judgment—grounded in both the data and the research objective.

This is what defines a resilient analyst—one whose value remains intact even in the age of AI.

Subio Platform provides the analytical foundation for this transition—from a user who simply operates tools to a researcher who interprets data and makes decisions. In addition to the software itself, we support this process through data analysis services and online training.

Now, apply this approach to your own data and experience the difference firsthand.

If you are just starting to learn RNA-Seq data analysis, Subio offers two approaches depending on your goals.

■ If you would like to first obtain analysis results and focus on learning through validation and interpretation, consider starting with our data analysis service.
→ Learn more about the Data Analysis Service

■ If you would like to understand and make decisions at each step—from preprocessing to analysis—and become able to carry out the entire workflow on your own, we recommend our online training.
→ Learn more about the Online Training

Back to TOC

Support Help - Installing & Using Software RNA-Seq Analysis Guide for Beginners: How to Analyze and Understand Your Data