Bulk RNA-Seq Data Analysis Tutorial: Learn the Workflow and How to Interpret Results

What Happens in RNA-Seq Data Analysis—and How Do You Interpret the Results?

This page explains the basic workflow and concepts of RNA-Seq data analysis
using public data from GEO (Gene Expression Omnibus).

RNA-Seq data analysis usually involves multiple steps, including FASTQ preprocessing, mapping, generating gene counts, normalization, filtering, PCA, clustering, differential expression analysis (DEG analysis), and enrichment analysis such as pathway and Gene Ontology (GO) analysis.

If you would like to review the overall workflow of gene expression data analysis, including RNA-Seq, please also see the following guide.
Analysis Guide

Introduction: What This Tutorial Aims to Help You Achieve

The important point in RNA-Seq data analysis is not simply to follow procedures. At each step, you need to check the state of the data and understand how to evaluate the results and make informed decisions.

This tutorial is designed to help you learn RNA-Seq data analysis without coding. Through data visualization, you will learn how to read analysis results and develop the ability to make your own decisions.

Before moving on to advanced statistical modeling or large-scale pipeline construction, it is highly valuable to first learn what to look at in the data and how to interpret what you see. The fact that no coding is required does not mean that no thinking is required. Rather, it means removing unnecessary complexity so that you can focus on understanding, interpreting, and judging your data.

After building that foundation, you can move on to statistical models and script-based analysis as needed, and develop your analysis into a more practical and reproducible workflow. Learning to first look at the data and understand it, and then moving on to statistics and coding, is also an effective approach from a learning perspective.

If you would like to follow this tutorial hands-on, please start by downloading Subio Platform .

A New Approach: No Coding, But Not Superficial RNA-Seq Data Analysis

Subio Platform is an analysis environment designed to help you see your data and make decisions without coding. In this tutorial, you will follow the basic workflow of RNA-Seq data analysis step by step, while visualizing data distributions and variability, and learning in practice why each step is necessary, what to look at, and how to make decisions.

In each section, we first explain the concept behind the method and the basic knowledge needed to make informed decisions. Then, at the end of the section, you can watch a video showing the actual operation steps. By understanding the meaning of each operation in the text and then confirming the actual procedure in the video, this tutorial is designed to help you learn not by simply imitating the steps, but by developing an understanding that involves analytical judgment.

RNA-Seq Data Analysis in the Age of AI: Become an Analyst Who Thinks

With AI tools such as ChatGPT, you can generate analysis code for methods such as DESeq2 and edgeR. However, to judge whether AI-generated results are appropriate, you still need the ability to see and understand your data.

The role of the analyst is not only to write code, but also to clearly define the analytical goal, check the results, and explain what they mean. This tutorial is designed to help you build that ability.

Move from “following procedures” to
“RNA-Seq data analysis where you can see the data, make decisions, and explain the results.”

That is the goal of this tutorial.

You Don’t Have to Do Everything on Your Own

This tutorial is designed for those who want to understand RNA-Seq data analysis and work through the process with their own hands. We explain the workflow and key decision points as carefully as possible. As a result, it may feel a little overwhelming if you are reading it for the first time.

However, you do not need to complete every step perfectly on your own from the beginning. Depending on your goals and situation, you can start learning from the part that is most useful to you.

If you already have data to analyze and want to focus first on interpretation

If going through the entire tutorial on your own feels like too much, another option is to start by looking at analysis results and learning from interpretation first.

With our Data Analysis Service, we prepare public data such as GEO or TCGA datasets, or your own research data, in a form that can be explored and analyzed. The results are delivered as an SSA file, allowing you to freely browse, validate, and explore the data. This option is suitable if you would like to begin with analysis results in hand and focus on validation and interpretation while looking at the data.

If you want to learn the operation steps and decision points quickly using your own data

With our Online Training, you can use public datasets or your own research data to learn the operations, decision points, and analysis workflow that match your current research objective. This option is suitable if you want to learn where to look and how to make decisions while working with your own data as the example.

If you want to start from the basics using demo data

Please continue through this tutorial step by step. It is designed to help you learn not only the operation steps, but also what to check at each stage and how to make analytical decisions.


1-a. Importing FASTQ Files

Key Takeaway: The first step in analysis is importing the measured experimental data. We explain the process for both raw data (FASTQ) and data already converted to gene-level expression (gene counts, FPKM, TPM, etc.).

Processing FASTQ files is often one of the first steps where many beginners struggle, and it can take a considerable amount of time. This step is an important preprocessing process for obtaining gene expression data, and it can affect the results of downstream analysis. However, it is not biological interpretation itself; rather, it is a technical process that defines the input data used for downstream analysis.

Therefore, in the early stages of learning, instead of spending too much time on the details of FASTQ processing, it is often better to first understand the overall analysis workflow after Gene Count data are obtained, and to learn how to interpret and make decisions based on the results of normalization, filtering, PCA, differential expression analysis, and enrichment analysis.

What truly deserves the most time and attention is the analysis and interpretation that follow. Spending excessive time and effort on these early steps can reduce the overall efficiency of the process.

If you find yourself stuck at this stage, it is perfectly reasonable to delegate this part—to colleagues, services, or even AI.
What matters more is how you approach the analysis itself. (See "How to Learn RNA-Seq Data Analysis")

At the same time, tools that allow you to “just input FASTQ files and click a button” can be useful for processing FASTQ files and obtaining gene counts (e.g., platforms such as Galaxy). While these tools can efficiently generate gene-level expression values, the underlying processing steps are not always fully transparent. Using them for downstream analysis may make the process less transparent.

In addition, some web-based tools, such as RaNA-Seq, automate the analysis process. However, they may not provide direct access to gene-level count data.

In RNA-Seq analysis, statistical methods are typically built on gene counts. In particular, widely used approaches such as edgeR and DESeq2 are designed to work with count data.
For this reason, if you plan to use these methods, it is best to choose a workflow that provides gene counts.

When no alternative is available, it is still possible to proceed with normalized values such as TPM. However, it is important to keep in mind that these data are based on different assumptions.

Whether you already have the data, have asked someone else to process it, or are using a tool, the method does not matter. If you already have data converted to gene-level expression values (e.g., gene counts, or not always optimal measures such as RPKM/FPKM, TPM, or CPM), you can skip this step and proceed directly to Section 1-b.


In this section (1-a), we will explain how to complete this preprocessing step within Subio Platform. However, keep in mind that this is still a preparatory stage—the essential analytical work lies ahead.

When importing RNA-Seq FASTQ files into Subio Platform, a pipeline consisting of fastp, HISAT2, and StringTie is executed to calculate gene-level expression values, including gene counts and FPKM.
This allows you to directly use data in formats required for downstream analysis, including count data suitable for statistical analysis.

No workstation or UNIX command-line skills are required. The process can be run on a standard Windows or Mac computer.
Even for users familiar with command-line tools, Subio Platform provides a more streamlined way to perform preprocessing and obtain expression data. This allows you to focus on interpreting the data rather than managing the processing steps.

Before running the pipeline, these tools must be installed. If this seems difficult, please consider our FASTQ Processing Troubleshooting Service.

Keep FASTQ files in .gz compressed format. A large number of intermediate files will be generated in the same folder, so ensure you have sufficient disk space (at least 5 times the total size of the FASTQ files). If the disk runs out during execution, it will stop. FASTQ files can be stored on an external drive.

For Paired-end samples, please follow the naming rules for FASTQ files. If you encounter issues, try our troubleshooting guide.

Importing RNA-Seq FASTQ files

Back to TOC


1-b. Importing Gene-Level Expression Data and Preparation

Preparing RNA-Seq data may seem complex at first, but once you understand the basic workflow, it becomes straightforward.

In this section, we use the GSE49110 (a GEO RNA-Seq dataset) to walk through the essential steps of RNA-Seq data preparation, including obtaining gene counts data and organizing annotation. This workflow is not limited to this dataset—it can be applied to many public datasets, such as those from GEO.

In this example, we use GEO data and retrieve annotation from NCBI. However, many RNA-Seq datasets use Ensembl gene IDs. In such cases, you can obtain the corresponding annotation using Ensembl BioMart (MartView).

For those who want to get started more quickly, an SSA file is also provided. By using the SSA file, you can skip the data preparation steps and begin exploring the dataset immediately. It also allows you to review the entire analysis workflow, modify parameters, and reanalyze the data interactively.
Note: The SSA file contains the full analysis workflow and data, but external resources such as pathway databases and publication PDFs are not included due to copyright restrictions.

Although a significant portion of the video focuses on creating and editing a Platform, this step only needs to be done once. Once created, the Platform can be reused, allowing you to start new analyses immediately in future projects.

For detailed step-by-step instructions, please follow along with the video.

Note: In this video, the related page is introduced as “Importing RNA-Seq Gene Counts Data from GEO.” It is now published as the following page, covering not only Gene Count data but also the integration of gene annotations and sample information:
GEO RNA-Seq Data Analysis Preparation: Obtaining and Integrating Gene Counts, Gene Annotations, and Sample Information

Correction: In this video, the two samples in each condition are described as “biological replicates.” More precisely, they should be described as independent replicate samples. This dataset contains independent replicate samples using the same MDA-MB231 cell line.

RNA-Seq Data Analysis Tutorial (01) - Import RNA-Seq Counts Data

▶ Start exploring immediately with an SSA file (90-second demo)

Back to TOC


2. Series Creation and Setup

Key Takeaway: In statistical analysis, everything depends on how you define your experimental design—specifically, which samples belong to which groups. Without this information, the analysis cannot even begin. By organizing sample attributes and defining the axis of comparison, you establish a clear and reliable analysis framework.

Once the Series is loaded, objects such as the Measurement List and DataSet will appear in the Series panel on the left.
In the Analysis Browser, a Scatter Plot is displayed in the upper pane by default, while the Setup Series tab opens in the lower pane.

Start by clicking the "Edit Parameters" button.

In most cases, importing information from the "Sample Information" is the fastest approach.
In this dataset, there are four conditions (control and three types of siRNA treatment), each with two replicates.

Next, move to the Setup DataSet tab to define and organize your DataSets.
This step determines how samples are grouped and compared, which directly impacts the interpretation of the results.
In this tutorial, we create two DataSets.

You can also record additional information in the Sample Info tab and attach related files for reference.

Note: If you continue from the steps in section 1-b, the information shown on your screen, such as File name, is different from what appears in the video.
In that case, import Sample_title from the Import Sample Information list in the upper-right corner of the screen.
We also provide an SSA file for this tutorial. You may import the SSA file and simply watch the video to understand the overall workflow.
Download GSE49110.ssa

RNA-Seq Data Analysis Tutorial (02) - Create and Setup A Series

Back to TOC


3. Normalization and Pre-processing

Key Takeaway: Normalization is an essential step to correct systematic errors caused by differences in total read counts, but it should not be blindly trusted (See Case Study 379). Instead of relying solely on presets, develop an eye for judging whether data distortions are correctly resolved by visually checking pre- and post-correction distributions with histograms.

Normalization and preprocessing are critical steps, as their settings can significantly influence downstream analysis results.
If the characteristics of the data are not properly understood and handled accordingly, this may lead to incorrect conclusions.

In this tutorial, we start from the preset scenario "RNA-Seq (Counts)" and adjust it based on the data.
Through this process, you will learn how to interpret the data—what to look at (and where), and how to make informed decisions.

The settings introduced here are only examples and are not universally applicable to all datasets.
In real analyses, it is essential to adjust parameters according to the specific characteristics of your data.
If you are unsure, consider using our data analysis service.

Low Signal Cutoff Setting

Setting this value higher reduces unstable genes but increases the risk of missing potentially interesting ones (false negatives).
Conversely, setting it lower increases unstable genes but reduces the risk of missing relevant signals.

In other words, this represents a trade-off between false positives and false negatives.

It is often said that minimizing false positives is important, but that applies mainly to the final stage of research.
In many cases, omics analysis is used for discovery, where avoiding false negatives is often more critical.

If you want to avoid missing potentially interesting genes—especially those in low-expression regions—it is better to set the Low Signal Cutoff to a lower value.
On the other hand, if your goal is to identify highly reliable biomarker candidates, a higher cutoff is more appropriate.

Fill Missing Value Setting

For this parameter, enter a value slightly lower than the threshold set for the Low Signal Cutoff.
Note that the input is interpreted on a logarithmic scale. For example, entering “4” corresponds to 24 = 16.

In this way, the ability to flexibly adjust parameters according to the purpose of the analysis is one of the key strengths of using Subio Platform.

While intentionally setting a lower Low Signal Cutoff can be effective, it requires careful interpretation of the results.
To properly understand the outcome, it is important not only to focus on final p-values, but also to understand the overall analysis workflow.

For more information on normalization and checking data distributions, please also see the following:

RNA-Seq Data Analysis Tutorial (03) - Normalization and Pre-processing

Back to TOC


4. Filtering (Quality Control)

Key Takeaway: “Garbage in, garbage out.” By removing low-expression, unreliable genes and uninformative noise, you can focus on signals that are biologically meaningful.

As discussed in the normalization step, not all measured values are equally reliable.
Before proceeding with the analysis, it is essential to identify which values are suitable for analysis and ensure that unreliable values do not interfere with the results.

For this purpose, we use the Filter tool included in the Basic Plug-in.
If you do not have the plug-in, you can try it with a 5-day free trial.

Fundamental Concept of Filtering

Filtering is not about “selecting reliable genes.”
It is about removing genes that are not suitable for analysis.

This distinction is critical.

If you focus only on selecting “reliable” genes, you may miss genes that are not expressed in the control condition but become expressed after treatment—potentially important biological signals.
Try using the Filter tool to observe this difference in practice.

Filtering Should Be Performed in Two Steps

In most cases, filtering is performed in two stages:

  • Remove genes with extremely low expression
  • Remove genes with little or no variation across samples

Are Common Thresholds Always Valid?

For example, when learning RNA-Seq analysis with DESeq2, you may often encounter code like this:

dds <- dds[rowSums(counts(dds)) >= 10,]

Have you ever thought:

“Is this threshold really appropriate for any dataset?”

Even if you have doubts, it is easy to copy and paste such code without fully understanding it.

Thresholds Are Not Universal Rules

These thresholds are widely used as rules of thumb, but they are not appropriate for all datasets.

The key point is that such thresholds are not universal rules.
They are simple pre-filters designed to remove extremely low-expression genes, and the appropriate cutoff depends on the characteristics of the data and the purpose of the analysis.

How Should You Determine the Threshold?

One practical approach is to examine the data directly.

You need to examine distributions, plots, and overall patterns to make informed decisions.

Practical Tip:
For TPM/FPKM data or low-input expression datasets, the boundary between noise and signal can be ambiguous, making it difficult to define a simple threshold based on raw signal values (e.g., Ch1 Raw Signal).
In such cases, filtering based on the coefficient of variation (CV) within Sample Groups defined in the Setup DataSet step can be effective.
Practical Tip:
For small datasets, it is often most reliable to iteratively inspect the data and adjust filtering as you proceed. For larger datasets, a practical approach is to randomly select around 100 samples, perform exploratory analysis to determine appropriate thresholds, and then apply them to the full dataset.
Using sample code without examining the data may lead to misleading or suboptimal results.

RNA-Seq Data Analysis Tutorial (04) - Filtering (Quality Control)

【Start your 5-day free trial】
Remove noisy genes with filtering

Back to TOC


5. PCA and Clustering: How to Read PCA Plots and Heatmaps

Key Takeaway: The movement of tens of thousands of genes exceeds human comprehension; we must make it understandable. Gain a bird's-eye view to intuitively grasp sample variation (experimental quality) and similarity between groups to determine "what to investigate next."

At this point, the data is ready for analysis.
We now move on to extracting meaningful insights from the data.

Note that the analysis should be performed not on all genes, but only on those that have passed Quality Control through filtering.

First, it is important to obtain an overview of the expression profiles and understand the overall structure of the data.
The most suitable method for this purpose is Principal Component Analysis (PCA).

A key advantage of PCA is that not only distances between samples, but also directions, have clear meaning.

How to Read PCA

When interpreting PCA results, the following three points are essential:

1. Distance
If two points are close, their expression profiles are similar; if they are far apart, they are substantially different.

2. Direction
Samples located in the same direction (from the origin) indicate variation driven by similar groups of genes.
If samples are positioned in opposite directions across the origin, this indicates that their expression changes are in opposite directions.
In addition, the farther a point is from the origin, the larger the magnitude of the variation.

3. Principal Components
Each axis (principal component) represents variation driven by different groups of genes.
Components with higher explained variance capture the overall structure of the data more strongly, but they are not necessarily biologically more important.
In some cases, variation in a small number of genes—captured by components with lower explained variance—may be more biologically meaningful.

Based on these principles, let us examine the PCA results for this dataset.

The distances between replicates (within-group) are clearly smaller than the distances between siRNA treatments (between-group), indicating that the data is of high quality with low variability.

Compared to the control, all siRNA-treated samples show a shared shift in the same direction (bottom-right), suggesting a common pattern of variation.
At the same time, each treatment exhibits distinct variation: siE2 shifts primarily to the right (PC1), while siE3 shifts downward (PC2).
In addition, although not visible in this 2D PCA plot, another principal component (PC3) captures the characteristic variation of siE1.

How to Read Hierarchical Clustering

Next, we apply hierarchical clustering.

From the heatmap, we can observe that many genes are commonly upregulated (red) across all siRNA treatments.
In contrast, downregulated genes (blue) differ between individual siRNA conditions.

Integrating PCA and Clustering Interpretations

The key point is not to create figures and feel satisfied, nor simply to look at them.
The essence of analysis lies in recognizing these patterns and differences, and interpreting what is happening within the cells.

PCA and hierarchical clustering are essentially different ways of viewing the same underlying data structure.
By comparing both results, you can deepen your understanding of the data.

Note
In the normalization and preprocessing step, log ratio transformation was applied. The meaning of the PCA origin depends on what is used as the reference (denominator) in this transformation.
When centering is used, the origin represents the average expression pattern of each gene. When using Ratio to Control Samples, the origin represents the average expression of the designated control group.
In this case, the siC-treated samples appear near the origin because they serve as the reference.

Methods such as UMAP are also widely used today, but they primarily represent structure based on distances. In contrast, PCA provides meaningful information about both distance and direction, making it particularly useful for interpreting the underlying sources of variation.

RNA-Seq Data Analysis Tutorial (05) - PCA and Clustering

【Start your 5-day free trial】
Explore global patterns with PCA and clustering

Back to TOC


6. Identifying Differentially Expressed Genes: How to Interpret Fold Change and P-values

Key Takeaway: Rather than relying solely on P-value cutoffs, use approaches such as Venn diagrams to narrow down genes from multiple perspectives. By distinguishing between genes that change consistently across conditions and those that are condition-specific, you gain insight into what to investigate next and how to validate it.

In this section, we compare gene expression between conditions and identify differentially expressed genes (DEGs).
Rather than simply comparing expression levels, we interpret the results together with fold change, P-values, sample variability, and visualization patterns.

6-a. Exploring Differential Expression through Visualization (Subio Platform)

In this case, as observed in the clustering results from the previous section, there appears to be a bias: genes with increased expression are commonly shared across the three siRNA conditions, whereas genes with decreased expression show very little overlap.

After extracting differentially expressed genes using the Basic Plug-in tools, use the Venn Diagram tool to examine how these gene sets overlap.

It is important to pay attention to such inconsistencies. Why is there so little overlap among downregulated genes? Why do upregulated genes tend to be shared across conditions? What mechanisms might exist at the boundary between “common” and “condition-specific” responses? Could time-course data reveal these patterns?

Rather than focusing solely on P-values, combine these observations with clustering and PCA results from previous sections to generate hypotheses. The essence of data analysis lies in asking what hypotheses can be formed and what experiments are needed to test them.

The analysis tools included in the Basic Plug-in of Subio Platform can perform group comparisons using t-tests and ANOVA. On the other hand, in RNA-Seq data analysis, it is sometimes said that “t-tests should not be used.” However, the essential point is not that t-tests themselves are inappropriate, but that they should not be applied mechanically to Gene Counts or similar data without appropriate preprocessing and data checking.

In this tutorial, we have performed normalization, preprocessing, and filtering in the previous sections, while visualizing the data and checking its condition. Through these steps, the data have been prepared so that general statistical methods can be applied more appropriately. Rather than blindly following the results of a specific statistical tool, choosing an analysis method after checking the state of the data is a safer way to perform DEG analysis without being driven by the tool.

A small P-value does not necessarily mean biological importance.

More importantly, you should not focus only on genes with small P-values. Even genes that are not statistically significant may still carry important biological insights if they show consistent patterns in expression profiles or clustering (Case Study No.199).

Practical Tip:
This may not be directly related to this specific case, but when you want to capture expression changes from a small subset of cells within a sampled tissue, it can be useful to combine multiple criteria such as:
  • Low expression levels
  • Low variability
  • Small changes in expression
By combining these conditions (e.g., using multiple filters), you may be able to detect signals that would otherwise be overlooked in standard analyses.

What matters is to imagine how the genes of interest would appear in the data, and to design ways to capture them.
These are precisely the skills required of analysts in the age of AI.

RNA-Seq Data Analysis Tutorial (06) - Extracting Differentially Expressed Genes (DEG)

【Start your 5-day free trial】
Identify differentially expressed genes (DEGs)

Back to TOC


6-b. RNA-Seq Differential Expression Analysis with edgeR & DESeq2 (Using ChatGPT and R)

Up to this point, you have explored the data visually and gained an understanding of its distributions and expression patterns.

Next, we evaluate differential expression quantitatively using statistical methods such as edgeR and DESeq2.

NOTE:
edgeR and DESeq2 are widely used statistical tools for RNA-Seq analysis that evaluate differential expression from count data. In this tutorial, the focus is not on statistical theory or coding, but on how to interpret the results.

The key point is that these methods do not provide definitive answers, but rather offer evidence to support interpretation.

By leveraging ChatGPT, you can focus on the core of your analysis without spending time writing scripts or debugging errors.

To generate accurate code, it is essential to provide the necessary information clearly and without ambiguity.

Rather than typing ideas directly into the chat, we recommend drafting your instructions in a text editor first, then copying the finalized version into ChatGPT. This simple step reduces ambiguity and leads to more reliable results.

In particular, clearly specifying the following information will significantly improve the quality of the generated code:

  • Input file specifications
    What files are used, what format they are in, and what information they contain.
  • Analysis specifications
    What analysis or processing should be performed using the input data.
  • Output specifications
    Ensure the output is a tab-delimited text file that can be imported into Subio Platform, and specify what information and columns it should include.

If an error message appears, copy and paste it into ChatGPT exactly as it is, without omitting any part.

In this way, rather than trying to write code directly, clearly defining the input, analysis, and output is the key to effectively using ChatGPT for data analysis.

You will use ChatGPT to generate R scripts for running edgeR and DESeq2 based on the specifications defined above.

The goal is not only to obtain results, but also to compare them—so that you can see how different statistical methods can lead to different outcomes. What matters is not which method you use, but how you interpret the results.

In practice, you may encounter situations where “one method shows significance while another does not.” However, such differences are often due to threshold settings or differences in statistical models, and do not directly indicate biological meaning (see Case Study No.355).

Rather than debating which method is “correct,” it is often more constructive to consider how the results can be used.

RNA-Seq data are observed as the combined result of multiple factors, including biological changes in gene expression, variability between samples, library size, low-count genes, and composition bias. edgeR and DESeq2 are methods for analyzing such count data using statistical models, but there is no guarantee that biological phenomena and mathematical models perfectly match. For this reason, rather than treating one method as the single “correct” answer, it is important to understand what kind of results each method produces and how to interpret the differences between them.

In actual analysis, it is important to compare the results obtained by multiple methods and then return to visualization and sample-level expression patterns. For example, you may want to examine genes detected by both methods, genes detected only by one method, or genes with large fold changes but unstable P-values.

From another perspective, when omics analysis is used at the exploratory stage, avoiding false negatives may sometimes be more important than strictly minimizing false positives. For this reason, using the union of DEGs identified by multiple methods can be one possible strategy. In particular, if the next step is enrichment analysis, it may be more useful to keep a somewhat broader set of candidate genes rather than applying overly strict thresholds.

For example, in the next section, “7. Gene Annotation and Enrichment Analysis,”we focus on the ESRRA gene, which is knocked out in this experiment. We extract candidate target genes that have ESRRA-binding motif sequences near their TSS from Gene Sets used in GSEA (Gene Set Enrichment Analysis). We then combine these candidate genes with genes whose expression is decreased, as an attempt to narrow down potential ESRRA target genes.
It is also important to view the entire analysis workflow from a broader perspective, including how biological filtering will be applied in later steps.
Even when DEG candidates are kept relatively broad at first, combining them with biological criteria can reduce the candidate list to a practical number of genes that can be examined in detail.
At that stage, candidates can be selected from an integrated perspective, taking into account both statistics and biology, including expression patterns, known gene functions, and literature information.

On the other hand, when the goal is final biomarker selection or clinical application, more conservative criteria are required in order to reduce false positives. It is important to be aware of whether the analysis is at the exploratory, validation, or application stage, and to use the results appropriately according to the research purpose.

The table below shows examples of possible strategies depending on the purpose of the analysis. DEG selection strategy may vary depending on the research objective.

Analysis Purpose Example Strategy for Handling DEG Candidates
Exploratory analysis Consider a broader set of candidates, including the union of DEGs detected by multiple methods
Biomarker candidate selection Focus on genes commonly detected by multiple methods as more reproducible candidates
Analysis used to support claims in a publication Clearly define criteria such as FDR and fold change in advance, and interpret the results carefully
Pathway or enrichment analysis Keep a somewhat broader set of candidates when the goal is to identify biological trends
Analysis followed by validation experiments Keep a somewhat broader set of candidates to avoid missing promising genes


The video below shows the practical workflow for edgeR and DESeq2 analysis using R together with ChatGPT.

This type of functional extension can also be applied to other R libraries and Bioconductor tools.
Related: A New Style of RNA-Seq Data Analysis: R/Python for Analysis × Subio for Visualization and Data Management

Calculate edgeR & DESeq2 P-values Using ChatGPT-Generated R Scripts | Subio Platform Workflow

As demonstrated in this video, most of the processing is handled within Subio Platform, while R scripts are used only where necessary. This approach keeps the scripts minimal and significantly reduces long-term maintenance costs.

Finally, the script can be saved as a text file and stored as an attachment. By preserving not only the results but also the analytical process (the script), the analysis can be reproduced, validated, and shared later. This is a key step in transforming analysis from a one-time task into a reproducible, reusable analytical asset.

Back to TOC


7. Gene Annotation and Enrichment Analysis: From Statistical Results to Biological Interpretation

Key Takeaway: Translating a “list of numbers” into “biological language” provides clues for interpreting the results of your analysis.

The next step after statistical analysis is to understand the results in terms of biological meaning.

The next step after statistical analysis is to understand the results in terms of biological meaning.

The most basic task for this purpose is probably web searching.
Biological interpretation begins with gathering information about individual genes and related pathways. Functions that support a large number of searches may look modest at first, but they can have a major impact on practical efficiency.

In the Annotation tab of Subio Platform, you can browse a gene list, right-click a gene of interest, and send a query to various databases or search sites. As search engines have become increasingly integrated with AI, this function has become much more useful.

For example,
“Pathways or networks including [gene name]” or
“Does [gene name] relate to [disease name] in humans?”
can be registered as predefined phrases designed for AI-assisted search. This eliminates the need to type the same search phrase every time, and makes it easier to reach information that is one step deeper than simply searching by gene name.

The next frequently used approach is enrichment analysis.
For this, you can use the Advanced Plug-in for Subio Platform, which includes the Enrichment Analysis tool, or you can use various free web-based tools. These tools help you examine biases in functional categories and pathways within a gene list, providing clues that can lead to biological interpretation.

Terminology can be confusing, but many approaches referred to as Gene Ontology (GO) analysis or pathway analysis are, in practice, forms of enrichment analysis.
DAVID Functional Annotation, Metascape, Enrichr, g:Profiler, and WebGestalt are representative web-based enrichment analysis tools.
In addition, Gene Set Enrichment Analysis (GSEA) is widely used as a representative method and tool for examining enrichment of gene sets using a ranked gene list.

These tools use different databases, statistical methods, and ways of summarizing results. Therefore, even when the same gene list is used, the results will not be exactly the same.
For example, DAVID and g:Profiler are useful for checking relatively well-organized results,
whereas Metascape and Enrichr are useful for expanding biological hypotheses through broader databases and visualization.
This is not so much a question of which tool is correct. Rather, it is easier to understand as a trade-off between conservative analysis and exploratory analysis.

Enrichment analysis does not provide definitive conclusions; rather, it offers clues for interpretation.
However, when combined with AI, it becomes even more powerful.
AI can help organize, compare, and integrate enrichment analysis results, making it easier to expand exploratory analysis and generate new biological hypotheses.

Please watch the videos below for practical examples.

Enrichment analysis is based on previous literature and existing databases, so its results have several limitations.

For example, results tend to be biased toward fields that have been studied extensively. Also, when only a small number of genes are associated with a certain biological phenomenon, it can be difficult to determine whether only a few genes are truly involved, or whether the area has simply not been studied enough. In addition, observations made in specific tissues or cell types may sometimes be presented as if they were general biological findings.

Therefore, enrichment analysis results should not be treated as “answers.” Instead, they should be used as clues for deciding what to examine next, while comparing them with the original expression data and visualization results.

With this in mind,
in 7-1, we introduce a basic example of enrichment analysis using DAVID Functional Annotation, along with some extended applications.
in 7-2, we show an example of enrichment analysis using a Subio Platform plug-in, and explore candidate ESRRA-related genes and pathways using GSEA Gene Sets.

RNA-Seq Data Analysis Tutorial (7-1): Gene Annotation and Enrichment Analysis Using Web Tools and AI

RNA-Seq Data Analysis Tutorial (7-2): Enrichment Analysis and Exploring ESRRA Target Candidates

Note: This video does not cover the pathway import procedure, but KEGG Pathways can be imported into Subio Platform and visualized together with gene expression data in the pathway view. For details, please see:
How to Use the Pathway Edit Tool.

【Start your 5-day free trial】
Discover GO terms and pathways from your gene list

Back to TOC


8. Genomic Location-Specific Regulation and Motif Search

Key Takeaway: Approach new regulatory mechanisms from physical positional relationships. Infer the "cause" of expression variation by examining changes dependent on specific genomic regions or associations with transcription factor "motif sequences."

In the GSEA ERR1_Q2 Gene Set used in the video in the previous section, genes that have an ESRRA (ERR alpha) binding motif sequence within a 4 kbp region centered on the TSS, that is, from 2 kbp upstream to 2 kbp downstream of the TSS, are treated as candidate target genes. However, depending on the purpose of the analysis, this range may feel too broad. Therefore, in this section, we introduce how to define a custom search range, extract genes that contain the motif sequence within that range, and examine their expression patterns.

Before doing that, once you have extracted differentially expressed genes or identified groups of genes with specific expression patterns by clustering, it is also useful to check whether those genes are unevenly distributed across the genome. If genes located in a specific genomic region are upregulated or downregulated together, changes in the epigenetic state of that region or alterations in chromosomal structure may be involved in the observed expression changes. If no such regional bias is observed, regulation by transcription factors may be a more likely possibility.

Next, by using the motif sequence search function, you can identify genes that have a specific sequence near the transcription start site (TSS) and examine their expression patterns. By combining analyses based on such positional information, you can further improve the resolution of your understanding of the data. The motif sequence search function also supports IUPAC nucleotide code.

In this case, there does not appear to be expression regulation biased toward a specific genomic region. On the other hand, using the ERR alpha binding motif sequence as a reference, we were able to extract genes that have this motif near the transcription start site. We then examined the overlap with the list of genes downregulated by siRNA and narrowed down the candidate genes.

RNA-Seq Data Analysis Tutorial (08) - Genomic Location Specificly Regulated Genes & Motif Sequences

【Start your 5-day free trial】
Explore genes with consensus motifs near TSS

Back to TOC

________________________________________

9. Conclusion: From Using Tools to Becoming a Decision-Maker

This tutorial has focused on introducing methods. However, in real-world data analysis, it is not the tools but the analyst who must decide which thresholds are appropriate and how to handle issues such as batch effects. It is important to recognize the role of subjective, informed decision-making beyond objective numbers.

In this tutorial, we have walked through the standard workflow of RNA-Seq data analysis. In practice, however, data analysis rarely follows a single linear path. The “correct” approach depends on the characteristics of the data and the goals of the study.

Once you have learned the methods, the next and most essential step is to evolve into an analyst who can make appropriate decisions based on informed judgment.

Real-world data is never perfect. This is precisely why analysts must complement incomplete information with their own judgment—grounded in both the data and the research objective.

This is one of the key characteristics of an analyst whose value remains relevant in the age of AI.

Subio Platform provides the analytical foundation for this transition—from a user who simply operates tools to a researcher who interprets data and makes decisions. In addition to the software itself, we support this process through data analysis services and online training.

Now, apply this approach to your own data and experience the difference firsthand.

If you are just starting to learn RNA-Seq data analysis, Subio offers two approaches depending on your goals.

■ If you would like to first obtain analysis results and focus on learning through validation and interpretation, consider starting with our data analysis service.
→ Learn more about the Data Analysis Service

■ If you would like to understand and make decisions at each step—from preprocessing to analysis—and become able to carry out the entire workflow on your own, we recommend our online training.
→ Learn more about the Online Training

Back to TOC