RNA-Seq Analysis Without Coding: How to Read Your Data and Make Decisions

How to Perform RNA-Seq Analysis Using GEO Data

This page explains the basic workflow (steps) and key concepts of RNA-Seq analysis using datasets from GEO (Gene Expression Omnibus), in a clear and structured way.

By using publicly available data, anyone can perform the same analysis under the same conditions.
Even if you are new to RNA-Seq analysis, you can start by understanding it through real data.

This is a tutorial for understanding RNA-Seq analysis without coding.
It helps you develop the ability to think through your analysis by visualizing your data—rather than relying on tools.

Unlike tutorials that focus on running pipelines in R or the command line,
this guide emphasizes learning how to interpret your data and make decisions based on visualization.

RNA-Seq analysis may seem complex, but the real challenge is not the tools—it is how you interpret the data.

When starting RNA-Seq analysis, many people quickly encounter difficulties with R and various packages.
Time is often consumed by writing code and fixing errors, leaving little room for interpreting the data and understanding its biological meaning.

On the other hand, there are also many tools available for RNA-Seq analysis that are designed for ease of use.

Many of these tools aim to automate the analysis process,
allowing users to obtain results quickly.
However, they may offer limited opportunities to understand the underlying steps of the analysis.

The goal of this tutorial is not to simply run analysis,
but to help you understand your data and develop the ability to make informed decisions.

________________________________________

Become an Analyst Who Thinks —  in the Age of AI

Subio Platform is designed as an analysis environment where you can interpret your data without coding.

By visualizing distributions and variability across samples,
it enables you to understand each step of the analysis and make your own decisions.

In this tutorial, you will follow the standard RNA-Seq workflow step by step,
learning the thought process behind analysis—what to look at, and how to make decisions.

Some parts of the workflow also introduce the use of AI tools.
In this context, the role of the analyst is not to write code,
but to clearly define what they want to do and communicate that intent to AI.

From simply “running predefined steps”
to “analyzing, interpreting, and explaining your data.

________________________________________

For those who need results quickly

This page is intended for those who want to understand RNA-Seq analysis by working through the tutorial themselves.
However, depending on your situation, a shorter and more efficient path may be available:

I want to analyze my own data as quickly as possible
Data Analysis Service: We deliver not only processed results, but also an interactive SSA file that allows you to explore the data freely.

I want to learn the workflow quickly using my own data
Online Training: Learn the essential analysis workflow efficiently using your own data. Sessions can be recorded for future reference.

I want to start with the basics using demo data
→ Continue with this tutorial.

________________________________________

________________________________________

1-a. Importing FASTQ Files

Key Takeaway: The first step in analysis is importing the measured experimental data. We explain the process for both raw data (FASTQ) and data already converted to gene-level expression (gene counts, FPKM, TPM, etc.).

Processing FASTQ files is often the first major hurdle for beginners and a step that tends to consume a significant amount of time.
However, FASTQ processing is not the core of research; it is closer to an execution process that follows defined procedures.

Learning command-line tools, overcoming errors, and successfully completing a pipeline can certainly bring a sense of accomplishment. However, this stage is merely a preparatory step. The core of research—understanding the data and deriving biological insights—lies beyond this point.

What truly deserves the most time and attention is the analysis and interpretation that follow. Spending excessive time and effort on these early steps can make the overall process inefficient.

If you find yourself stuck at this stage, it is perfectly reasonable to delegate this part—to colleagues, services, or even AI.
What matters more is how you approach the analysis itself. (See "How to Learn RNA-Seq Data Analysis")

At the same time, tools that allow you to “just input FASTQ files and click a button” can be useful for processing FASTQ files and obtaining gene counts (e.g., platforms such as Galaxy).  While these tools can efficiently generate gene-level expression values, the underlying processing steps are not always fully transparent. Using them for downstream analysis can therefore easily turn the process into a black box.

In addition, some web-based tools, such as RaNA-Seq, automate the analysis process. However, they may not provide direct access to gene-level count data.

In RNA-Seq analysis, statistical methods are typically built on gene counts. In particular, widely used approaches such as edgeR and DESeq2 are designed to work with count data.
For this reason, if you plan to use these methods, it is best to choose a workflow that provides gene counts.

When no alternative is available, it is still possible to proceed with normalized values such as TPM. However, it is important to keep in mind that these data are based on different assumptions.

Whether you already have the data, have asked someone else to process it, or are using a tool, the method does not matter. If you already have data converted to gene-level expression values (e.g., gene counts, or not always optimal measures such as RPKM/FPKM, TPM, or CPM), you can skip this step and proceed directly to Section 1-b.
________________________________________

In this section (1-a), we will explain how to complete this preprocessing step within Subio Platform. However, keep in mind that this is still a preparatory stage—the essential analytical work lies ahead.

When importing RNA-Seq FASTQ files into Subio Platform, a pipeline consisting of fastp, HISAT2, and StringTie is executed to calculate gene-level expression values, including gene counts and FPKM.
This allows you to directly use data in formats required for downstream analysis, including count data suitable for statistical analysis.

No workstation or UNIX command-line skills are required. The process can be run on a standard Windows or Mac computer.
Even for users familiar with command-line tools, Subio Platform provides a more streamlined way to perform preprocessing and obtain expression data. This allows you to focus on interpreting the data rather than managing the processing steps.

Before running the pipeline, these tools must be installed. If this seems difficult, please consider our FASTQ Processing Troubleshooting Service.

Keep FASTQ files in .gz compressed format. A large number of intermediate files will be generated in the same folder, so ensure you have sufficient disk space (at least 5 times the total size of the FASTQ files). If the disk runs out during execution, it will stop. FASTQ files can be stored on an external drive.

For Paired-end samples, please follow the naming rules for FASTQ files. If you encounter issues, try our troubleshooting guide.

Importing RNA-Seq FASTQ Files

Back to TOC

________________________________________

1-b. Importing Gene-Level Expression Data and Preparation

Preparing RNA-Seq data may seem complex at first, but once you understand the basic workflow, it becomes straightforward.

In this section, we use the GSE49110 (a GEO RNA-Seq dataset) to walk through the essential steps of RNA-Seq data preparation, including obtaining gene counts data and organizing annotation. This workflow is not limited to this dataset—it can be applied to many public datasets, such as those from GEO.

In this example, we use GEO data and retrieve annotation from NCBI. However, many RNA-Seq datasets use Ensembl gene IDs. In such cases, you can obtain the corresponding annotation using Ensembl BioMart (MartView).

For those who want to get started more quickly, an SSA file is also provided. By using the SSA file, you can skip the data preparation steps and begin exploring the dataset immediately. It also allows you to review the entire analysis workflow, modify parameters, and reanalyze the data interactively.
Note: The SSA file contains the full analysis workflow and data, but external resources such as pathway databases and publication PDFs are not included due to copyright restrictions.

Although a significant portion of the video focuses on creating and editing a Platform, this step only needs to be done once. Once created, the Platform can be reused, allowing you to start new analyses immediately in future projects.

For detailed step-by-step instructions, please follow along with the video.

RNA-Seq Tutorial (01) - Importing RNA-Seq Counts Data

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)
▶ Start exploring immediately with an SSA file (90-second demo)

Back to TOC

________________________________________

2. Series Creation and Setup

Key Takeaway: In statistical analysis, everything depends on how you define your experimental design—specifically, which samples belong to which groups. Without this information, the analysis cannot even begin. By organizing sample attributes and defining the axis of comparison, you establish a clear and reliable analysis framework.

Once the Series is loaded, objects such as the Measurement List and DataSet will appear in the Series panel on the left.
In the Analysis Browser, a Scatter Plot is displayed in the upper pane by default, while the Setup Series tab opens in the lower pane.

Start by clicking the "Edit Parameters" button.

In most cases, importing information from the "Sample Information" is the fastest approach.
In this dataset, there are four conditions (control and three types of siRNA treatment), each with two replicates.

Next, move to the Setup DataSet tab to define and organize your DataSets.
This step determines how samples are grouped and compared, which directly impacts the interpretation of the results.
In this tutorial, we create two DataSets.

You can also record additional information in the Sample Info tab and attach related files for reference.

RNA-Seq Tutorial (02) - Creating and Setting Up a Series

Back to TOC

________________________________________

3. Normalization and Pre-processing

Key Takeaway: Normalization is an essential step to correct systematic errors caused by differences in total read counts, but it should not be blindly trusted (See Case Study 379). Instead of relying solely on presets, develop an eye for judging whether data distortions are correctly resolved by visually checking pre- and post-correction distributions with histograms.

Normalization and preprocessing are critical steps, as their settings can significantly influence downstream analysis results.
If the characteristics of the data are not properly understood and handled accordingly, this may lead to incorrect conclusions.

In this tutorial, we start from the preset scenario "RNA-Seq (Counts)" and adjust it based on the data.
Through this process, you will learn how to interpret the data—what to look at (and where), and how to make informed decisions.

The settings introduced here are only examples and are not universally applicable to all datasets.
In real analyses, it is essential to adjust parameters according to the specific characteristics of your data.
If you are unsure, consider using our data analysis service.

Low Signal Cutoff Setting

Setting this value higher reduces unstable genes but increases the risk of missing potentially interesting ones (false negatives).
Conversely, setting it lower increases unstable genes but reduces the risk of missing relevant signals.

In other words, this represents a trade-off between false positives and false negatives.

It is often said that minimizing false positives is important, but that applies mainly to the final stage of research.
In many cases, omics analysis is used for discovery, where avoiding false negatives is often more critical.

If you want to avoid missing potentially interesting genes—especially those in low-expression regions—it is better to set the Low Signal Cutoff to a lower value.
On the other hand, if your goal is to identify highly reliable biomarker candidates, a higher cutoff is more appropriate.

Fill Missing Value Setting

For this parameter, enter a value slightly lower than the threshold set for the Low Signal Cutoff.
Note that the input is interpreted on a logarithmic scale. For example, entering “4” corresponds to 24 = 16.

In this way, the ability to flexibly adjust parameters according to the purpose of the analysis is one of the key strengths of using Subio Platform.

While intentionally setting a lower Low Signal Cutoff can be effective, it requires careful interpretation of the results.
To properly understand the outcome, it is important not only to focus on final p-values, but also to understand the overall analysis workflow.

For related topics, please also refer to the following:

RNA-Seq Tutorial (03) - Normalization and Pre-Processing

▶ Get your data analyzed now (data analysis service)
▶ Master RNA-Seq analysis using your own data (online training)

Back to TOC

________________________________________

4. Filtering (Quality Control)

Key Takeaway: “Garbage in, garbage out.” By removing low-expression, unreliable genes and uninformative noise, you can focus on signals that are biologically meaningful.

As discussed in the normalization step, not all measured values are equally reliable.
Before proceeding with the analysis, it is essential to identify which values are suitable for analysis and ensure that unreliable values do not interfere with the results.

For this purpose, we use the Filter tool included in the Basic Plug-in.
If you do not have the plug-in, you can try it with a 5-day free trial.

Fundamental Concept of Filtering

Filtering is not about “selecting reliable genes.”
It is about removing genes that are not suitable for analysis.

This distinction is critical.

If you focus only on selecting “reliable” genes, you may miss genes that are not expressed in the control condition but become expressed after treatment—potentially important biological signals.
Try using the Filter tool to observe this difference in practice.

Filtering Should Be Performed in Two Steps

In most cases, filtering is performed in two stages:

  • Remove genes with extremely low expression
  • Remove genes with little or no variation across samples

Are Common Thresholds Always Valid?

For example, when learning RNA-Seq analysis with DESeq2, you may often encounter code like this:

dds <- dds[rowSums(counts(dds)) >= 10,]

Have you ever thought:

“Is this threshold really appropriate for any dataset?”

Even if you have doubts, it is easy to copy and paste such code without fully understanding it.

Thresholds Are Not Universal Rules

These thresholds are widely used as rules of thumb, but they are not appropriate for all datasets.

The key point is that such thresholds are not universal rules.
They are simple pre-filters designed to remove extremely low-expression genes, and the appropriate cutoff depends on the characteristics of the data and the purpose of the analysis.

How Should You Determine the Threshold?

The answer is simple:
Look at the data.

You need to examine distributions, plots, and overall patterns to make informed decisions.

Practical Tip:
For TPM/FPKM data or low-input expression datasets, the boundary between noise and signal can be ambiguous, making it difficult to define a simple threshold based on raw signal values (e.g., Ch1 Raw Signal).
In such cases, filtering based on the coefficient of variation (CV) within Sample Groups defined in the Setup DataSet step can be effective.
Practical Tip:
For small datasets, it is often most reliable to iteratively inspect the data and adjust filtering as you proceed. For larger datasets, a practical approach is to randomly select around 100 samples, perform exploratory analysis to determine appropriate thresholds, and then apply them to the full dataset.
Using sample code without examining the data may lead to misleading or suboptimal results.

RNA-Seq Tutorial (04) - Filtering (Quality Control)

【Start your 5-day free trial】
▶ Remove noisy genes with filtering

Back to TOC

________________________________________

5. PCA and Clustering

Key Takeaway: The movement of tens of thousands of genes exceeds human comprehension; we must make it understandable. Gain a bird's-eye view to intuitively grasp sample variation (experimental quality) and similarity between groups to determine "what to investigate next."

At this point, the data is ready for analysis.
We now move on to extracting meaningful insights from the data.

Note that the analysis should be performed not on all genes, but only on those that have passed Quality Control through filtering.

First, it is important to obtain an overview of the expression profiles and understand the overall structure of the data.
The most suitable method for this purpose is Principal Component Analysis (PCA).

A key advantage of PCA is that not only distances between samples, but also directions, have clear meaning.

How to Read PCA

When interpreting PCA results, the following three points are essential:

1. Distance
If two points are close, their expression profiles are similar; if they are far apart, they are substantially different.

2. Direction
Samples located in the same direction (from the origin) indicate variation driven by similar groups of genes.
If samples are positioned in opposite directions across the origin, this indicates that their expression changes are in opposite directions.
In addition, the farther a point is from the origin, the larger the magnitude of the variation.

3. Principal Components
Each axis (principal component) represents variation driven by different groups of genes.
Components with higher explained variance capture the overall structure of the data more strongly, but they are not necessarily biologically more important.
In some cases, variation in a small number of genes—captured by components with lower explained variance—may be more biologically meaningful.

Based on these principles, let us examine the PCA results for this dataset.

The distances between replicates (within-group) are clearly smaller than the distances between siRNA treatments (between-group), indicating that the data is of high quality with low variability.

Compared to the control, all siRNA-treated samples show a shared shift in the same direction (bottom-right), suggesting a common pattern of variation.
At the same time, each treatment exhibits distinct variation: siE2 shifts primarily to the right (PC1), while siE3 shifts downward (PC2).
In addition, although not visible in this 2D PCA plot, another principal component (PC3) captures the characteristic variation of siE1.

How to Read Hierarchical Clustering

Next, we apply hierarchical clustering.

From the heatmap, we can observe that many genes are commonly upregulated (red) across all siRNA treatments.
In contrast, downregulated genes (blue) differ between individual siRNA conditions.

Integrating PCA and Clustering Interpretations

The key point is not to create figures and feel satisfied, nor simply to look at them.
The essence of analysis lies in recognizing these patterns and differences, and interpreting what is happening within the cells.

PCA and hierarchical clustering are essentially different ways of viewing the same underlying data structure.
By comparing both results, you can deepen your understanding of the data.

Note
In the normalization and preprocessing step, log ratio transformation was applied. The meaning of the PCA origin depends on what is used as the reference (denominator) in this transformation.
When centering is used, the origin represents the average expression pattern of each gene. When using Ratio to Control Samples, the origin represents the average expression of the designated control group.
In this case, the siC-treated samples appear near the origin because they serve as the reference.

Methods such as UMAP are also widely used today, but they primarily represent structure based on distances. In contrast, PCA provides meaningful information about both distance and direction, making it particularly useful for interpreting the underlying sources of variation.

RNA-Seq Tutorial (05) - PCA and Clustering

【Start your 5-day free trial】
Explore global patterns with PCA and clustering

Back to TOC

________________________________________

6. Identifying Differentially Expressed Genes

Key Takeaway: Rather than relying solely on P-value cutoffs, use approaches such as Venn diagrams to narrow down genes from multiple perspectives. By distinguishing between genes that change consistently across conditions and those that are condition-specific, you gain insight into what to investigate next and how to validate it.

6-a. Exploring Differential Expression through Visualization (Subio Platform)

In this case, as observed in the clustering results from the previous section, there appears to be a bias: genes with increased expression are commonly shared across the three siRNA conditions, whereas genes with decreased expression show very little overlap.

After extracting differentially expressed genes using the Basic Plug-in tools, use the Venn Diagram tool to examine how these gene sets overlap.

It is important to pay attention to such inconsistencies. Why is there so little overlap among downregulated genes? Why do upregulated genes tend to be shared across conditions? What mechanisms might exist at the boundary between “common” and “condition-specific” responses? Could time-course data reveal these patterns?

Rather than focusing solely on P-values, combine these observations with clustering and PCA results from previous sections to generate hypotheses. The essence of data analysis lies in asking what hypotheses can be formed and what experiments are needed to test them.

A small P-value does not necessarily mean biological importance.
More importantly, you should not focus only on genes with small P-values. Even genes that are not statistically significant may still carry important biological insights if they show consistent patterns in expression profiles or clustering (Case Study No.199).

Practical Tip:
This may not be directly related to this specific case, but when you want to capture expression changes from a small subset of cells within a sampled tissue, it can be useful to combine multiple criteria such as:
  • Low expression levels
  • Low variability
  • Small changes in expression
By combining these conditions (e.g., using multiple filters), you may be able to detect signals that would otherwise be overlooked in standard analyses.

What matters is to imagine how the genes of interest would appear in the data, and to design ways to capture them.
These are precisely the skills required of analysts in the age of AI.

RNA-Seq Tutorial (06) - Extracting DEGs

【Start your 5-day free trial】
Identify differentially expressed genes (DEGs)

Back to TOC

________________________________________

6-b. RNA-Seq Differential Expression Analysis with edgeR & DESeq2 (Using ChatGPT and R)

Up to this point, you have explored the data visually and gained an understanding of its distributions and expression patterns.

Next, we evaluate differential expression quantitatively using statistical methods such as edgeR and DESeq2.

NOTE:
edgeR and DESeq2 are widely used statistical tools for RNA-Seq analysis that evaluate differential expression from count data. In this tutorial, rather than focusing on statistical theory or coding itself, we emphasize how to interpret the results.

The key point is that these methods do not provide definitive answers, but rather offer evidence to support interpretation.

By leveraging ChatGPT, you can focus on the core of your analysis without spending time writing scripts or debugging errors.

However, to generate accurate code, it is essential to provide the necessary information clearly and without ambiguity.

Rather than typing ideas directly into the chat, we recommend drafting your instructions in a text editor first, then copying the finalized version into ChatGPT.
This simple step reduces ambiguity and helps generate code that better matches your intent.

In particular, clearly specifying the following information will significantly improve the quality of the generated code:

  • Input file specifications
    What files are used, what format they are in, and what information they contain.
  • Processing specifications
    What analysis or processing should be performed using the input data.
  • Output specifications
    Ensure the output is a tab-delimited text file that can be imported into Subio Platform, and specify what information and columns it should include.

If an error message appears, copy and paste it into ChatGPT exactly as it is, without omitting any part.

In this way, rather than trying to write code directly, clearly defining the input, processing, and output is the key to effectively using ChatGPT for data analysis.

By comparing the results from edgeR and DESeq2, you can also observe how different methods produce different outcomes.

In practice, you may encounter situations where “one method shows significance while another does not.” However, such differences are often due to threshold settings or differences in statistical models, and do not directly indicate biological meaning (see Case Study No.355).

Watch the video below to see a practical workflow for performing edgeR and DESeq2 analysis using R and ChatGPT. This approach can also be applied to other R libraries and Bioconductor tools.

Calculate edgeR & DESeq2 P-values Using ChatGPT-Generated R Scripts | Subio Platform Workflow

As demonstrated in this video, most of the processing is handled within Subio Platform, while R scripts are used only where necessary. This approach keeps the scripts minimal and significantly reduces long-term maintenance costs.

Finally, the script can be saved as a text file and stored as an attachment. By preserving not only the results but also the analytical process (the script), the analysis can be reproduced, validated, and shared later. This is a crucial step in transforming analysis from a one-time task into a reusable and scalable analytical asset.

Back to TOC

________________________________________

7. Enrichment Analysis

Key Takeaway: Translating a “list of numbers” into “biological language” provides clues for interpreting the results of your analysis.

The next step after statistical analysis is to translate these results into biological meaning.
You can use the Enrichment Analysis tool included in the Advanced Plug-in of Subio Platform, or various freely available web-based tools. These tools allow you to interpret gene lists in terms of biological meaning.

Terminology can be confusing, but many approaches referred to as Gene Ontology (GO) analysis or pathway analysis are, in practice, forms of enrichment analysis.
Gene Set Enrichment Analysis (GSEA), DAVID Functional Annotation, and Metascape are among the most widely used tools for performing enrichment analysis.

Enrichment analysis does not provide definitive conclusions; rather, it offers clues for interpretation. 

Because enrichment analysis relies on existing literature and databases, several limitations should be considered:

  • Results tend to be biased toward well-studied areas.
  • When only a small number of genes are associated with a given phenomenon, it can be difficult to determine whether this reflects a true biological property or simply a lack of prior research.
  • Phenomena observed in specific tissues or cell types may be presented as if they were general.

It is not uncommon to feel that the analysis is complete once enrichment results are obtained.
However, this is only the starting point for interpretation.

RNA-Seq Tutorial (07) - Gene Annotation and Enrichment Analysis

【Start your 5-day free trial】
▶ Discover GO terms and pathways from your gene list

Back to TOC

________________________________________

8. Genomic Location-Specific Regulation and Motif Search

Key Takeaway: Approach new regulatory mechanisms from physical positional relationships. Infer the "cause" of expression variation by examining changes dependent on specific genomic regions or associations with transcription factor "motif sequences."

Once you have extracted DEGs or genes with specific expression patterns through clustering, you might want to check if these genes are distributed unevenly across the genome. If a cluster of genes in a specific region shows a synchronized increase or decrease in expression, it could potentially hint at localized epigenetic shifts or structural variations in the chromosome. 

If no such bias is observed, it may be more likely that the expression is regulated by transcription factors. You can identify genes having motif sequences near their TSS. Exploring these spatial relationships may provide additional layers of insight into your data. For your reference, the motif sequence search supports IUPAC nucleotide codes.

In this specific case, there appears to be no expression control biased toward genomic locations. However, we were able to find the binding motif sequence for ERR alpha (which was knocked down by siRNA) on Wikipedia and extract genes possessing this motif near their TSS. By checking the overlap between these genes and the list of genes downregulated by all three types of siRNA, we arrived at a single gene. Nevertheless, as mentioned earlier, it may not be necessary for a gene to be suppressed by all three siRNAs, or the threshold for differential expression analysis might have been too stringent. By relaxing these conditions, the number of candidate genes directly regulated by ERR alpha will likely expand.

RNA-Seq Tutorial (08) - Genomic Location Specific Regulation & Motif Seq

【Start your 5-day free trial】
▶ Explore genes with consensus motifs near TSS

Back to TOC

________________________________________

9. Conclusion: From Using Tools to Becoming a Decision-Maker

This tutorial has focused on introducing methods. However, in real-world data analysis, it is not the tools but the analyst who must decide which thresholds are appropriate and how to handle issues such as batch effects. It is important to recognize the role of subjective, informed decision-making beyond objective numbers.

In this tutorial, we have walked through the standard workflow of RNA-Seq data analysis. In practice, however, data analysis rarely follows a single linear path. The “correct” approach depends on the characteristics of the data and the goals of the study.

Once you have learned the methods, the next and most essential step is to evolve into an analyst who can make appropriate decisions based on informed judgment.

Real-world data is never perfect. This is precisely why analysts must complement incomplete information with their own judgment—grounded in both the data and the research objective.

This is what defines a resilient analyst—one whose value remains intact even in the age of AI.

Subio Platform provides the analytical foundation for this transition—from a user who simply operates tools to a researcher who interprets data and makes decisions. In addition to the software itself, we support this process through data analysis services and online training.

Now, apply this approach to your own data and experience the difference firsthand.

If you are just starting to learn RNA-Seq data analysis, Subio offers two approaches depending on your goals.

■ If you would like to first obtain analysis results and focus on learning through validation and interpretation, consider starting with our data analysis service.
→ Learn more about the Data Analysis Service

■ If you would like to understand and make decisions at each step—from preprocessing to analysis—and become able to carry out the entire workflow on your own, we recommend our online training.
→ Learn more about the Online Training

Back to TOC