What Are Best Practices for RNA-Seq Data Analysis? - In the AI Era, Start by Looking at Similar Data

  • Gene Expression
  • Microarray
  • High-Throughput Sequencing

When you start learning RNA-Seq data analysis, you may find yourself searching for information such as:
“RNA-Seq data analysis best practices”
“RNA-Seq analysis workflow”
“Should I use edgeR or DESeq2?”

Of course, RNA-Seq analysis does have a broadly used basic workflow. FASTQ files are processed, Gene Counts or TPM values are calculated, and the analysis proceeds to normalization, PCA, clustering, differential expression analysis, and GO or pathway analysis. This kind of overall workflow is common to many datasets.

However, there is one important point to keep in mind. There is no universal best practice for RNA-Seq data analysis that can be applied to every dataset as-is.

Even if a certain workflow is presented as a “best practice,” you cannot judge from the workflow name or the tools used alone whether it is really suitable for the specific data you want to analyze.

The characteristics of RNA-Seq data vary greatly depending on the conditions

Real RNA-Seq data are not as uniform as one might imagine from textbook explanations or statistical models. This is because experimental and measurement conditions have a major impact on data quality.

For example, the appearance of the data can vary greatly depending on factors such as:

  • read depth
  • tissue or cell type
  • cellular composition of the sample
  • RNA quality
  • sample preparation method
  • library preparation protocol
  • kit used
  • sequencing facility or timing
  • number of samples
  • whether the design is paired or not
  • how many low-expression genes are included
  • presence or absence of batch effects

Even within RNA-Seq data, an in vitro experiment using cultured cells and biopsy data from patients can differ greatly in variability, how low-expression regions should be handled, and the relationships among samples. In addition, for data such as low-input RNA-Seq, where the amount of RNA is small, the assumptions normally used for bulk RNA-Seq may not apply as-is. (Case Study No.432)

Whether a workflow is suitable for the data in front of you must be judged by actually looking at the data.

Best practices are not fixed procedures

There are general principles in RNA-Seq analysis that are meaningful. However, this does not mean that the same settings should be applied to every dataset.

How strongly should low-expression genes be filtered? Which normalization method is appropriate? How should separation in a PCA plot be interpreted? How should differences among edgeR, DESeq2, and t-test results be understood? Is a P-value sufficient, or should FDR be used? How should fold change criteria be combined? Should batch effects be corrected, or is the experimental design one in which correction should not be applied?

These questions cannot be answered without looking at the actual data, and the answers also depend on the purpose of the analysis.

In other words, best practices in RNA-Seq data analysis do not mean applying a fixed workflow as-is. They mean checking the data distribution, visualizing relationships among samples, examining the effects of low-count regions and batch effects, and judging whether the analysis workflow is suitable for that dataset.

Start by looking at GEO Series with similar conditions

So, where should you start when thinking about what kind of analysis is suitable for your data?

One practical approach is to look for public datasets with conditions similar to the data you want to analyze, and actually inspect them.

GEO contains RNA-Seq data from many different tissues, cells, diseases, drug treatments, gene knockdown experiments, and experimental designs. By finding Series close to your own research and checking the distribution of Gene Counts or TPM, sample-to-sample variability, PCA, clustering, and the behavior of low-expression regions, you can begin to see what tends to happen in that type of data.

For example, if you are going to analyze bulk RNA-Seq data from human tissue, you can look for GEO Series involving similar tissues or diseases. If you are planning an experiment using cultured cells, you can look for data with similar cell types or treatment conditions. If you are planning low-input RNA-Seq, you can look for cases with similar cell types or experimental conditions.

In this way, looking at real data under similar conditions gives you much more concrete information than searching for abstract best practices.

AI has made it easier to find similar public datasets

In the past, finding GEO Series similar to your own data was not easy. You had to use clues such as disease name, tissue name, cell type, treatment condition, measurement platform, and sample size, and then examine the details one by one.

For this reason, it was not always realistic to recommend that beginners in RNA-Seq analysis “first find similar public datasets and look at them.”

Today, however, AI has changed this situation significantly.

For example, you can ask in natural language:

Please find GEO Series comparing disease and control groups in human XX tissue. RNA-Seq is preferred, but microarray data are also acceptable. Please choose datasets with about 20 to 100 samples in the disease group.

I am looking for RNA-Seq data from liver-derived cultured cells in which a specific gene was knocked down. If possible, I would prefer a cell line that is not derived from cancer cells.

Please find examples of low-input RNA-Seq or bulk RNA-Seq data derived from a small number of neuronal cells. I want to understand the degree of variability in such data, so please list as many candidates as possible without restricting the protocol too much.

In this way, AI can greatly reduce the burden of searching for public datasets. Of course, each candidate suggested by AI still needs to be checked individually. Even so, the fact that the initial candidate search has become much easier than before is a major change.

In other words, in the AI era, RNA-Seq data analysis no longer has to start with a search for abstract best practices. It is becoming realistic even for beginners to search for public datasets close to their own purpose and then actually visualize those data.

Looking at similar public datasets also helps experimental planning

Finding and inspecting GEO Series close to your own data is useful not only for thinking about analysis methods, but also for planning experiments.

In omics experiments, including RNA-Seq, many factors affect the characteristics of the data: read depth, tissue or cell type, sample preparation, library preparation protocol, kit used, sequencing facility, batch effects, and more.

Therefore, before starting an experiment, checking similar public datasets to see what level of variability or bias may occur is important when considering budget, sample size, and experimental design.

In omics experiments, systematic errors and nonlinear bias cannot be completely avoided. For this reason, it is important to anticipate in advance what kinds of problems may actually occur. If possible, incorporating measures against those potential problems into the experimental design can help manage the risk to some extent.

For more on this issue, please also see “Omics Experiments Cannot Escape the Problem of Nonlinear Bias” and “Good Research Starts with a Superior Experimental Design.

In addition, before starting an experiment, checking the realistic performance of the measurement system you plan to use is highly effective for avoiding wasted time and budget. Please also see “Before You Fail with RNA-Seq or Microarray: Why Pre-Experiment Data Assessment Saves Time and Budget.”

Be careful when trying to integrate multiple GEO Series

If you are thinking of integrating GEO data or analyzing them together with your own data, a different kind of caution is needed.

Using public datasets with similar conditions as a reference is not the same as simply combining multiple GEO Series and directly comparing them.

RNA-Seq Gene Counts are not absolute amounts of expression that can be directly compared across different Series. They should be understood as values that may be comparable only when viewed as ratios relative to a common control group. The insight gained from this point can be especially important for those planning large prospective studies.

This point is discussed in more detail in “RNA-Seq Gene Counts Are Not Directly Comparable Absolute Values | GEO Dataset Integration and How to Handle Batch Effects.”

The way we learn RNA-Seq analysis is also changing

In this way, in the AI era, RNA-Seq data analysis can now start from practice: rather than first memorizing the “correct workflow,” it is becoming possible to begin by looking at real data and understanding why each step is needed.

If you find a similar GEO Series but have trouble importing the data, or if you are unsure how to interpret the data or choose an analysis strategy, you can also consult Subio’s data analysis service for assessment. Estimates are free, so if you have a GEO Series you would like to discuss, please contact us with the GSE number.

Consult Subio about data analysis and data assessment

Summary: Instead of searching for best practices, look at the data

There is no universal best practice for RNA-Seq data analysis.

This is because the characteristics of RNA-Seq data vary greatly depending on read depth, tissue or cell type, sample preparation, protocol, kit, measurement conditions, experimental design, and other factors.

Therefore, whether a generally recommended workflow is suitable for your data cannot be judged without actually looking at the data.

If you are starting to learn RNA-Seq data analysis, we recommend first using AI to find GEO Series close to your purpose, and then actually visualizing those data.

Best practices are not fixed procedures. They are a way of thinking: look at the data, check it, and make informed judgments.

Ignoring the Fortune Teller