GEO (Gene Expression Omnibus)
contains a large number of publicly available RNA-Seq datasets.
Some of these datasets provide not only FASTQ files, but also Gene Count data as
Supplementary Files, which can be used directly for analysis.
To obtain GEO-calculated Gene Count data, you can use the following URL.
Replace [GSE_accession_number] with the target GSE accession number:
https://www.ncbi.nlm.nih.gov/geo/download/?acc=[GSE_accession_number]
However, Gene Count data alone are not enough to start RNA-Seq data analysis.
You also need to organize gene annotations and sample information,
and link them properly to the expression data.
This video shows how to obtain RNA-Seq Gene Count data from GEO,
combine it with gene annotations and sample information, including experimental parameters,
and prepare the dataset for analysis in Subio Platform.
If a Gene Count table is available, you can proceed to downstream analyses
such as normalization, filtering, PCA, and differential expression analysis
without performing FASTQ preprocessing.
Two Ways to Obtain Gene Count Data: Supplementary Files and GEO-calculated Data
This page covers two types of Gene Count data: Supplementary File versions and GEO-calculated versions.
It explains how to obtain them and prepare them in a format that can be used for analysis in Subio Platform.
If the sample name (Sample_title) can be used as an experimental parameter,
this video covers that workflow.
If you need to extract experimental parameters from Sample_characteristics_ch1,
please also see this article.
Advantages and Limitations of Supplementary File Versions
In many RNA-Seq datasets registered in GEO, Gene Count text files are provided as
Supplementary Files.
Supplementary File versions have the advantage of being easier to relate to the results reported in the paper.
However, because FASTQ processing methods differ from study to study,
they may not be suitable for integrated analysis across different Series.
In Supplementary File versions, Ensembl Gene IDs are often used as the gene identifiers.
In that case, gene annotations corresponding to Ensembl Gene IDs are preinstalled in Subio Platform,
so you can start the analysis smoothly without separately obtaining annotation files.
Also, when datasets use Ensembl Gene IDs,
it may be easier to compare analysis results at the gene level.
However, because experimental conditions and preprocessing methods differ across Series,
such data cannot always be used directly for integrated analysis.
One point to watch when using Supplementary File versions is that the sample names in the Gene Count table
may not match the Sample Titles registered in GEO.
In such cases, you need to match the sample columns in the Gene Count table
with the sample information in GEO.
Advantages and Limitations of GEO-calculated Versions
Even when Supplementary Files are not available,
GEO may provide Gene Count data calculated using a standardized method.
In that case, the data can be obtained from the following link:
https://www.ncbi.nlm.nih.gov/geo/download/?acc=[GSE_accession_number]
Using standardized Gene Count data provided by GEO
makes it easier to handle multiple datasets registered in GEO under the same conditions
and to consider integrated reanalysis.
However, whether the data can actually be used for integrated analysis
must be judged by checking data distributions, batch effects, and other factors.
Note that GEO-calculated Gene Count data use NCBI Gene IDs.
Therefore, this video also shows how to obtain gene annotations corresponding to NCBI Gene IDs
and integrate them with the Gene Count data.
On the other hand, many RNA-Seq datasets use Ensembl Gene IDs.
Therefore, if you want to compare the analysis results with datasets that use Ensembl Gene IDs,
ID conversion is required.
Next Step
Now that you have obtained Gene Count data from GEO,
you are ready to start RNA-Seq data analysis.
The following tutorial explains the next steps.
Since the data import step has already been completed,
please proceed to “2. Creating and Setting Up a Series”:
→ RNA-Seq Data Analysis Tutorial
Related Page
If Gene Count data are not available, or if you want to calculate expression values using a specific method,
you need to start the analysis from FASTQ files.
In that case, please see the following article:
How to obtain .fastq.gz files from Gene Expression Omnibus (GEO)