We introduced an automatic data import wizard for TCGA data in Subio Platform v1.20.5031. So the operation is much easier in the latest version. Please refer a new TCGA READ case study for operations of data import.
The Cancer Genome Atlas (TCGA) is a large data repository of multi-omics and clinical data of cancers.This tutorial shows how to get RNA-seq data to analyze (part 1), how to get DNA methylation microarray data to analyze (part 2) and how to find relationships between gene expression patterns and DNA methylation patterns (part 3). I extracted a list of genes which are potentially controlled by changed DNA methylation status in prostate cancer, e.g. TDRD1 and SLC45A2 as up-regulated, and SCGB3A1 and EVPLL as down.
And this is the list of genes which are possibly controlled by DNA methylation in PRAD. (22 up, and 173 down)
Part 1 and part 2 are preparation and you can skip to part 3. The integration of DNA methylation and gene expression data with "Find Correlated Regions" tool included in Advanced Plug-in is in part 3.
TCGA PRAD, RNA-Seq & DNA Methylation Integration (Part 1)
0:00 Getting RNA-Seq and clinical data from TCGA Data Portal.
2:15 Importing the RNA-Seq data into Subio Platform. (Import Samples wizard)
3:05 Editing the platform to separate Gene Symbols and Entrez Gene IDs.
3:30 Set the Gene Symbol column.
3:45 Editing the platform to fill gene annotations using "gene_info" file of NCBI FTP site.
4:25 Import Gene2Go file to fill GO terms in the platform.
5:00 Editing sample information (from file name to TCGA barcode)
6:10 Editing sample information (from TCGA barcode to patient ID)
8:00 Editing sample information (from patient ID to clinical information)
9:10 Creating a Series. (The end of collecting and merging a variety of information. )
-- You can skip to here if you're not interested in details process of data import. --
9:30 Importing the SSA file, an archive file containing alｌ information and analysis results.
9:55 Browsing DataSets, different definition of sample groups.
10:10 Comparing counts of 2 samples on the scatter plot and histogram.
10:40 Browsing Measurement Lists to understand the filtering steps.
11:25 Browsing a result of hierarchical clustering of gene expression data.
11:50 Browsing a result of PCA, as interacting view of PCA and clustering.
13:00 Importing the SSA file of RNA-Seq data of samples, which are paired with methylation data.
13:35 Browsing a result of PCA to separate into 2 types of tumorizing.
14:15 Browsing measurement lists of DEGs.
15:45 Browsing hierarchical clustering results.
TCGA PRAD, RNA-Seq & DNA Methylation Integration (Part 2)
TCGA's DNA Methyaltion data of 301 Prostate Adenocalcinoma patients, based on Illumina's HumanMethylation450 bead chip. A PC with 64GB RAM is not powerful enough to visualize 480k probes x 301 samles. So I'm going to show how to divide data to make it handleable. I hope it gives an idea to whom think they can't work with a big data set on their computers.
0:00 Getting clinical and DNA methylation array data from TCGA Data Portal.
0:50 Getting the annotation table of the Illumina methylation array from GEO site.
1:00 Editing the annotation table to separate into 3 to make it small.
3:05 Importing the separated annotation tables to create sub-platforms.
3:25 Importing experimental data of DNA methylation study.
4:10 Creating a Series and editing parameter to extract TCGA sample ID.
4:40 Filtering to remove probes which methylation status are unchanging.
5:45 Exporting the data of changing probes only.
6:15 Merging the data of changing probes from the separated 3 Series on Excel.
7:00 Importing the changing data into the platform of whole probes.
7:45 Creating a Series and restoring the sample IDs.
8:20 Covert data into Region Lists to make them available to tools in Advanced Plug-in.
8:40 Loading a genome of hg19 RefSeq genes.
9:20 Creating intervals to merge probes which are closely located and sharing similar values.
10:20 Exporting the intervals as BED files, and import them again as creating a platform of intervals.
11:00 Editing sample information to fill clinical information.
12:50 Creating a Series and editing parameters.
13:55 Editing DataSets as defining groups of samples.
15:10 Editing normalization scenario as pre-processing.
16:00 PCA on DNA methylation status.
17:00 Filtering to extract intervals which methylation status change between normal and tumor.
18:55 Applying Genomic Location Filter to extract genes having methylated or unmethylated intervals.
TCGA PRAD, RNA-Seq & DNA Methylation Integration (Part 3)
This demonstration shows how to visualize the relationship between gene expression patterns and DNA methylation patterns in their upstream regions, and how to extract genes which expression patterns are anti-correlated to DNA methylation patterns.
0:00 Extracting genes with methylated or unmethylated sites, and browsing their expression patterns.
1:30 Extracting DEGs between normal and tumor.
2:10 Comparing DEGs and genes with methylated or unmethylated sites. It's indicated getting methylated in cancer down-regulates their expression levels.
3:00 Loading the Series of DNA methylation data again, and converting methylation status into Region Lists.
3:50 Loading the Series of RNA-Seq data again.
4:00 Extracting genes which are anti-correlated to methylation status. ("Find Correlated Regions" tool)
5:10 Browsing gene and methylation status pairs on scatter plot.
5:40 Creating a measurement list of genes which are anti-correlated to methylation status.
5:50 Comparing DEGs and genes in anti-correlation against their methylation status.
Analyzing a large data set requires large memory (RAM) on the computer. If memory error often happens, you need to add RAM physically though, there are workarounds to handle with a limited memory.