The Cancer Genome Atlas Stomach Adenocarcinoma (TCGA-STAD) data is available from GDC Data Portal. You can easily import the methylation and RNA-Seq data with Subio Platform. (Watch tutorials "how to import RNA-Seq data" or "DNA methylation data" for import operations.) This is a very practical case study of data analysis of the multi-omics dataset, composed of DNA methylation and gene expression data.
This multi-omics dataset is composed of 396 Illumina HumanMethylation450K BeadChip samples (2 Solid Tissue Normal, 394 Primary Tumor) and 406 RNA-Seq samples (32 Solid Tissue Normal, 374 Primary Tumor). And 336 patients are with both gene expression and DNA methylation data. This sounds enough to see how much the DNA methylation pattern effects on gene expression patterns.
Let's see the methylation data first. Subio Platform can draw a TSS plot of methylation sites.
Methylation sites designed in the array are mostly very close to TSS (< 500bp). Scatter plot version of TSS plot gives you more information. For example, you can see the average beta values of normal samples around TSS.
Beta values located farther than 2000bp from TSS looks stable and binary status of methylated or unmethylated. On the other hand, the statuses near TSS look ambiguous or fragile. Let's see the average of tumor samples.
The distribution of beta values looks similar. They seem to be somewhat different from Normal, but the difference is not very clear by comparing the two charts. So you can visualize how different they are on TSS plot.
Now it is clear that the methylation status changing sites are limited to near TSS. But the magnitudes of changes are not big. This chart represents the average of 336 tumor samples, and the average may be converging to 0. So let's see the difference at one tumor sample.
It shows larger changes as you expected. But still, the difference is not like shifting from 0 to 1. Such ambiguous epigenetic modifications seem to be able to cause the alteration in expression level. Another information from this chart is that the fluctuations between -0.15 and 0.15 are likely to be noise, and changes whose magnitudes are larger than 0.2 looks meaningful, in this case. So I filtered out sites with <0.15 fluctuations at 333 out of 336 tumor samples. About 2/3 sites remained, and I used them for the following analysis.
Calculating Anti-Correlation between DNA methylation status and neighboring genes' expression.
Now let's see how much the methylation and gene expression patterns are anti-correlated over the 336 patients. I removed noise from the RNA-Seq data before calculating correlations. The following chart represents the distribution of correlation coefficients of gene expression patterns and the pattern of average methylation statuses of sites CpG islands within 500bp from TSS. They are biased toward negative (anti-correlation) though; most genes are in a weak relationship (around -0.3). Only some are in strong anti-correlation (around -0.7).
Contrarily, the shape of the correlation coefficient is obviously different from that between microRNA expression and their predicted target genes' patterns, which looks like a normal distribution. So the histogram above indicates DNA methylation does suppress the expression levels of their downstream genes.
On the other hand, this biased shape fades if I calculate correlation coefficients between the gene expression and the average methylation in 1500-2500bp upstream and outside CpG islands patterns. So only methylation sites near TSS (<500bp) can have the strong effect on gene expression regulations.
See the correlation coefficients patterns in the following matrix. It looks there are two types of effect of DNA methylation on gene expression regulations. The strong effect seems to require being close to TSS, but not to do being in CpG islands. On the other hand, the weak effect is specific to island sites within 1500bp upstream of TSS.
Genes under the strong effect of the DNA methylation status (correlation coefficient < -0.6) is only 216, which are very limited compared to those under the weak effect. But it is remarkable that 53 out of 216 are ZNF family genes. The following table is the result of GO enrichment analysis, and it says the list is super-enriched with transcription factors. The strong effect might bring the global alteration in expression levels through these specific transcription factors.
I applied hierarchical clustering with the transcription factors over stage 1 and 2 patients. The patients were divided into 2 clusters.
And I examined the survival curves of the two groups of early-stage patients. You see the difference in their survival rates in the second year. So the transcription factors may affect outcomes of the early stage stomach adenocarcinoma. Contrarily, they don't seem to affect the outcomes of late-stage patients.
About the analysis tools
The tools I used for this case study is listed in "Product" section at the right side. Please visit for more information.
Subio Platform is a free omics data browser, and data-sharing platform. Please visit for more information.
If you are not familiar with analysis tools, you can order the analysis service to let us work for you.