An exploratory analysis on TCGA-BLCA

  • Gene Expression
  • High-Throughput Sequencing

Analyzing omics data is not something like you can follow the standard workflow. It is a cycle of watching-discovering-examining. Subio Platform is a software for doing it.


You must look at data without any preconceptions, because there must be non-linear biases due to artificial effects especially in the large data set. You cannot assume the data is all right as the textbook describes, and you have to grasp the unique characteristics of the data.


Even if you want to know what genes are differentially expressed between normal and tumor, you cannot assume all the normal or tumor samples are uniform. You usually find subgroups in normal and tumor samples. So you must decide how to handle the variations, and what to extract from the data.


Extracting differentially expressed genes is easy, but interpreting the result is hard. Visualizing many sample parameters is helpful to find possible correlations. And then you have to make hypotheses to be proven by subsequent experiments. It is a kind of speculation because the data does not tell everything.

The best way to learn about omics data analysis.

The following movie is an example on TCGA-BLCA mRNA-Seq data. But actually, it is not a way to discover something. It is only a series of trials and errors on the real data set. 

You can easily import a lot of gene expression data sets of a variety of cancers from GDC web site (TCGA and TARGET projects) to Subio Paltform. This is one of the best real data sets to learn about omics data analysis, because you will find the real data is not what textbooks or statistical models assume. Watch the data with your own eyes,  and try any ways by yourself.

Exploratory Analysis of TCGA-BLCA RNA Seq data

00:10 - The memory setting to handle large data sets.

00:40 - Importing samples from GDC data center.

01:30 - Creating a series, and set GDC sample annotations as parameters.

03:50 - Setting a normalization as viewing data distribution patterns.

06:20 - Filtering.

08:20 - PCA, and marking samples in a cluster.

08:50 - Visualizing parameters to help interpreting the result.

13:10 - Examining data distribution patterns of artificial effects.

18:20 - Excluding a part of samples from the analysis.

19:00 - Defining subgroups of tumor samples.

21:10 - Extracting differentially expressed genes between the subgroups.

22:00 - Creating a new series of Normal-Tumor paired samples. 

24:30 - Making tumor/normal ratios to cancel individual differences.

27:00 - Examining "tumorization" effect on the expression profile.

27:30 - Defining 2 types of "tumorization" from a result of PCA.

29:10 - Extracting differentially expressed genes between the "tumorization" types.

30:00 - Comparing results for further analysis.

Please take an Online Training for a full instruction of the data analysis.

Related Topics

Analyzing a large data set requires large memory (RAM) on the computer. If memory error often happens, you need to add RAM physically though, there are workarounds to handle with a limited memory.

How to analyze a large data set with a limited memory.