Between Statistics and Biology|Thoughts After Reading Simply Statistics
Recently, I came across Simply Statistics , a blog about statistics and data analysis, and read several of its articles. What struck me was not so much the statistical methods themselves, but the way the authors think about data analysis.
I have not learned statistics or data analysis systematically as a statistician would. For me, the biological question came first. How should I look at gene expression data? Are the differences between samples really biological differences? How far can I trust a list of extracted genes? I gained experience by trying analysis tools and methods in order to deal with these questions.
For that reason, when I read discussions that focus only on analytical theory or on which method is superior, I honestly find it difficult to become very interested in that part alone. What matters to me is not how theoretically elegant a method is, but what it allows me to see in order to understand the biological problem in front of me, and what it may make me overlook.
On the other hand, for people who specialize in statistics or data science, the individual biological background itself may not always be the central concern. From the statistical side, the important questions may be how to extract information from data, what kinds of inference are valid, and what kinds of visualization or analysis are likely to lead to misunderstanding.
Then how can we bridge the distance between people who look at data from the statistical side and people who enter data analysis from the biological side? After reading Simply Statistics, I felt that the key may be to share how we formulate questions, how we look at data, and how we check the results.
A beautiful plot does not necessarily mean a good analysis
One of the articles that left a particularly strong impression on me was an article about UMAP plots.
UMAP is a very useful method for visualizing high-dimensional data such as single-cell RNA-Seq data. However, even if clear clusters appear in a two-dimensional plot, that does not mean that the distances or shapes can be interpreted directly as biological entities.
This issue is not limited to UMAP. It also applies broadly to RNA-Seq data analysis. Samples may separate in PCA. Groups may appear in clustering. A heatmap may show a clean pattern. These plots are useful as starting points for analysis, but they are not enough to draw conclusions by themselves.
What matters is understanding what the plot represents. Could normalization or filtering have affected it? Could variation in the low-count region or batch effects be related to the structure we see? The important thing is not simply to make plots, but to be able to read them. In this sense, I feel uncomfortable with the tendency to think that using methods such as UMAP or t-SNE means that we have understood complex data.
Failure in analysis can become an entrance to understanding data
Another article discusses failure in data analysis. Here, failure does not simply mean that the code does not run. When the result is different from what we expected, it often indicates that something about the data, the analytical method, or the scientific understanding has not yet been sufficiently understood.
In practical RNA-Seq data analysis, this happens often. Samples from the same condition may separate in PCA. edgeR, DESeq2, and t-test may extract very different gene lists. The result of GO analysis may not match the biological interpretation we expected.
At that point, the important thing is not to decide immediately which method is correct. The important thing is to ask why the result appeared that way. By checking the data distribution, relationships among samples, behavior of low-expression genes, outlier samples, batch effects, and the appearance after normalization, the meaning of the analytical result gradually becomes clearer.
When the result is not what we expected, it can feel like a failure. However, that discomfort may point to what should be checked next. Instead of simply discarding an analysis that did not go well, thinking about why it appeared that way can become an entrance to understanding the data.
Reproducibility cannot be maintained simply by preserving raw data
I was also impressed by an article about reproducibility. Public databases such as GEO are based on the idea that storing and sharing raw data supports scientific reproducibility. This idea is extremely important. Without the original data, it is impossible to reanalyze the study later from a different perspective.
However, when I work with old GEO datasets or past microarray data, I often feel that making raw data public and making the analysis realistically checkable are not the same thing.
Even if FASTQ files or CEL files remain, the tools and execution environments needed to process them are gradually lost over time. In academia, there is generally an incentive to develop tools, but less incentive for long-term maintenance, and securing funds for maintenance is often difficult. In addition, if we do not know the preprocessing, filtering, or interpretive decisions made at the time, it is not easy to reproduce the meaning of the results shown in the paper.
It may be possible to redo the analysis from raw data, but if the amount of work required is too large, in practice almost no one will check that far. Even if a mechanism for reproducibility exists in a formal sense, it does not function well in practice unless the results can be checked within a realistic amount of time and cost.
In that sense, what matters in practice is not only whether we can perfectly reproduce the same analysis in a strict sense. It is whether we can perform the necessary level of checking within realistic limits of time and cost. It is important to preserve, in a traceable form, which data were used, how they were preprocessed, under what conditions they were visualized, and what kinds of judgments were made.
We designed the SSA file mechanism based on this idea. By preserving not only the analytical results, but also the data, visualizations, analysis settings, and interpretive process in a form that can be easily checked later, we aim to make it possible to perform the necessary level of rechecking and reinterpretation.
Good analysis starts with making better questions
In an article about John Tukey, the idea is introduced that the purpose of data analysis is not to produce the correct answer from the beginning, but to develop better questions.
This is also a very important idea in omics data analysis, including RNA-Seq and microarray data analysis. When we first look at RNA-Seq data, we do not always have a clear question. Which samples should be compared? Which gene groups should be examined? Should we perform differential expression analysis, or should we first check data quality? How should we interpret the result of GO analysis?
We need to revise the question itself while looking at the data. Even if the initial question is vague, checking PCA, clustering, heatmaps, gene lists, and expression patterns gradually clarifies what should be examined next.
Data analysis is not the work of following a procedure that was fixed at the beginning. We look at the data, slightly change the question, look again, revise the interpretation, and, if necessary, check it with another method. Through this back-and-forth process, biological meaning gradually becomes visible.
What bridges statistics and biology?
People who look at data from the statistical side and people who enter data analysis from the biological side may look at the same data but focus on different things.
From the statistical side, the validity of inference, assumptions of models, limitations of visualization, reproducibility, and analytical methods that are easily misunderstood become important. From the biological side, what matters is how the difference relates to cells, tissues, pathways, or biological phenomena.
To bridge this gap, it is not necessary for statisticians to learn every biological term, nor is it necessary for biologists to understand every statistical theory. What is needed, I think, is to share how we formulate questions, how we look at data, and how we check the results.
At present, when data analysis is discussed, there are countless discussions on the web and in books about statistical models and the validity of analytical methods. On the other hand, I feel that not enough is said about how the result connects to biological questions, where researchers become uncertain, and what they check as they move toward interpretation. This gap may be giving people who are just beginning to learn data analysis the biased impression that data analysis is mainly about choosing a method and applying it correctly.
To fill this missing part, neither the language of statistics alone nor the language of biology alone is enough. We need a point of contact where we can look at the data together and ask: “What does this difference mean?” “How far can we trust this result?” “What should we check next?”
From producing analysis results to being able to read them
In Subio Platform, we also place importance on understanding data while looking at it, rather than simply accepting analytical results as they are.
RNA-Seq and microarray data analysis involve many steps, including normalization, filtering, PCA, clustering, differential expression analysis, GO analysis, and pathway analysis. At each step, the way the data appears can change.
For that reason, analysis software should not only produce calculation results. It should provide an environment in which users can check data distributions, relationships among samples, gene clusters, expression patterns, and changes in results when analysis settings are modified, so that they can make their own judgments.
Reading Simply Statistics reminded me that what truly matters in data analysis is not only which method was used, but how the result was checked and how it can be explained.
I do not think the distance between statistics and biology can be bridged by one side completely adapting to the other. What matters is to look at the same data, formulate questions, check results, and share interpretations together.
I found Simply Statistics to be a very useful site for learning this way of thinking about data analysis. Even from the perspective of someone who is not a specialist in statistics, it offers an opportunity to reconsider data analysis not as the act of applying methods, but as a process of looking at data, formulating questions, and checking results.
