Chapter 101 The Relationship between Biology and Big Data

Foreword: We need to combine mathematics, computer science and biology to better guide biological research.

The scientific research of biology is undergoing a paradigm change, such as Kepler's three laws proposed by Kepler for his teacher Tycho's astronomical data, then Newton's further abstraction of the law of gravitation, and even the Hamiltonian system, through the construction of elaborate mathematical structures, we can explain many complex behaviors based on limited assumptions.

This is how physics came to be, and now it is the turn of biology, but unlike the previous abstraction of mathematical structures, we are more data-driven model development, through a series of attribute definitions to prematurely high-dimensional space, so that various categorical clustering dimensionality reduction regression analysis can be carried out at this level.

In other words, we use algorithms such as machine learning to understand complex data and extract certain patterns with biological significance, which we apply to mathematics, but we no longer directly explain it through mathematics.

Unlike the mathematical formula of reductionism, this is actually a systematic way of thinking. At the level of big data, we can further classify various biological processes, such as decomposing the biological entity of the cell into biological processes such as differentiation, proliferation, apoptosis, division, etc., find relevant genes/proteins/signaling pathways with specific expressions, and build complex connections based on these definable objects.

The idea of linear algebra is to treat these classifications as linear independent substrates, and then the specific linear combinations correspond to various complex biological processes, which we store in the form of matrices, and we can understand dynamic biological processes in matrix transformations, for example, Shinya Yamanaka transferred four transcription factors into fibroblasts so that they can differentiate backwards into pluripotent stem cells AX=B, We can understand that the matrix representing fibroblasts (gene expression profile) multiplied by four transcription factors (high expression) gives a certain similarity to the matrix representing embryonic stem cells, i.e., pluripotent stem cells iPS cells.

Thus we can understand all the possible mechanisms of living things in this way. Therefore, how to find the processing matrix between cells at different stages and even between different cells is the mechanism we understand.

It's just that there are too many dimensions and computational complexity to construct such a matrix from all genes, so we need to shrink this matrix, generally by expressing a limited number of specific objects (such as biomarker molecules.

There are significant upward and downward movements), and then enrichment analysis is carried out to integrate them into existing signaling pathways (KEGG database and GO database), and then it can be connected to higher-level biological processes such as differentiation, proliferation, apoptosis, division, and finally rise to the cellular level, and even the organ health level.

This is the relationship between the function and the original function level explained by the fundamental theorem of calculus, and the simple operation at the high-dimensional level can be equivalent to the complex operation at the low-dimensional level, such as the division of cells at the cellular level is a simple division, but the bottom layer involves a lot of signaling pathways, so we can add up the gene expression changes at the bottom level, and upgrade to the high-dimensional level of differentiation, proliferation, apoptosis, division and other biological processes.

We can represent these processes by constructing certain continuous functions, and essentially the changes in these matrices are the mappings of the functions.

And if we assume that the changes in these processes are continuous (we think that the molecular level is at the submicroscopic level, so we don't have to take into account the discrete quantum level), we can further expand these functions into the sum of series, the most classic being the Fourier series, which can decompose the periodic function into the sum of orthogonal trigonometric functions (and coefficients).

There is an implicit assumption here that the function is integrable so that the series can converge to the original function.

This is the idea of finally finding a concrete form of existence through the existence of nature. Because we assume that the relationship between changes in infinite subdivision is fixed, such as the up-down and down-regulation of expression between genes (the change is the derivative, dx=dA).

And we think that the expression relationship of these genes can be expressed as a certain function (assuming its existence, in Fourier series), and then we can find the specific coefficients by the properties they have, and if we can converge, we can consider that we have successfully constructed this relationship.

If we can decompose it into the sum of trigonometric functions, that is, extract a more essential property, i.e., frequency, then we can use the selective combination of frequencies (frequency domain) as a transformation of the original function (time domain).

These frequencies can correspond to signaling pathways in living things, and this decomposition can be considered a reductionist idea.

We can then make a connection between calculus and linear algebra: linear algebra is the sum of a series of series that are decomposed and expanded by a function.

Ideally, we would refer to cells at the level of gene expression, and then perform various complex transformation operations at this level to refer to changes in biological processes, providing a basis for us to dig out biologically meaningful changes.

But it is based on too many assumptions, and in reality, gene expression is regulated in many ways: the structure of the gene (introns and exons alternately, and some other regulatory regions such as CCAAT boxes, TATA boxes, promoters, enhancers), RNA and protein modifications based on the central rule (cutting out introns so that exons can continuously become proteins; Ribosomal translation of mRNA yields proteins that require further modifications), chromatin remodelers, histone modifications/epigenetics (histone-free regions facilitate transcription factor binding and thus turn on transcription) Therefore, at the level of mathematics, it can be understood as multiplying one new matrix after another, making new transformations.

Therefore, gene expression networks are complex, and there are multiple ways to regulate gene expression, such as nucleosome regulation of gene expression (Histonemodification, histone modification, H3K4me3 and H3K27me3; nucleosome localization, chromatin remodelers, DNA sequence transcription apparatus; histone variants H2A.Z and H3.3), these epigenetic modifications can be seen as layered matrix transformations.

On this basis, more detailed regulatory mechanisms can be explored, such as the recognition role of certain sequences.

We need to develop certain technologies to generate this kind of large-scale data, and fortunately, there are already technologies such as sequencers, gene chips, and so on.

We're able to focus more on schema mining for data.