Chapter 103 Big Data Analysis, Data-Driven Discovery

Big data analysis, data-driven discovery is a comprehensive application of computer technology, statistics, and mathematics.

Changes in the scientific paradigm: 1 Experiments or measurements 2 Analytical theories 3 Numerical simulations 4 Data driven. Creatures are stuck between 1 and 2. The data generated by astronomy made Kepler's Three Laws and the Law of Gravitation possible, and theoretically, as long as the cost of a single data point was low enough, we could generate a large amount of data for pattern recognition. Complex phenomena require complex data to understand, and the patterns are extracted to approximate the reality like the construction of axiomatic systems. I'm not very ambitious, and it's enough to be able to use biological information to explain life. Specific applications can include the discovery of disease biomarkers, the discovery of disease-related genes, and so on.

Discover the limitations of the current work and challenge it down to a level that can be solved – propose possible solutions, and finally integrate them. This is the Divide and conquer of computer science. This repetition, like iteration, is the interpretation of mathematics, constantly broadening the boundaries of human cognition. Machine learning algorithms also require a variety of feedback to further modify the parameters until they converge to the optimal solution (e.g., gradient descent).

Scientific discovery workflow: collection, processing, management, analysis, 1 collection of data, that is, experiments or observations, 2 data collation, in a certain defined form to organize such as database 3 data mining, the construction of various correlations, I personally think that the same mechanism as Bayesian inference can be used to construct a high probability correlation in the combination of some related objects, which can be understood as the construction of high-dimensional relations is the accumulation of underlying relations (calculus fundamental theorem), and the idea of analytical mathematics is that there is always this deterministic relationship in these complex objects(fixed point, such as the median value theorem)4 Data comprehension, integration into the specific context 5 new knowledge

Data mining methods include important machine learning algorithms, such as unsupervised learning algorithms such as clustering, dimensionality reduction, etc., supervised learning such as classification regression, etc., and other algorithms such as neural network algorithms, as well as further deep learning.

Lecture 1: Pattern Recognition

Everything is justified, this is our belief, which can be understood as an assumption about the existence of relationships. Then it's about looking specifically for this relationship, which is known as pattern recognition. There are many specific methods, such as classification, regression, and more specific implementations such as the nearest neighbor method, KNN, etc.

Pattern recognition is actually the construction of specific functions, such as the mathematical analysis of the regression equation y=ax+b, after assuming the existence of these functions, the construction is the determination of the parameters, so that the comparison with the real situation can maintain a relatively low error (the error is less than a certain degree is regarded as equivalent), and various indicators such as correctness, recall and so on.

The kernel method is essentially a certain function to meet certain needs.

We can model real-world situations in a high-dimensional enough space, where each data point has multiple attributes. It could theoretically be infinite-dimensional, thus modeling everything. But this doesn't make sense, because we don't have enough computing resources to support it, and the number and importance of properties that we think are meaningful are power-law releases, i.e., only a small number of metrics are of greater importance. Therefore, it is imperative to deal with dimensionality reduction. This is similar to the decomposition of linear algebra's linearity-independent base, which stores data with minimal loss. For example, PCA principal component analysis. It can also be understood as the extraction of eigenvalues/vectors.

The calculation of the algorithm requires the selection and calculation of specific indicators, which must be quantifiable so that various parameters can be updated and thus convergent downtime. For example, face recognition is to infer which areas are more likely to be target areas through the calculation of relevant indicators.

Lecture 2: Introduction to Machine Learning

Wide range of applications. Essentially, it is trained to find a certain function or classifier that can be applied to generalized data.

Automatic programming machine - artificial intelligence, from the beginning of hard programming, that is, encoding all the rules, on the basis of discovering its impossibility to explore soft programming, through data learning, for a specific task T, there are certain measurement indicators performance measure P, which can be continuously improved according to experience P. Such as spam identification, medical diagnosis, advertising recommendations, etc.

Big data: volume large amount of data, velocity generation speed, variety, veracity, value, extract knowledge from data.

For different problems, different algorithms should be chosen, and there is no one-size-fits-all algorithm, and the gains and losses must be comprehensively considered. Although deep learning has this potential. Our goal is to discover the non-significant effects of organisms, such as histone acetylation and transcription factor interactions, which are multiple factors influencing transcription, and to explore possible mechanisms such as the role of related proteins.

Lecture 3: Data – Data Model – Database

Data-information-knowledge-principle, the structure of the pyramid, is decreasing in number, but increasing in importance.

Lecture 4: Neural networks play an important role in various fields, such as speech and image recognition, recommendation systems, social networks, etc., and we pay special attention to the biological applications in them, such as data analysis of gene expression chips, etc. The combination of data-model-computational power allows us to mine statistically meaningful patterns that can be correlated with certain biological mechanisms. There can be probe-style inputs to find relevant groups in the library.

The training of neural networks requires large-scale matrix operations, which are highly complex, and certain optimization measures need to be taken to accelerate the operations: low-rank approximation, network pruning, scalar or vector quantization.

Matrix factorization can reduce the amount of computation. Sparse matrix to reduce storage capacity. Fixed point saving storage, calculation time, etc., is an existential assumption and transformation. It is equivalent to the calculation of a priori probabilities, which can converge to the optimal solution of the purpose more quickly.

Deep Learning: Feature Extraction—Metric Learning—Classification