Chapter 84 Machine Learning Notes for Network Implementation

The new ideas are used to guide the underlying computing of the computer to compare human intelligence, and to construct a growable system of local self-organization ability, so that it can traverse to a sufficiently high-dimensional level. This kind of probability operation is consistent with the previous development process of calculus, that is, through Bayesian calculation of probability and iterative operation, corresponding to the sum of infinitesimal quantities of calculus, so as to build a higher-dimensional structure, I think it is similar to an infinite-dimensional network. Therefore, we can use the various properties of the network to guide the construction of specific algorithms and computer calculations, so as to finally achieve the identification of large-scale data and mine meaningful information that we need. Probability and statistics are at the core, and the operation of the sequence we have been hoping to construct is a kind of processing of information. Due to the limited ability of individuals, we can only hope that we can use the existing successful experience in the real world as a seed sequence, match the ideal infinite dimensional sequence, and find its derivative sequence, and establish a certain correspondence with the specific reality, as revealed by the BLAST algorithm.

(1) Supervised learning (parameter/non-parametric algorithms, support vector machines, kernel functions, neural networks). (2) Unsupervised Learning (Clustering, Dimensionality Reduction, Recommender Systems, Deep Learning Recommendations) (3) Best Practices in Machine Learning (Bias/Variance Theory; in Machine Learning and AI Innovation Processes)

To construct the algorithm, we need to select certain computable statistics in advance, and the relative combination patterns of various defined statistics can correspond to differential equations, which are the objects that we can finally calculate. My ideal is to process medical information from a network perspective. Various performances need to be defined before they can be evaluated and improved on that basis. The number of underlying computations that can mimic the algorithm's evaluation.

The learnability of the network, for the relational connection of new nodes can be equivalent to the Bayesian computation of probability.

The way to realize artificial intelligence is not to simulate humans, human intelligence is a feasible way derived from large-scale trial and error at the biological level, which has reference value for machine intelligence, but must take a different path, only in this way can we make up for hundreds of millions of years of biological evolution.

Specific applications, with good definitions, can approximate the problems we want to solve them. Understanding a wide range of medical data and approximating the most critical fixed point properties provides an excellent tool for our specific applications.

The so-called learning is based on a large-scale training set (network data), in a variety of key special sequences (interfaces) with specific information in the individual, constantly targeted, to provide a specific path expression probability, so as to provide individualized services. This is like the collapse of a wave function. The key is to eliminate and select possible paths through trial and error, and to constantly traverse through that. Learning-induced annealing, i.e., path formation, is the result we want and can be continuously optimized through this iteration.

We have the belief that there are always different variables that can be expressed as eigenlinear relations, and that others can be seen as different positions in the distribution of this relationship. The former can be the relationship between house price and area, and the latter can be the relationship between house price and hospital distance.

Basic assumption: There is a certain statistical analysis to construct regression relationships for different variables. It can be further expressed as a multi-level coupled structure of differential equations. All of this is based on model construction and judgment based on existing large-scale data collection.

Multipath: Different ways can be fitted to realistic data, which is the result of the expression of Markov sequences.

The assumption of continuity, which is the theoretical basis for the existence of answers to various questions.

The relationship construction of multiple indicators, such as ABCDEFG distribution and the occurrence of cancer has a certain probability connection, assuming that they are 10% 20% 30% 40% 50% 60% 70%, and the specific expression sequence can be represented by an adjacency matrix, and then we can observe large-scale patients and healthy people, + representation-non-expression, ABCDEFG +++++++It has the largest probability, the ------- of ABCDEFG has the lowest probability, and the prevalence probability of the rest of the expression patterns is a certain distribution, and we hope to be able to make a more reliable prediction for a new individual by mining this data pattern. We start with Boolean functions, and then move on to specific probability expressions and Bayesian operations.

For the same problem, the idea of machine learning is to classify, distribute with certain features, and then extract a certain boundary function as the evaluation standard. Theoretically, the infinite classification of features can correspond to the situation of each specific person in reality, and the combined consideration of these features can lead to the conclusion that it is true with a high probability (similar to the Fourier series of linear algebra, of course, we need to consider precision and fast convergence). This is also the correspondence between the sequence and the specific situation that I have been thinking about. Of course, the scale and difficulty of such calculations are extremely high, which makes this algorithm useless, so according to the distribution that will inevitably exist in nature, we can expect the existence of clustering, that is, some relationships will have a shorter distance, which is obviously different from the outside. In fact, there are also different distributions in this clustering, such as the sequence alignment of its different segments. Support vector machines are capable of handling an infinite number of features. (For qualitative description, we need to improve in the form of probability, and the interaction between possible features is expressed in probability, i.e., the high-dimensional structure formed by sequence coupling, and the secondary structure of reference DNA is formed).

The existence of this kind of clustering corresponds to the network theory that is the formation of a community structure, which corresponds to the high-dimensional structure of the relationship between nodes, and is a secondary structure higher than the central node formed by using the interaction between nodes. This results in the formation of high-dimensional structures by comparing the degrees of connectivity (different criteria correspond to different path formations). Therefore, in essence, this distribution exists before the definition, that is, the pattern recognition we want (large-scale computational judgment based on certain rules, so it is important to determine the rules, such as the clustering of news needs to determine the number of keywords and so on).

On these foundations, we need to establish high-dimensional omics thinking, understand that large-scale data is the basis for our selective expression, that is, the matrix of the hidden states (the relative proportions of various states) of what we think of as the hidden Markov model, and the specific expression is a probabilistic pre-state (which is essentially multi-probable, of course, there is a certain distribution, and a few may occur with high probability, which is a power-law distribution). The clustering algorithm on this basis makes sense, and we need to construct a high-dimensional structure, which can have a more definite transformation path with the low-dimensional structure.

The coarseness of the current clustering algorithm is too large, and we need more refined classification, theoretically the underlying classification has greater certainty (like the infinitesimal quantity of calculus), and then on this basis, like the programming design, it constantly traverses to form expressions, functions, cyclic structures of operations, and so on. And one of the ideas to get rid of at present is that the difference in expression has a greater probability of being related to a special process, and we should think in a more low-level way, because there may be various periodic changes and impulse expressions and other interference effects, and the relationship between the nodes of the network may make a better explanation.

In order for unsupervised learning to automatically find the structure of the data, we still need to extract and define some quantities that can be calculated, so as to form clusters at different levels according to the judgment of these quantities. All calculations require objects and rules to be calculated. Then we can expect these large-scale operations to calculate different levels of classification for us, which is a way to transcend personal knowledge, just as mathematical structures can tell us more information than is needed to construct them, that is, our creations are smarter than we are, but it is also an effort to free ourselves from our own limitations.

Above, we hope to be able to do a certain cluster analysis of large-scale measurement indicators, so as to correspond to the basic of our reality with more characteristic changes, such as diabetes can be decomposed into three more and one less, etc., which requires our existing medical knowledge to do this kind of descriptive work. Then, based on this, we continue to go deeper and deeper into smaller levels, such as the expression patterns of gene proteins, and finally be able to integrate and map the data level to the situation of the body.

Linear algebra is used to represent probability, so as to achieve the imitation of calculus, so the knowledge of the crew is essential.

In the construction of relations at the statistical level, linear relations are the most basic assumptions, and the rest of the complex relationships can be approximated with arbitrary precision by selective expression of a certain base. The more data, that is, the more variables considered, the more accurate the theory can be, but the reality is based on the selective expression of the results, that is, it is possible to adjust up or down, of course, there are still at the statistical level.

Construction and implementation of prediction functions. The statistical analysis of the amount of error, which can be used as a criterion for evaluating our data (the smaller the error, the better), can be achieved by algorithms, because everything is clearly comparable. The minimum value of the value of the function can be solved using other algorithms. The local optimal solution selects the possible combination of parameters, based on certain judgments, and continuously iterates. This iteration can be re-assigned by the processed variables. For example, the updating of such parameters is a kind of learning.

The gradient descent algorithm can be used to minimize any cost function.

Matrices can be used to express a variety of complex network relationships, and can be used at this level to approximate the real network relationships with arbitrary precision by selective representation of different matrices. This is why linear regression models, as the underlying layer, can continue to traverse high-dimensional structures, that is, why they can have such a powerful effect on the processing of large-scale data. The elements of the matrix are the objects of the operation, and its various variations, such as singular value decomposition and eigenvalue solving, are such processes. Moreover, there is a certain interaction between matrices, and the final eigenvalue solution can be understood as the expression process of Markov sequences. The final interaction is represented by matrix multiplication, which can form a larger pathway, i.e., a specific path, at this level. This is also related to the multiplication and addition of our probabilities to form a good mathematical structure, that is, the final probability path formation.

This can also be understood as multivariate regression analysis, where the order of the matrix corresponds to the number of variables, and the final linear relationship can emerge through the operation inside the matrix. This is actually very similar to our sequential operations, which are all about dealing with multi-state variables, and we need to consider interactions. Because the coupling effect that may occur between variables has a certain distribution, just as the DNA primary sequence structure can form a secondary structure according to the matching of sequences, i.e., A*B.

, the definition of the parameter.

Then build a set of languages on the basis of the above linear algebra: variable definition, logical comparison, arithmetic bit operations, expression formation, cyclic structure, etc., and so on. Essentially, we can get a general idea of specific languages through statistics, such as frequency analysis of diseases.

In front of us we have Dr. Watson from IBM, who chooses to identify various keywords to sift through a series of treatment options from massive amounts of data, which is very powerful and we can't compete with it, so we can only find another way, that is, to go further than it, to do more detailed than it. This is what I have always expected for networks and sequences, and the emergence of patterns based on computable relations, i.e., probabilistic inter-object relations and probabilistic network formation. We intend to start with the most basic disease diagnosis, classify unique individuals through the measurement of multiple features, and theoretically match a diagnosis with a certain clustering significance with arbitrary precision as long as there are enough features (the same disease may have different manifestations, but the statistical significance is similar). On this basis, we can continue to perform cluster analysis and continue to traverse to higher dimensions.

The linear regression relationship is like the infinitesimal at the bottom, on the basis of which all possible relations can be traversed, of course, we need a certain target to maintain its specific path formation within a certain range.

Infinite classification, which can be represented as a 1/0 sequence. In addition to this direct definition, it can also be defined in terms of certain effects, such as whether it is expressed or not.

Logistic regression, as a classification algorithm, can distinguish different object properties with a certain probability, just as the Markov sequence considers the multi-possibility whole. For example, if hθ(x)=0.7 is calculated from the already determined parameters for a given x, it means that there is a 70% chance that y will be a positive class, and accordingly a probability that y will be a negative class is 1-0.7=0.3.

The boundary function is the criterion for judging.

A good definition helps us to perform a variety of calculations. The variation of various parameters depends on the iterative progression to finally find specific extreme values.

Multi-category classification, a pair of more ideas, extract the unique features, and treat the rest as the same class, and then repeat the process to know how to decompose into multiple unique categories. Our ideal is to continue the classification through these categories, to get a unique classification in the higher dimensions.

Regularization, which retains all features but reduces the magnitude of parameters, can improve or reduce overfitting problems. This can be optimized using a certain penalty strategy.

We have to find a balance between underfitting and overfitting so that we can have good predictive capabilities on the data and be constantly compatible with new data additions. Our ideal Markov model is to have a rough distribution, allowing a small amount of anomalous data, because this is the result of the selective expression of the network, which can also refer to the coupling of different distributions of the eagle-pigeon game of game theory. Our borders are blurred.

The above is the classification of linearity, followed by the nonlinear polynomial construction. Such as neural networks, support vector machines, and other algorithms. In essence, it is an optimization of an infinite number of calculations for linear calculations, reducing useless calculations through certain feature recognition, and devoting more resources to path formation with a greater probability of emergence. From the point of view of the network, it is to pay more attention to the operation of the central node, and to lead the area with points. (It is true that the large-scale computation of the general node is also the result of the computation that can approximate the central node, but relatively speaking, the input-output ratio is too low.) We still follow the power-law distribution of the Matthew effect)

That is, we leapfrog from linear fitting of variables to finding distribution patterns. Calculation of parameters for possible combinations and explosive growth (. Suppose we have a lot of features, e.g. >100 variables, and we want to use those 100 features to build a nonlinearity

The result would be a very staggering number of feature combinations, and even if we only used pairs of features (in fact, we needed more), we would have close to 5,000 combinations. This is too many features to be computed for general logistic regression. We need to think about it from a high-dimensional perspective, which is like the dimensional operation of calculus: the simple addition and subtraction of high-dimensional functions corresponds to the complex accumulation of low-dimensional functions (the selective representation of a pair of bases revealed by the Fourier series can approximate the real function with arbitrary precision, but the problem is that the amount of computation is too large to be acceptable even for the computer, so we need to extract the high-dimensional quantities to calculate, that is, the distribution).

The convergence of polynomials, i.e., the number of feature combinations is like a Taylor series, considering only the first and second order cases and ignoring the higher orders.

Pattern recognition, select a macro object, extract its corresponding mathematical model, then select the elements in it, form a certain new relationship according to a certain relationship (function definition), and then construct a certain algorithm to identify its characteristics, and finally make a judgment according to a certain judgment standard. If we only use grayscale images, and each pixel has only one value, we can choose two pixels at two different positions on the image, and then train a logistic regression algorithm to use the values of these two pixels to determine whether the image is a car: if we use a small image of 50x50 pixels, and we treat all pixels as features, there will be 2500 features, and if we want to further combine the two pairs of features to form a polynomial model, there will be about 25002/ 2 (close to 3 million) characteristics.

Neural Networks:

Simulate the rapid convergence mechanism of the brain, so as to extract high-dimensional feature quantities from large-scale calculations in mathematical space, and then perform operations at this high-dimensional level to make quick judgments. For example, our brain's recognition mechanism for various pictures. Machine intelligence recognizes large-scale data and performs operations at the bottom level, and we need to train it to simulate our way of thinking, more precisely, the high-dimensional way of thinking (I am not confident that the computer's low-level computing traversal to the high-dimensional level is similar to the human thinking process, of course, we have proved that this is a feasible way for mammals, and it has great reference value for the evolution of machine intelligence) to reduce unnecessary calculations. I think this can only be achieved based on the existing large-scale underlying operations, which require multi-level coupling to form a paradoxical high-dimensional structure, and then can be selectively expressed as a variety of specific paths. In fact, this also relies on the underlying computation of our neurons. I think probabilistic networks do a good job of illustrating this.

The idea of neural network: through the proposal of learning algorithms, various high-dimensional functional implementation modules are constructed. This is a kind of low-level computational traversal to a high-dimensional structure, during which we need to go through a large-scale trial and error and selection, after all, the relationship between the higher dimensions and the lower dimensions is like different positions in the pyramid. Moreover, although the different modules of the neural network are the result of differentiation and distribution, they also have the possibility of returning to the initial state, thus evolving functions that they did not originally have, such as severing nerves from the ear to the auditory cortex. In this case, it is reattached to the brain of an animal so that the signal from the eye to the optic nerve will eventually reach the auditory cortex.

Multi-level information transmission and operation (signal transmission, various projection fibers, neuronal connections), a new Turing machine, or an application of Turing machine? In a neural network, the high-dimensional complex structure formed by the dynamic connection of neurons can correspond to all the possibilities in the world, and then under the guidance of certain rules, it can be annealed and collapsed into a certain path according to a certain standard, that is, learning, which is the application of power-law distribution.

This brain learning algorithm is based on the selective expression of the environment and the formation of pathways that resemble natural selection, and we can mimic this process by connecting basic neurons. That is, we assume that the brain is a model in which only neurons and their connections exist, and that specific neuronal connection patterns correspond to certain computational structures. Then consider glial cells and so on as the screening mechanism of the environment and the key targets for the calculation. Perhaps we should also consider the underlying gene protein expression network, which provides more reliability for the construction of probabilistic networks. Because these networks are all similar, they can do some transformation. New synapses and dendritic formation of neurons are specific learning behaviors of neurons, but this is a probabilistic behavior based on expression networks. This corresponds to a neural network model that is a network of many logical units organized in different levels, with the output variables of each layer being the input variables of the next.

Synaesthesia, where various senses are transformed into information to construct equivalence, such as BrainPort's system that enables blind people to learn to "see" with their tongues.

The learning mechanism of the brain is essentially an integration mechanism for the insertion of new nodes, because the brain is actually in a dynamic state of intense change, with new interactions between levels, like the immune system, so that it has greater adaptability to the environment. We can then think of the fitness function as a defined function.

The information can be expressed as the flow of electric current, its specific direction, intensity, position, etc., all carry certain information.

Taking the neuron as the learning unit and the existing mathematical model as the specific operation mode, the various parameters of the expression can be regarded as weights in a linear expression. The processing of the data does not necessarily follow a completely large-scale linear calculation, but is filtered according to certain rules such as setting thresholds (action potentials need to be higher than the threshold to be transmitted), and the final calculation is also based on certain nonlinear processing rules. This process is like a Markov sequence, and the results are all probabilistic distributions, which we consider high-dimensional calculations. Just like in the evolution of natural selection, new variations are constantly introduced while maintaining the general direction, and naturally there is no need for mathematical precision. That is, the transfer of a matrix is the result of a nonlinear screening of the linear elements of its underlying elements, which is a high-dimensional operation.

The distribution of weights is a linear regression distribution, but the weights are also obtained through a series of calculations. Therefore, the neural network is essentially a calculation of high-dimensional quantities, which represents the original data (original features) in a more complex matrix form (multiple levels, extracting the lower layers, and theoretically the more layers of classification, the better the description), that is, constructing a certain matrix according to the eigenvalues.

Specific implementation (requires good function definition and value): In a neural network, the computation of a single-layer neuron (without an intermediate layer) can be used to represent logical operations, such as logical AND, logic, or OR (the allocation of weights can be used to construct different logical operations and, or, not)

Then build higher-dimensional results in this underlying logical structure, which is consistent with the underlying operations of our computer, which can continue to traverse upwards to form higher-dimensional results, that is, more complex functions.

The cost function (the measurement of error, which can be continuously iterated over time), which is defined to make a computable comparison of a certain criterion, is supposed to be derived from the idea of a scoring matrix for biological information, which can be used as a criterion for classification here. as

The calculation of N-dimensional data and the interaction of matrices represent its operation, but this operation is not completely according to the multiplication of the matrix, but selectively performs some processing in different links, such as excluding some data according to the cost function.

Pure conjecture: the formation of concrete relations, path formation/collapse annealing: combining the forward propagation method and the backward propagation method of the neural network to select the path with the greatest probability, and the selective expression of the substrate formed by the respective results corresponds to the real situation (the equilibrium reached by the game). The former is the operation of the matrix, and the latter is the error calculation from the last level, all the way back to the previous level. We know that the cost function is a criterion, and these methods are the strategies taken to meet the criterion.

The parameters are expanded from the matrix to vectors, and the dimensionality reduction analysis is carried out to reduce the dimensionality of complex objects to the most basic one-dimensional sequence, and then the optimization analysis of various algorithms is carried out at this level.

to avoid possible local optimization. Random initialization of parameters, which can be included in further iterations until convergence.

The realizability of neural network algorithms, where the network structure is layered enough to approximate the real situation (hierarchical interaction). The number of units in the first layer is the number of features in the training set. The number of units in the last layer is the number of classes that result from the training set. If the number of hidden layers is greater than 1, make sure that the number of cells in each hidden layer is the same, usually the more hidden layer cells, the better. What we really have to decide is the number of layers of the hidden layer and the number of units for each intermediate layer.

How do we use these successful algorithms to build the algorithms that we need to perform operations on medical data? This is something I've been thinking about. Needless to say, machine learning is an indispensable choice. Before we reach the ultimate ideal, a complete data understanding human (data person in the true sense of the word), we should first develop some of the diagnostic and therapeutic platforms that are available. Luckily, we already have massive data on the medical side, it's just that we need to organize it in a very well-organized form. The current algorithm not only iterates on the object of the operation, but also iterates on its own parameters, and this coupling mode of operation is very appealing to me. All of the above algorithms are essentially doing a job, classification, and I think that's the underlying variety, and then how to build new connections on that basis is the specific application. This kind of thinking is the best way that my vision can understand so far: select a certain computable object, build a certain judgment standard function, establish a certain model, iteratively calculate, and continuously optimize. The limit of what can be done theoretically is to build a certain model of everything in the world, and then quickly match it to the exact classification of new patients based on the data input (the sequence I have been emphasizing can be mapped according to the possible relationships of the five elements), and then derive a diagnosis and treatment plan.

The implementation of any advanced function requires us to choose a certain path and feature quantity, which is a kind of dimensionality reduction idea. Then continuous improvement, the development of more features. Feature extraction is a kind of fixed-point search.

Identify a simple algorithm, implement it, and then decide which processes will improve the performance of the algorithm based on a certain evaluation without this analysis, and quickly decide what to do next.

My consistent thinking is to take advantage of the geometric distribution and combination of nodes in the network, and look for real-world correspondence between these nodes. In this way, we can simulate the movement of gas molecules in nature, and we can use a variety of distributions, and the calculations at the distribution level seem to me to be high-dimensional calculations. Much of the knowledge of graph theory can also come in handy, and existing network theories such as power-law distributions, small-world models, and six-degree separations can also be used to describe and compute networks.

Therefore, the specific realization requires us to find a good correspondence, the movement process of this node can be idealized as the movement of coordinates, and the construction of the relationship between nodes can be understood as the correlation of motion, and the formation of a higher-dimensional structure can correspond to various abstract situations of reality. Then, the correspondence is constructed with the probabilistic network and the Markov sequence.

(Maybe I shouldn't have tried so hard to make sense of these calculations in my own way, which is almost against the whole world, but I'm more afraid that if I go down someone else's path, there will be no room for resistance.) However, now my thinking is also getting closer to the current mature thinking, probably always seeking to be achievable, and I don't know whether it is mature or degenerate. Now I can only wait and see, I don't know if I will lose my original dream in the process of learning)

The general optimization idea is to collect a larger sample volume, try to reduce/increase the number of features, increase the polynomial features such as x1 squared, x2 squared, x1x2 product, and decrease or increase the value of the regularization parameter lambda.

The assumption of the network model we chose is that a series of outcomes of the network, such as power-law distribution, small-world model, and six-degree separation, etc., need to be well defined.

Algorithms should not only have computable objects, but also evaluable criteria, which is the basis for further improvement. Then there are various statistics like variance, mean, and so on. We always want to be able to reach certain extremums, such as the smallest value of the cost function or a zero derivative, etc. We can refer to the mathematical entrapment theorem to formulate a boundary function to evaluate its overall distance. We need to choose specific optimization measures on a case-by-case basis.

The higher the order of the polynomial model, the more it can adapt to our training dataset, but it is difficult to generalize to the general situation, so we should choose the equilibrium achieved by the real level competitive game. This equilibrium can be expressed as the derivative of the return function. This equilibrium is the Markov sequence, and its selective expression is the concrete reality (the expression of probability is the basis for the occurrence of the coupling situation). The final trade-off is also reflected in recall and accuracy as evaluation measures for skewed problems. It is important to ensure a relative balance between the accuracy and accuracy of the check.

In my opinion, the process of accumulation of experience by doctors is a learning process, and we can use algorithms to simulate the final behavior. And experience corresponds to the amount of data, regardless of the algorithm (the talent of the individual doctor, etc.), and an increase in the amount of data can significantly enhance the performance of the algorithm. Of course, this requires us to grasp enough eigenvalues to be able to express traversal through the entire space and eventually selectively collapse into a specific path.

Support Vector Machine:

Supervised learning algorithms. Boundaries are simulated with certain simple functions (piecewise functions) to make various inferences, and can be optimized according to various evaluation functions. Strive to separate the samples with a maximum spacing. For this reason, support vector machines are sometimes referred to as large-spacing classifiers.

The exclusion of anomalous data, based on its relative proportion/distance, results in a better demarcation function.

The combination of the original features --- the use of kernel functions to calculate the new features (using the approximation of each feature of x to our pre-selected landmarks to select the new features f1, f2, f3), which is a high-dimensional extraction.

Unsupervised learning algorithms, which divide different classes through certain data analysis, can extract features beyond what we humans can understand based on the computing power of computers. In my opinion, this is the key to the development of machines in the direction of intelligence, we no longer direct the way, they find the way themselves. And it seems that the network ideas that I have always preferred can be better represented in unsupervised learning, such as cluster analysis and dimensionality reduction analysis. Network analysis, sufficiently refined classifications can make individualized judgments, such as various recommendation algorithms, network relationship discovery can lead to the relationship between different individuals in the network, such as searching for terrorists, and reallocation of resources and relayout of the network. This optimizes the data center and optimizes data communication;

There is a certain distribution of the data itself, we only explained the existing objects and finally found their distribution patterns, but now we are the reverse to define these distributions to make explanations, and this definition of label information needs to be constructed internally, that is, to define and classify. One of the algorithms that define these distributions, the clustering algorithm, is the basic operation. It can be mapped to the definition of a certain eigenquantity, but of course, clustering is much more than that.

K-means is an iterative algorithm that calculates the average of each group, moves the center point associated with the group to the position of the mean, and repeats this operation until a steady state is formed. In this process, the iteration corresponds to a certain amount of computable change operation, and the final convergent structure is the steady state. The goal of optimization is to find the object that minimizes the cost function.

Dimensionality reduction: Abstracting complex structures into more fundamental results, as if infinitesimal quantities can be superimposed, thus constructing different levels of relationships. This kind of eigenvalue extraction can compress the data to a certain extent, and the operation at this level can approximate all the results.

Reducing the characteristics of the computation and reducing the redundancy makes our calculations more feasible.

We can then construct relative relationships between different dimensions, and the meaning of this newly created feature can be defined by us. The infinite dimensional space we can base on can correspond exactly to all cases.

Principal component analysis and singular value decomposition can sort the importance of the newly obtained "principal element" vectors, take the most important part of the front according to the need, omit the dimension of the back, and achieve dimension reduction and simplification

The model or the effect of compressing the data. At the same time, the information of the original data is maintained to the greatest extent. This is a distribution in itself.

Pattern recognition (building a model based on features and then comparing it with specific data): As long as the number of feature variables is large enough, we can avoid all kinds of possible errors with relatively high accuracy, which I think can minimize the harm that may be caused by various medical treatments, which is the so-called anomaly detection problem. We use probability to classify, i.e. what is the probability that a particular data is in the range of making, above a certain threshold (distance from the mean) can be considered to belong to the same category.

We can also perform pattern recognition at the level of distribution.

μ is the mean, and the latter is the variance. A threshold is selected as the boundary of the evaluation for pattern recognition, that is, the parameters are fitted with the given data set, the parameters are estimated, the parameter μ and σ are obtained, and then the new sample is detected to determine whether the new sample is abnormal.

By combining some related features (such as the ratio between features) to obtain some new and better features, this can be regarded as the nature of the coupling effect of the hierarchy. I'm wondering if this can be combined in the form of differential equations.

Multivariate Gaussian distributions can construct more precise boundaries, and these correlations can be captured by constructing new features.

The specific application, the recommender system, I think is a good illustration of sequence recognition and even sequence matching. This requires enough features to identify, and this is our sequence, which is a Bayesian operation based on past experience to iterate on the various possible probabilities. This combination of eigenvectors is a sequence.

To obtain data, various descriptive indicators in medical care, such as redness, lack of energy, etc., we need to consider constructing a certain feature vector, and then obtain data in its specific proportion. We can consider building such a medical diagnosis platform, formulating a certain classification based on large-scale data, and continuously iterating by simulating the diagnostic thinking process of doctors in reality and constructing certain evaluation criteria to approach or even surpass doctors.