Chapter 81 Statistical Ideas of Network Science
By understanding the real world with the concept of the network, we can know the validity of all theories and the convergence attenuation as a hierarchy. The network cannot be like a clock, it can have time inversion and comparative certainty, because the network is a multi-dimensional coupling, and when the number of iterations exceeds a certain threshold, it will attenuate and converge, which is chaos. At the same time, it is also an error. So we end up using statistical models to understand the network and, by extension, the reality.
The statistical nature of the data is based on large-scale, random data
It is almost impossible to explore the role of a single variable because the role of the network is coupled
Regression to the mean is the achievement of another dimension of dynamic equilibrium, and the average is an intrinsic characteristic, which maintains the overall homeostasis and avoids polarization
We substitute the formula by substituting the original solution of a set of distributed data: the observations are always random, but the overall observations appear in a high-dimensional structure
A phenomenon is the result of an observation, and its essence is the distribution function (probability distribution) behind it: the Poisson distribution
Goodness-of-fit test: Determines whether a given set of observations is appropriate for a particular mathematical distribution function
Monte Carlo technique: a mathematical model that is repeatedly simulated to determine the probability distribution of relevant data (a means of traversing in which the proportions of various partial derivatives conform to a certain distribution function)
High-dimensional data: The sample is large enough to determine the parameters without error----- the processing of random error for small samples: the ratio of the mean to the standard deviation estimate, K. Pearson's four parameters (mean and standard deviation, symmetry, and kurtosis) are correlated and matched to one of the distributions in K. Pearson's skewed distribution series. The ratio of the first two parameter estimates has a tabulable probability distribution, and the ratio of the two sample estimates is calculated to give a known distribution.
The basic assumption is that the raw measurements follow a normal distribution.
Complex iterative formulas are transformed into multi-dimensional geometric space forms
The statistical distribution of various parameters is a high-dimensional structure, such as the continuous change of distribution parameters is the true essence of evolution
It is hypothesized that these phenotypes are the result of interactions between genes, which in turn have different probabilities
The multivariate influence of the network, through a certain constraint to divide the module, we make the relationship of a certain path appear through random adjustment, and establish a large number of correlation effects of interrelated causes
Breaking down the effects of different treatments: Fischer's ANOVA, an analysis of interactions
Degrees of freedom reconcile the varied and anomalous results observed by different authors
The distribution of extreme values determines the convergence range of the hierarchy, and knowing the relationship between the distribution of extreme values and the distribution of normal values makes it possible to predict the occurrence of extreme cases: extreme statistics
The distribution is probabilistic, and its error from reality is also probabilistic
The maximum likelihood estimators are always consistent, and if one agrees with several assumptions that are considered "regularityconditions", then MLE is the most efficient of all statistics. In addition, Fischer demonstrated that even if MLE is biased, the magnitude of the deviation can be calculated and then subtracted from the MLE estimate to obtain a consistent, valid, and unbiased correction statistic (sequence match similarity)
Iterative algorithms, constantly approaching the intrinsic. The Bayesian formula is a treatment of probability and conforms to the hierarchy of the network. By repeating Bayes' theorem, we are able to determine the distribution of these parameters, and then the distribution of these hyperparameters. In principle, we can use super-hyper-hyperparameters to find the distribution of super-hyperparameters, and then lead this analytic hierarchy process to a deeper level, and so on. This is convergent, just as the higher-order derivatives of Taylor series decomposition are of little use in the simulation
Insects were divided into groups, kept in jars, and then experimented with different ingredients and different doses of insecticides. In the course of these experiments, he discovered a remarkable phenomenon: no matter how high the insecticide he tried to prepare, one or two insects would always be alive after the application, and no matter how much he diluted the insecticide, even if he only used a container containing the insecticide, a few insects would always die. (Expression of Probabilistic Networks, Stability)
Probability Unit Analysis: The relationship between "the dose of insecticide" and "the probability that a bug will die when the dose is used" is established. A similar concept of half-life can only be used: "50 percentlethaldoes", often denoted as "LD-50", refers to the dose at which the insecticide kills the insect with a 50% probability. At the same time: it is impossible to determine the dose required to kill a particular bug used as an experimental specimen. (This is the nature of the network, and we can only find a more certain relationship from the statistics of the whole)
The stochastic process theorem is an operation at the sequence level
Probability is the partial derivative, i.e., the relative proportion, between the different levels of the high-dimensional structure of the network
Central Limit Theorem: Everything has a distribution. The sums and differences of the various types of normal variables also obey a normal distribution. As a result, many of the statistics derived from normal random variables (variates) themselves obey normal distributions.
Operations research, the optimal allocation of resources, which is a discrete state of equilibrium (eigen) reached by hierarchical competition, uses a normal distribution to deal with the problem.
What appears to be a purely random measurement is actually generated by a deterministic system of equations, i.e., a selective representation of the network. The problem is that the coupling of multiple differential equations makes it impossible to solve them accurately, and we still need to understand them statistically
Constructing a statistic that tests goodness-of-fit obeys a probability distribution, K. Pearson proves that the chi-square goodness-of-fit test obeys the same distribution regardless of the type of data used
Hypothesis testing (or significance testing) is a formal statistical method used to calculate the probability of previously observed outcomes under the assumption that the hypothesis to be tested is true
The significance test is used to draw one of three possible conclusions:
If the P-value is small (usually less than 0.01), he asserts that a certain outcome has already appeared, if the P-value is large (usually greater than 0.2), he declares that even if a result does exist, it will be too unlikely that there will be any large-scale experiments showing the result, and if the P-value is in between, he discusses how to design the next experiment to get a better result.
Interval estimates, the probability of believing that the true value of a population parameter will fall within the estimated interval, i.e., the confidence interval
The power-law distribution of the network makes the probability of extreme values relatively large, which significantly affects the results, resulting in a smaller value of the "student" t-test statistic than it would normally be (in general, a large t-test statistic corresponds to a small p-value).
The scatterplot of the observed data needs to be compared to what would be expected from a purely random distribution â a non-parametric test that eliminates noise
Eigen, a small sample is collected that is sufficiently representative and can be used to estimate the characteristics of the population
The network as a whole can be divided into several relatively independent parts (there is still a certain similarity between these levels, that is, a coupling relationship), and its further division may be somewhat duplicated. Mathematically, input-output analysis requires that the matrix describing network activity must have a unique inverse matrix, which means that once the matrix is obtained, it must be used as a mathematical "inverse matrix". The more refined the classification, the higher the probability of the existence of a unique inverse matrix because the degree of simulation of reality is constantly enhanced
The impact of a single variable is unreliable, and a more certain correlation can only be constructed at the network level. The language of the network is probability, and a certain path requires the accumulation of probability in a sequence, which fundamentally denies causality. The influence of multiple variables, i.e., the probability of Bayesian formula calculations, can only be observed at the macroscopic scale, i.e. the frequency. The many parameters of the network can never be observed exactly, but they interact with each other and influence each other
Everything we can see and touch is in fact only a shadow of the real world, and the real things that can really be found in this universe can only be obtained through pure reason. The selective representation of a probabilistic network is a real thing
In this 5000-dimensional space, these real data are not scattered, but actually tend to be a lower-dimensional space. Suppose that these points scattered in three-dimensional space all fall on the same plane or even the same line (Riemann hypothesis?), this is exactly what the real data is like. The 5,000 observations for each patient in a clinical study are not uncorrelated and scattered, because many of these measurements are correlated with each other.
In medical research, the true "dimensions" of data usually do not exceed 5. (Six degrees of separation of the network, average distance)
Correlation between power-law distributions and hidden Markov models: Determining the independent hierarchy (robustness) by finding a way to estimate the central trend of this distribution: An experiment conducted at Yale University in the 50s of the 20th century estimated the earnings of its graduates in 10 years. If they use the average, then the income is very high, because a few were multi-millionaires at the time, but, in fact, more than 80% of graduates earn less than this average on average
Network of dialectical treatment, the systematic expression of the disease (congestive heart failure is not an ordinary disease. The cause is not a simple source of infection, nor can it be alleviated by blocking the pathway of a biochemical enzyme. Hormones in the human body delicately control the heart, regulating its beating speed and contraction ability to adapt to the changing needs of the body, but the heart of patients with congestive heart failure is becoming less and less responsive to this regulation, and the main symptom of the patient is that the heart muscle is gradually weakening, and the muscles of the heart are becoming more and more hypertrophied and relaxed. As a result, patients develop edema in the lungs and ankles, and the slightest exercise can make it difficult for them to breathe. Patients may also feel drowsy and confused due to insufficient blood supply to the brain due to the blood supply to the stomach during meals. To maintain homeostasis, the affected body automatically adjusts to the reduced energy output of the heart. For many people, hormones that regulate changes in the heart muscle and other muscles reach a balance in some steady state. Although for the average person, such hormone levels are not normal. If doctors use β epinephrine contractionists or calcium-ionizing isolators during treatment, the results can complicate the patient's situation. Pulmonary edema is an important cause of death in patients with congestive heart failure. Modern medicine relies on diuretics, which can relieve edema. However, after the use of diuretics, the hormonal changes caused by the regulation of kidney function and heart function will cause new problems due to the interaction of patients.)
When designing a study, the first question that comes to mind is what to measure. The measurements in this experiment are multi-layered, so their distribution functions â the parameters of these functions â must be estimable and their composition must also be multidimensional.
Levy's proof of the central limit theorem establishes a more universal set of necessary conditions that are equivalent to having a set of randomly generated sequences of numbers one after the other: 1. Variation is bounded, so individual values cannot be infinitely large or infinitesimally small. 2. The best estimate of the next number must be its previous value. Levy called such columns martingale, a convergence of the hidden Markov model and a manifestation of energy minimization
The patient's response is a martingale. The difference between the two martingale is still a martingale-linear system
Abraham Deimover introduced calculus into probability calculations
Glivenco-Cantelli Lemma: It is possible to make the less beautiful empirical distribution function (Fourier series) closer and closer to the real distribution function (Fourier series) by increasing the number of observations
More accurate measurements in turn make the difference between the predicted value of the model and the actual observed value larger, as in the uncertainty principle of quantum physics
A probability distribution is a low-dimensional projection of the network structure
2 Statistics is an essential tool for understanding large-scale data in our time, and its importance is self-evident, as it can help us mine high-dimensional trends that we humans can understand from low-dimensional and complex data. The processing of various experimental data also requires statistical methods, so the specific design determines the level of our cognition of the world (emphasis on application).
The object of action, multi-level coupling, can always find a specific fixed point, can be used as an object of operation, like the definition of a quantity such as divergence, curl, etc. in physics, and can construct such a magnificent relationship as Maxwell's equations. This is the basis of our quantitative analysis. We chose the sequence as the object of our quantitative analysis. The homology comparison of sequences is a specific operation, which is a game behavior like game theory, and the game is multi-level, that is, not only the comparison of a single sequence, but also the comparison of sequences and their combinations of different segments, which is a dynamic process, and can construct multiple possible paths, which can construct a certain equivalence relationship with the wave function of quantum physics.
The relationship between the series is determined through certain relationship construction, such as p-value and other indicators.
Correlation construction of sequences, regression analysis, is essentially multiple linear regression, and the relationship between multiple variables is expressed in the form of a matrix
We can analyze the position of specific individuals through the grasp of the distribution, such as each person's height, weight, blood pressure, etc., and the coupling of the distribution of multiple indicators is a certain sequence, and we can carry out certain pattern recognition, so as to make meaningful judgments for diagnosis. The objects of these sequences may not be relatively independent, so they can form certain pathways. For example, the normal distribution, there are other statistical distributions, we can find the measurement data of a specific person, and we can construct a certain series (the first row is the concentration of various defined indicators, such as the concentration of various molecules, and the second row is the position of the specific statistical distribution, such as 50%)
ABCDEFGHIJ
, so we can do the operation in such a sequence. Based on the previous experience, a defined patient's action sequence is developed, and then the measurement data of the new patient can be entered to determine whether the patient has the disease based on the homology results of the sequence alignment. This matching operation can mine patterns in large-scale data so that they can be understood from a high-dimensional perspective. In fact, this is also the way I have observed the diagnosis of experienced doctors, looking for a change in a specific fixed point (the ability to refer to the nature of the whole in terms of the nature of the part) to determine the macroscopic disease. The more experience you have, the faster you can converge, that is, you can judge accurately and quickly. In my opinion, this is a high-dimensional level of operation, such as the observation of R-S cells in the pathological section can confirm the diagnosis of Hodgkin lymphoma with some certainty, which is a good way (also macroscopic level pattern recognition). It's just that now in the era of the Internet, we can use the power of computers to represent these data in the form of sequences, and find specific patterns, such as characteristic changes, through certain algorithms. Essentially, it's the same thing. Of course, we need to build a very large database to achieve this, and I just think about it.
The shorter the path between the relationships between sequences, the greater the reliability. For example, the serum triglyceride level is related to the risk of coronary heart disease, that is, the higher the triglyceride level, the greater the risk of coronary heart disease. But in fact, it is all related to cholesterol + HDL. For example, there is a positive correlation between the number of people who drown in the summer and the sales of ice cream, but the essential link is temperature. The longer the path A-B-C-D-E-F may cause a rapid decay convergence, perhaps as in 90%^n, and we can only observe a very low correlation between A-F. Therefore, this must be the result of multi-path coupling so that we can observe the relationship between pathways, and we can refer to the decoherence and wave function collapse of quantum physics, and the path we can observe is an equilibrium formed by a multi-level game, that is, the fixed point. And this is the path that we want to discover in our research, which is like a statistical distribution, and then can be understood by selectively expressing these pathways in specific cases.
The network is a sufficiently high-dimensional space, so the continuous traversal of the properties of trigonometric functions can explain its various properties, not only the geometric relations formed between the nodes of the set (trigonometric is the most stable), but also the approximation of the Fourier series that trigonometric functions can form as linear independent quantities to arbitrary functions, and the relationship between trigonometric functions and exponential and imaginary numbers revealed by Euler's formula is the basis for network construction.
The relationship between different nodes uses the relatively high-dimensional concept of sequence to refer to a certain node relationship, which can be based on the relationship at different levels. Our goal is to establish more precise rules, i.e., underlying relationships. First of all, we need to construct certain quantities that can be calculated, such as various proportions as a kind of mapping. And the concrete relationship construction is the collapse of the path, which is the emergence of the network. Many of these relationships can be constructed using the ideas of calculus to construct relationships between objects of different dimensions. The idea of fixed point is the key to communicating between low and high dimensions.
The formulation of an equation is an accurate description of a relationship.
Geometry represents our human figurative thinking, so we have the concept of networks, and the algebraic approach of geometry enables us to operate on it, that is, we abstract functions from the concept of geometry and perform certain operations at this level. And for the relationship of the network, we abstract the concept of probability according to different proportions.
Calculus is essentially a new mode of thinking that communicates the relationships between different dimensional levels through infinite separation, allowing it to approximate the real situation with arbitrary precision.
The equilibrium concept of mechanics, the chemical equilibrium of chemical reactions, and the equilibrium achievement of game theory are all required for the secondary structure and sequence of the network.
Considering the practical significance of the network, based on the mathematical description of the hierarchical interaction of organisms, it can at least have a good application in medical research.
Equivalence: the law of conservation of energy, the law of buoyancy (F = ĎVG, the mass of the liquid of the discharged volume is converted into the buoyancy of the floating body), the law of conservation of momentum.
The hierarchical similarity of exponents (e^ax)'=a(e^ax) is an important form of solution to differential equations. The various distributions corresponding to the logarithms are also an important property of the fractal structure of the network, which is essentially understood as a kind of convergence, like the elimination and screening of natural selection.
Correspondence networks, we need to have a mathematical form as great as calculus to explain, so the most important thing is its powerful simulation and correspondence to the real world. Therefore, it must be based on the work of our predecessors, and the object of operation selected by our network is a sequence, that is, a combination formed by a certain node, and this level studies various relationships and identifies unknown ones, and finally finds the fixed point relationship that we can directly use. One example of this is the idea of drug targets. This idea originated from bioinformatics, and combined with other mathematical theories such as calculus (the sum of low-dimensional levels can build high-dimensional levels), game theory, Darwin's theory of natural selection and the collapse of wave functions in quantum physics are also the basis for the selective expression of networks.
Then the sequence theory of the network is the analysis of the complex relationships between various objects, and finally forms the relationship sequence of fixed points. Among them, the problem of granularity must be paid attention to. Our ultimate basic relationship is the similarity matching of the sequences. Sequences can also be derived like functions, but they are based on the assumption of continuity that may not hold true in the network, i.e., the network is discrete.
The inverse operation of different coupling levels of the network is similar to the concept of preference.
Networks can be computed using paradoxical language, which is the method by which concrete paths are formed.
The addition of series is the addition of different levels of the sequence to approximate the real function. Fourier series is also an application. The terms of different times represent the relationship of different dimensions.
We collect data for analysis to finally obtain its eigens (the law of large numbers), and hope to form a certain high-probability connection with the higher-dimensional whole through the high-dimensional states formed by multiple eigens. Essentially, it is to construct the probability relationship between the combinations of different levels, and finally we can form the relationship between different levels, which provides the basis for our relationship prediction.
Randomization design is the use of large-scale data to emerge in a pattern, i.e., a fixed-point, high-probability path.
With the help of a series of algorithmic bases in bioinformatics, we transfer these ideas to specific medical fields: first, the hidden Markov model, in which the various symptoms of a specific patient are based on a certain high-dimensional structure, that is, the probability expression of a hidden probability distribution matrix (the specific distribution of different parts of the body is different, which makes up the complex situation of our organism). Of course, we must consider that the relationship between the formation of the underlying proportions at the macro level is relatively certain, that is, the probability connection between the sequences formed is relatively high, that is, the various experiences we have. Then there is dynamic programming, which approximates the global optimal with the local optimum, and then finds a certain possible path through backtracking. After that, the BLAST algorithm quickly looks for the correspondence of small sequences, determines the object of the greater probability of action, and then extends.
Some of the known studies are a kind of local optimal approximation, which can be regarded as a basic algorithm operation, and then the local optimal path can be gradually iterated under the operation of the big computer of society. We can't deny its effectiveness, we just need to improve this algorithm (greedy algorithm, bubbling algorithm), because even small improvements can go a long way in reducing the loss of society. Current neural networks, annealing algorithms, and genetic algorithms can provide good ideas.
Specific indicators are statistics and probability, such as different structures resulting from different treatments in different groups, which can form certain sequences, and we can draw various conclusions by comparing the occurrence probability of these sequences (and their possible hidden probability distribution matrix), such as A promoting/inhibiting the expression of B. Eventually, we need to unify these conclusions, i.e., form a certain sequence of the effects of AJWJRF (different treatments) on specific diseases.
We want to break through the existing assumptions - to validate statistical models, and no longer draw conclusions through complex calculations and scale comparisons of data, which is good, but too qualitative. We want to collect the probabilistic effects of different treatments on different indicators (quantum events, like atoms) and then form a Bayesian probability network, which can calculate the possible effects of different combinations in quantitative Bayesian formulas. That is, we need to collect information and analyze data from the database.
There is a high probability of connection between the underlying layers of different sequences, and according to the distribution, a high-dimensional structure, that is, a pathway, can be formed. How to introduce competitive games to form a certain equilibrium, that is, the final expression sequence? This can refer to the hidden Markov model, equilibrium is an observable sequence, and competitive game is a change of the probability distribution matrix of high dimensions.
The operation of the series may require the use of a regression model, of course, the measurement of the indicators is also in line with a certain distribution pattern, and the linear relationship that can be formed by the indicators at different levels is a fixed point relationship emergence. This is an overall level of expectation that tends to be maintained with a certain dimensional state. Of course, this expectation is an equilibrium reached by the overall level of the game, which can change with the changes in the external environment. This distribution is a Markov process.
This linear relationship is low-level and can form part of the relational connections of other objects by traversing, which is calculated by the Bayesian formula for the probabilistic connections between different objects. At the multivariate level, the relationship may be represented as a complex functional relationship, which is the path formation of the network, which is the traversal of different levels. Least Squares is a method that uses certain statistical indicators to obtain fixed point relations.
The individual's unique deviation from the eigencase is the result of a selective expression based on the Markov model, and we can approximate its variation through multivariate exploration.
The sequence has a certain distribution, which is determined by the power-law distribution nature of the network. The relationship between different objects is constructed by the high-dimensional structure of distribution, and the linear correlation is the underlying relationship, on which different complex nonlinear relations can be traversed, and we believe that the idea of Fourier series can be used to infinitely approximate this complex relationship through the selective expression of the game base.
Graph theory analysis of data, using the knowledge of various networks for statistical analysis, so as to make more accurate predictions of various situations.
The datasets formed by the various indicators of the disease can be classified by machine learning to help diagnose new patients. This is based on the assumption that enough classifications can divide all the diseases in the world, and that we can quickly classify the patient's condition by comparing the characteristics of these diseases (the idea of a Fourier series, a linear combination of a pair of substrates can traverse every point in space, and theoretically enough classifications to accurately define all the basic types in the world), first ruling out a range of possibilities, and then gradually converging based on further various data collections, which is a process of Bayesian inference, performing a certain pattern recognition, and finally being able to obtain a limited possible diagnosis (pattern) We can even assign a possible weight, i.e., the probability of occurrence, e.g. A53% B41% C36% D26%. A more achievable example is the identification of a specific tumor type by a combination of gene expressions, which requires large-scale measurements (gene chips) and then extraction of meaningful patterns (expression of specific gene combinations), such as fixed-point populations (Shinya Yamanaka's four transcription factors). This is the basis for our further experiments, theoretically just as dynamic programming algorithms can approximate the global optimal with local optimum, we need the formation of power-law distribution of the network to form a specific pattern, that is, a certain feature clustering