Chapter 106 Introduction to Deep Learning - Andrew Ng Course Notes

Previous Chapter Next Chapter

Text Size:

Appearance:

Deep learning, a sufficiently multi-level neural network, is essentially the construction of a specific classification function, which can construct a high-dimensional original function from simple classification functions such as ReLU, sigmoid function, etc., just like the fundamental theorem of calculus, so as to be able to recognize complex objects such as face recognition. We are more interested in the possibility that the processing of large-scale data in biological research can mine possible patterns, and can establish a certain connection with the actual biological mechanism, so as to develop certain technologies such as PCR and CRISPR. Because the relationship construction of multiple objects of living organisms can be understood as a multivariate function, such as housing prices are related to many factors, and its complex interaction relationship will have the possibility of exponential explosion if it is randomly combined, it is difficult or even impossible to construct an accurate analytical solution due to the complexity of organisms, so statistical simulation is a good idea. And the hidden layer of the neural network can be regarded as an approximation, and we think that some parameters of the trained neural network algorithm can provide evidence of the interaction of these biological objects, and behind this transfer learning is the assumption that there is some kind of universal mechanism under different complex phenomena, so it is universal, that is, some characteristics of the image such as edge color and so on (pattern of patterns). These extracted features may have some biological significance. For example, we can find which objects have greater correlations with each other, which can be used as a guide for biological research, and furthermore, we can understand various interactions at a statistical level, and can make better predictions about the overall impact of changes in specific objects.

Biological research may require a paradigm shift in science, where big data-driven scientific discoveries are then complemented by experiments to continuously improve human understanding of life. For example, unsupervised algorithms are used to mine the patterns behind large-scale biological data, that is, biological mechanisms.

According to a certain input, produce a certain output, may have a certain application value, this kind of function mapping is our pursuit, is a way to understand the world, the parameters of the specific model to determine is this black box function mapping. Moreover, this mapping is universal, which can promote the development of social collaboration systems, such as intelligent diagnosis, which can greatly reduce the burden on doctors and even promote the improvement of technology. For example, in medical imaging diagnosis, through the images labeled by experts and various diagnoses, the relevant features are found through deep learning, and after training, the correlation between the features recognized by the experts (the basis for making the diagnosis) and the features recognized by machine learning can be constructed, so that different images can be classified with a certain accuracy rate. This learning process is consistent with what experts learn from rookies, and as the amount of data increases and computing power accumulates, these models can perform better and better, i.e., with higher accuracy and lower error rates, until they approach the theoretical limit. For example, the accuracy of image classification in the ImageNet competition has surpassed that of humans. Troika: Data + Computing Power + Algorithms

To continuously improve with the idea of experimentation, it is necessary to have a good control, to be able to continuously improve based on various innovative ideas of existing work, such as the combination of different algorithms, the transcendence of the original assumptions, etc., and to demonstrate their own views through the changes of relevant indicators. We can use insights from disciplines such as neuroscience and developmental biology to migrate to the realm of machine learning, such as Hubel's discovery that neurons can only recognize a limited number of features, while visual cortex functional columns are able to recognize complex things. Further, we may construct more complex objects such as emotions, emotions, and so on through a combination of neural circuits. The process of development also has certain inspirations, such as mosaic development and gradient development, there is internal programming, and there are adjustments according to the environment. The working mechanism of the brain, the evolutionary biological development, may be the result of some underlying mechanism, such as the principle of energy minimum. There are also certain inspirations in physics, such as Heisenberg's matrix mechanics and Schrödinger's equivalence of wave dynamics, and our large-scale matrix operations can be equivalent through experiments in quantum mechanics.

Because in essence, we analyze by extracting the data of complex objects, extract various features, remodel them on this basis, and adjust the model according to the gap with reality (such as the adjustment of parameters by the backpropagation algorithm) until it is close enough to reality and is regarded as equivalent (the idea of mathematical analysis). We need a trained model to be robust and generalizable. Various hyperparameters are tuned to approximate certain extreme values to converge. In order to develop algorithms, mathematics is very important, such as calculus, linear algebra, probability theory, and statistics. There are also various computer knowledge, programming languages, data structures, algorithms, computer composition principles, databases, and so on.

Data needs to be well organized, i.e., structured data to be better calculated. The representation of the data is very important, and there can be one-hot encoding, and there is a softmax function mapping to derive the final classification.

Neural Network Fundamentals:

From the simplest dichotomy (1/0) implementations such as logistic regression, to the integration of these classifiers, more complex classifications are constructed.

The construction of a function requires the determination of parameters, the former is to select the model, and the latter is to establish a specific model. This can be understood as a planning problem for a certain amount of computing resources, and it is hoped that it will converge to an optimal level. Just like searching, it is not possible to traverse the entire space at random, and certain optimization methods such as pruning are required. Therefore, the constraint is introduced, that is, the loss function, which can be the sum of the residuals, which is essentially a construction, and can converge faster when using the gradient descent method to determine the specific value of the parameters. When the valence function obtains an extreme value, the corresponding parameter is the best parameter, that is, the algorithm convergence.

The gradient descent method requires solving for the partial derivative of the loss function relative to a particular variable, as well as determining the learning rate a. For example, w=w-a×dJ(w)/dw. To a certain extent, the solution of the derivative can be understood as the difference quotient (d(x+h)-d(x))/h, as long as h is small enough, it can be regarded as approaching this limit, that is, the derivative, which can be understood as the slope of the image. Of course, theoretical mathematics can be infinitely small (continuous variation), but the implementation of the computer is finite (the implementation of discrete mathematics), we can only take a value that is small enough but not small enough, as long as the error is less than a certain value, we accept. This is consistent with the idea of a statistical p-value.

The construction of multivariable functions may be a composite of different functions, such as J(a,b,c)=3*(a+bc), so the chain rule is required to find the partial derivative. In general, we can only observe the effect of small changes in each variable on other variables. This makes it possible to deduce the organizational form of the variables, i.e. the specific structure of the function, from this data. Decoupling with the idea of modularity and decomposing complex functions into certain module functions is the divide-and-conquer idea of computer science. As long as there are enough variables, it is theoretically possible to construct any continuous function, but we need to consider the cost of the computation, and we can only find a local optimal solution. The idea of optimization, theoretically there is a set of parameters, which can make the algorithm have the best performance, and we are constantly approaching it, so there are various combinations of parameters and models, and so on. For example, from the hidden layers of simple neural networks to the continuous increase, there are also convolutional neural networks and so on.

Define various data with the idea of linear algebra, such as representing a vector with a one-dimensional array, so as to be able to map specific information to a high-dimensional space, and then explore various relationships on this basis. We need to understand it in terms of the theorem of representation. For example, the solution of a system of linear equations corresponds to a realistic planning problem. In specific programming, due to the discrete nature of computers, these linear algebraic objects such as vectors and matrices are often used to construct the mapping relationship of these functions. For example, the learning rate of gradient descent is some constants that we need to define.

Abstracting real objects into the form of matrices and performing various transformations, the similarity of different matrices can be found, so that the corresponding real objects are similar. There are various operations on matrices such as addition and subtraction, dot and cross products of matrices, inversion, transpose, etc., as well as matrix diagonalization to solve eigenvalues. A typical example is that images are stored in a matrix of pixels.

Vectorization, using the numpy library for computation can reduce the computation time, because the matrix operation has a certain optimization. vector ization to speed up the computation. After all, in many cases, we need to deal with the case of sparse matrices, which are computationally intensive and time-consuming, and can only use some fast approximation algorithms to calculate an acceptable result.

Visualization is one way we make sense of data, and Python's MattPlotLib library provides a number of tools.

Shallow neural network: The simplest model f(x) = ∑wixi+b. xi is the input, and after the function mapping f of this hidden layer in the middle, a certain output f(x) can be generated, which can continue to be processed by a certain activation function such as the softmax function to obtain a specific classification. The nesting of hidden layers and the construction of composite functions are consistent. Theoretically, by adjusting the parameters, we can construct a function that arbitrarily meets our needs, such as mapping the image input to the specific text output, that is, image recognition. Of course, shallow neural networks do not have such good performance and require more complex neural network structures to be constructed, but the performance of deep learning is actually based on the simple performance of shallow neural networks, which can be regarded as the original function of calculus. The hidden layer of the neural network corresponds to the features of different objects, and finally the recognition of the high-dimensional level can be realized through the adjustment of parameters, corresponding to the linear combination of the substrate of linear algebra (feature = substrate). Therefore, it can be abstracted into a large-scale matrix operation z[i] [i]=w[i] [i]*x[i] [i]+b【i].There is a reason to use the form of vector matrix to represent data, on the one hand, the demand for computer storage and operation, and at the same time, the optimization of various matrix calculations reduces the amount of large-scale computation, and can obtain more satisfactory results with the time and space overhead that can be beared.

The activation function is non-linear and can achieve the function we need, after all, the world is inherently complex and can only be approximated linearly. The use of nonlinear functions allows for a better approximation of reality. There are sigmoid functions 1/(1+e,^-1), ReLU functions max(0,z), tanh functions (e^z-e^-z)/(e^z+e^-z), leaking ReLU functions max(0.001z,z)

The activation function can also be derived, and the gradient descent method can be used to find the value that minimizes the loss function.

The backpropagation intuition, according to the derivative of the loss function, in turn modifies the parameters of the hidden layer, like the formation of a feedback loop, the initial input produces a certain value of the loss function, and through the gradient descent of the loss function, the parameters of the previous hidden layer are adjusted, and finally the cycle repeatedly makes the value of the loss function minimum, and each parameter also converges to a specific value.

Batch batch to random processing to change the value of the parameter, from initialization to parameter update, such as w=w-a*dw, the initial value of w can be random, and the learning rate a can also change.

Enough matrix operations can make the parameters of the model converge, i.e., train an algorithm that can be used. Therefore, the increase in computing power brought about by the use of GPUs is the basis for algorithms to play an important role, and large-scale data makes computing more meaningful.

If we suddenly become seriously ill, how can we use technology to change our fate? We can model the disease, through various characteristics such as various indicators, blood routine, urine routine, fecal routine, etc., other laboratory tests, medical imaging, pathological examinations, etc., to construct a complex neural network model, through the structured data that has been organized, that is, the diagnosis of the disease and the corresponding indicator matrix, to train the model, so as to make a diagnosis of the disease. Then, on the one hand, we can continue to learn the correspondence between mature therapies and diseases, and on the other hand, we can construct new therapies, such as identifying the indicators that need to be changed for a particular disease, and a variety of treatments that are known to have an impact on these indicators, and after training, we can get different therapies for each person's different situations, which may be able to save lives in despair.

Deep Neural Networks: More complex models based on the superposition of shallow neural networks can handle more complex situations with higher accuracy. It's essentially a function between constructing an input and a desired output. The multiple hidden layers correspond to different features, enabling the construction of more complex functions, i.e., objects corresponding to selective combinations of features. However, a deep enough network can construct a lot of features, and the construction of this high-dimensional space can theoretically correspond well to the construction of reality.

The adjustment and training of parameters is the focus of modeling work, and it is also the most computationally resource-intensive step, in order to find the parameter set that makes the algorithm perform the best, it needs to go through a lot of experimental attempts, and there is a certain element of luck. Hyperparameters: learning rate, number of hidden layers, number of neurons per layer, etc.

The hidden layer can also be encapsulated into a certain module block, that is, specific features, and these parameters that have been trained and converged can be migrated to different domains of learning.

Deep learning and the working mechanisms of the brain may share the same system, which is essentially computational, and the ability to construct intermediate functions between specific inputs and outputs is a kind of deconstruction and re-implementation of the black box.

Previous Chapter Next Chapter

Back to Book