Chapter 107 Convolutional Neural Networks and Visual Recognition - Stanford CS231n Course Notes

Previous Chapter Next Chapter

Text Size:

Appearance:

Deep learning and neural network algorithms for computer vision simulate the architecture of the brain, so that the working mechanism of the real brain, i.e., neurocognition, can be finally explained. Complex and incomprehensible data collation is abstracted into information that we can understand, such as visual image and video information needs to be organized. The framework of neural networks is able to solve the problem of recognition

The format in which the information is stored, pixels. It is stored in the format of a matrix. We need to deal with the exponential explosion of visual information next. So we need to understand, we need to constantly abstract, such as tagging/classification/indexing, etc., which is actually the mechanism by which we understand this information. It is through continuous ascension to achieve various complex functions.

Understand the formation and evolution of sensations from an evolutionary point of view. The emergence of vision may have been one reason for the emergence of the Cambrian Big Bang, as organisms were able to extract more information about the world and thus be able to explain the formation of diversity. Starting with the appearance of the simplest eyes, due to the pressure of survival, genetic drift, most organisms have this novel structure. And on this basis, they began to evolve like an arms race, and on this basis, they patched and tinkered, and they evolved into the various eyes of the present creatures. For example, the visual cortex is at a considerable distance from the eye.

The primary visual cortex is the first step in the visual processing process, which is capable of processing large amounts of visual information. The content of the picture does not activate neurons, while the action of changing the picture is able to attract attention (hubel, wiesel, etc.) because the cortex is sensitive to this limbic effect, and different combinations of neurons are sensitive to different combinations, i.e. our neurons are only sensitive to features. Therefore, the first step is to extract features such as shapes, edges, alignments, etc. It is able to elicit the activation of neurons in specific tissue structures, and through the combination of these features, we can understand more high-dimensional outcomes, such as recognizing a human face from a picture (edge-determining structure, as in the Newton-Leibniz formula, ascending from the bottom to the higher dimension). Then there's the idea of linear algebra: the world can be broken down into linearly independent bases, and then everything else is a selective combination of them; that's the top-to-bottom idea). Therefore, deep learning is the use of complex mappings built by multi-layer neural networks to identify complex inputs.

Then the vision should be layered, with the first layer being the edge layer...... Finally, it can be upgraded to a space high enough to be suitable for a variety of situations. This inspires us to build a new level of learning, like the hidden layer of a neural network (feature learning), which is the idea of modeling.

View segmentation and grouping (e.g., face recognition) is the first step in understanding an image, and recognition is a more high-dimensional level, such as recognizing features to infer the whole, such as seeing the pattern of a tiger to infer the appearance of a tiger. This mode of learning can guide recognition in more low-dimensional contexts.

Object recognition requires the same evaluation system to compare the performance of different algorithms. Such as PASCAL. Next is Li Feifei's ImageNet dataset. Running the training convolutional neural network algorithm CNN is essentially proposing a function that can be classified with a certain degree of accuracy, and then constructed by adjusting the parameters (like series expansion, the linear combination of features, the more layers the more accurate). This requires the training of large-scale data and the development of hardware such as GPUs.

Image classification, object detection, that is, to further understand pictures on the basis of classification, such as describing them in a language that humans can understand. A deeper understanding requires a selective combination of what we already have. Of course, a comparison needs to be made with the mind of the superorganism of the human being, because the individual always has certain limitations. We believe that classification is to find certain high-dimensional patterns that enable us to understand the same object in different low-dimensional situations, such as cats in various situations (sleeping, playing, petting, etc.), and we believe that the existence of such features is essentially a proof of existence of a function.

Convolutional Neural Networks: High-performance hardware runs larger models (GPUs) with large amounts of data. This enables further development on the basis of the LeCun network, such as Alexnet, VGG, etc.

The process of neural network algorithm construction: 1. Establish the project and import the toolkit (sklearn, etc.); 2. Import the data of the dataset and convert it into a certain data structure (such as converting the n*n array of pictures into a one-dimensional array); 3. Set the parameters of the neural network algorithm such as the number of hidden layers, learning rate, the number of neurons in a single layer, etc., 4. Evaluate the performance of the algorithm by indicators such as accuracy and recall;

The neural network algorithm is essentially constructing a function that can accomplish our ideal goal, we assume that it exists, and then define this function by the various properties we need, the hidden layer can be understood as the series expansion, and the weight of the specific neuron can be understood as the parameters of the specific series/feature. F=∑wixi (wi is the specific parameter i.e., the weight, and xi is the feature). Theoretically, we can approximate theoretically existing functions by combining enough objects, but this requires too much computational resources and limited accuracy. The idea of linear algebra is to construct complex mappings.

The backpropagation algorithm of neural networks, the application of the chain rule, first constructs a complex multivariate function, and then determines the simple relationship, that is, the gradient, through a refined derivative. And the function we construct is ultimately the result of these gradient convergences.

Sigmoid function: 1/(1+e^-x)

2017/8/9

Neural network algorithms can be applied to problems in different domains, indicating that they may be some kind of underlying computing mechanism. It is a general system, like a Turing machine, which can produce certain outputs through the input of large-scale data, and these outputs correspond to our requirements such as image classification and so on. A large number of parameters of a neural network algorithm correspond to the fitting of a function, and we believe that there are specific parameters that can be biologically meaningful (energy minimization). The learning of larger parameters allows for the identification of more advanced features, which is consistent with the fundamental theorem of calculus. For example, from the edge/color and so on, the underlying features continue to rise to the recognition of complex objects, such as face recognition. The determination of parameters is the idea of optimization, just like the minimization of energy and convergence to the most stable result. Or it can be understood as the expansion of the Taylor series, and the determination of the parameters is the determination of the coefficients before the series, as long as the parameters are sufficiently approximated to this function, it can be regarded as reaching the optimal (gradient descent method). And the terms of this series are actually the characteristics of feature engineering.

When we train a model, we construct a specific function that can meet the correspondence between specific inputs and outputs

The training of parameters is a large-scale matrix operation. An important problem is that due to the power-law distribution, meaningful data is in the minority, that is, in the form of sparse matrices, so compression, dimensionality reduction, and so on are required to make full use of computing resources. Just like search, we need to take certain optimization measures, such as depth/breadth-first search, and pruning treatment of various conditions, etc., which can greatly reduce the scope of the search space. Fewer parameters, as in PCA, need to find more relevant parameters/features for linear combinations, as in PCA

Transfer learning may run the parameters of the trained model that we use and apply them to different domains, so that we don't need to train such large-scale parameters for use in new scenarios, thus reducing the amount of computation. The underlying mechanism may be that a large number of features are generic and have already constructed some meaningful functions.

Teamwork may also correspond to a certain computational mechanism that can achieve more ambitious goals.

Distbelief to tensorflow, open machine learning systems, more collaborative systems, sharing and a lot of infrastructure, have made people more focused on idea generation rather than implementation (cross-platform, cross-device).

The output can also be used as input for the next step, like a feedback system, which can present a spiraling process and identify features in higher dimensions. Essentially, it's learning.

Data-Driven Image Classification: Linear Classifiers (Non-Explicit Programming)

It is necessary to consider the definition of specific indicators, such as the distance has Euclidean distance, Manhattan distance, etc., so there is a KNN algorithm to perform cluster analysis: 1 randomly generate n centers 2 calculate the distance and classify 3 update the center 4 continue to calculate the distance until convergence and generate n classifications

Based on model/statistical programming, the function f(x,W,b)=Wx+b.. is constructed. W is the weight, x is the concrete object, and we can construct a series ∑wixi by adding them, so as to approximate a specific function, so that we can construct a certain connection between the pixel matrix of the image (which is regarded as a high-dimensional object) and a specific noun/category. It can be understood as constructing a certain classification curve in the high-dimensional space of the image construction. As long as the dimensions are high enough, the division of high dimensions can be achieved in the same way as the fundamental theorem of calculus.

Loss function and optimization

A certain loss function and objective function are constructed, which can quantitatively determine the quality of the current classifier, so that it can be further improved. In planning, constraints are necessary, and a high loss function indicates that this is a poor classification and the parameters need to be further adjusted.

As a classifier, the score of the Softmax function can be used as a measure of the accuracy of various classifications on the one hand, and at the same time, it can also be understood as the probability of a particular classification, which may be further calculated using Bayesian inference.

Gradient descent method, derivative df(x)/dx=lim [f(x+h)-f(x)]/h. approximation, the slope can be moved in a smaller direction to the extreme value, and the point where the loss function is minimized.

A concrete object can be decomposed into a set of features that add up the weights, which is the idea of linear algebra, where the selective combination of linearly independent substrates corresponds to the concrete vector/object, i.e., the data we collect. Specific characteristics can be constructed statistically.

Neural network training: Define the function to add the neural layer: 1. The data of the training 2. Define the data received by the nodes, 3. Define the neural layer, i.e., the hidden layer and the prediction layer, 4. Define the loss function, 4. Select the optimization method to minimize the loss

Chain rule, backpropagation, the effect of intermediate variables on the loss function. Every time a parameter is updated, it needs to be fed forward and fed back.

Selection of hyperparameters: number of hidden layers, number of neurons in a single hidden layer, learning rate, step size, probability of random inactivation dropout, etc.,

The multi-hidden layer corresponds to these intermediate functions, corresponding to the final feature.

Dropout technology, some low-weight neurons are eliminated according to a certain probability, that is, slimming, this random inactivation can reduce the size and computation of the neural network, and it is also more difficult to fit.

Think of a developmental machine learning algorithm, where new data inputs can be used to create intelligence like swarm intelligence, just like biological development, so that an intrinsic coding mechanism can be expressed according to the specific environment/data, just like the creation of life.

Convolutional neural networks

The perception mechanism of the visual cortex, the hierarchical architecture, the matrix superposition, and the deep learning are the superposition of multi-layer networks. The convolution operation can be understood as a certain data compression, which can be used to calculate a larger amount of data. Transfer learning may be the use of encapsulated parameter layers that correspond to some fixed architecture of the brain (the evolution of organisms is patched updates), which can greatly reduce the amount of computation required to train new objects.

filter (the specific size is also a parameter, such as 5*5*3), and the convolution operation is dot multiplication. Pooling layer to find more important information (a type of sampling). It needs to go through the operation of an excitation function such as the ReLU function (the operation of neurons ∑wixi), and these layers intersect with each other to form a certain feature. The final fully connected layer can build the mapping of the final classification.

The pursuit continues on the basis of classification, i.e., the division of items, in the form of boxes to enclose objects of a particular classification. This is a more granular calculation, because the specific segmentation can be more infinitely possible, and the prediction can be understood as search, and the possible search space will be very large, so the neural network algorithm is still needed to construct this connection, that is, positioning. Find areas with greater possibilities to divide, i.e. through specially crafted matches. （region proposal）

In the R-CNN algorithm, a certain number of boxes such as 2000 are first constructed, and then some boxes with greater probability are found among them, and then the neural network CNN is classified, the results are updated and returned, until convergence. To a certain extent, it can be regarded as a neural network superimposed on a neural network, and the so-called deep learning is the use of deeper neural networks to achieve more complex functions.

Visualization and further understanding of convolutional neural networks

The hidden layer in the middle corresponds to the recognition of certain features, and can be trained to find neurons corresponding to specific image activation (bias), and the weights of these neurons correspond to the selective combination of a certain base. However, these hidden layers are only visualized at a specific level, just as our pathological slides can accept some lesions in the human body, but only specific areas are more characteristic. These levels are the immovable points of mathematics.

By extracting features through statistics, we can discover knowledge that we don't know, and the combination of these features may be upgraded to a level that we can understand, such as face recognition. We can continue to ascend to higher dimensions through machine learning to features of higher dimensions that we cannot understand. Features correspond to certain patterns of neuronal activation, such as the formation of human stereotypes, the suspicion of smart children, and so on. This characteristic, like the style of the picture, can be transferred

Recurrent Neural Network RNN:

According to the Markov model, this is a transition between states, and a transition probability matrix can be statistically constructed, so that a certain sequence can be formed. Bioinformatics uses Markov models to understand coding and non-coding regions, also by calculating this probability.

Text processing: such as numpy reads text data, builds dictionaries, establishes indexes and mappings between letters, and then organizes them to form a certain data set. After RNN training, it can produce all kinds of meaningful texts, such as poems, essays, codes. By learning various features and learning the underlying mechanism, we can generate certain complex mappings, which can have certain similarities with some human states, and we need to abstract the modules of these feature values to form various meaningful units.

Implementation tips:

Making the most of the data (data augmentations such as horizontal flipping, etc., should be used in cases where data is insufficient; and transfer learning can reduce the computational effort to train new models, and certain layers of the trained network can be extracted as feature filters, which can be applied to different domains, and generally not trained from scratch, but fine-tuned in these trained models. dropout layer to avoid overfitting),

How to perform convolution and fast computation, i.e. ignore the intermediate steps of large-scale matrix operations, such as Strassen's algorithm

Implementation details such as GPU/CPU calculations, computational bottlenecks, distributed training

The general process of training CNN is as follows: 1. Load images and labels, 2. Train with CNN, 3. Calculate the value of the loss function through the output and labels of CNN, 4. Upload gradient to update the parameters of CNN

Deep learning open-source library

Caffe: Blob class stores data, Layer converts blob data, Net multi-hidden layer, calculates gradients, and Solver uses gradients to update weights. No programming is required, and ProtoTXT stores a variety of parameters

1. Transform the data, 2. Define the network, 3. Define the solver4 training. Recommended: feature extraction, fine-tuning of existing models

Torch：tensor。 Recommended use: Use pre-trained models to write your own layers

Theano, calculate the gradient. Recommended Use: Train CNNs

Tensorflow: Google's open-source framework. Recommended use: training large models

Image Segmentation and Attention Models

Feature extraction, sampling, and combining the results to define a certain boundary

Video detection vs. unsupervised learning

Principal component analysis (PCA) is used to find hidden patterns through cluster analysis. The features extracted by the neural network are the statistical substrates.

Autoencoder: The data area is used to learn features, and the data is encoded to form features, and then decoded to form output.

Bayesian inference. Maximum likelihood method.

Previous Chapter Next Chapter

Back to Book