ResNet paper : arxiv.org/abs/1512.03385
Today, we're going to talk about ResNet, which was introduced by the Microsoft team in Deep Residual Learning for Image Recognition ResNet introduced a methodically novel concept rather than a mathematically difficult concept: a concept called Residual.
Residual can be simply thought of as the error of the result, but it can be thought of as the remainder after subtracting X from Y. Therefore, in the past, the Residual was only used as a criterion for evaluation, and there was no thought of using it for learning.
Problem of Plain Network
Plain Network refers to networks that do not use skip/shortcut connection, such as AlexNet and VGGNet. These networks are problematic as the network gets deeper and deeper: gradient vanishing and gradient expanding.
Gradient vanishing/exploding
In a neural network, the derivative of the loss function corresponding to the weight is obtained through error back propagation to obtain the gradient. In this process, the partial derivative of the activation function is found and multiplied by that value. This causes the derivative of the activation function to converge to 0 or diverge to a very large value as the layer goes to the rear.
When the neural network deepens like this, when a small differential value is multiplied several times, it approaches 0, which is called Gradient Vanishing. Conversely, if a large derivative is multiplied several times, it will have a very large value, which is called Gradient Exploding.
A plain network that is too deep is difficult to learn. Let's look at the next picture first.
The reason why the error rate of the deeper 56-layer network is higher than the error rate of the 20-layer is called degradation in this paper, which is the same problem as the gradient vanishing described above. In this paper, the core idea to solve the gradient vanishing problem is that the deep model should not have a higher training error than the shallow model if the layers are deep and many layers are made by identity mapping. In this paper, this method is presented as a deep residual learning framework.
Residual Learning
If the underlying mapping of the existing neural network is H(x), that is, if the learning goal was to minimize H(x), the goal in this paper is to minimize F(x) = H(x) - x as a residual mapping. H(x) is redefined.
This redefinition method started with the assumption that identity mapping residual mapping F(x) is easier to optimize than optimizing the existing unreferenced mapping H(x).
That is, it is defined as H(x) = F(x) + x. Learn to F(x) = 0 so that H(x) = 0+ x. The characteristic of this method is that even if the above equation is differentiated, the gradient vanishing problem can be solved because the right term x is differentiated and has a value of 1.
If you solve the problem of gradient vanishing in this way, you can build a neural network with better performance because you can build deeper layers of the neural network without reducing the accuracy.
Comparison of structures of ResNet and other networks
Architecture of Plain Networks
Plain baseline was inspired by VGG. The convolutional layer is designed with a 3x3 filter and two rules.
- Filter of the same size as the map size of the same output feature.
- If the size of the feature map is reduced by half, the depth of the filter is doubled to preserve the time complexity.
Architecture of Residual Network
A shortcut connection has been added based on the Plain Network. Residual Network should set input and output to the same size. There are two things to consider for doing so.
- Sizing with zero padding.
- Sizing with projection.
Experiments
For plain networks, 34-layer performed worse than 18-layer. On the other hand, ResNet can confirm that 34-layer performs better than 18-layer.
Based on the above experiment, the residual network (right graph) shows that the degradation problem of the plain network (left graph) has been solved, and the initial convergence speed is also fast, and it can be seen that the optimization of resnet is easy through this.
In Table 1, it can be seen that the structure has changed slightly from 50-layer or higher, as the number of layers increases, the parameter increases, and the number of FLOP increases, so the complexity increases, and the bottle neck architecture is used to solve this problem.
Bottleneck is basically used to reduce the amount of computation, and it reduces the number of channels through 1x1 conv to reduce the demension like GoogLeNet's Inception structure, and then increases the number of channels through 1x1 conv again. As shown in the figure above, the computation volume is reduced, and additional non-line (ReLU) is also helpful.