VGG paper : arxiv.org/abs/1409.1556
Abstarct
The VGG paper investigates the accuracy pattern while increasing the depth of convolutional networts. The model is a 16-19 depth model using 3x3 convolution layers continuously.
In fact, this model won second place in the classification category of ImageNet Challenge 2014. As the depth deepens, overfitting and gradient vanishing problems arise, so let's take a look at how to solve the problem and win the second place in the competition.
1. Introduction
Convolutional networks say they have been very successful in image and video recognition, thanks to large public image repositories such as ImageNet and high-performance computing systems (GPU). In particular, advances in visual recognition structures are said to play a major role in the ILSVRC competition, which is a test band for large-scale image classification systems (AlexNet, ZFNet, etc.).
As ConvNets became more commercialized in the field of computer vision, many attempts were made to improve the underlying structure of AlexNet to achieve better accuracy. In this paper, they focus on the depth of the ConvNet structure. To do this, we added more convolutional layers, and they say that this is possible thanks to the use of very small 3x3 convolution filters on all layers.
As a result, they envision a fairly more accurate ConvNet structure and are said to have high accuracy and be applicable to other image-aware datasets.
2. ConvNet Configurations
2-1. Architecture
The input value of ConvNet is a fixed size 224x224 RGB image, and preprocessing subtracts the average RGB value for each pixel in the training set.
The input image is passed to ConvNet with 3x3 filters, and 1x1 convolutional filters are also applied for nonlinearity. stride=1 is applied and padding is applied to preserve spatial resolution. Some conv layers have max-pooling (size=2x2, stride=2) layers.
After convolutional layers, there are three fully-connected (FC) layers, the first and second FCs are 4096 channels, and the third FCs are soft-max layers with 1000 channels. ReLU was used as an activation function for all hidden layers, and Local Response Normalization (LRN) applied to AlexNet was not applied because it does not affect the performance of the VGG model.
2.2 Configurations
Table 1 shows the placement of models A to E with different depths. Experiments were conducted at depths ranging from 11 to 19, and the channels doubled after each max-pooling layer to 512. Table 2 shows the number of parameters for each batch.
2.3 Discussion
The VGG model is very different from the ILSVRC-2012 winning AlexNet and ILSVRC-2013 winning ZFNet. Both used a large filter size, but the VGG only uses a 3x3 filter size with stride=1 overall. And with this, they find a tremendous fact.
If you use two 3x3 convolutional filters, it becomes 5x5 convolutional, and if you use three, it becomes 7x7 convolutional. If you use multiple layers of 3x3 filters, you can use 2 and 3 relu instead of one relu, and you can reduce the number of parameters.
Sources : training.galaxyproject.org/training-materia..
The 1x1 conv layer is intended to give nonlinearity. The channels of the input and output are the same, and the 1x1 conv layer is used to go through the relu function, giving additional nonlinearity.