Yolo V2

YOLO V2 paper : arxiv.org/abs/1612.08242

1. Introduction

Object Detection aims to capture objects quickly, accurately, and extensively. Object detection using deep learning has greatly improved speed and accuracy, but the number of objects is still low. The reason is that the dataset itself is small because it is difficult to build unlike other tasks such as classification and tagging.

To solve this problem, YOLOv2 applies a method that combines Classification datasets into Object Detection to learn. In addition, we propose a joint training algorithm that learns using both data for detection and classification.

2.Better

YOLOv1 had lower recall performance and more localization errors than detectors such as Fast R-CNN at the time. So YOLOv2 aims to increase recall and localization.

To do so, the simplest way to think of it is to configure the network itself larger and deeper. But then the pace is bound to slow down. YOLOv2 wanted to maintain speed but only increase accuracy. The goal is to make the network architecture simpler and better at presentation learning.

2-1. Batch Normalization

BatchNorm is a very familiar regularization method. Optimization speed can be made fast, the dropout layer can be removed, and the bias term can be removed. BatchNorm is applied to every conv layer. As a result, the mAP increases by 2%.

2-2. High Resolution Classifier

YOLOv2 fine-tuning the Classification Network ImageNet at 448x448 resolution for 10 epochs. Existing detectors mainly use a method of fine-tuning a pre-trained image classifier to match ImageNet. The existing YOLOv1 also performs fine-tuning by raising the classifier network pre-trained at 224x224 resolution to 448x448 for detection.

YOLOv2 is a feature extractor that works well with higher resolution in the first place, so that it can be learned in advance from the beginning, and then fine-tuned to the same high-resolution image according to detection. This is named High Resolution Classifier, and this method increases the mAP by about 4%.

2-3. Convolutional With Anchor Boxes

The existing YOLOv1 was a method of directly obtaining the coordinates of the bounding boxes by putting the feature obtained from the convolutional feature extractor into the fcl layer. In contrast, in the case of Fast R-CNN, a predetermined Anchor Boxes was used. And the reference point is a method proposed in the Fully Convolutional Network (FCN), which inputs the feature map of the input image. Without obtaining the coordinates of the bounding boxes directly, the method of predicting the bounding boxes by projecting predetermined anchor boxes at each location in the feature map makes the problem much simpler and easier to learn.

img1.daumcdn.png

Therefore, YOLOv2 removes the fc layer and uses Anchor Boxes to predict bounding boxes. As of Figure 2, the output of 7x7x1024 is used. Then, remove one pooling layer to obtain a higher resolution output. If so, it can be seen that an output of 14x14x1024 comes out. In addition, the input image is compressed to 416x416 instead of 448x448. If so, since the final output has a downsampling coefficient of 32, an output (416/32=13) of 13x13x1024 is obtained. It was made to have only one value in the center position by making the output size an odd number. In general, when the object is large, it is often located in the center of the image, so that the center of the final output has one value, not four values.

img1.daumcdn-1.png

Now, if we apply Region Proposal Network (RPN) in Fast R-CNN to this output of 13x13x1024, a very large number of Anchor Boxes are now proposed based on the feature. In the existing YOLOv1, if the coordinates of the bounding boxes, the object of each bounding box, and the conditional probability for each class were obtained in grid units, the class and objects are now obtained for every Anchor Boxes.

Anchor Boxes decreases by 0.3% in mAP, but the recall performance increases by 7% (81->88).

2-4. Dimension Clusters

There is also room for improvement in using Anchor Boxes. If you use Anchor Boxes of a given situation (given dataset) rather than just Anchor Boxes of a set size, you will do better at presentation learning. For this reason, YOLOv2 uses k-means clustering to set the best Anchor Boxes for the entire training set.

img1.daumcdn-2.png

Looking at Figure 4, the shape of the proposed Anchor Box is different for each dataset as expected. What distance is used in k-means clustering was also an important issue, and YOLOv2 set the distance by the following equation. Since the existing Euclidean distance had a different error rate depending on the size of the box, a new distance was set to set the IOU score high regardless of the size of the box.

img1.daumcdn-3.png

2-5. Direct location prediction

With YOLOv2 using Anchor Boxes, there was one more thing to improve. The instability of the model that occurs early in learning. RPN predicts (t_x, t_y) and obtains the center point (x,y) using the following equation.

img1.daumcdn-4.png

Because this formula has no range limitations, Anchor boxes can exist anywhere in the image, regardless of where the box was predicted. In the early stages of learning, this method makes learning unstable. In general, there is an object near the center of the image. For example, if learning starts from the edge of the image, it will become unstable to learn correctly initially. By limiting the position of the object to a possible position, it is possible to optimize it more stably. To this end, the prediction position of the model is locked inside the grid cell. In other words, it is to specify a position relative to the grid cell.

In Figure 6. below, the model predicts the black parameters t_x, t_y, t_w, t_h, and t_o, and then converts them into blue b_x, bi, b_w, and b_h using logistic activation (sigmoid-like). c_x and c_y refer to positions of grid cells. Through this, it can be interpreted that the coordinates were converted based on the grid cell by identifying the number of grid cell and adding it as shown in the equation in the figure. p_w and p_h refer to a predetermined Anchor box size.

If the predicted value is converted without using it as it is, the location coordinates of the bounding box are limited from 0 to 1. Eventually, the learning process becomes stable.

img1.daumcdn-5.png

The key idea of YOLOv1 was to predict the bounding box of objects in the grid by dividing the image into grid units. In YOLOv2, Anchor Box is used to predict the number of bouncing boxes per grid cell, and the size of Anchor Box is determined through Dimension Cluster. In addition, using the Direct location prediction method, YOLOv2 finally obtains an additional mAP of 5% than YOLOv1.

2-6. Fine-Grained Features

YOLO v2's 13x13 feature map may be sufficient to detect large objects, but it may not be able to detect small objects well. To solve this problem, a feature map of 26x26 resolution is obtained through a past through layer in the front layer before obtaining a 13x13 feature map.

As shown in Photo 4 above, after decomposing to 26x26x512 -> 13x13x2048, perform the conventional output, 13x13 feature map and concatenate. In this way, the performance was improved by 1%.

2-7. Multi-Scale Training

Unlike the existing YOLO, multi-scale training is performed when learning for YOLO v2, which is robust to various input dimensions.

multi-scale training

Each of the 10 batches is learned by applying a new image dimension.
Since YOLO v2 proceeds with a downsample of 1/32 times, the image dimensions used for learning are multiples of 32. ex){320, 352, ..., 608}.

The reason why such multi-scale training is possible is that YOLO v2 has a fully convolution network structure. Therefore, YOLO v2 can be applied to various input dimensions, and the experimental results thereof are shown in the following Figure.

img1.daumcdn-6.png

3.Faster

The authors of the paper wanted to develop fast-paced models such as traditional YOLO with high accuracy. To this end, Darknet-19 developed by itself is used as a backbone network. It shows faster and similar performance than VGG, which is mainly used by other models in the past.

4.Stronger

YOLO v2 proposes a method of learning classification and detection dataset together. For the detection data, the full loss of YOLO v2 is reversed. For the classification data, only loss corresponding to the classification part is reversed.

At this time, there is a problem that the label of each dataset is different. For example, for detection data, there are labels such as "dog" and "boat", while for classification data, there are labels such as "Norfolk terrier", "Yorkshire terrier", and "Bedlington terrier". This violates the mutually exclusive assumption of softmax. To solve this problem, a multi-label model is used.

Hierarchical classification

The labels of ImageNet are configured from WordNet data. WordNet is directed graphically by the complex relationship of natural language. The authors of the paper use their shortest paths to create a hierarchical tree structure, WordTree.

Using the hierarchical structure of WordTree, the conditional probability for each label is obtained. Examples of conditional probabilities are as follows.

Pr(Norfolk terrier∣terrier) Pr(Yorkshire terrier∣terrier) Pr(Bedlington terrier∣terrier)

If you want to calculate the absolute probability of a particular node, you can multiply the conditional probabilities along the tree structure from the root node.

5.Joint Classification and detection

WordTree is used to combine the COCO detection set with a dataset containing the top 9,000 categories of ImageNet. Through this, a dataset with 9418 categories is used. At this time, the learning ratio is adjusted to 4:1 in consideration of the amount of data of ImageNet and COCO.

When learning, as mentioned above, the full loss is reversed in the case of detection data, and only the classification loss part is reversed in the case of classification data. In classification, the subcategories (node) of the label are not considered in learning, but only the upper categories (node). For example, if the category is "dog", the higher category "animal" is learned with consideration, but not the lower category "terrier".

Through this learning method, YOLO v2 becomes YOLO9000 capable of detecting more than 9,000 categories. It was followed by the title of the paper "YOLO9000".