MobileNet V1: arxiv.org/abs/1704.04861
In the early days of CNN, it is composed of a series of convolution-pooling, simply increasing the number of channels. The most intuitive structure is easy to understand and implement. However, the convolution structure is rather heavy to run in a mobile environment. So a new paper called MobileNet comes out to dramatically reduce the amount of parameters.
This is a change to the existing convolution operation to depthwise separable convolution. Depthwise separable convolution is a combination of depthwise convolution and pointwise convolution.
Depthwise convolution is the separation of channels and convolution of each channel into its own kernel. At this time, the channel in the input and the channel in the output are always the same.
Pointwise convolution is the ability to change the channel of the output, which means 1x1 conv.
Let's look at the advantages of doing this from the parameter quantity perspective. For example, if you want to output 3x3x3 through convolution operations when 3x3x3 inputs are received.
1) For conventional convolution,
Since there are 3 kernels of 3x3x3, the parameter quantity is 3x3x3=81
2) For depthwise separable convolution
Since there are three 3x3x1 kernels (depthwise) and three 1x1x3 kernels, the parameter amount is 3x3x1x3+1x3x3=27+9=36
You can dramatically reduce the number of parameters as shown above.
The overall structure of MobileNet is shown below.
In fact, the structure is almost the same as the above VGG, but the big difference is that the existing convolution is replaced with the depthwise separable convolution and the size is reduced by 2 instead of pooling.
Network Structure