Andrew G. Howard Menglong Zhu Bo Chen Dmitry ...

MobileNets: Efficient Convolutional Neural Networks for Mobile VisionApplicationsAndrew G. HowardMenglong ZhuBo ChenDmitry KalenichenkoWeijun WangTobias WeyandMarco AndreettoHartwig AdamGoogle present a class of efficient models called MobileNetsfor mobile and embedded vision applications . MobileNetsare based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deepneural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency andaccuracy. These hyper-parameters allow the model builderto choose the right sized model for their application basedon the constraints of the problem. We present extensiveexperiments on resource and accuracy tradeoffs and showstrong performance compared to other popular models onImageNet classification.

We then demonstrate the effective-ness of MobileNets across a wide range of applications anduse cases including object detection, finegrain classifica-tion, face attributes and large scale IntroductionConvolutional neural networks have become ubiquitousin computer vision ever since AlexNet [19] popularizeddeep convolutional neural networks by winning the Ima-geNet Challenge: ILSVRC 2012 [24]. The general trendhas been to make deeper and more complicated networksin order to achieve higher accuracy [27, 31, 29, 8]. How-ever, these advances to improve accuracy are not necessar-ily making networks more efficient with respect to size andspeed. In many real world applications such as robotics,self-driving car and augmented reality, the recognition tasksneed to be carried out in a timely fashion on a computation-ally limited paper describes an efficient network architectureand a set of two hyper-parameters in order to build verysmall, low latency models that can be easily matched to thedesign requirements for mobile and embedded vision ap-plications.

Section 2 reviews prior work in building smallmodels. Section 3 describes the MobileNet architecture andtwo hyper-parameters width multiplier and resolution mul-tiplier to define smaller and more efficient MobileNets. Sec-tion 4 describes experiments on ImageNet as well a varietyof different applications and use cases. Section 5 closeswith a summary and Prior WorkThere has been rising interest in building small and effi-cient neural networks in the recent literature, [16, 34,12, 36, 22]. Many different approaches can be generallycategorized into either compressing pretrained networks ortraining small networks directly. This paper proposes aclass of network architectures that allows a model devel-oper to specifically choose a small network that matchesthe resource restrictions (latency, size) for their primarily focus on optimizing for latency butalso yield small networks.

Many papers on small networksfocus only on size but do not consider are built primarily from depthwise separableconvolutions initially introduced in [26] and subsequentlyused in Inception models [13] to reduce the computation inthe first few layers. Flattened networks [16] build a networkout of fully factorized convolutions and showed the poten-tial of extremely factorized networks. Independent of thiscurrent paper, Factorized Networks[34] introduces a similarfactorized convolution as well as the use of topological con-nections. Subsequently, the Xception network [3] demon-strated how to scale up depthwise separable filters to outperform Inception V3 networks. Another small network isSqueezenet [12] which uses a bottleneck approach to designa very small network.

Other reduced computation networksinclude structured transform networks [28] and deep friedconvnets [37].A different approach for obtaining small networks isshrinking, factorizing or compressing pretrained based on product quantization [36], hashing1 [ ] 17 Apr 2017 Proprietary + ConfidentialLandmark RecognitionFinegrain ClassificationObject DetectionMobileNetsPhoto by Sharon VanderKaay (CC BY )Photo by Juanedc (CC BY )Photo by HarshLight (CC BY )Face AttributesGoogle Doodle by Sarah HarrisonFigure 1. MobileNet models can be applied to various recognition tasks for efficient on device intelligence.[2], and pruning, vector quantization and Huffman coding[5] have been proposed in the literature.

Additionally var-ious factorizations have been proposed to speed up pre-trained networks [14, 20]. Another method for trainingsmall networks is distillation [9] which uses a larger net-work to teach a smaller network. It is complementary toour approach and is covered in some of our use cases insection 4. Another emerging approach is low bit networks[4, 22, 11].3. MobileNet ArchitectureIn this section we first describe the core layers that Mo-bileNet is built on which are depthwise separable then describe the MobileNet network structure and con-clude with descriptions of the two model shrinking hyper-parameters width multiplier and resolution Depthwise Separable ConvolutionThe MobileNet model is based on depthwise separableconvolutions which is a form of factorized convolutionswhich factorize a standard convolution into a depthwiseconvolution and a1 1convolution called a pointwise con-volution.

For MobileNets the depthwise convolution ap-plies a single filter to each input channel. The pointwiseconvolution then applies a1 1convolution to combine theoutputs the depthwise convolution. A standard convolutionboth filters and combines inputs into a new set of outputsin one step. The depthwise separable convolution splits thisinto two layers, a separate layer for filtering and a separatelayer for combining. This factorization has the effect ofdrastically reducing computation and model size. Figure 2shows how a standard convolution 2(a) is factorized into adepthwise convolution 2(b) and a1 1pointwise convolu-tion 2(c).A standard convolutional layer takes as input aDF DF Mfeature mapFand produces aDF DF Nfeature mapGwhereDFis the spatial width and heightof a square input feature map1,Mis the number of inputchannels (input depth),DGis the spatial width and height ofa square output feature map andNis the number of outputchannel (output depth).

The standard convolutional layer is parameterized byconvolution kernelKof sizeDK DK M NwhereDKis the spatial dimension of the kernel assumed to be squareandMis number of input channels andNis the number ofoutput channels as defined output feature map for standard convolution assum-ing stride one and padding is computed as:Gk,l,n= i,j,mKi,j,m,n Fk+i 1,l+j 1,m(1)Standard convolutions have the computational cost of:DK DK M N DF DF(2)where the computational cost depends multiplicatively onthe number of input channelsM, the number of outputchannelsNthe kernel sizeDk Dkand the feature mapsizeDF DF. MobileNet models address each of theseterms and their interactions. First it uses depthwise separa-ble convolutions to break the interaction between the num-ber of output channels and the size of the standard convolution operation has the effect of fil-tering features based on the convolutional kernels and com-bining features in order to produce a new filtering and combination steps can be split into twosteps via the use of factorized convolutions called depthwise1We assume that the output feature map has the same spatial dimen-sions as the input and both feature maps are square.

Our model shrinkingresults generalize to feature maps with arbitrary sizes and aspect convolutions for substantial reduction in compu-tational separable convolution are made up of twolayers: depthwise convolutions and pointwise use depthwise convolutions to apply a single filter pereach input channel (input depth). Pointwise convolution, asimple1 1convolution, is then used to create a linear com-bination of the output of the depthwise layer. MobileNetsuse both batchnorm and ReLU nonlinearities for both convolution with one filter per input channel(input depth) can be written as: Gk,l,m= i,j Ki,j,m Fk+i 1,l+j 1,m(3)where Kis the depthwise convolutional kernel of sizeDK DK Mwhere themthfilter in Kis applied tothemthchannel inFto produce themthchannel of thefiltered output feature map convolution has a computational cost of:DK DK M DF DF(4)Depthwise convolution is extremely efficient relative tostandard convolution.

However it only filters input chan-nels, it does not combine them to create new features. Soan additional layer that computes a linear combination ofthe output of depthwise convolution via1 1convolutionis needed in order to generate these new combination of depthwise convolution and1 1(pointwise) convolution is called depthwise separable con-volution which was originally introduced in [26].Depthwise separable convolutions cost:DK DK M DF DF+M N DF DF(5)which is the sum of the depthwise and1 1pointwise expressing convolution as a two step process of filter-ing and combining we get a reduction in computation of:DK DK M DF DF+M N DF DFDK DK M N DF DF=1N+1D2 KMobileNet uses3 3depthwise separable convolutionswhich uses between 8 to 9 times less computation than stan-dard convolutions at only a small reduction in accuracy asseen in Section factorization in spatial dimension such as in[16, 31] does not save much additional computation as verylittle computation is spent in depthwise (a) Standard Convolution (b) Depthwise Convolutional (c)1 1 Convolutional Filters called Pointwise Convolution in the con-text of Depthwise Separable ConvolutionFigure 2.

Andrew G. Howard Menglong Zhu Bo Chen Dmitry ...

Tags:

Information

Transcription of Andrew G. Howard Menglong Zhu Bo Chen Dmitry ...

Related search queries

Andrew G. Howard Menglong Zhu Bo Chen Dmitry ...

Tags:

Information

Documents from same domain

Related documents

Related search queries