MobileNetV2: Inverted Residuals and Linear Bottlenecks

MobileNetV2: Inverted Residuals and Linear BottlenecksMark Sandler Andrew Howard Menglong Zhu Andrey Zhmoginov Liang-Chieh ChenGoogle Inc.{sandler, howarda, menglong, azhmogin, this paper we describe a new mobile architecture,MobileNetV2, that improves the state of the art perfor-mance of mobile models on multiple tasks and bench-marks as well as across a spectrum of different modelsizes. We also describe efficient ways of applying thesemobile models to object detection in a novel frameworkwe call SSDLite. Additionally, we demonstrate howto build mobile semantic segmentation models througha reduced form of DeepLabv3 which we call based on an Inverted residual structure wherethe shortcut connections are between the thin bottle-neck layers. The intermediate expansion layer useslightweight depthwise convolutions to filter features asa source of non-linearity.}

Additionally, we find that it isimportant to remove non-linearities in the narrow layersin order to maintain representational power. We demon-strate that this improves performance and provide an in-tuition that led to this , our approach allows decoupling of the in-put/output domains from the expressiveness of the trans-formation, which provides a convenient framework forfurther analysis. We measure our performance onImageNet [1] classification, COCO object detection [2],VOC image segmentation [3]. We evaluate the trade-offsbetween accuracy, and number of operations measuredby multiply-adds (MAdd), as well as actual latency, andthe number of IntroductionNeural networks have revolutionized many areas ofmachine intelligence, enabling superhuman accuracy forchallenging image recognition tasks.

However, the driveto improve accuracy often comes at a cost: modern stateof the art networks require high computational resourcesbeyond the capabilities of many mobile and paper introduces a new neural network architec-ture that is specifically tailored for mobile and resourceconstrained environments. Our network pushes the stateof the art for mobile tailored computer vision models,by significantly decreasing the number of operations andmemory needed while retaining the same main contribution is a novel layer module: theinverted residual with Linear bottleneck. This mod-ule takes as an input a low-dimensional compressedrepresentation which is first expanded to high dimen-sion and filtered with a lightweight depthwise convo-lution. Features are subsequently projected back to alow-dimensional representation with alinear convolu-tion.

The official implementation is available as part ofTensorFlow-Slim model library in [4].This module can be efficiently implemented usingstandard operations in any modern framework and al-lows our models to beat state of the art along multipleperformance points using standard benchmarks. Fur-thermore, this convolutional module is particularly suit-able for mobile designs, because it allows to signifi-cantly reduce the memory footprint needed during in-ference by never fully materializing large intermediatetensors. This reduces the need for main memory accessin many embedded hardware designs, that provide smallamounts of very fast software controlled cache Related WorkTuning deep neural architectures to strike an optimalbalance between accuracy and performance has beenan area of active research for the last several manual architecture search and improvements intraining algorithms, carried out by numerous teams haslead to dramatic improvements over early designs suchas AlexNet [5], VGGNet [6], GoogLeNet [7].

, andResNet [8]. Recently there has been lots of progressin algorithmic architecture exploration included hyper-parameter optimization [9,10,11] as well as various4510methods of network pruning [12,13,14,15,16,17] andconnectivity learning [18,19]. A substantial amount ofwork has also been dedicated to changing the connectiv-ity structure of the internal convolutional blocks such asin ShuffleNet [20] or introducing sparsity [21] and oth-ers [22].Recently, [23,24,25,26], opened up a new direc-tion of bringing optimization methods including geneticalgorithms and reinforcement learning to architecturalsearch. However one drawback is that the resulting net-works end up very complex. In this paper, we pursue thegoal of developing better intuition about how neural net-works operate and use that to guide the simplest possiblenetwork design.

Our approach should be seen as compli-mentary to the one described in [23] and related this vein our approach is similar to those taken by[20,22] and allows to further improve the performance,while providing a glimpse on its internal operation. Ournetwork design is based on MobileNetV1 [27]. It re-tains its simplicity and does not require any special op-erators while significantly improves its accuracy, achiev-ing state of the art on multiple image classification anddetection tasks for mobile Preliminaries, discussion and depthwise separable ConvolutionsDepthwise separable Convolutions are a key build-ing block for many efficient neural network architectures[27,28,20] and we use them in the present work as basic idea is to replace a full convolutional opera-tor with a factorized version that splits convolution intotwo separate layers.

The first layer is called a depthwiseconvolution, it performs lightweight filtering by apply-ing a single convolutional filter per input channel. Thesecond layer is a1 1convolution, called a pointwiseconvolution, which is responsible for building new fea-tures through computing Linear combinations of the in-put convolution takes anhi wi diin-put tensorLi, and applies convolutional kernelK Rk k di djto produce anhi wi djoutput ten-sorLj. Standard convolutional layers have the compu-tational cost ofhi wi di dj k separable convolutions are a drop-in re-placement for standard convolutional layers. Empiri-cally they work almost as well as regular convolutionsbut only cost:hi wi di(k2+dj)(1)which is the sum of the depthwise and1 1pointwiseconvolutions. Effectively depthwise separable convolu-tion reduces computation compared to traditional layersby almost a factor ofk21.

MobileNetV2 usesk= 3(3 3depthwise separable convolutions) so the compu-tational cost is8to9times smaller than that of standardconvolutions at only a small reduction in accuracy [27]. Linear BottlenecksConsider a deep neural network consisting ofnlayersLieach of which has an activation tensor of dimensionshi wi di. Throughout this section we will be dis-cussing the basic properties of these activation tensors,which we will treat as containers ofhi wi pixels withdidimensions. Informally, for an input set of realimages, we say that the set of layer activations (for anylayerLi) forms a manifold of interest . It has been longassumed that manifolds of interest in neural networkscould be embedded in low-dimensional subspaces. Inother words, when we look at all individuald-channelpixels of a deep convolutional layer, the informationencoded in those values actually lie in some manifold,which in turn is embeddable into a low-dimensional a first glance, such a fact could then be capturedand exploited by simply reducing the dimensionality ofa layer thus reducing the dimensionality of the oper-ating space.

This has been successfully exploited byMobileNetV1 [27] to effectively trade off between com-putation and accuracy via a width multiplier parameter,and has been incorporated into efficient model designsof other networks as well [20]. Following that intuition,the width multiplier approach allows one to reduce thedimensionality of the activation space until the mani-fold of interest spans this entire space. However, thisintuition breaks down when we recall that deep convo-lutional neural networks actually have non- Linear per co-ordinate transformations, such asReLU. For example,ReLUapplied to a line in 1D space produces a ray ,where as inRnspace, it generally results in a piece-wiselinear curve is easy to see that in general if a result of a layertransformationReLU(Bx)has a non-zero volumeS,the points mapped tointeriorSare obtained via a lin-ear transformationBof the input, thus indicating thatthe part of the input space corresponding to the full di-mensional output, is limited to a Linear other words, deep networks only have the power ofa Linear classifier on the non-zero volume part of the1more precisely, by a factork2dj/(k2+dj)2 Note that dimensionality of the manifold differs from the dimen-sionality of a subspace that could be embedded via a Linear 1.

Examples ofReLUtransformations oflow-dimensional manifolds embedded in higher-dimensionalspaces. In these examples the initial spiral is embedded intoann-dimensional space using random matrixTfollowed byReLU, and then projected back to the 2D space usingT examples aboven= 2,3result in information loss wherecertain points of the manifold collapse into each other, whileforn= 15to30the transformation is highly non-convex.(a) Regular(b) separable (c) separable with linearbottleneck(d) Bottleneck with ex-pansion layerFigure 2:Evolution of separable convolution blocks. Thediagonally hatched texture indicates layers that do not containnon-linearities. The last (lightly colored) layer indicates thebeginning of the next block. Note:2dand2care equivalentblocks when stacked.

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Tags:

Information

Transcription of MobileNetV2: Inverted Residuals and Linear Bottlenecks

Related search queries

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Tags:

Information

Documents from same domain

Related documents

Related search queries