fzhangxiangyu,zxy,linmengxiao,[email protected] arXiv ...

ShuffleNet: An Extremely Efficient Convolutional Neural Network for MobileDevicesXiangyu Zhang Xinyu Zhou Mengxiao LinJian SunMegvii Inc introduce an extremely computation-efficient CNNarchitecture named ShuffleNet, which is designed speciallyfor mobile devices with very limited computing power ( ,10-150 MFLOPs). The new architecture utilizes two newoperations, pointwise group convolution and channel shuf-fle, to greatly reduce computation cost while maintainingaccuracy. Experiments on ImageNet classification and MSCOCO object detection demonstrate the superior perfor-mance of ShuffleNet over other structures, lower top-1error (absolute ) than recent MobileNet [12] on Ima-geNet classification task, under the computation budget of40 MFLOPs.

On an ARM-based mobile device, ShuffleNetachieves 13 actual speedup over AlexNet while main-taining comparable IntroductionBuilding deeper and larger convolutional neural net-works (CNNs) is a primary trend for solving major visualrecognition tasks [21, 9, 33, 5, 28, 24]. The most accu-rate CNNs usually have hundreds of layers and thousandsof channels [9, 34, 32, 40], thus requiring computation atbillions of FLOPs. This report examines the opposite ex-treme: pursuing the best accuracy in very limited compu-tational budgets at tens or hundreds of MFLOPs, focusingon common mobile platforms such as drones, robots, andsmartphones. Note that many existing works [16, 22, 43, 42,38, 27] focus on pruning, compressing, or low-bit represent-ing a basic network architecture.

Here we aim to explorea highly efficient basic architecture specially designed forour desired computing notice that state-of-the-art basic architectures such asXception[3] andResNeXt[40] become less efficient in ex-tremely small networks because of the costly dense1 1convolutions. We propose usingpointwise group convolu-* Equally reduce computation complexity of1 1convolu-tions. To overcome the side effects brought by group con-volutions, we come up with a novelchannel shuffleopera-tion to help the information flowing across feature on the two techniques, we build a highly efficient ar-chitecture calledShuffleNet. Compared with popular struc-tures like [30, 9, 40], for a given computation complexitybudget, our ShuffleNet allows more feature map channels,which helps to encode more information and is especiallycritical to the performance of very small evaluate our models on the challenging ImageNetclassification [4, 29] and MS COCO object detection [23]tasks.

A series of controlled experiments shows the effec-tiveness of our design principles and the better performanceover other structures. Compared with the state-of-the-artarchitectureMobileNet[12 ], ShuffleNet achieves superiorperformance by a significant margin, absolute ImageNet top-1 error at level of 40 also examine the speedup on real hardware, anoff-the-shelf ARM-based computing core. The ShuffleNetmodel achieves 13 actualspeedup (theoretical speedupis 18 ) over AlexNet [21] while maintaining Related WorkEfficient Model DesignsThe last few years have seenthe success of deep neural networks in computer visiontasks [21, 36, 28], in which model designs play an im-portant role.

The increasing needs of running high qual-ity deep neural networks on embedded devices encour-age the study on efficient model designs [8].For ex-ample,GoogLeNet[33] increases the depth of networkswith much lower complexity compared to simply stack-ing convolution [14] reduces parame-ters and computation significantly while maintaining [9, 10] utilizes the efficient bottleneck struc-ture to achieve impressive [13] in-troduces an architectural unit that boosts performance atslight computation cost. Concurrent with us, a very re-1 [ ] 7 Dec 2017 Figure 1. Channel shuffle with two stacked group convolutions. GConv stands for group convolution.

A) two stacked convolution layerswith the same number of groups. Each output channel only relates to the input channels within the group. No cross talk; b) input andoutput channels are fully related when GConv2 takes data from different groups after GConv1; c) an equivalent implementation to b) usingchannel work [46] employs reinforcement learning and modelsearch to explore efficient model designs. The proposedmobileNASN etmodel achieves comparable performancewith our counterpart ShuffleNet model ( @ 564 MFLOPs vs. @ 524 MFLOPs for ImageNet clas-sification error). But [46] do not report results on extremelytiny models ( complexity less than 150 MFLOPs), norevaluate the actual inference time on mobile ConvolutionThe concept of group convolution,which was first introduced inAlexNet[21] for distribut-ing the model over two GPUs, has been well demon-strated its effectiveness in ResNeXt [40].

depthwise sep-arable convolution proposed in Xception [3] generalizes theideas of separable convolutions in Inception series [34, 32].Recently, MobileNet [12] utilizes the depthwise separa-ble convolutions and gains state-of-the-art results amonglightweight models. Our work generalizes group convolu-tion and depthwise separable convolution in a novel Shuffle OperationTo the best of our knowl-edge, the idea of channel shuffle operation is rarely men-tioned in previous work on efficient model design, althoughCNN librarycuda-convnet[20] supports random sparseconvolution layer, which is equivalent to random channelshuffle followed by a group convolutional layer.

Such ran-dom shuffle operation has different purpose and been sel-dom exploited later. Very recently, another concurrent work[41] also adopt this idea for a two-stage convolution. How-ever, [41] did not specially investigate the effectiveness ofchannel shuffle itself and its usage in tiny model AccelerationThis direction aims to accelerate in-ference while preserving accuracy of a pre-trained network connections [6, 7] or channels [38] re-duces redundant connections in a pre-trained model whilemaintaining performance. Quantization [31, 27, 39, 45, 44]and factorization [22, 16, 18, 37] are proposed in litera-ture to reduce redundancy in calculations to speed up in-ference.

Without modifying the parameters, optimized con-volution algorithms implemented by FFT [25, 35] and othermethods [2] decrease time consumption in practice. Distill-ing [11] transfers knowledge from large models into smallones, which makes training small models Channel Shuffle for Group ConvolutionsModern convolutional neural networks [30, 33, 34, 32,9, 10] usually consist of repeated building blocks with thesame them, state-of-the-art networkssuch asXception[3] andResNeXt[40] introduce efficientdepthwise separable convolutions or group convolutionsinto the building blocks to strike an excellent trade-offbetween representation capability and computational , we notice that both designs do not fully take the1 1convolutions (also calledpointwise convolutionsin[12])

Into account, which require considerable complex-ity. For example, in ResNeXt [40] only3 3layers areequipped with group convolutions. As a result, for eachresidual unit in ResNeXt the pointwise convolutions multiplication-adds (cardinality = 32 as suggested in[40]). In tiny networks, expensive pointwise convolutionsresult in limited number of channels to meet the complexityconstraint, which might significantly damage the address the issue, a straightforward solution is to ap-Figure 2. ShuffleNet Units. a) bottleneck unit [9] with depthwise convolution (DWConv) [3, 12]; b) ShuffleNet unit with pointwise groupconvolution (GConv) and channel shuffle; c) ShuffleNet unit with stride = channel sparse connections, for example group convo-lutions, also on1 1layers.

fzhangxiangyu,zxy,linmengxiao,[email protected] arXiv ...

Tags:

Information

Transcription of fzhangxiangyu,zxy,linmengxiao,[email protected] arXiv ...

Related search queries

fzhangxiangyu,zxy,linmengxiao,[email protected] arXiv ...

Tags:

Information

Documents from same domain

Related documents

Related search queries