Example: barber

Abstract arXiv:1611.05431v2 [cs.CV] 11 Apr 2017

Aggregated Residual Transformations for Deep Neural NetworksSaining Xie1 Ross Girshick2 Piotr Doll ar2 Zhuowen Tu1 Kaiming He21UC San Diego2 Facebook AI present a simple, highly modularized network archi-tecture for image classification. Our network is constructedby repeating a building block that aggregates a set of trans-formations with the same topology . Our simple design re-sults in a homogeneous, multi-branch architecture that hasonly a few hyper-parameters to set. This strategy exposes anew dimension, which we call cardinality (the size of theset of transformations), as an essential factor in addition tothe dimensions of depth and width. On the ImageNet-1 Kdataset, we empirically show that even under the restrictedcondition of maintaining complexity, increasing cardinalityis able to improve classification accuracy.

We present a simple, highly modularized network archi-tecture for image classification. Our network is constructed by repeating a building block that aggregates a set of trans-formations with the same topology. Our simple design re-sults in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy ...

Tags:

  Architecture, Network, Chair, Topology, Curette, Network archi tecture

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Abstract arXiv:1611.05431v2 [cs.CV] 11 Apr 2017

1 Aggregated Residual Transformations for Deep Neural NetworksSaining Xie1 Ross Girshick2 Piotr Doll ar2 Zhuowen Tu1 Kaiming He21UC San Diego2 Facebook AI present a simple, highly modularized network archi-tecture for image classification. Our network is constructedby repeating a building block that aggregates a set of trans-formations with the same topology . Our simple design re-sults in a homogeneous, multi-branch architecture that hasonly a few hyper-parameters to set. This strategy exposes anew dimension, which we call cardinality (the size of theset of transformations), as an essential factor in addition tothe dimensions of depth and width. On the ImageNet-1 Kdataset, we empirically show that even under the restrictedcondition of maintaining complexity, increasing cardinalityis able to improve classification accuracy.

2 Moreover, in-creasing cardinality is more effective than going deeper orwider when we increase the capacity. Our models, namedResNeXt, are the foundations of our entry to the ILSVRC2016 classification task in which we secured 2nd further investigate ResNeXt on an ImageNet-5K set andthe COCO detection set, also showing better results thanits ResNet counterpart. The code and models are publiclyavailable IntroductionResearch on visual recognition is undergoing a transi-tion from feature engineering to network engineering [25, 24, 44, 34, 36, 38, 14]. In contrast to traditional hand-designed features ( , SIFT [29] and HOG [5]), featureslearned by neural networks from large-scale data [33] re-quire minimal human involvement during training, and canbe transferred to a variety of recognition tasks [7, 10, 28].

3 Nevertheless, human effort has been shifted to designingbetter network architectures for learning architectures becomes increasingly difficultwith the growing number of hyper-parameters (width2, fil-ter sizes, strides,etc.), especially when there are many lay-ers. The VGG-nets [36] exhibit a simple yet effective strat-egy of constructing very deep networks: stacking build-1 refers to the number of channels in a , 1x1, 44, 3x3, 44, 1x1, 256+256, 1x1, 44, 3x3, 44, 1x1, 256256, 1x1, 44, 3x3, 44, 1x1, 32paths256-d in+256, 1x1, 6464, 3x3, 6464, 1x1, 256+256-d in256-d out256-d outFigure : A block of ResNet [14].Right: A block ofResNeXt with cardinality= 32, with roughly the same complex-ity.

4 A layer is shown as (# in channels, filter size, # out channels).ing blocks of the same shape. This strategy is inheritedby ResNets [14] which stack modules of the same topol-ogy. This simple rule reduces the free choices of hyper-parameters, and depth is exposed as anessential dimensionin neural networks. Moreover, we argue that the simplicityof this rule may reduce the risk of over-adapting the hyper-parameters to a specific dataset. The robustness of VGG-nets and ResNets has been proven by various visual recog-nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasksinvolving speech [42, 30] and language [4, 41, 20].Unlike VGG-nets, the family of Inception models [38,17, 39, 37] have demonstrated that carefully designedtopologies are able to achieve compelling accuracy with lowtheoretical complexity.

5 The Inception models have evolvedover time [38, 39], but an important common property isasplit-transform-mergestrategy. In an Inception module,the input is split into a few lower-dimensional embeddings(by 1 1 convolutions), transformed by a set of specializedfilters (3 3, 5 5,etc.), and merged by concatenation. Itcan be shown that the solution space of this architecture is astrict subspace of the solution space of a single large layer( , 5 5) operating on a high-dimensional split-transform-merge behavior of Inception modulesis expected to approach the representational power of largeand dense layers, but at a considerably lower good accuracy, the realization of Inception mod-els has been accompanied with a series of complicating fac-1 [ ] 11 Apr 2017tors the filter numbers and sizes are tailored for eachindividual transformation, and the modules are customizedstage-by-stage.

6 Although careful combinations of thesecomponents yield excellent neural network recipes, it is ingeneral unclear how to adapt the Inception architectures tonew datasets/tasks, especially when there are many factorsand hyper-parameters to be this paper, we present a simple architecture whichadopts VGG/ResNets strategy of repeating layers, whileexploiting the split-transform-merge strategy in an easy, ex-tensible way. A module in our network performs a setof transformations, each on a low-dimensional embedding,whose outputs are aggregated by summation. We pursuit asimple realization of this idea the transformations to beaggregated are all of the same topology ( , Fig. 1 (right)).

7 This design allows us to extend to any large number oftransformations without specialized , under this simplified situation we show thatour model has two other equivalent forms (Fig. 3). The re-formulation in Fig. 3(b) appears similar to the Inception-ResNet module [37] in that it concatenates multiple paths;but our module differs from all existing Inception modulesin that all our paths share the same topology and thus thenumber of paths can be easily isolated as a factor to be in-vestigated. In a more succinct reformulation, our modulecan be reshaped by Krizhevskyet al. s grouped convolu-tions [24] (Fig. 3(c)), which, however, had been developedas an engineering empirically demonstrate that our aggregated trans-formations outperform the original ResNet module, evenunder the restricted condition of maintaining computationalcomplexity and model size , Fig.

8 1(right) is designedto keep the FLOPs complexity and number of parameters ofFig. 1(left). We emphasize that while it is relatively easy toincrease accuracy by increasing capacity (going deeper orwider), methods that increase accuracy while maintaining(or reducing) complexity are rare in the method indicates thatcardinality(the size of theset of transformations) is a concrete, measurable dimen-sion that is of central importance, in addition to the dimen-sions of width and depth. Experiments demonstrate thatin-creasing cardinality is a more effective way of gaining accu-racy than going deeper or wider, especially when depth andwidth starts to give diminishing returns for existing neural networks, namedResNeXt(suggesting thenextdimension), outperform ResNet-101/152 [14], ResNet-200 [15], Inception-v3 [39], and Inception-ResNet-v2 [37]on the ImageNet classification particular, a101-layer ResNeXt is able to achieve better accuracy thanResNet-200 [15] but has only 50% complexity.

9 Moreover,ResNeXt exhibits considerably simpler designs than all In-ception models. ResNeXt was the foundation of our sub-mission to the ILSVRC 2016 classification task, in whichwe secured second paper further evaluatesResNeXt on a larger ImageNet-5K set and the COCO objectdetection dataset [27], showing consistently better accuracythan its ResNet counterparts. We expect that ResNeXt willalso generalize well to other visual (and non-visual) recog-nition Related WorkMulti-branch convolutional Inceptionmodels [38, 17, 39, 37] are successful multi-branch ar-chitectures where each branch is carefully [14] can be thought of as two-branch networkswhere one branch is the identity mapping.

10 Deep neural de-cision forests [22] are tree-patterned multi-branch networkswith learned splitting convolutions. The use of grouped convolutionsdates back to the AlexNet paper [24], if not earlier. Themotivation given by Krizhevskyet al. [24] is for distributingthe model over two GPUs. Grouped convolutions are sup-ported by Caffe [19], Torch [3], and other libraries, mainlyfor compatibility of AlexNet. To the best of our knowledge,there has been little evidence on exploiting grouped convo-lutions toimproveaccuracy. A special case of grouped con-volutions is channel-wise convolutions in which the numberof groups is equal to the number of channels. Channel-wiseconvolutions are part of the separable convolutions in [35].


Related search queries