Selective Kernel Networks

Selective Kernel NetworksXiang Li 1,2, Wenhai Wang 3,2, Xiaolin Hu 4and Jian Yang 11 PCALab, Nanjing University of Science and Technology2 Momenta3 Nanjing University4 Tsinghua UniversityAbstractIn standard Convolutional Neural Networks (CNNs), thereceptive fields of artificial neurons in each layer are de-signed to share the same size. It is well-known in the neu-roscience community that the receptive field size of visualcortical neurons are modulated by the stimulus, which hasbeen rarely considered in constructing CNNs. We proposea dynamic selection mechanism in CNNs that allows eachneuron to adaptively adjust its receptive field size basedon multiple scales of input information. A building blockcalled Selective Kernel (SK) unit is designed, in which mul-tiple branches with different Kernel sizes are fused usingsoftmax attention that is guided by the information in thesebranches. Different attentions on these branches yield dif-ferent sizes of the effective receptive fields of neurons inthe fusion layer.

Multiple SK units are stacked to a deepnetwork termed Selective Kernel Networks (SKNets). Onthe ImageNet and CIFAR benchmarks, we empirically showthat SKNet outperforms the existing state-of-the-art archi-tectures with lower model complexity. Detailed analysesshow that the neurons in SKNet can capture target objectswith different scales, which verifies the capability of neu-rons for adaptively adjusting their receptive field sizes ac-cording to the input. The code and models are available Xiang Li and Jian Yang are with PCA Lab, Key Lab of IntelligentPerception and Systems for High-Dimensional Information of Ministry ofEducation, and Jiangsu Key Lab of Image and Video Understanding forSocial Security, School of Computer Science and Engineering, NanjingUniversity of Science and Technology, China. Xiang Li is also a visitingscholar at Momenta. Email: Wenhai Wang is with National Key Lab for Novel Software Technol-ogy, Nanjing University.

He was an research intern at Momenta. Xiaolin Hu is with the Tsinghua National Laboratory for InformationScience and Technology (TNList) Department of Computer Science andTechnology, Tsinghua University, China. Corresponding IntroductionThe local receptive fields (RFs) of neurons in the primaryvisual cortex (V1) of cats [14] have inspired the construc -tion of Convolutional Neural Networks (CNNs) [26] in thelast century, and it continues to inspire mordern CNN struc-ture construction. For instance, it is well-known that in thevisual cortex, the RF sizes of neurons in the same area ( ,V1 region) are different, which enables the neurons to col-lect multi-scale spatial information in the same processingstage. This mechanism has been widely adopted in recentConvolutional Neural Networks (CNNs). A typical exam-ple is InceptionNets [42, 15, 43, 41], in which a simple con-catenation is designed to aggregate multi-scale informationfrom, , 3 3, 5 5, 7 7 convolutional kernels inside the inception building , some other RF properties of cortical neuronshave not been emphasized in designing CNNs, and one suchproperty is the adaptive changing of RF size.

Numerous ex-perimental evidences have suggested that the RF sizes ofneurons in the visual cortex are not fixed, but modulated bythe stimulus. The Classical RFs (CRFs) of neurons in theV1 region was discovered by Hubel and Wiesel [14], as de-termined by single oriented bars. Later, many studies ( ,[30]) found that the stimuli outside the CRF will also af-fect the responses of neurons. The neurons are said to havenon-classical RFs (nCRFs). In addition, the size of nCRFis related to the contrast of the stimulus: the smaller thecontrast, the larger the effective nCRF size [37]. Surpris-ingly, by stimulating nCRF for a period of time, the CRFof the neuron is also enlarged after removing these stim-uli [33]. All of these experiments suggest that the RF sizesof neurons are not fixed but modulated by stimulus [38].Unfortunately, this property does not receive much atten-tion in constructing deep learning models. Those modelswith multi-scale information in the same layer such as In-ceptionNets have an inherent mechanism to adjust the RFsize of neurons in the next convolutional layer accordingto the contents of the input, because the next convolutional510layer linearly aggregates multi-scale information from dif-ferent branches.

But that linear aggregation approach maybe insufficient to provide neurons powerful adaptation the paper, we present a nonlinear approach to aggre-gate information from multiple kernels to realize the adap-tive RF sizes of neurons. We introduce a Selective Kernel (SK) convolution, which consists of a triplet of operators:Split, FuseandSelect. TheSplitoperator generates mul-tiple paths with various Kernel sizes which correspond todifferent RF sizes of neurons. TheFuseoperator combinesand aggregates the information from multiple paths to ob-tain a global and comprehensive representation for selectionweights. TheSelectoperator aggregates the feature maps ofdifferently sized kernels according to the selection SK convolutions can be computationally lightweightand impose only a slight increase in parameter and compu-tational cost. We show that on the ImageNet 2012 dataset[35] SKNets are superior to the previous state-of-the-artmodels with similar model complexity.

Based on SKNet-50, we find the best settings for SK convolution and showthe contribution of each component. To demonstrate theirgeneral applicability, we also provide compelling results onsmaller datasets, CIFAR-10 and 100 [22], and successfullyembed SK into small models ( , ShuffleNetV2 [27]).To verify the proposed model does have the ability toadjust neurons RF sizes, we simulate the stimulus by en-larging the target object in natural images and shrinking thebackground to keep the image size unchanged. It is foundthat most neurons collect information more and more fromthe larger Kernel path when the target object becomes largerand larger. These results suggest that the neurons in the pro-posed SKNet have adaptive RF sizes, which may underliethe model s superior performance in object Related WorkMulti-branch convolutional net-works [39] introduces the bypassing paths along with gat-ing units. The two-branch architecture eases the difficultyto training Networks with hundreds of layers.

The ideais also used in ResNet [9, 10], but the bypassing pathis the pure identity mapping. Besides the identity map-ping, the shake-shake Networks [7] and multi-residual net-works [1] extend the major transformation with more iden-tical paths. The deep neural decision forests [21] form thetree-structural multi-branch principle with learned splittingfunctions. FractalNets [25] and Multilevel ResNets [52]are designed in such a way that the multiple paths canbe expanded fractally and recursively. The InceptionNets[42, 15, 43, 41] carefully configure each branch with cus-tomized Kernel filters, in order to aggregate more informa-tive and multifarious features. Please note that the proposedSKNets follow the idea of InceptionNets with various filtersfor multiple branches, but differ in at least two importantaspects: 1) the schemes of SKNets are much simpler with-out heavy customized design and 2) an adaptive selectionmechanism for these multiple branches is utilized to realizeadaptive RF sizes of con-volutions are becoming popular due to their low compu-tational cost.

Denote the group size byG, then both thenumber of parameters and the computational cost will bedivided byG, compared to the ordinary convolution. Theyare first adopted in AlexNet [23] with a purpose of distribut-ing the model over more GPU resources. Surprisingly, us-ing grouped convolutions, ResNeXts [47] can also improveaccuracy. ThisGis called cardinality , which characterizethe model together with depth and compact models such as IGCV1 [53], IGCV2 [46]and IGCV3 [40] are developed, based on the interleavedgrouped convolutions. A special case of grouped convolu-tions is depthwise convolution, where the number of groupsis equal to the number of channels. Xception [3] and Mo-bileNetV1 [11] introduce the depthwise separable convolu-tion which decomposes ordinary convolutions into depth-wise convolution and pointwise convolution. The effec-tiveness of depthwise convolutions is validated in the sub-sequent works such as MobileNetV2 [36] and ShuffleNet[54, 27].

Beyond grouped/depthwise convolutions, dilatedconvolutions [50, 51] support exponential expansion of theRF without loss of coverage. For example, a 3 3 convo-lution with dilation 2 can approximately cover the RF ofa 5 5 filter, whilst consuming less than half of the com-putation and memory. In SK convolutions, the kernels oflarger sizes ( ,>1) are designed to be integrated with thegrouped/depthwise/dilated convolutions, in order to avoidthe heavy , the benefits of attentionmechanism have been shown across a range of tasks, fromneural machine translation [2] in natural language process-ing to image captioning [49] in image understanding. It bi-ases the allocation of the most informative feature expres-sions [16, 17, 24, 28, 31] and simultaneously suppresses theless useful ones. Attention has been widely used in recentapplications such as person re-ID [4], image recovery [55],text abstraction [34] and lip reading [48]. To boost the per-formance of image classification, Wang et al.

[44] proposea trunk-and-mask attention between intermediate stages ofa CNN. An hourglass module is introduced to achieve theglobal emphasis across both spatial and channel , SENet [12] brings an effective, lightweightgating mechanism to self-recalibrate the feature map viachannel-wise importances. Beyond channel, BAM [32] andCBAM [45] introduce spatial attention in a similar way. Incontrast, our proposed SKNets are the first to explicitly fo-cus on the adaptive RF size of neurons by introducing the511softmaxChw Kernel 3x3 Kernel 5x5 SplitFuseSelect element-wise summationelement-wise productFigure 1. Selective Kernel Transform Networks [18]learns a parametric transformation to warp the feature map,which is considered difficult to be trained. Dynamic Fil-ter [20] can only adaptively modify the parameters of fil-ters, without the adjustment of Kernel size. Active Convo-lution [19] augments the sampling locations in the convolu-tion with offsets.

Selective Kernel Networks

Tags:

Information

Advertisement

Transcription of Selective Kernel Networks

Related search queries

Selective Kernel Networks

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries