Convolutional Neural Networks

SimAM: A Simple, Parameter-Free Attention Module forConvolutional Neural NetworksLingxiao Yang1 2 3Ru-Yuan Zhang4 5 Lida Li6 Xiaohua Xie1 2 3 AbstractIn this paper, we propose a conceptually simplebut very effective attention module for convolu - tional Neural Networks (ConvNets). In contrast toexisting channel-wise and spatial-wise attentionmodules, our module instead infers 3-D atten-tion weights for the feature map in a layer with-out adding parameters to the original , we base on some well-known neuro-science theories and propose to optimize an ener-gy function to find the importance of each further derive a fast closed-form solution forthe energy function, and show that the solutioncan be implemented in less than ten lines of advantage of the module is that most ofthe operators are selected based on the solution tothe defined energy function, avoiding too many ef-forts for structure tuning.

Quantitative evaluationson various visual tasks demonstrate that the pro-posed module is flexible and effective to improvethe representation ability of many ConvNets. Ourcode is available at IntroductionConvolutional Neural Networks (ConvNets) trained onlarge-scale datasets ( ,ImageNet (Russakovsky et al.,2015)) have greatly boosted the performance on many visiontasks, such as image classification (Krizhevsky et al., 2012;Simonyan & Zisserman, 2014; He et al., 2016b; Huang1 School of Computer Science and Engineering, Sun Yat-senUniversity, Guangzhou, China2 Guangdong Province Key Labo-ratory of Information Security Technology, Sun Yat-sen Universi-ty, Guangzhou, China3 Key Laboratory of Machine Intelligenceand Advanced Computing, Ministry of Education, Sun Yat-senUniversity, Guangzhou, China4 Institute of Psychology and Be-havioral Science, Shanghai Jiao Tong University, Shanghai, China5 Shanghai Key Laboratory of Psychotic Disorders, Shanghai Men-tal Health Center, Shanghai Jiao Tong University, Shanghai, China6 The Hong Kong Polytechnic University, Hong Kong, China.)

Cor-respondence to: Xiaohua of the38thInternational Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).ResNet-50+ SE+ SimAMsteel_arch_bridgesmountain_tentlips tickgrey_whaleFigure of feature activations obtained by dif-ferent compared Networks are trained on ImageNet(Russakovsky et al., 2015) under a consistent setting. The featuresare extracted on the validation set and shown by Grad-CAM (Sel-varaju et al., 2017). Our SimAM helps the network focus on someprimary regions which are close to the image labels shown al., 2017; Szegedy et al., 2015; Sandler et al., 2018), ob-ject detection (Ren et al., 2015; Liu et al., 2016; He et al.,2017), and video understanding (Feichtenhofer et al., 2016;Wang et al., 2018a). Multiple studies have demonstratedthat a better ConvNet structure can significantly improveperformance on various problems.

Therefore, constructinga strong ConvNet is an essential task in vision modern ConvNet typically has multiple stages, and eachstage consists of a few blocks. Such block is constructedby several operators like convolution, pooling, activation orsome customized meta-structure (referred asmodulein thispaper). Recently, instead of designing the whole architectureas (Krizhevsky et al., 2012), many works focus on buildingadvanced blocks to improve the representational power ofConvNets. Stacked convolutions (Simonyan & Zisserman,2014), residual units (He et al., 2016b;a; Zagoruyko & Ko-modakis, 2016; Sandler et al., 2018), and dense connections(Huang et al., 2017; 2018) are the most representative onesthat have been widely applied in existing architectures. How-ever, designing those blocks requires rich expert knowledgeand enormous time.

To circumvent this, many researchersseek for some search strategies to automatically build archi-SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networkstectures (Zoph & Le, 2016; Liu et al., 2018b; Dong & Yang,2019; Tan & Le, 2019; Guo et al., 2020; Liu et al., 2019;Feichtenhofer, 2020; Tan et al., 2020).Besides designing sophisticated blocks, another line of re-search focuses on building plug-and-play modules (Hu et al.,2018b; Woo et al., 2018; Cao et al., 2020; Lee et al., 2019;Wang et al., 2020; Yang et al., 2020) that can refine convolu - tional outputs within a block and enable the whole networkto learn more informative features. For example, Squeeze-and-Excitation (SE) module (Hu et al., 2018b) allows a net-work to capture task-relevant features (see mountaintent in Figure 1) and suppress many background activations (see steelarchbridges in Figure 1).

This module is indepen-dent of network architecture, and therefore can be pluggedinto a broad range of Networks such as VGG (Simonyan &Zisserman, 2014), ResNets (He et al., 2016b), and ResNeXts(Xie et al., 2017). More recently, the SE module is includedas a component in AutoML to search for better networkstructures (Howard et al., 2019; Tan & Le, 2019).However, existing attention modules have two , they can only refine features along either channel orspatial dimensions, limiting their flexibility of learning at-tention weights that vary across both channel and , their structures are built by a series of complexfactors, ,the choice for pooling. We address these issuesby proposing a module based on well-established neuro-science theories. Specifically, to make the network learnmore discriminative neurons, we propose to directly infer3-D weights ( , considering both spatial and channel di-mensions) from current neurons and then in turn refine thoseneurons.

To efficiently infer such 3-D weights, we define anenergy function guided by the knowledge from neuroscienceand derive a closed-form solution. As shown in Figure 1,our module helps the network capture many valuable cueswhich are consistent with image labels (see examples of mountaintent and greywhale ). Moreover, most of theoperators used in our module are obtained from the solutionto the energy function without other bells and is worth emphasizing that we mainly focus on a smallplug-and-play module rather than a new architecture beyondexisting ConvNets. One previous study (Wang et al., 2017)also attempts to infer 3-D weights. Their promising result-s are based on a hand-crafted encoder-decoder to that study, our work provides an alternativeand efficient way to generate 3-D weights.

Our module ismore flexible and modularized, and still remains sum up, our main contributions are: Inspired by the attention mechanisms in human brain,we propose an attention module with full 3-D weightsand design an energy function to calculate the weights. We derive a closed-form solution of the energy functionthat speedup the weight calculation and allows for alightweight form of the whole module. We integrate the proposed module into some well-known Networks and evaluate them on various module performs favourably against other popularmodules in terms of accuracy, model size, and Related WorkIn this section, we briefly discuss representative works onnetwork architectures and plug-and-play attention 2012, a modern deep ConvNet,AlexNet (Krizhevsky et al., 2012), was released for large-scale image classification.

It is a simple feedforward struc-ture similar to the setup in LeNet (LeCun et al., 1998). Afterthat, multiple approaches have been proposed to strengthenthe power of ConvNets. Some works focus on finding theoptimal filter shapes (Zeiler & Fergus, 2014; Chatfield et al.,2014), and some other methods attempt to design muchdeeper Networks . For example, VGG (Simonyan & Zis-serman, 2014) and Inception Net (Szegedy et al., 2015)use stacked convolutions to reduce the risk of gradientvanishing/exploding (Bengio et al., 1994; Glorot & Ben-gio, 2010). Following this up, ResNet (He et al., 2016b)and Highway network (Srivastava et al., 2015) add shortcutconnections from input to output within each block. Theshortcut connections enable ConvNet to scale up to hun-dreds of layers.

Their results reveal that increasing networkdepth can substantially boost representational power of aConvNet. Besides network depth, some works propose toincrease the number of filters (Zagoruyko & Komodakis,2016) for wider block, to add more connections within eachblock (Huang et al., 2017), or to explore group/depth-wiseconvolutions (Xie et al., 2017; Chollet, 2017). More recent-ly, a bunch of works use AutoML (Zoph & Le, 2016; Liuet al., 2018b;a; Tan et al., 2019; Howard et al., 2019; Wuet al., 2019) to save the manual efforts in network from such mentioned works, we aim at designinga lightweight plug-and-play module. This module can beadopted for many ConvNets to further boost their perfor-mance in various tasks without big changes in and Recalibration worksalso design some computational modules that refine featuremaps.

Convolutional Neural Networks

Tags:

Information

Transcription of Convolutional Neural Networks

Related search queries

Convolutional Neural Networks

Tags:

Information

Documents from same domain

Related documents

Related search queries