Transcription of Learning Deep Features for Discriminative Localization
1 Learning Deep Features for Discriminative LocalizationBolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio TorralbaComputer Science and Artificial Intelligence Laboratory, this work, we revisit the global average pooling layerproposed in [13], and shed light on how it explicitly enablesthe convolutional neural network to have remarkable local-ization ability despite being trained on image-level this technique was previously proposed as a meansfor regularizing training, we find that it actually builds ageneric localizable deep representation that can be appliedto a variety of tasks. Despite the apparent simplicity ofglobal average pooling, we are able to achieve top-5error for object Localization on ILSVRC 2014, which is re-markably close to the top-5 error achieved by a fullysupervised CNN approach.
2 We demonstrate that our net-work is able to localize the Discriminative image regions ona variety of tasks despite not being trained for IntroductionRecent work by Zhouet al[33] has shown that the con-volutional units of various layers of convolutional neuralnetworks (CNNs) actually behave as object detectors de-spite no supervision on the location of the object was pro-vided. Despite having this remarkable ability to localizeobjects in the convolutional layers, this ability is lost whenfully-connected layers are used for classification. Recentlysome popular fully-convolutional neural networks such asthe Network in Network (NIN) [13] and GoogLeNet [24]have been proposed to avoid the use of fully-connected lay-ers to minimize the number of parameters while maintain-ing high order to achieve this, [13] usesglobal average pool-ingwhich acts as a structural regularizer, preventing over-fitting during training.
3 In our experiments, we found thatthe advantages of this global average pooling layer extendbeyond simply acting as a regularizer - In fact, with a littletweaking, the network can retain its remarkable localizationability until the final layer. This tweaking allows identifyingeasily the Discriminative image regions in a single forward-pass for a wide variety of tasks, even those that the networkwas not originally trained for. As shown in Figure 1(a), aBrushing teethCutting treesFigure 1. A simple modification of the global average pool-ing layer combined with our class activation mapping (CAM)technique allows the classification-trained CNN to both classifythe image and localize class-specific image regions in a singleforward-pass , the toothbrush forbrushing teethand the chain-saw forcutting trained on object categorization is successfully able tolocalize the Discriminative regions for action classificationas the objects that the humans are interacting with ratherthan the humans the apparent simplicity of our approach, for theweakly supervised object Localization on ILSVRC bench-mark [20]
4 , our best network achieves top-5 test er-ror, which is rather close to the top-5 test errorachieved by fully supervised AlexNet [10]. Furthermore,we demonstrate that the localizability of the deep Features inour approach can be easily transferred to other recognitiondatasets for generic classification, Localization , and Related WorkConvolutional Neural Networks (CNNs) have led to im-pressive performance on a variety of visual recognitiontasks [10, 34, 8]. Recent work has shown that despite beingtrained on image-level labels, CNNs have the remarkableability to localize objects [1, 16, 2, 15]. In this work, weshow that, using the right architecture, we can generalizethis ability beyond just localizing objects, to start identi-fying exactly which regions of an image are being used for1 Our models are available at: [ ] 14 Dec 2015discrimination.
5 Here, we discuss the two lines of work mostrelated to this paper: weakly- supervised object localizationand visualizing the internal representation of object Localization :There havebeen a number of recent works exploring weakly- supervised object Localization using CNNs [1, 16, 2, 15].Bergamoet al[1] propose a technique for self-taught objectlocalization involving masking out image regions to iden-tify the regions causing the maximal activations in order tolocalize objects. Cinbiset al[2] combine multiple-instancelearning with CNN Features to localize objects. Oquabetal[15] propose a method for transferring mid-level imagerepresentations and show that some object Localization canbe achieved by evaluating the output of CNNs on multi-ple overlapping patches.
6 However, the authors do not ac-tually evaluate the Localization ability. On the other hand,while these approaches yield promising results, they are nottrained end-to-end and require multiple forward passes of anetwork to localize objects, making them difficult to scaleto real-world datasets. Our approach is trained end-to-endand can localize objects in a single forward most similar approach to ours is the work based onglobal max pooling by Oquabet al[16]. Instead of globalaveragepooling, they apply globalmaxpooling to localizea point on objects. However, their Localization is limited toa point lying in the boundary of the object rather than deter-mining the full extent of the object .
7 We believe that whilethemaxandaveragefunctions are rather similar, the useof average pooling encourages the network to identify thecomplete extent of the object . The basic intuition behindthis is that the loss for average pooling benefits when thenetwork identifiesalldiscriminative regions of an object ascompared to max pooling. This is explained in greater de-tail and verified experimentally in Sec. Furthermore,unlike [16], we demonstrate that this Localization ability isgeneric and can be observed even for problems that the net-work was not trained useclass activation mapto refer to the weighted acti-vation maps generated for each image, as described in Sec-tion 2.
8 We would like to emphasize that while global aver-age pooling is not a novel technique that we propose here,the observation that it can be applied for accurate discrimi-native Localization is, to the best of our knowledge, uniqueto our work. We believe that the simplicity of this tech-nique makes it portable and can be applied to a variety ofcomputer vision tasks for fast and accurate CNNs:There has been a number of recentworks [29, 14, 4, 33] that visualize the internal represen-tation learned by CNNs in an attempt to better understandtheir properties. Zeileret al[29] use deconvolutional net-works to visualize what patterns activate each unit.
9 Zhouetal.[33] show that CNNs learn object detectors while beingtrained to recognize scenes, and demonstrate that the samenetwork can perform both scene recognition and object lo-calization in a single forward-pass. Both of these worksonly analyze the convolutional layers, ignoring the fully-connected thereby painting an incomplete picture of the fullstory. By removing the fully-connected layers and retain-ing most of the performance, we are able to understand ournetwork from the beginning to the al[14] and Dosovitskiyet al[4] analyzethe visual encoding of CNNs by inverting deep featuresat different layers. While these approaches can invert thefully-connected layers, they only show what informationis being preserved in the deep Features without highlight-ing the relative importance of this information.
10 Unlike [14]and [4], our approach can highlight exactly which regionsof an image are important for discrimination. Overall, ourapproach provides another glimpse into the soul of Class Activation MappingIn this section, we describe the procedure for generatingclass activation maps(CAM) using global average pooling(GAP) in CNNs. A class activation map for a particular cat-egory indicates the Discriminative image regions used by theCNN to identify that category ( , Fig. 3). The procedurefor generating these maps is illustrated in Fig. use a network architecture similar to Network in Net-work [13] and GoogLeNet [24] - the network largely con-sists of convolutional layers, and just before the final out-put layer (softmax in the case of categorization), we per-form global average pooling on the convolutional featuremaps and use those as Features for a fully-connected layerthat produces the desired output (categorical or otherwise).