Example: bankruptcy

Bilinear CNN Models for Fine-grained Visual Recognition

Bilinear CNN Models for Fine-grained Visual RecognitionTsung-Yu LinAruni RoyChowdhurySubhransu MajiUniversity of Massachusetts, propose Bilinear Models , a Recognition architecturethat consists of two feature extractors whose outputs aremultiplied using outer product at each location of the im-age and pooled to obtain an image descriptor. This archi-tecture can model local pairwise feature interactions in atranslationally invariant manner which is particularly use-ful for Fine-grained categorization. It also generalizes var-ious orderless texture descriptors such as the Fisher vec-tor, VLAD and O2P. We present experiments with bilinearmodels where the feature extractors are based on convolu-tional neural networks. The Bilinear form simplifies gra-dient computation and allows end-to-end training of bothnetworks using image labels only. Using networks initial-ized from the ImageNet dataset followed by domain spe-cific fine -tuning we obtain accuracy of the CUB-200-2011 dataset requiring only category labels at train-ing time.

feature extractor. The bilinear model also generalizes sev-eral widely used orderless texture descriptors such as the Bag-of-Visual-Words [8], VLAD [20], Fisher vector [28], ... stochastic gradient descent with mini-batches with weight decay and momentum as described in Sect 3.1.

Tags:

  Mini, Fine, Extractor

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Bilinear CNN Models for Fine-grained Visual Recognition

1 Bilinear CNN Models for Fine-grained Visual RecognitionTsung-Yu LinAruni RoyChowdhurySubhransu MajiUniversity of Massachusetts, propose Bilinear Models , a Recognition architecturethat consists of two feature extractors whose outputs aremultiplied using outer product at each location of the im-age and pooled to obtain an image descriptor. This archi-tecture can model local pairwise feature interactions in atranslationally invariant manner which is particularly use-ful for Fine-grained categorization. It also generalizes var-ious orderless texture descriptors such as the Fisher vec-tor, VLAD and O2P. We present experiments with bilinearmodels where the feature extractors are based on convolu-tional neural networks. The Bilinear form simplifies gra-dient computation and allows end-to-end training of bothnetworks using image labels only. Using networks initial-ized from the ImageNet dataset followed by domain spe-cific fine -tuning we obtain accuracy of the CUB-200-2011 dataset requiring only category labels at train-ing time.

2 We present experiments and visualizations thatanalyze the effects of fine -tuning and the choice two net-works on the speed and accuracy of the Models . Resultsshow that the architecture compares favorably to the exist-ing state of the art on a number of Fine-grained datasetswhile being substantially simpler and easier to train. More-over, our most accurate model is fairly efficient runningat 8 frames/sec on a NVIDIA Tesla K40 GPU. The sourcecode for the complete system will be made available IntroductionFine-grained Recognition tasks such as identifying thespecies of a bird, or the model of an aircraft, are quitechallenging because the Visual differences between the cat-egories are small and can be easily overwhelmed by thosecaused by factors such as pose, viewpoint, or location of theobject in the image. For example, the inter-category vari-ation between Ringed-beak gull and a California gull due to the differences in the pattern on their beaks is signifi-cantly smaller than the inter-category variation on a popularfine-grained Recognition dataset for birds [37].

3 Bilinear vectorsoftmaxconvolutional + pooling layersCNN stream ACNN stream !sided!warblerFigure Bilinear CNN model for image time an image is passed through two CNNs, A and B, andtheir outputs are multiplied using outer product at each location ofthe image and pooled to obtain the Bilinear vector. This is passedthrough a classification layer to obtain common approach for robustness against these nui-sance factors is to first localize various parts of the objectand model the appearance conditioned on their detectedlocations. The parts are often defined manually and thepart detectors are trained in a supervised manner. Recentlyvariants of such Models based on convolutional neural net-works (CNNs) [2,38] have been shown to significantlyimprove over earlier work that relied on hand-crafted fea-tures [1,11,39]. A drawback of these approaches is thatannotating parts is significantly more challenging than col-lecting image labels.

4 Morevoer, manually defined parts maynot be optimal for the final Recognition approach is to use a robust image represen-tation. Traditionally these included descriptors such asVLAD [20] or Fisher vector [28] with SIFT features [25].By replacing SIFT by features extracted from convolu-tional layers of a deep network pre-trained on ImageNet [9],these Models achieve state-of-the-art results on a numberof Recognition tasks [7]. These Models capture local fea-ture interactions in a translationally invariant manner whichis particularly suitable for texture and Fine-grained recogni-tion tasks. Although these Models are easily applicable asthey don t rely on part annotations, their performance is be-low the best part-based Models , especially when objects aresmall and appear in clutter. Moreover, the effect of end-to-end training of such architectures has not been fully main contribution is a Recognition architecture thataddresses several drawbacks of both part-based and texturemodels ( ).

5 It consists of two feature ex-tractors based on CNNs whose outputs are multiplied usingthe outer product at each location of the image and pooledacross locations to obtain an image descriptor. The outerproduct captures pairwise correlations between the featurechannels and can model part-feature interactions, , if oneof the networks was a part detector and the other a localfeature extractor . The Bilinear model also generalizes sev-eral widely used orderless texture descriptors such as theBag-of- Visual -Words [8], VLAD [20], Fisher vector [28],and second-order pooling (O2P) [3]. Moreover, the archi-tecture can be easily trained end-to-end unlike these texturedescriptions leading to significant improvements in perfor-mance. Although we don t explore this connection further,our architecture is related to the two stream hypothesis ofvisual processing in the human brain [15] where there aretwo main pathways, or streams.

6 The ventral stream (or, what pathway ) is involved with object identification andrecognition. The dorsal stream (or, where pathway ) is in-volved with processing the object s spatial location relativeto the viewer. Since our model is linear in the outputs oftwo CNNs we call our approachbilinear are presented on Fine-grained datasets ofbirds, aircrafts, and cars ( ). We initialize various bi-linear architectures using Models trained on the ImageNet,in particular the M-Net of [5] and the verydeep network D-Net of [32]. Out of the box these networks do remark-ably well, , features from the penultimate layer of thesenetworks achieve and accuracy on the CUB-200-2011 dataset [37] respectively. fine -tuning improvesthe performance further to and In compari-son a fine -tuned Bilinear model consisting of a M-Net anda D-Net obtains accuracy, outperforming a numberof existing methods that additionally rely on object or partannotations ( , [21], or [2]).

7 We present ex-periments demonstrating the effect of fine -tuning on CNNbased Fisher vector Models [7], the computational and ac-curacy tradeoffs of various Bilinear CNN architectures, andways to break the symmetry in the Bilinear Models usinglow-dimensional projections. Finally, we present visualiza-tions of the Models in conclude in Related workBilinear Models were proposed by Tanenbaum and Free-man [33] to model two-factor variations, such as style and content , for images. While we also model two fac-tor variations arising out of part location and appearance,our goal is prediction. Our work is also related to bilinearclassifiers [29] that express the classifier as a product of twolow-rank matrices. However, in our model the features arebilinear, while the classifier itself is linear. Our reduced di-mensionality Models ( ) can be interpreted as bilin-ear classifiers. Two-stream architectures have been usedto analyze video where one networks Models the temporalaspect, while the other Models the spatial aspect [12,31].

8 Ours is a two-steam architecture for image number of recent techniques have proposed to useCNN features in an orderless pooling setting such as Fishervector [7], or VLAD [14]. We compare against these meth-ods. Two other contemporaneous works are of first is the hypercolumns of [17] that jointly con-siders the activations from all the convolutional layers ofa CNN allowing finer grained resolution for localizationtasks. However, they do not consider pairwise interactionsbetween these features. The second is the cross-layer pool-ing method of [24] that considers pairwise interactions be-tween features of adjacent layers of a single CNN. Our bi-linear model can be seen as a generalization of this approachusing separate CNNs simplifying gradient computation fordomain specific Bilinear Models for image classificationIn this section we introduce a general formulation of abilinear model for image classification and then describe aspecific instantiation of the model using CNNs.

9 We thenshow that various orderless pooling methods that are widelyused in computer vision can be written as Bilinear Bilinear modelBfor image classification consists of aquadrupleB=(fA,fB,P,C). HerefAandfBarefeaturefunctions,Pis apooling functionandCis aclassificationfunction. A feature function is a mappingf:L I !Rc Dthat takes an imageIand a locationLand outputs afeature of sizec D. We refer to locations generally whichcan include position and scale. The feature outputs are com-bined at each location using the matrix outer product, ,thebilinear featurecombination offAandfBat a locationlis given by Bilinear (l,I,fA,fB)=fA(l,I)TfB(l,I).Both fAandfBmust have the feature dimensionctobe compatible. The reason forc>1will become clearlater when we show that various texture descriptors can bewritten as Bilinear Models . To obtain an image descrip-tor the pooling functionPaggregates the Bilinear featureacross all locations in the image.

10 One choice of poolingis to simply sum all the Bilinear features, , (I)=Pl2 Lbilinear(l,I,fA,fB). An alternative is these ignore the location of the features and are henceorderless. IffAandfBextract features of sizeC MandC Nrespectively, then (I)is of sizeM N. Thebi-linear vectorobtained by reshaping (I)to sizeMN 1is a general purpose image descriptor that can be used witha classification functionC. Intuitively, the Bilinear form al-lows the outputs of the feature exactorsfAandfBto beconditioned on each other by considering all their pairwiseinteractions similar to a quadratic kernel Bilinear CNN modelsA natural candidate for the feature functionfis a CNNconsisting of a hierarchy of convolutional and pooling lay-ers. In our experiments we use CNNs pre-trained on theImageNet dataset [9]truncatedat a convolutional layer in-cluding non-linearities as feature extractors. By pre-trainingwe benefit from additional training data when domain spe-cific data is scarce.


Related search queries