Multiscale Vision Transformers - arXiv

Multiscale Vision TransformersHaoqi Fan*, 1Bo Xiong*, 1 Karttikeya Mangalam*, 1, 2 Yanghao Li*, 1 Zhicheng Yan1 Jitendra Malik1, 2 Christoph Feichtenhofer*, 11 Facebook AI Research2UC BerkeleyAbstractWe present Multiscale Vision Transformers (MViT) forvideo and image recognition, by connecting the seminal ideaof Multiscale feature hierarchies with transformer Transformers have several channel-resolutionscale stages. Starting from the input resolution and a smallchannel dimension, the stages hierarchically expand thechannel capacity while reducing the spatial resolution. Thiscreates a Multiscale pyramid of features with early lay-ers operating at high spatial resolution to model simplelow-level visual information, and deeper layers at spatiallycoarse, but complex, high-dimensional features.

We eval-uate this fundamental architectural prior for modeling thedense nature of visual signals for a variety of video recog-nition tasks where it outperforms concurrent Vision trans-formers that rely on large scale external pre-training andare 5-10 more costly in computation and parameters. Wefurther remove the temporal dimension and apply our modelfor image classification where it outperforms prior workon Vision Transformers . Code is available at: IntroductionWe begin with the intellectual history of neural networkmodels for computer Vision . Based on their studies of catand monkey visual cortex, Hubel and Wiesel [55] developedahierarchicalmodel of the visual pathway with neuronsin lower areas such as V1 responding to features such asoriented edges and bars, and in higher areas to more spe-cific stimuli.

Fukushima proposed the Neocognitron [32], aneural network architecture for pattern recognition explic-itly motivated by Hubel and Wiesel s hierarchy. His modelhad alternating layers of simple cells and complex cells, thusincorporating downsampling, and shift invariance, thus incor-porating convolutional structure. LeCunet al. [65] took theadditional step of using backpropagation to train the weightsof this network. But already the main aspects of hierarchy ofvisual processing had been established: (i) Reduction in spa-tial resolution as one goes up the processing hierarchy and(ii) Increase in the number of different channels , with each*Equal technical Vision Transformerslearn a hierarchy fromdense(in space) andsimple(in channels) tocoarseandcomplexfeatures.

Several resolution-channelscalestages progressivelyincreasethe channel capacity of the intermediate latent sequencewhilereducingits length and thereby spatial corresponding to ever more specialized a parallel development, the computer Vision com-munity developedmultiscaleprocessing, sometimes called pyramid strategies, with Rosenfeld and Thurston [85], Burtand Adelson [8], Koenderink [61], among the key were two motivations (i) To decrease the computing re-quirements by working at lower resolutions and (ii) A bettersense of context at the lower resolutions, which could thenguide the processing at higher resolutions (this is a precursorto the benefit of depth in today s neural networks.)

The Transformer [98] architecture allows learning ar-bitrary functions defined over sets and has been scalablysuccessful in sequence tasks such as language comprehen-sion [26] and machine translation [7]. Fundamentally, atransformer uses blocks with two basic operations. First,is an attention operation [4] for modeling inter-element re-lations. Second, is a multi-layer perceptron (MLP), whichmodels relations within an element. Intertwining these oper-ations with normalization [2] and residual connections [44]allows Transformers to generalize to a wide variety of , Transformers have been applied to key com-puter Vision tasks such as image classification.

In the spiritof architectural universalism, Vision Transformers [25, 95]approach performance of convolutional models across a va-riety of data and compute regimes. By only having a firstlayer that patchifies the input in spirit of a 2D convolu-tion, followed by a stack of transformer blocks, the visiontransformer aims to showcase the power of the transformerarchitecture using little inductive [ ] 22 Apr 2021In this paper, our intention is to connect the seminal ideaofmultiscale feature hierarchieswith the transformer posit that the fundamental Vision principle of resolutionand channel scaling, can be beneficial for transformer modelsacross a variety of visual recognition present Multiscale Vision Transformers (MViT)

, atransformer architecture for modeling visual data such as im-ages and videos. Consider an input image as shown in Fig. conventional Transformers , which maintain a constantchannel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the image resolution and a smallchannel dimension, the stageshierarchically expandthechannelcapacity whilereducingthespatialresolution. Thiscreates a Multiscale pyramid of feature activations inside thetransformer network, effectively connecting the principlesof Transformers with multi scale feature conceptual idea provides an effective design advan-tage for Vision transformer models.

The early layers of ourarchitecture can operate at high spatial resolution to modelsimplelow-level visual information, due to the lightweightchannel capacity. In turn, the deeper layers can effectivelyfocus on spatially coarse butcomplexhigh-level featuresto model visual semantics. The fundamental advantage ofour Multiscale transformer arises from the extremely densenature of visual signals, a phenomenon that is even morepronounced for space-time visual signals captured noteworthy benefit of our design is the presence ofstrong implicit temporal bias in video Multiscale models. Weshow that Vision transformer models [25] trained on naturalvideo suffer no performance decay when tested on videoswithshuffledframes.

This indicates that these models are noteffectively using the temporal information and instead relyheavily on appearance. In contrast, when testing our MViTmodels on shuffled frames, we observe significant accuracydecay, indicating strong use of temporal focus in this paper is video recognition, and we de-sign and evaluate MViT for video tasks (Kinetics [59, 10],Charades [86], SSv2 [38] and AVA [39]). MViT providesa significant performance gain over concurrent video trans-formers [78, 6, 1],withoutany external pre-training Fig. we show the computation/accuracy trade-offfor video-level inference, when varying the number of tem-poral clips used in MViT.

The vertical axis shows accuracyon Kinetics-400 and the horizontal axis the overall infer-ence cost in FLOPs for different models,MViTand concur-rent ViT [25] video variants: VTN [78], TimeSformer [6],ViViT [1]. To achieve similar accuracy level as MViT, thesemodels require significant more computation and parameters( ViViT-L [1] has higher FLOPs and more pa-rameters at equal accuracy, more analysis in ) and needlarge-scale external pre-training on ImageNet-21K(whichcontains around 60 more labels than Kinetics-400).IN-1 KIN-21 KIN-21 KIN-21K+ accat 1/5 FLOPsat 1/3 Params without ImageNet MViT-B 16x4 MViT-B 32x2[1] ViViT-L ImageNet-21K[6] TimeSformer ImageNet-21K[78] VTN ImageNet-1K / 21 KInference cost per video in TFLOPs (# of multiply-adds x 1012)Kinetics top-1 val accuracy (%) Figure trade-offon Kinetics-400 forvarying # of inference clips per video shown in MViT Vision -transformer based methods [78, 6, 1] require over5 more computationand large-scale external pre-trainingonImageNet-21K(IN-21K)

Multiscale Vision Transformers - arXiv

Tags:

Information

Advertisement

Transcription of Multiscale Vision Transformers - arXiv

Related search queries

Multiscale Vision Transformers - arXiv

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries