ViViT: A Video Vision Transformer - arXiv

vivit : A Video Vision TransformerAnurag Arnab*Mostafa Dehghani*Georg HeigoldChen SunMario Lu ci c Cordelia Schmid Google Research{aarnab, dehghani, heigold, chensun, lucic, present pure- Transformer based models for videoclassification, drawing upon the recent success of such mod-els in image model extracts spatio-temporal tokens from the input Video , which are then en-coded by a series of Transformer layers. In order to han-dle the long sequences of tokens encountered in Video , wepropose several, efficient variants of our model which fac-torise the spatial- and temporal-dimensions of the input. Al-though Transformer -based models are known to only be ef-fective when large training datasets are available, we showhow we can effectively regularise the model during trainingand leverage pretrained image models to be able to train oncomparatively small datasets.}

We conduct thorough abla-tion studies, and achieve state-of-the-art results on multiplevideo classification benchmarks including Kinetics 400 and600, Epic Kitchens, Something-Something v2 and Momentsin Time, outperforming prior methods based on deep 3 Dconvolutional networks. To facilitate further research, werelease code at IntroductionApproaches based on deep convolutional neural net-works have advanced the state-of-the-art across many stan-dard datasets for Vision problems since AlexNet [38]. Atthe same time, the most prominent architecture of choice insequence-to-sequence modelling ( in natural languageprocessing) is the Transformer [68], which does not use con-volutions, but is based on multi-headed self-attention.

Thisoperation is particularly effective at modelling long-rangedependencies and allows the model to attend over all ele-ments in the input sequence. This is in stark contrast toconvolutions where the corresponding receptive field islimited, and grows linearly with the depth of the success of attention-based models in NLP has re-cently inspired approaches in computer Vision to integratetransformers into CNNs [75, 7], as well as some attempts toreplace convolutions completely [49, 3, 53]. However, it is*Equal contribution Equal advisingonly very recently with the Vision Transformer (ViT) [18],that a pure- Transformer based architecture has outperformedits convolutional counterparts in image classification.

Doso-vitskiyet al. [18] closely followed the original transformerarchitecture of [68], and noticed that its main benefitswere observed at large scale as transformers lack someof the inductive biases of convolutions (such as transla-tional equivariance), they seem to require more data [18]or stronger regularisation [64].Inspired by ViT, and the fact that attention-based ar-chitectures are an intuitive choice for modelling long-range contextual relationships in Video , we develop sev-eral Transformer -based models for Video classification. Cur-rently, the most performant models are based on deep 3 Dconvolutional architectures [8, 20, 21] which were a natu-ral extension of image classification CNNs [27, 60].

Re-cently, these models were augmented by incorporating self-attention into their later layers to better capture long-rangedependencies [75, 23, 79, 1].As shown in Fig. 1, we propose pure- Transformer mod-els for Video classification. The main operation performedin this architecture is self-attention, and it is computed ona sequence of spatio-temporal tokens that we extract fromthe input Video . To effectively process the large number ofspatio-temporal tokens that may be encountered in Video ,we present several methods of factorising our model alongspatial and temporal dimensions to increase efficiency andscalability. Furthermore, to train our model effectively onsmaller datasets, we show how to reguliarise our model dur-ing training and leverage pretrained image also note that convolutional models have been de-veloped by the community for several years, and there arethus many best practices associated with such pure- Transformer models present different characteris-tics, we need to determine the best design choices for sucharchitectures.

We conduct a thorough ablation analysis oftokenisation strategies, model architecture and regularisa-tion methods. Informed by this analysis, we achieve state-of-the-art results on multiple standard Video classificationbenchmarks, including Kinetics 400 and 600 [35], EpicKitchens 100 [13], Something-Something v2 [26] and Mo-ments in Time [45]. [ ] 1 Nov + Token EmbeddingMLP HeadClassFactorisedEncoderL K V Q Self-AttentionTransformer Encoder MLPL ayer NormLayer NormMulti-HeadDot-Product AttentionEmbed to tokensFactorisedSelf-Attention21 NFactorisedDot-Product SpatialSpatial TemporalTemporalSpatialTemporalSpatialTe mporal SpatialTemporal FuseSpatialTemporalFuse 21N 21N Figure 1: We propose a pure- Transformer architecture for Video classification, inspired by the recent success of such models for images [18].

To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different componentsof the Transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to differentattention patterns over space and Related WorkArchitectures for Video understanding have mirrored ad-vances in image recognition. Early Video research usedhand-crafted features to encode appearance and motioninformation [41, 69]. The success of AlexNet on Ima-geNet [38, 16] initially led to the repurposing of 2D im-age convolutional networks (CNNs) for Video as two-stream networks [34, 56, 47].

These models processedRGB frames and optical flow images independently beforefusing them at the end. Availability of larger Video classi-fication datasets such as Kinetics [35] subsequently facili-tated the training of spatio-temporal 3D CNNs [8, 22, 65]which have significantly more parameters and thus requirelarger training datasets. As 3D convolutional networks re-quire significantly more computation than their image coun-terparts, many architectures factorise convolutions acrossspatial and temporal dimensions and/or use grouped convo-lutions [59, 66, 67, 81, 20]. We also leverage factorisationof the spatial and temporal dimensions of videos to increaseefficiency, but in the context of Transformer -based , in natural language processing (NLP),Vaswaniet al.

[68] achieved state-of-the-art results by re-placing convolutions and recurrent networks with the trans-former network that consisted only of self-attention, layernormalisation and multilayer perceptron (MLP) state-of-the-art architectures in NLP [17, 52] re-main Transformer -based, and have been scaled to web-scaledatasets [5]. Many variants of the Transformer have alsobeen proposed to reduce the computational cost of self-attention when processing longer sequences [10, 11, 37,62, 63, 73] and to improve parameter efficiency [40, 14].Although self-attention has been employed extensively incomputer Vision , it has, in contrast, been typically incor-porated as a layer at the end or in the later stages ofthe network [75, 7, 32, 77, 83] or to augment residualblocks [30, 6, 9, 57] within a ResNet architecture [27].

Although previous works attempted to replace convolu-tions in Vision architectures [49, 53, 55], it is only very re-cently that Dosovitiskyet al. [18] showed with their ViT ar-chitecture that pure- Transformer networks, similar to thoseemployed in NLP, can achieve state-of-the-art results forimage classification too. The authors showed that suchmodels are only effective at large scale, as transformers lacksome of inductive biases of convolutional networks (suchas translational equivariance), and thus require datasetslarger than the common ImageNet ILSRVC dataset [16] totrain. ViT has inspired a large amount of follow-up workin the community, and we note that there are a numberof concurrent approaches on extending it to other tasks incomputer Vision [71, 74, 84, 85] and improving its data-efficiency [64, 48].

ViViT: A Video Vision Transformer - arXiv

Tags:

Information

Transcription of ViViT: A Video Vision Transformer - arXiv

Related search queries

ViViT: A Video Vision Transformer - arXiv

Tags:

Information

Documents from same domain

Related documents

Related search queries