1 Transformers in Vision: A Survey

1 Transformers in vision : A SurveySalman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir,Fahad Shahbaz Khan, and Mubarak ShahAbstract Astounding results from Transformer models on natural language tasks have intrigued the vision community to study theirapplication to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between inputsequence elements and support parallel processing of sequence as compared to recurrent , Long short-term memory(LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suitedas set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities ( , images, videos,text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and hugedatasets.

These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This Survey aims toprovide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction tofundamental concepts behind the success of Transformers , self-attention, large-scale pre-training, and bidirectional featureencoding. We then cover extensive applications of Transformers in vision including popular recognition tasks ( , image classification,object detection, action recognition, and segmentation), generative modeling, multi-modal tasks ( , visual-question answering, visualreasoning, and visual grounding), video processing ( , activity recognition, video forecasting), low-level vision ( , imagesuper-resolution, image enhancement, and colorization) and 3D analysis ( , point cloud classification and segmentation).

Wecompare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimentalvalue. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite furtherinterest in the community to solve current challenges towards the application of transformer models in computer Terms Self-attention, Transformers , bidirectional encoders, deep neural networks, convolutional networks, INTRODUCTIONTRANSFORMER models [1] have recently demonstratedexemplary performance on a broad range of , text classification, machine translation [2] andquestion answering.

Among these models, the most popularones include BERT (Bidirectional Encoder Representationsfrom Transformers ) [3], GPT (Generative Pre-trained Trans-former) v1-3 [4] [6], RoBERTa (Robustly Optimized BERTPre-training) [7] and T5 (Text-to-Text Transfer Transformer)[8]. The profound impact of Transformer models has becomemore clear with their scalability to very large capacity mod-els [9], [10]. For example, the BERT-large [3] model with340 million parameters was significantly outperformed bythe GPT-3 [6] model with 175 billion parameters while thelatest mixture-of-experts Switch transformer [10] scales upto a whopping trillion parameters!

The breakthroughs from Transformer networks in Nat-ural Language Processing (NLP) domain has sparked greatinterest in the computer vision community to adapt thesemodels for vision and multi-modal learning tasks (Fig. 1). S. Khan, M. Naseer and F. S. Khan are with the MBZ University ofArtificial Intelligence, Abu Dhabi, : M. Hayat is with the Faculty of IT, Monash University, Clayton VIC3800, Australia. S. W. Zamir is with the Inception Institute of Artificial Intelligence, AbuDhabi, UAE. S. Khan and M. Naseer are also with the CECS, Australian NationalUniversity, Canberra ACT 0200, Australia.

F. S. Khan is also with the Computer vision Laboratory, Link opingUniversity, Sweden. M. Shah is with the Center for Research in Computer vision , Universityof Central Florida, Orlando, FL 32816, United received March, , visual data follows a typical structure ( , spatialand temporal coherence), thus demanding novel networkdesigns and training schemes. As a result, Transformer mod-els and their variants have been successfully used for imagerecognition [11], [12], object detection [13], [14], segmenta-tion [15], image super-resolution [16], video understanding[17], [18], image generation [19], text-image synthesis [20]and visual question answering [21], [22], among severalother use cases [23] [26].

This Survey aims to cover suchrecent and exciting efforts in the computer vision domain,providing a comprehensive reference to interested architectures are based on a self-attentionmechanism that learns the relationships between elementsof a sequence. As opposed to recurrent networks that pro-cess sequence elements recursively and can only attend toshort-term context, Transformers can attend to completesequences thereby learning long-range relationships. Al-though attention models have been extensively used inboth feed-forward and recurrent networks [27], [28], Trans-formers are based solely on the attention mechanism andhave a unique implementation ( , multi-head attention)optimized for parallelization.

An important feature of thesemodels is their scalability to high-complexity models andlarge-scale datasets , in comparison to some of the otheralternatives such as hard attention [29] which is stochastic innature and requires Monte Carlo sampling for sampling at-tention locations. Since Transformers assume minimal priorknowledge about the structure of the problem as comparedto their convolutional and recurrent counterparts [30] [32],they are typically pre-trained using pretext tasks on large-scale (unlabelled) datasets [1], [3]. Such a pre-training avoidscostly manual annotations, thereby encoding highly expres- [ ] 19 Jan 20222 Fig. 1:Statistics on the number of times keywords such as BERT, Self-Attention, and Transformers appear in the titles of Peer-reviewed and arXiv papers over the past few years (in Computer vision and Machine Learning).

The plots show consistent growthin recent literature. This Survey covers recent progress on Transformers in the computer vision and generalizable representations that model rich rela-tionships between the entities present in a given dataset. Thelearned representations are then fine-tuned on the down-stream tasks in a supervised manner to obtain paper provides a holistic overview of the trans-former models developed for computer vision develop a taxonomy of the network design space andhighlight the major strengths and shortcomings of the ex-isting methods. Other literature reviews mainly focus onthe NLP domain [33], [34] or cover generic attention-basedapproaches [27], [33].

By focusing on the newly emergingarea of visual Transformers , we comprehensively organizethe recent approaches according to the intrinsic features ofself-attention and the investigated task. We first provide anintroduction to the salient concepts underlying Transformernetworks and then elaborate on the specifics of recent visiontransformers. Where ever possible, we draw parallels be-tween the Transformers used in the NLP domain [1] and theones developed for vision problems to flash major noveltiesand interesting domain-specific insights. Recent approachesshow that convolution operations can be fully replacedwith attention-based transformer modules and have alsobeen used jointly in a single design to encourage symbiosisbetween the two complementary set of operations.

1 Transformers in Vision: A Survey

Tags:

Information

Transcription of 1 Transformers in Vision: A Survey

Related search queries

1 Transformers in Vision: A Survey

Tags:

Information

Documents from same domain

Related documents

Related search queries