arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

CROSSFORMER: A VERSATILEVISIONTRANSFORMERHINGING ONCROSS-SCALEATTENTIONW enxiao Wang1,2, Lu Yao1, Long Chen3, Binbin Lin4, Deng Cai1, Xiaofei He1& Wei Liu21 State Key Lab of CAD & CG, Zhejiang University2 Data Platform, Tencent3 Columbia University4 School of Software Technology, Zhejiang UniversityABSTRACTT ransformers have made great progress in dealing with computer vision , existing vision transformers do not yet possess the ability of building theinteractions among features of different scales, which is perceptually important tovisual inputs. The reasons are two-fold: (1) Input embeddings of each layer areequal-scale, so no cross-scale feature can be extracted; (2) to lower the computa-tional cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the em-beddings and also disabling the cross-scale interactions. To this end, we proposeCross-scaleEmbeddingLayer (CEL) andLongShortDistanceAttention (LSDA).

On the one hand, CEL blends each embedding with multiple patches of differentscales, providing the self-attention module itself with cross-scale features. On theother hand, LSDA splits the self-attention module into a short-distance one and along-distance counterpart, which not only reduces the computational burden butalso keeps both small-scale and large-scale features in the embeddings. Throughthe above two designs, we achieve cross-scale attention. Besides, we put forward adynamic position bias for vision transformers to make the popular relative positionbias apply to variable-sized images. Hinging on the cross-scale attention module,we construct a versatile vision architecture, dubbed CrossFormer, which accom-modates variable-sized inputs. Extensive experiments show that CrossFormer out-performs the other vision transformers on image classification, object detection,instance segmentation, and semantic segmentation turns out that transformer (Vaswani et al., 2017; Devlin et al.)

, 2019; Brown et al., 2020) hasachieved great success in the field of natural language processing (NLP). Benefitting from its self-attention module, transformer is born with the key ability to build long-distance dependencies. Sincelong-distance dependencies are also needed by a number of vision tasks (Zhang & Yang, 2021; Chuet al., 2021), a surge of research work (Dosovitskiy et al., 2021; Touvron et al., 2021; Wang et al.,2021) has been conducted to explore various transformer-based vision transformer requires a sequence of embeddings2( , word embeddings) as input. To adapt thisrequirement to typical vision tasks, most existing vision transformers (Dosovitskiy et al., 2021;Touvron et al., 2021; Wang et al., 2021; Liu et al., 2021b) produce embeddings by splitting aninput image into equal-sized patches. For example, a224 224image can be split into56 56patches of size4 4, and these patches are projected through a linear layer to yield an embeddingsequence.

Inside a certain transformer, self-attention is engaged to build the interactions between anytwo embeddings. Thus, the computational or memory cost of the self-attention module isO(N2),whereNis the length of an embedding sequence. Such a cost is too big for a visual input becauseits embedding sequence is much longer than that of NLP. Therefore, the recently proposed visiontransformers (Wang et al., 2021; Liu et al., 2021b; Lin et al., 2021) develop multiple substitutes toapproximate the vanilla self-attention module with a lower code has been released: this paper, we also use embeddings to represent the input of each [ ] 8 Oct 2021 Though the aforementioned vision transformers have made some progress, they suffer from an issuethat restricts their performance They fail to build the interactions among features of differentscales, whereas such an ability is very vital for a lot of vision tasks. For example, an image oftencontains many objects of different scales, and to fully understand the image, building the interactionsamong those objects is helpful.

Besides, some particular tasks such as instance segmentation needthe interactions between large-scale (coarse-grained) features and small-scale (fine-grained) vision transformers fail to deal with the above cases due to two reasons: (1) The embeddingsare generated from equal-sized patches, so they only own features of one single scale. Moreover,their scales are kept unchanged or enlarged uniformly through operations like average pooling in thefollowing layers. Hence, embeddings in the same layer are always equal-scale. (2) Inside the self-attention module, adjacent embeddings are often grouped together and merged (Wang et al., 2021;Chu et al., 2021). Since the number of groups is smaller than that of embeddings, such behavior canreduce the computational budget of the self-attention. In this case, however, even if embeddings haveboth small-scale and large-scale features, merging operations will lose the small-scale (fine-grained)features of each individual embedding, thereby disabling the cross-scale enable the building of cross-scale interactions, we co-design a novel embedding layer and self-attention module as follows.

1)Cross-scale Embedding Layer (CEL) Following Wang et al. (2021),we also employ a pyramid structure for our transformer, which naturally splits the vision transformermodel into multiple stages. CEL appears at the start of each stage, which receives last stage s output(or an input image) as input and samples patches with multiple kernels of different scales ( ,4 4or8 8). Then, each embedding is constructed by projecting and concatenating these patchesas opposed to solely using one single-scale patch, which endows each embedding with cross-scalefeatures. 2)Long Short Distance Attention (LSDA) We propose a substitute of the vanilla self-attention, but to preserve small-scale features, the embeddings will not be merged. In contrast, wesplit the self-attention module intoShort Distance Attention(SDA) andLong Distance Attention(LDA). SDA builds the dependencies among neighboring embeddings, while LDA takes charge ofthe dependencies among embeddings far away from each other.

The proposed LSDA can also reducethe cost of the self-attention module like previous studies (Wang et al., 2021; Chu et al., 2021), butdifferent from them, LSDA does not undermine either small-scale or large-scale features. As aconsequence, attention with cross-scale interactions is , following prior work (Shaw et al., 2018; Liu et al., 2021b), we employ a relative positionbias for embeddings position representations. The Relative Position Bias (RPB) only supports fixedimage/group size3. However, image size for many vision tasks such as object detection is variable, sodoes group size for many architectures, including ours. To make the RPB more flexible, we furtherintroduce a trainable module calledDynamic Position Bias(DPB), which receives two embeddings relative distance as input and outputs their position bias. The DPB module is optimized end-to-endin the training phase, inducing an ignorable cost but making RPB apply to variable image/group our proposed modules can be implemented with about ten lines of code.

Based on them, weconstruct four versatile vision transformers of different sizes, dubbedCrossFormers. Other than im-age classification, the proposed CrossFormer can handle a variety of tasks with variable-sized inputssuch as object detection. Experiments on four representative vision tasks ( , image classification,object detection, instance segmentation, and semantic segmentation) demonstrate that CrossFormeroutperforms the other state-of-the-art vision transformers on all the tasks. Remarkably, the perfor-mance gains brought by CrossFormer are substantially significant on dense prediction tasks, ,object detection and instance/semantic is worth highlighting our contributions as follows: We propose cross-scale embedding layer (CEL) and long short distance attention (LSDA), whichtogether compensate for existing transformers incapability of building cross-scale attention. The dynamic position bias module (DPB) is further proposed to make the relative position biasmore flexible, , accommodating variable image size or group size.

Multiple CrossFormers with different sizes are constructed, and we corroborate their effective-ness through sufficient experiments on four representative vision vision transformers split input embeddings into several groups. Group size means the number ofembeddings in a (4 4,8 8,16 16,32 32)CrossFormerBlock "!CEL(2 2,4 4)CrossFormerBlock ""CEL(2 2,4 4)CrossFormerBlock "#CEL(2 2,4 4)CrossFormerBlock "$ClassificationHeadStage-1 Stage-2 Stage-3 Stage-4!!4 $!4 %!!8 $!8 2%!!16 $!16 4%!!32 $!32 8%LNSDALNMLPDPBLNLDALNMLPDPB(b)Twoconsec utiveCrossFormerblocks.!!$!(a)Thearchite cture of 1: (a) The architecture of CrossFormer for classification. The input size isH0 W0, and thesize of feature maps in each stage is shown on the of a CEL andniCrossFormerblocks. Numbers in CELs represent kernels sizes used for sampling patches. (b) The inner structureof two consecutive CrossFormer blocks. SDA and LDA appear alternately in different by the transformers developed for NLP, researchers design specificvisual transformers for vision tasks to take full advantage of their powerful attention mechanism.

Inparticular, ViT and DeiT transfer the original transformer Vaswani et al. (2017) to vision tasks (Tou-vron et al., 2021; Dosovitskiy et al., 2021), achieving impressive performance. Later, PVT (Wanget al., 2021), HVT (Pan et al., 2021), Swin (Liu et al., 2021b), and ViTAE (Xu et al., 2021) intro-duce a pyramid structure into the visual transformers , greatly decreasing the number of patches inthe later layers of a respective model. They also extend the visual transformers to other vision taskslike object detection and segmentation (Wang et al., 2021; Liu et al., 2021b).Substitutes of the core component of transformers , the self-attention moduleincurs theO(N2)computational/memory cost, whereNis the length of an embedding such a cost may be acceptable for image classification, it is not the case for other tasks withmuch larger input images ( , object detection and segmentation). To alleviate the cost, Swin (Liuet al., 2021b) restricts the attention in a certain local region, giving up long-distance (Wang et al.)

arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

Tags:

Information

Transcription of arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

Related search queries

arXiv:2108.00154v2 [cs.CV] 8 Oct 2021

Tags:

Information

Documents from same domain

Related documents

Related search queries