Learning Spatio-Temporal Transformer for Visual Tracking

Learning Spatio-Temporal Transformer for Visual TrackingBin Yan1, , Houwen Peng2, , Jianlong Fu2, Dong Wang1, , Huchuan Lu11 Dalian University of Technology2 Microsoft Research AsiaAbstractIn this paper, we present a new Tracking architecturewith an encoder-decoder Transformer as the key compo-nent. The encoder models the global Spatio-Temporal fea-ture dependencies between target objects and search re-gions, while the decoder learns a query embedding to pre-dict the spatial positions of the target objects. Our methodcasts object Tracking as a direct bounding box predictionproblem, without using any proposals or predefined an-chors. With the encoder-decoder Transformer , the predic-tion of objects just uses a simple fully-convolutional net-work, which estimates the corners of objects directly. Thewhole method is end-to-end, does not need any postprocess-ing steps such as cosine window and bounding box smooth-ing, thus largely simplifying existing Tracking pipelines.

Theproposed tracker achieves state-of-the-art performance onmultiple challenging short-term and long-term benchmarks,while running at real-time speed, being6 faster thanSiam R-CNN [54]. Code and models are open-sourced IntroductionVisual object Tracking is a fundamental yet challeng-ing research topic in computer vision. Over the past fewyears, based on convolutional neural networks, object track-ing has achieved remarkable progress [28, 11, 54]. How-ever, convolution kernels are not good at modeling long-range dependencies of image contents and features, becausethey only process a local neighborhood, either in space ortime. Current prevailing trackers, including both the offlineSiamese trackers and the online Learning models, are almostall built upon convolutional operations [2, 44, 3, 54]. As aconsequence, these methods only perform well on model-ing local relationships of image content, but being limitedto capturing long-range global interactions. Such deficiencymay degrade the model capacities for dealing with the sce-narios where the global contextual information is important Work performed when Bin Yan was an intern of MSRA.

Corresponding authors: Houwen Peng Wang Figure 1: Comparison with state-of-the-arts on LaSOT [15]. Wevisualize the Success performance with respect to the Frames-Per-Seconds (fps) Tracking speed. The circle size indicates a weightedsum of the tracker s speed (x-axis) and success score (y-axis). Thelarger, the better. Ours-ST101 and Ours-ST50 indicate the pro-posed trackers with ResNet-101 and ResNet-50 as backbones, re-spectively. Better viewed in localization, such as the objects undergoing large-scalevariations or getting in and out of views problem of long-range interactions has been tackledin sequence modeling through the use of Transformer [53]. Transformer has enjoyed rich success in tasks such asnatural language modeling [13, 46] and speech recogni-tion [40]. Recently, Transformer has been employed in dis-criminative computer vision models and drawn great atten-tion [14, 5, 41].

Inspired by the recent DEtection TRans-former (DETR) [5], we propose a new end-to-end trackingarchitecture with encoder-decoder Transformer to boost theperformance of conventional convolution spatial and temporal information are important forobject Tracking . The former one contains object appearanceinformation for target localization, while the latter one in-cludes the state changes of objects across frames. PreviousSiamese trackers [28, 59, 16, 7] only exploit the spatial in-formation for Tracking , while online methods [63, 66, 11, 3]use historical predictions for model updates. Although be-ing successful, these methods do not explicitly model therelationship between space and time. In this work, consider-ing the superior capacity on modeling global dependencies,we resort to Transformer to integrate spatial and temporal10448information for Tracking , generating discriminative Spatio-Temporal features for object specifically, we propose a new Spatio-Temporal ar-chitecture based on the encoder-decoder Transformer forvisual Tracking .

The new architecture contains three keycomponents: an encoder, a decoder and a prediction encoder accepts inputs of an initial target object, thecurrent image, and a dynamically updated template. Theself-attention modules in the encoder learn the relation-ship between the inputs through their feature the template images are updated throughout video se-quences, the encoder can capture both spatial and tempo-ral information of the target. The decoder learns a queryembedding to predict the spatial positions of the target ob-ject. A corner-based prediction head is used to estimatethe bounding box of the target object in the current , a score head is learned to control the updates ofthe dynamic template experiments demonstrate that our method es-tablishes new state-of-the-art performance on both short-term [20, 43] and long-term Tracking benchmarks [15, 25].For instance, our Spatio-Temporal Transformer tracker sur-passes Siam R-CNN [54] by (AO score) and (Success) on GOT-10K [20] and LaSOT [15], is also worth noting that compared with previous long-term trackers [9, 54, 62], the framework of our method ismuch simpler.

Specifically, previous methods usually con-sist of multiple components, such as base trackers [11, 57],target verification modules [23], and global detectors [47,21]. In contrast, our method only has a single networklearned in an end-to-end fashion. Moreover, our tracker canrun at real-time speed, being6 faster than Siam R-CNN(30 5fps) on a tesla V100 GPU, as shown in Fig. 1 Considering recent trends of over-fitting on small-scale benchmarks, we collect a new large-scale trackingbenchmark calledNOTU, integrating all sequences fromNFS [24], OTB100 [58], TC128 [33], and UAV123 [42].In summary, this work has four contributions. We propose a new Transformer architecture dedicatedto Visual Tracking . It is capable of capturing global fea-ture dependencies of both spatial and temporal infor-mation in video sequences. The whole method is end-to-end, does not needany postprocessing steps such as cosine window andbounding box smoothing, thus largely simplifying ex-isting Tracking pipelines.

The proposed trackers achieve state-of-the-art perfor-mance on five challenging short-term and long-termbenchmarks, while running at real-time speed. We construct a new large-scale Tracking benchmark toalleviate the over-fitting problem on previous small-scale Related WorkTransformer in Language and originally proposed by Vaswaniet al.[53] for machinetranslation task, and has become a prevailing architecture inlanguage modeling. Transformer takes a sequence as the in-put, scans through each element in the sequence and learnstheir dependencies. This feature makes Transformer be in-trinsically good at capturing global information in sequen-tial data. Recently, Transformer has shown their great po-tential in vision tasks like image classification [14], objectdetection [5], semantic segmentation [56], multiple objecttracking [51, 41], etc. Our work is inspired by the recentwork DETR [5], but has following fundamental differences.(1) The studied tasks are different.

DETR is designed forobject detection, while this work is for object Tracking . (2)The network inputs are different. DETR takes the wholeimage as the input, while our input is a triplet consisting ofone search region and two templates. Their features fromthe backbone are first flattened and concatenated then sentto the encoder. (3) The query design and training strate-gies are different. DETR uses 100 object queries and usesthe Hungarian algorithm to match predictions with ground-truths during training. In contrast, our method only uses onequery and always matches it with the ground-truth withoutusing the Hungarian algorithm. (4) The bounding box headsare different. DETR uses a three-layer perceptron to pre-dict boxes. Our network adopts a corner-based box head forhigher-quality , TransTrack [51] and TrackFormer [41] aretwo most recent representative works on Transformer track-ing. TransTrack [51] has the following features. (1) Theencoder takes the image features of both the current andthe previous frame as the inputs.

(2) It has two decoders,which take the learned object queries and queries from thelast frame as the input respectively. With different queries,the output sequence from the encoder is transformed intodetection boxes and Tracking boxes respectively. (3) Thepredicted two groups of boxes are matched based on theIoUs using the Hungarian algorithm [27]. While Track-former [41] has the following features. (1) It only takes thecurrent frame features as the encoder inputs. (2) There isonly one decoder, where the learned object queries and thetrack queries from the last frame interact with each other.(3) It associates tracks over time solely by attention opera-tions, not relying on any additional matching such as mo-tion or appearance modeling. In contrast, our work has thefollowing fundamental differences with these two methods.(1) Network inputs are different. Our input is a triplet con-sisting of the current search region, the initial template anda dynamic template.

(2) Our method captures the appear-ance changes of the tracked targets by updating the dynamictemplate, rather than updating object queries as [51, 41]. Spatio-Temporal Information of spatial and temporal information is a core problem inobject Tracking field. Existing trackers can be divided intotwo classes: spatial-only ones and Spatio-Temporal of offline Siamese trackers [2, 29, 28, 69, 34] be-long to the spatial-only ones, which consider the objecttracking as a template-matching between the initial tem-plate and the current search region. To extract the rela-tionship between the template and the search region alongthe spatial dimension, most trackers adopt the variants ofcorrelation, including the naive correlation [2, 29], thedepth-wise correlation [28, 69], and the point-wise corre-lation [34, 61]. Although achieving remarkable progressin recent years, these methods merely capture local simi-larity, while ignoring global information.

Learning Spatio-Temporal Transformer for Visual Tracking

Tags:

Information

Transcription of Learning Spatio-Temporal Transformer for Visual Tracking

Related search queries

Learning Spatio-Temporal Transformer for Visual Tracking

Tags:

Information

Documents from same domain

Related documents

Related search queries