arXiv:2004.01888v6 [cs.CV] 19 Oct 2021

Noname manuscript No.(will be inserted by the editor)FairMOT: On the Fairness of Detection and Re-Identification in MultipleObject TrackingYifu Zhang1 Chunyu Wang2 Xinggang Wang1 Wenjun Zeng2 Wenyu Liu1 Received: date / Accepted: dateAbstractMulti-object tracking (MOT) is an important prob-lem in computer vision which has a wide range of applica-tions. Formulating MOT as multi-task learning of object de-tection and re-ID in a single network is appealing since itallows joint optimization of the two tasks and enjoys highcomputation efficiency. However, we find that the two taskstend to compete with each other which need to be carefullyaddressed.

In particular, previous works usually treat re-IDas a secondary task whose accuracy is heavily affected bythe primary detection task. As a result, the network is bi-ased to the primary detection task which is notfairto there-ID task. To solve the problem, we present a simple yeteffective approach termed asFairMOTbased on the anchor-free object detection architecture CenterNet. Note that it isnot a naive combination of CenterNet and re-ID. Instead,we present a bunch of detailed designs which are critical toachieve good tracking results by thorough empirical resulting approach achieves high accuracy for both de-tection and tracking.

The approach outperforms the state-of-the-art methods by a large margin on several public ZhangE-mail: WangE-mail: WangE-mail: ZengE-mail: LiuE-mail: University of Science and Technology, Wuhan, China2 Microsoft Research Asia, Beijing, China Corresponding Author Yifu Zhang and Chunyu Wang have contributed source code and pre-trained models are released at Multi-Object Tracking One-Shot Anchor-Free Real-Time Inference1 IntroductionMulti-Object Tracking (MOT) has been a longstanding goalin computer vision (Bewley et al., 2016; Wojke et al., 2017;Chen et al., 2018a; Yu et al.)

, 2016). The goal is to esti-mate trajectories for objects of interest presented in successful resolution of the problem can immediatelybenefit many applications such as intelligent video analy-sis, human computer interaction, human activity recognition(Wang et al., 2013; Luo et al., 2017), and even social of the existing methods such as (Mahmoudi et al.,2019; Zhou et al., 2018; Fang et al., 2018; Bewley et al.,2016; Wojke et al., 2017; Chen et al., 2018a; Yu et al., 2016)attempt to address the problem by two separate models: thedetectionmodel firstly detects objects of interest by bound-ing boxes in each frame, then theassociationmodel extractsre-identification (re-ID) features from the image regions cor-responding to each bounding box, links the detection to oneof the existing tracks or creates a new track according tocertain metrics defined on has been remarkable progress on object detection(Ren et al.

, 2015; He et al., 2017; Zhou et al., 2019a; Red-mon and Farhadi, 2018; Fu et al., 2020; Sun et al., 2021b,a)and re-ID (Zheng et al., 2017a; Chen et al., 2018a) respec-tively recently which in turn boosts the overall tracking ac-curacy. However, these two-step methods suffer from scala-bility issues. They cannot achieve real-time inference speedwhen there are a large number of objects in the environmentbecause the two models do not share features and they [ ] 19 Oct 20212 Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, Wenyu Liuto apply the re-ID models for every bounding box indepen-dently in the the maturity of multi-task learning (Kokkinos, 2017;Chen et al.

, 2018b), one-shot trackers which estimate ob-jects and learn re-ID features using a single network haveattracted more attention (Wang et al., 2020b; Voigtlaenderet al., 2019). For example, Voigtlaenderet al. (Voigtlaenderet al., 2019) add a re-ID branch to Mask R-CNN to extract are-ID feature for each proposal (He et al., 2017). It reducesinference time by re-using backbone features for the re-IDnetwork. But the performance drops remarkably comparedto the two-step models. In fact, the detection accuracy is stillgood but the tracking performance drops a lot. For example,the number of ID switches increases by a large margin.

Theresult suggests that combining the two tasks is a non-trivialtask and should be treated this paper, we investigate the reasons behind the fail-ure, and present a simple yet effective solution. Three fac-tors are identified to account for the failure. The first issueis caused by anchors. Anchors are originally designed forobject detection (Ren et al., 2015). However, we show thatanchors are not suitable for extracting re-ID features for tworeasons. First, anchor-based one-shot trackers such as TrackR-CNN (Voigtlaender et al., 2019) overlook the re-ID taskbecause they need anchors to first detect objects ( , us-ing RPN (Ren et al.))

, 2015)) and then extract the re-ID fea-tures based on the detection results (re-ID features are use-less when detection results are incorrect). So when competi-tion occurs between the two tasks, it will favor the detectiontask. Anchors also introduce a lot of ambiguity during train-ing the re-ID features because one anchor may correspondto multiple identities and multiple anchors may correspondto one identity, especially in crowded second issue is caused by feature sharing betweenthe two tasks. Detection task and re-ID task are two totallydifferent tasks and they need different features.

In general,re-ID features need more low-level features to discriminatedifferent instances of the same class while detection featuresneed to be similar for different instances. The shared fea-tures in one-shot trackers will lead to feature conflict andthus reduce the performance of each third issue is caused by feature dimension. The di-mension of re-ID features is usually as high as512(Wanget al., 2020b) or1024(Zheng et al., 2017a) which is muchhigher than that of object detection. We find that the hugedifference between dimensions will harm the performanceof the two tasks.

More importantly, our experiments suggestthat it is a generic rule that learning low-dimensional re-IDfeatures for joint detection and re-ID networks achievesboth higher tracking accuracy and efficiency. This also re-veals the difference between the MOT task and the re-IDtask, which is overlooked in the field of this work, we present a simple approach termed asFairMOTwhich elegantly address the three issues as illus-trated in Figure built on top of CenterNet(Zhou et al., 2019a). In particular, the detection and re-IDtasks are treated equally inFairMOTwhich essentially dif-fers from the previous detection first, re-ID secondary frame-work.

arXiv:2004.01888v6 [cs.CV] 19 Oct 2021

Tags:

Information

Transcription of arXiv:2004.01888v6 [cs.CV] 19 Oct 2021

Related search queries

arXiv:2004.01888v6 [cs.CV] 19 Oct 2021

Tags:

Information

Documents from same domain

Related documents

Related search queries