ABSTRACT arXiv:1602.00763v2 [cs.CV] 7 Jul 2017

[ ] 7 Jul 2017 SIMPLE ONLINE AND REALTIME TRACKINGAlex Bewley , Zongyuan Ge , Lionel Ott , Fabio Ramos , Ben Upcroft Queensland University of Technology , University of Sydney ABSTRACTThis paper explores a pragmatic approach to multiple ob-ject tracking where the main focus is to associate objects ef-ficiently for online and realtime applications. To this end,de-tection quality is identified as a key factor influencing track-ing performance, where changing the detector can improvetracking by up to Despite only using a rudimentarycombination of familiar techniques such as the Kalman Filterand Hungarian algorithm for the tracking components, thisapproach achieves an accuracy comparable to state-of-the-artonline trackers. Furthermore, due to the simplicity of ourtracking method, the tracker updates at a rate of260 Hzwhichis over 20x faster than other state-of-the-art Terms Computer Vision, Multiple Object Track-ing, Detection, Data Association1.

INTRODUCTIONThis paper presents a lean implementation of a tracking-by-detection framework for the problem of multiple object track-ing (MOT) where objects are detected each frame and repre-sented as bounding boxes. In contrast to many batch basedtracking approaches [1, 2, 3], this work is primarily targetedtowards online tracking where only detections from the pre-vious and the current frame are presented to the tracker . Ad-ditionally, a strong emphasis is placed on efficiency for fa-cilitating realtime tracking and to promote greater uptakeinapplications such as pedestrian tracking for autonomous MOT problem can be viewed as a data associa-tion problem where the aim is to associate detections acrossframes in a video sequence. To aid the data association pro-cess, trackers use various methods for modelling the motion[1, 4] and appearance [5, 3] of objects in the scene.

Themethods employed by this paper were motivated throughobservations made on a recently established visual MOTbenchmark [6]. Firstly, there is a resurgence of mature dataassociation techniques including Multiple Hypothesis Track-ing (MHT) [7, 3] and Joint Probabilistic Data Association(JPDA) [2] which occupy many of the top positions of theMOT benchmark. Secondly, the only tracker that does notuse the Aggregate Channel Filter (ACF) [8] detector is alsoThanks to ACARP for RealtimeAccuracy (MOTA)Speed (Hz)Accuracy vs. SpeedTDAMLP2 DTBDJPDANOMTDPNMSSMOTSORTMDPTCODALMHTDAM Fig. 1. Benchmark performance of the proposed method(SORT) in relation to several baseline trackers [6]. Eachmarker indicates a trackers accuracy and speed measured inframes per second (FPS) [Hz], higher and more right top ranked tracker , suggesting that detection quality couldbe holding back the other trackers.

Furthermore, the trade-offbetween accuracy and speed appears quite pronounced, sincethe speed of most accurate trackers is considered too slow forrealtime applications (see Fig. 1). With the prominence oftraditional data association techniques among the top onlineand batch trackers along with the use of different detectionsused by the top tracker , this work explores how simple MOTcan be and how well it can in line with Occam s Razor, appearance featuresbeyond the detection component are ignored in tracking andonly the bounding box position and size are used for both mo-tion estimation and data association. Furthermore, issuesre-garding short-term and long-term occlusion are also ignored,as they occur very rarely and their explicit treatment intro-duces undesirable complexity into the tracking argue that incorporating complexity in the form of objectre-identification adds significant overhead into the trackingframework potentially limiting its use in realtime design philosophy is in contrast to many proposedvisual trackers that incorporate a myriad of components tohandle various edge cases and detection errors [9, 10, 11, 12].

This work instead focuses on efficient and reliable handlingofthe common frame-to-frame associations. Rather than aim-ing to be robust to detection errors, we instead exploit re-cent advances in visual object detection to solve the detec-tion problem directly. This is demonstrated by comparing thecommon ACF pedestrian detector [8] with a recent convolu-tional neural network (CNN) based detector [13]. Addition-ally, two classical yet extremely efficient methods, Kalmanfilter [14] and Hungarian method [15], are employed to han-dle the motion prediction and data association components ofthe tracking problem respectively. This minimalistic formu-lation of tracking facilitates both efficiency and reliability foronline tracking, see Fig. 1. In this paper, this approach isonly applied to tracking pedestrians in various environments,however due to the flexibility of CNN based detectors [13], itnaturally can be generalized to other objects main contributions of this paper are: We leverage the power of CNN based detection in thecontext of MOT.

A pragmatic tracking approach based on the Kalmanfilter and the Hungarian algorithm is presented andevaluated on a recent MOT benchmark. Code will be open sourced to help establish a baselinemethod for research experimentation and uptake in col-lision avoidance paper is organised as follows: Section 2 provides ashort review of related literature in the area of multiple ob-ject tracking. Section 3 describes the proposed lean trackingframework before the effectiveness of the proposed frame-work on standard benchmark sequences is demonstrated inSection 4. Finally, Section 5 provides a summary of the learntoutcomes and discusses future LITERATURE REVIEWT raditionally MOT has been solved using Multiple Hypothe-sis Tracking (MHT) [7] or the Joint Probabilistic Data Associ-ation (JPDA) filters [16, 2], which delay making difficult de-cisions while there is high uncertainty over the object assign-ments.

The combinatorial complexity of these approaches isexponential in the number of tracked objects making themimpractical for realtime applications in highly dynamic envi-ronments. Recently, Rezatofighi et al. [2], revisited the JPDA formulation [16] in visual MOT with the goal to address thecombinatorial complexity issue with an efficient approxima-tion of the JPDA by exploiting recent developments in solv-ing integer programs. Similarly, Kim et al. [3] used an ap-pearance model for each target to prune the MHT graph toachieve state-of-the-art performance. However, these meth-ods still delay the decision making which makes them unsuit-able for online online tracking methods aim to build appearancemodels of either the individual objects themselves [17, 18,12]or a global model [19, 11, 4, 5] through online learning. In ad-dition to appearance models, motion is often incorporated toassist associating detections to tracklets [1, 19, 4, 11].

Whenconsidering only one-to-one correspondences modelled as bi-partite graph matching, globally optimal solutions such astheHungarian algorithm [15] can be used [10, 20].The method by Geiger et al. [20] uses the Hungarian algo-rithm [15] in a two stage process. First, tracklets are formedby associating detections across adjacent frames where bothgeometry and appearance cues are combined to form the affin-ity matrix. Then, the tracklets are associated to each othertobridge broken trajectories caused by occlusion, again usingboth geometry and appearance cues. This two step associa-tion method restricts this approach to batch computation. Ourapproach is inspired by the tracking component of [20], how-ever we simplify the association to a single stage with basiccues as described in the next METHODOLOGYThe proposed method is described by the key components ofdetection, propagating object states into future frames, asso-ciating current detections with existing objects, and managingthe lifespan of tracked DetectionTo capitalise on the rapid advancement of CNN based de-tection, we utilise the Faster Region CNN (FrRCNN) detec-tion framework [13].

FrRCNNis an end-to-end frameworkthat consists of two stages. The first stage extracts featuresand proposes regions for the second stage which then clas-sifies the object in the proposed region. The advantage ofthis framework is that parameters are shared between the twostages creating an efficient framework for detection. Addi-tionally, the network architecture itself can be swapped toanydesign which enables rapid experimentation of different ar-chitectures to improve the detection we compare two network architectures providedwithFrRCNN, namely the architecture of Zeiler and Fer-gus (FrRCNN(ZF)) [21] and the deeper architecture of Si-monyan and Zisserman (FrRCNN(VGG16)) [22]. Through-out this work, we apply theFrRCNN with default parameterslearnt for the PASCAL VOC challenge. As we are only inter-ested in pedestrians we ignore all other classes and only passperson detection results with output probabilities greater than50% to the tracking 1.

Comparison of tracking performance by switchingthe detector component. Evaluated on Validation sequencesas listed in [12].TrackerDetectorDetectionTrackingRec all Precision ID Sw MOTAMDP [12] (ZF) (VGG16) (ZF) (VGG16) our experiments, we found that the detection qualityhas a significant impact on tracking performance when com-paring theFrRCNN detections toACFdetections. Thisis demonstrated using a validation set of sequences ap-plied to both an existing online trackerMDP[12] and thetracker proposed here. Table 1 shows that the best detector(FrRCNN(VGG16)) leads to the best tracking accuracy forbothMDPand the proposed Estimation ModelHere we describe the object model, the representation andthe motion model used to propagate a target s identity into thenext frame. We approximate the inter-frame displacements ofeach object with a linear constant velocity model which isindependent of other objects and camera motion.

ABSTRACT arXiv:1602.00763v2 [cs.CV] 7 Jul 2017

Tags:

Information

Transcription of ABSTRACT arXiv:1602.00763v2 [cs.CV] 7 Jul 2017

Related search queries

ABSTRACT arXiv:1602.00763v2 [cs.CV] 7 Jul 2017

Tags:

Information

Documents from same domain

Related documents

Related search queries