Transcription of YOLACT: Real-Time Instance Segmentation
1 YOLACTReal-time Instance SegmentationDaniel BolyaChong ZhouFanyi XiaoYong Jae LeeUniversity of California, Davis{dbolya, cczhou, fyxiao, present a simple, fully-convolutional model for Real-Time Instance Segmentation that achieves mAP on MSCOCO at fps evaluated on a single Titan Xp, which issignificantly faster than any previous competitive , we obtain this result after training ononly oneGPU. We accomplish this by breaking Instance segmenta-tion into two parallel subtasks: (1) generating a set of pro-totype masks and (2) predicting per- Instance mask coeffi-cients. Then we produce Instance masks by linearly combin-ing the prototypes with the mask coefficients. We find thatbecause this process doesn t depend on repooling, this ap-proach produces very high-quality masks and exhibits tem-poral stability for free. Furthermore, we analyze the emer-gent behavior of our prototypes and show they learn to lo-calize instances on their own in a translation variant man-ner, despite being fully-convolutional.}
2 Finally, we also pro-pose Fast NMS, a drop-in 12 ms faster replacement for stan-dard NMS that only has a marginal performance Introduction Boxes are stupid anyway though, I m probably a truebeliever in masks except I can t get YOLO to learn them. Joseph Redmon, YOLOv3 [36]What would it take to create a Real-Time Instance seg-mentation algorithm? Over the past few years, the vi-sion community has made great strides in Instance seg-mentation, in part by drawing on powerful parallels fromthe well-established domain of object detection. State-of-the-art approaches to Instance Segmentation like Mask R-CNN [18] and FCIS [24] directly build off of advances inobject detection like Faster R-CNN [37] and R-FCN [8].Yet, these methods focus primarily on performance overspeed, leaving the scene devoid of Instance segmentationparallels to Real-Time object detectors like SSD [30] andYOLO [35,36]. In this work, our goal is to fill that gap witha fast, one-stage Instance Segmentation model in the sameway that SSD and YOLO fill that gap for object 1: Speed-performance trade-off for various instancesegmentation methods on COCO.
3 To our knowledge, oursis the firstreal-time(above 30 FPS) approach with around30 mask mAP on , Instance Segmentation is hard much harderthan object detection. One-stage object detectors like SSDand YOLO are able to speed up existing two-stage de-tectors like Faster R-CNN by simply removing the sec-ond stage and making up for the lost performance in otherways. The same approach is not easily extendable, how-ever, to Instance Segmentation . State-of-the-art two-stageinstance Segmentation methods depend heavily onfeaturelocalizationto produce masks. That is, these methods re-pool features in some bounding box region ( , via RoI-pool/align), and then feed these now localized features totheir mask predictor. This approach is inherently sequentialand is therefore difficult to accelerate. One-stage methodsthat perform these steps in parallel like FCIS do exist, butthey require significant amounts of post-processing after lo-calization, and thus are still far from address these issues, we propose YOLACT1, a Real-Time Instance Segmentation framework that forgoes an ex-plicit localization step.
4 Instead, YOLACT breaks up in-stance Segmentation into two parallel tasks: (1) generat-1 YouOnlyLookAtCoefficienTs9157ing a dictionary of non-localprototype masks over the en-tire image, and (2) predicting a set oflinear combinationcoefficients per Instance . Then producing a full-image in-stance Segmentation from these two components is simple:for each Instance , linearly combine the prototypes using thecorresponding predicted coefficients and then crop with apredicted bounding box. We show that by segmenting inthis manner,the network learns how to localize instancemasks on its own, where visually, spatially, and semanti-cally similar instances appear different in the , since the number of prototype masks is inde-pendent of the number of categories ( , there can be morecategories than prototypes), YOLACT learns a distributedrepresentation in which each Instance is segmented with acombination of prototypes that are shared across distributed representation leads to interesting emergentbehavior in the prototype space: some prototypes spatiallypartition the image, some localize instances, some detect in-stance contours, some encode position-sensitive directionalmaps (similar to those obtained by hard-coding a position-sensitive module in FCIS [24]), and most do a combinationof these tasks (see Figure5).
5 This approach also has several practical and foremost, it s fast: because of its parallel struc-ture and extremely lightweight assembly process, YOLACT adds only a marginal amount of computational overhead toa one-stage backbone detector, making it easy to reach 30fps even when using ResNet-101 [19]; in fact,the entiremask branch takes only 5 ms to evaluate. Second, masksare high-quality: since the masks use the full extent of theimage space without any loss of quality from repooling, ourmasks for large objects are significantly higher quality thanthose of other methods (see Figure7). Finally, it s gen-eral: the idea of generating prototypes and mask coefficientscould be added to almost any modern object main contribution is the first Real-Time (>30fps) in-stance Segmentation algorithm with competitive results onthe challenging MS COCO dataset [28] (see Figure1). Inaddition, we analyze the emergent behavior of YOLACT sprototypes and provide experiments to study the speedvs.
6 Performance trade-offs obtained with different back-bone architectures, numbers of prototypes, and image res-olutions. We also provide a novel Fast NMS approach thatis 12ms faster than traditional NMS with a negligible per-formance penalty. The code for YOLACT is available Related WorkInstance SegmentationGiven its importance, a lot of re-search effort has been made to push Instance segmentationaccuracy. Mask-RCNN [18] is a representative two-stageinstance Segmentation approach that first generates candi-date region-of-interests (ROIs) and then classifies and seg-ments those ROIs in the second stage. Follow-up workstry to improve its accuracy by , enriching the FPNfeatures [29] or addressing the incompatibility between amask s confidence score and its localization accuracy [20].These two-stage methods require re-pooling features foreach ROI and processing them with subsequent computa-tions, which make them unable to obtain Real-Time speeds(30 fps) even when decreasing image size (see Table2c).
7 One-stage Instance Segmentation methods generate po-sition sensitive maps that are assembled into final maskswith position-sensitive pooling [6,24] or combine seman-tic Segmentation logits and direction prediction logits [4].Though conceptually faster than two-stage methods, theystill require repooling or other non-trivial computations( , mask voting). This severely limits their speed, plac-ing them far from Real-Time . In contrast, our assembly stepis much more lightweight (only a linear combination) andcan be implemented as one GPU-accelerated matrix-matrixmultiplication, making our approach very , some methods first perform semantic segmen-tation followed by boundary detection [22], pixel clus-tering [3,25], or learn an embedding to form instancemasks [32,17,9,13]. Again, these methods have multi-ple stages and/or involve expensive clustering procedures,which limits their viability for Real-Time Instance SegmentationWhile Real-Time ob-ject detection [30,34,35,36], and semantic segmenta-tion [2,41,33,11,47] methods exist, few works havefocused on Real-Time Instance Segmentation .
8 Straight toShapes [21] and Box2 Pix [42] can perform Instance seg-mentation in Real-Time (30 fps on Pascal SBD 2012 [12,16]for Straight to Shapes, and fps on Cityscapes [5] and 35fps on KITTI [15] for Box2 Pix), but their accuracies are farfrom that of modern baselines. In fact, Mask R-CNN [18]remains one of the fastest Instance Segmentation methodson semantically challenging datasets like COCO [28] ( on5502px images; see Table2c).PrototypesLearning prototypes (aka vocabulary or code-book) has been extensively explored in computer representations include textons [23] and visualwords [40], with advances made via sparsity and localitypriors [44,43,46]. Others have designed prototypes for ob-ject detection [1,45,38]. Though related, these works useprototypes to represent features, whereas we use them toassemble masks for Instance Segmentation . Moreover, welearn prototypes that are specific to each image, rather thanglobal prototypes shared across the entire YOLACTOur goal is to add a mask branch to an existing one-stageobject detection model in the same vein as Mask R-CNN[18] does to Faster R-CNN [37], but without an explicit fea-9158++-=++-+=-Detection2 Detection1 ProtonetPrediction HeadNMSCropThresholdPrototypesMask Coefficients-+PersonDetection 1-+RacketDetection 2 AssemblyFeature BackboneFeature PyramidFigure 2:YOLACT ArchitectureBlue/yellow indicates low/high values in the prototypes, gray nodes indicate functionsthat are not trained, andk=4in this example.
9 We base this architecture off of RetinaNet [27] using ResNet-101 + localization step ( , feature repooling). To do this,we break up the complex task of Instance Segmentation intotwo simpler, parallel tasks that can be assembled to formthe final masks. The first branch uses an FCN [31] to pro-duce a set of image-sized prototype masks that do not de-pend on any one Instance . The second adds an extra headto the object detection branch to predict a vector of maskcoefficients for each anchor that encode an Instance s rep-resentation in the prototype space. Finally, for each instancethat survives NMS, we construct a mask for that Instance bylinearly combining the work of these two perform Instance Segmentation in this wayprimarily because masks are spatially coherent; , pixelsclose to each other are likely to be part of the same a convolutional (conv) layer naturally takes advan-tage of this coherence, a fully-connected (fc) layer does poses a problem, since one-stage object detectors pro-duce class and box coefficients for each anchor as an outputof stage approaches like Mask R-CNN getaround this problem by using a localization step ( , RoI-Align), which preserves the spatial coherence of the fea-tures while also allowing the mask to be aconvlayer out-put.
10 However, doing so requires a significant portion of themodel to wait for a first-stage RPN to propose localizationcandidates, inducing a significant speed , we break the problem into two parallel parts, mak-ing use offclayers, which are good at producing semanticvectors, andconvlayers, which are good at producing spa-tially coherent masks, to produce the mask coefficients and prototype masks , respectively. Then, because proto-types and mask coefficients can be computed independently,2To show that this is an issue, we develop an fc-mask model that pro-duces masks for each anchor as the reshaped output of anfclayer. As ourexperiments in Table2cshow, simply adding masks to a one-stage modelasfcoutputs only obtains mAP and is thus very much computational overhead over that of the backbone de-tector comes mostly from the assembly step, which can beimplemented as a single matrix multiplication. In this way,we can maintain spatial coherence in the feature space whilestill being one-stage Prototype GenerationThe prototype generation branch (protonet) predicts a setofkprototype masks for the entire image.