Transcription of Multi-scale Patch Aggregation (MPA) for Simultaneous ...
1 Multi-scale Patch Aggregation (MPA)for Simultaneous detection and Segmentation Shu Liu Xiaojuan Qi Jianping Shi Hong Zhang Jiaya Jia The Chinese University of Hong Kong SenseTime Group Limited{sliu, xjqi, hzhang, at Simultaneous detection and segmentation (SD-S), we propose a proposal-free framework, which detect andsegment object instances via mid-level patches. We designa unified trainable network on patches, which is followedby a fast and effective Patch Aggregation algorithm to in-fer object instances. Our method benefits from end-to-endtraining. Without object proposal generation, computationtime can also be reduced. In experiments, our method terms ofmAPron VOC2012segmentation val and VOC2012 SDS val, which are state-of-the-art at the time of submission. We also report resultson Microsoft COCO test-std/test-dev dataset in this IntroductionObject detection and semantic segmentation have beencore tasks of image understanding for long time.}
2 object de-tection focuses on generating bounding boxes for boxes may not be accurate enough to localize object -s. Meanwhile semantic segmentation is to predict a moredetailed mask in pixel-level for different classes. It howev-er ignores existence of single- object , Simultaneous detection and segmentation(S-DS) [14] becomes a promising direction to generate pixel-level labels for every object instance, naturally leading tothe next-generation object recognition [24] goal. Accurateand efficient SDS can be used in a lot of disciplines as afundamental tool, where both pixel-wise label and objectinstance information can help build robotics, achieve auto-matic driving, enhance surveillance systems, construct in-telligent home, to name a is more challenging than object detection and se-mantic segmentation separately. In this task, instance-levelinformation and pixel-wise accurate mask for objects are tobe estimated.
3 Nearly all previous work [14, 5, 15, 3] took This work is supported by a grant from the Research Grants Councilof the Hong Kong SAR (project No. 413113).the bottom-up segment-based object proposals [35, 31] asinput and modeled the system as classifying proposals withthe help of powerful deep convolutional neural networks(DCNNs). Classified proposals are either output or refinedin post-processing to produce final of object Proposals in SDSIt has been noticedthat systems with object -proposal input may be accompa-nied by a few shortcomings. First, generating segment-based proposals takes time. The high-quality proposalgenerator [31] that was employed in previous SDS work[14, 5, 15, 3] takes about40seconds to process one was discussed in [5] that using previous faster segment-based proposals decreases performance. The newest pro-posal generators [30] are not evaluated yet for , the overall SDS performance is bounded by thequality of proposals since they only select provided pro-posals.
4 object proposals inevitably contain noise regardingmissing objects and errors inside each proposal. Last butnot least, if a SDS system is independent of object propos-al generation, end-to-end parameter tuning is , the system loses the chance to learn featureand structure information from images directly, which how-ever could be important to further improve the system per-formance with information End-to-End SDS SolutionTo address these issues,we propose a systematically feasible scheme to integrateobject proposal generation into the networks, enablingend-to-end training from images to pixel-level labels forinstance-aware semantic beautiful in concept, practically establishing suit-able models is difficult due to various scales, aspect ratios,and deformation of objects. In our work, instead of seg-menting objects directly, we propose segmenting and clas-sifying part of or entire objects using many densely locat-ed patches.
5 The mask of an object is then generated byaggregating masks of the overlapping patches in a post-processing step, as shown in Fig. 1. This scheme shares thespirit of mid-level representation [1, 7, 36] and part-based1imagedensely localized patchesaggregation resultFigure 1. Objects overlapped with many densely localized patches. After segmenting objects in different patches, Aggregation can be usedto infer complete [37, 16]. It is yet different by nature in terms of sys-tem construction and our scheme, overlapped patches gather different levelsof information for final object segmentation, which makesthe result more robust than prediction from only one end-to-end trainable SDS system is thus with output ofsemantic segment labels in ContributionsOur framework to tackle the SDSproblem makes the following main contributions. We propose the strategy to generate dense multi-scalepatches for object parsing. Our unified end-to-end trainable proposal-freenetworkcan achieve segmentation and classification simultane-ouslyfor each Patch .
6 By sharing convolution in thenetwork, computation time is reduced and good quali-ty results are produced. We develop an efficient algorithm to infer the segmen-tation mask for each object by merging informationfrom mid-level evaluated our method on PASCAL VOC 2012 seg-mentation validation and VOC 2012 SDS validation bench-mark datasets. Our method yields state-of-the-art perfor-mance with reasonably short running time. We also evalu-ated it on Microsoft COCO test-std and test-dev data. De-cent performance is achieved based on the VGG-16 networkstructure without network Related WorkThe SDS task is closely related to object detection , se-mantic segmentation, and proposal generation. We brieflyreview them in this DetectionObject detection has a long historyin computer vision. Before DCNN shows its great abil-ity for image classification [21, 33], part-based models[9, 37] were popular. Recent object detection framework-s [11, 12, 17, 37, 29, 34, 23, 32, 10] are based on DCNN[21, 33] to classify object proposals.
7 These methods eithertake object proposals as independent input [12, 37, 29, 34],or use the entire image and pool features for each propos-al [17, 11, 32, 10]. Different from these methods, Renetal. [32] unified proposal generation and classification withshared convolution feature maps. It saves time to generateobject proposals and yields good SegmentationDCNNs [21, 33] also boost per-formance of semantic segmentation [26, 5, 15, 27, 2, 20, 28,25]. Related methods can be categorized into two streams one utilizes DCNNs to classify segment proposals [15, 5]and the other line is to use fully convolutional network-s [26, 2, 20, 25] for dense prediction. CRF can be applied inpost-processing [2] or incorporated in the network [20, 25]to refine segment is a relatively new topic. Hariharanet al. [14]presented pioneer work. It took segment-based object pro-posals [31] as input similar to object detection . Two net-works one for bounding boxes and one for masks wereadopted to extract features.
8 Then features from these net-works were concatenated and classified by SVM [4].Hariharanet al. [15] used hyper-column representationto refine segment masks. But updating all proposals is com-putationally too costly, especially when complex networks,such as VGG [33], are deployed. So the method made useof detection results [12] and a final rescore procedure wasadopted. Chenet al. [3] developed an energy minimizationframework incorporating top-down and bottom-up informa-Figure 2. Objects consist of many different patches. This exampleshows many semantically meaningful human body and car region-s. Part of or the entire objects can be in a to handle occlusion [14]. Daiet al. [5] resolved theefficiency problem by pooling segments and bounding boxfeatures from convolutional feature maps shared by all pro-posals. All methods rely on proposal generation [31] andconduct separate classification al. [22] proposed a proposal-free network totackle SDS.
9 In [22], the category-level segmentation maskis first generated by the method of [2]. Then another net-work is used to assign pixels to objects by predicting loca-tion of objects for every pixel. Finally, post-processing isperformed to generate the instance-level mask. It is notablethat we use a completely different system. Instead of havingthese separate steps, we aggregate mid-level Patch segmentprediction results for SDS. Our unified framework is thusmore Our MethodWe solve the SDS problem via Aggregation of local seg-ment prediction results. We generate Multi-scale densepatches, and classify and segment them in a network. Weinfer objects based on these patches. In the following, wefirst motivate our SDS network and give an MotivationAn object consists of patches corresponding to concept was extensively explored in mid-level repre-sentation work [1, 7, 36] and found useful to extract andorganize structural information.
10 Intuitively, by dividing ob-jects into semantic patches, as shown in Fig. 2, it is easierto model and highlight object variation in local from the traditional way [9, 12, 11], whichclassifies sliding-windows or proposals as objects, ourmethod regards semantic patches as part of an object . Pre-vious proposal-classification frameworks, contrarily, arebased on the assumption that most objects are already therein proposals and what remains to do is to pick them do not search for missing objects and thus greatly de-pend on the quality of object proposals. Our strategy toutilize patches to represent objects is more Network StructureOur network is illustrated in Fig. 3. It jointly learnsthe classification label and segmentation mask on each can-didate Patch . The key components are shared convolutionlayers, Multi-scale Patch generator, multi-class classifica-tion branch, and the segmentation Convolution LayersIn our method, convolution layers are shared among subse-quent classification and segmentation branches.