Example: confidence

Multi-scale Patch Aggregation (MPA) for Simultaneous ...

Multi-scale Patch Aggregation (MPA)for Simultaneous Detection and Segmentation Shu Liu Xiaojuan Qi Jianping Shi Hong Zhang Jiaya Jia The Chinese University of Hong Kong SenseTime Group Limited{sliu, xjqi, hzhang, at Simultaneous detection and segmentation (SD-S), we propose a proposal-free framework, which detect andsegment object instances via mid-level patches. We designa unified trainable network on patches, which is followedby a fast and effective Patch Aggregation algorithm to in-fer object instances. Our method benefits from end-to-endtraining. Without object proposal generation, computationtime can also be reduced. In experiments, our method terms ofmAPron VOC2012segmentation val and VOC2012 SDS val, which are state-of-the-art at the time of submission. We also report resultson Microsoft COCO test-std/test-dev dataset in this IntroductionObject detection and semantic segmentation have beencore tasks of image understanding for long time.}

nied by a few shortcomings. First, generating segment-based proposals takes time. The high-quality proposal generator [31] that was employed in previous SDS work [14, 5, 15, 3] takes about 40seconds to process one image. It was discussed in [5] that using previous faster segment-based proposals decreases performance. The newest pro-

Tags:

  Patch, Aggregation, Enid, Simultaneous, Patch aggregation, For simultaneous

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Multi-scale Patch Aggregation (MPA) for Simultaneous ...

1 Multi-scale Patch Aggregation (MPA)for Simultaneous Detection and Segmentation Shu Liu Xiaojuan Qi Jianping Shi Hong Zhang Jiaya Jia The Chinese University of Hong Kong SenseTime Group Limited{sliu, xjqi, hzhang, at Simultaneous detection and segmentation (SD-S), we propose a proposal-free framework, which detect andsegment object instances via mid-level patches. We designa unified trainable network on patches, which is followedby a fast and effective Patch Aggregation algorithm to in-fer object instances. Our method benefits from end-to-endtraining. Without object proposal generation, computationtime can also be reduced. In experiments, our method terms ofmAPron VOC2012segmentation val and VOC2012 SDS val, which are state-of-the-art at the time of submission. We also report resultson Microsoft COCO test-std/test-dev dataset in this IntroductionObject detection and semantic segmentation have beencore tasks of image understanding for long time.}

2 Object de-tection focuses on generating bounding boxes for boxes may not be accurate enough to localize object-s. Meanwhile semantic segmentation is to predict a moredetailed mask in pixel-level for different classes. It howev-er ignores existence of single-object , Simultaneous detection and segmentation(S-DS) [14] becomes a promising direction to generate pixel-level labels for every object instance, naturally leading tothe next-generation object recognition [24] goal. Accurateand efficient SDS can be used in a lot of disciplines as afundamental tool, where both pixel-wise label and objectinstance information can help build robotics, achieve auto-matic driving, enhance surveillance systems, construct in-telligent home, to name a is more challenging than object detection and se-mantic segmentation separately.

3 In this task, instance-levelinformation and pixel-wise accurate mask for objects are tobe estimated. Nearly all previous work [14, 5, 15, 3] took This work is supported by a grant from the Research Grants Councilof the Hong Kong SAR (project No. 413113).the bottom-up segment-based object proposals [35, 31] asinput and modeled the system as classifying proposals withthe help of powerful deep convolutional neural networks(DCNNs). Classified proposals are either output or refinedin post-processing to produce final of Object Proposals in SDSIt has been noticedthat systems with object-proposal input may be accompa-nied by a few shortcomings. First, generating segment-based proposals takes time. The high-quality proposalgenerator [31] that was employed in previous SDS work[14, 5, 15, 3] takes about40seconds to process one was discussed in [5] that using previous faster segment-based proposals decreases performance.

4 The newest pro-posal generators [30] are not evaluated yet for , the overall SDS performance is bounded by thequality of proposals since they only select provided pro-posals. Object proposals inevitably contain noise regardingmissing objects and errors inside each proposal. Last butnot least, if a SDS system is independent of object propos-al generation, end-to-end parameter tuning is , the system loses the chance to learn featureand structure information from images directly, which how-ever could be important to further improve the system per-formance with information End-to-End SDS SolutionTo address these issues,we propose a systematically feasible scheme to integrateobject proposal generation into the networks, enablingend-to-end training from images to pixel-level labels forinstance-aware semantic beautiful in concept, practically establishing suit-able models is difficult due to various scales, aspect ratios,and deformation of objects.

5 In our work, instead of seg-menting objects directly, we propose segmenting and clas-sifying part of or entire objects using many densely locat-ed patches. The mask of an object is then generated byaggregating masks of the overlapping patches in a post-processing step, as shown in Fig. 1. This scheme shares thespirit of mid-level representation [1, 7, 36] and part-based1imagedensely localized patchesaggregation resultFigure 1. Objects overlapped with many densely localized patches. After segmenting objects in different patches, Aggregation can be usedto infer complete [37, 16]. It is yet different by nature in terms of sys-tem construction and our scheme, overlapped patches gather different levelsof information for final object segmentation, which makesthe result more robust than prediction from only one end-to-end trainable SDS system is thus with output ofsemantic segment labels in ContributionsOur framework to tackle the SDSproblem makes the following main contributions.

6 We propose the strategy to generate dense multi-scalepatches for object parsing. Our unified end-to-end trainable proposal-freenetworkcan achieve segmentation and classification simultane-ouslyfor each Patch . By sharing convolution in thenetwork, computation time is reduced and good quali-ty results are produced. We develop an efficient algorithm to infer the segmen-tation mask for each object by merging informationfrom mid-level evaluated our method on PASCAL VOC 2012 seg-mentation validation and VOC 2012 SDS validation bench-mark datasets. Our method yields state-of-the-art perfor-mance with reasonably short running time. We also evalu-ated it on Microsoft COCO test-std and test-dev data. De-cent performance is achieved based on the VGG-16 networkstructure without network Related WorkThe SDS task is closely related to object detection, se-mantic segmentation, and proposal generation.

7 We brieflyreview them in this DetectionObject detection has a long historyin computer vision. Before DCNN shows its great abil-ity for image classification [21, 33], part-based models[9, 37] were popular. Recent object detection framework-s [11, 12, 17, 37, 29, 34, 23, 32, 10] are based on DCNN[21, 33] to classify object proposals. These methods eithertake object proposals as independent input [12, 37, 29, 34],or use the entire image and pool features for each propos-al [17, 11, 32, 10]. Different from these methods, Renetal. [32] unified proposal generation and classification withshared convolution feature maps. It saves time to generateobject proposals and yields good SegmentationDCNNs [21, 33] also boost per-formance of semantic segmentation [26, 5, 15, 27, 2, 20, 28,25]. Related methods can be categorized into two streams one utilizes DCNNs to classify segment proposals [15, 5]and the other line is to use fully convolutional network-s [26, 2, 20, 25] for dense prediction.

8 CRF can be applied inpost-processing [2] or incorporated in the network [20, 25]to refine segment is a relatively new topic. Hariharanet al. [14]presented pioneer work. It took segment-based object pro-posals [31] as input similar to object detection. Two net-works one for bounding boxes and one for masks wereadopted to extract features. Then features from these net-works were concatenated and classified by SVM [4].Hariharanet al. [15] used hyper-column representationto refine segment masks. But updating all proposals is com-putationally too costly, especially when complex networks,such as VGG [33], are deployed. So the method made useof detection results [12] and a final rescore procedure wasadopted. Chenet al. [3] developed an energy minimizationframework incorporating top-down and bottom-up informa-Figure 2.

9 Objects consist of many different patches. This exampleshows many semantically meaningful human body and car region-s. Part of or the entire objects can be in a to handle occlusion [14]. Daiet al. [5] resolved theefficiency problem by pooling segments and bounding boxfeatures from convolutional feature maps shared by all pro-posals. All methods rely on proposal generation [31] andconduct separate classification al. [22] proposed a proposal-free network totackle SDS. In [22], the category-level segmentation maskis first generated by the method of [2]. Then another net-work is used to assign pixels to objects by predicting loca-tion of objects for every pixel. Finally, post-processing isperformed to generate the instance-level mask. It is notablethat we use a completely different system.

10 Instead of havingthese separate steps, we aggregate mid-level Patch segmentprediction results for SDS. Our unified framework is thusmore Our MethodWe solve the SDS problem via Aggregation of local seg-ment prediction results. We generate Multi-scale densepatches, and classify and segment them in a network. Weinfer objects based on these patches. In the following, wefirst motivate our SDS network and give an MotivationAn object consists of patches corresponding to concept was extensively explored in mid-level repre-sentation work [1, 7, 36] and found useful to extract andorganize structural information. Intuitively, by dividing ob-jects into semantic patches, as shown in Fig. 2, it is easierto model and highlight object variation in local from the traditional way [9, 12, 11], whichclassifies sliding-windows or proposals as objects, ourmethod regards semantic patches as part of an object.


Related search queries