IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, MARCH 20201 Deep High-Resolution Representation Learningfor visual RecognitionJingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu,Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin XiaoAbstract High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation,semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolutionrepresentation through a subnetwork that is formed by connecting high-to-low resolution convolutionsin series( , ResNet,VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposednetwork, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process.

There aretwo key characteristics: (i) Connect the high-to-low resolution convolution streamsin parallel; (ii) Repeatedly exchange the informationacross resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show thesuperiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, andobject detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are availableat Terms HRNet, high-resolution representations, low-resolution representations, human pose estimation, semanticsegmentation, object INTRODUCTIONDEEP convolutional neural networks (DCNNs) haveachieved state-of-the-art results in many computervision tasks, such as image classification, object detection,semantic segmentation, human pose estimation, and so strength is that DCNNs are able to learn richer repre-sentations than conventional hand-crafted recently-developed classification networks, in-cluding AlexNet [77], VGGNet [126], GoogleNet [133],ResNet [54], etc.

, follow the design rule of LeNet-5[81].The rule is depicted in Figure 1 (a): gradually reduce thespatial size of the feature maps, connect the convolutionsfrom high resolution to low resolution in series, and lead toalow-resolution representation, which is further processed representationsare needed for position-sensitive tasks, , semantic segmentation, human poseestimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery pro-cess to raise the representation resolution from the low-resolution representation outputted by a classification orclassification-like network as depicted in Figure 1 (b), ,Hourglass [105], SegNet [3], DeconvNet [107], U-Net [119],SimpleBaseline [152], and encoder-decoder [112]. In addi-tion, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution represen-tations [19], [181].

We present a novel architecture, namely High-ResolutionNet (HRNet), which is able tomaintain high-resolution repre-sentationsthrough the whole process. We start from a high-resolution convolution stream, gradually add high-to-lowresolution convolution streams one by one, and connect the J. Wang is with Microsoft Research, Beijing, : streams in parallel. The resulting networkconsists of several (4in this paper) stages as depicted in Fig-ure 2, and thenth stage containsnstreams corresponding tonresolutions. We conduct repeated multi-resolution fusionsby exchanging the information across the parallel streamsover and high-resolution representations learned from HR-Net are not only semantically strong but also spatiallyprecise. This comes from two aspects. (i) Our approachconnects high-to-low resolution convolution streams in par-allel rather than in series.

Thus, our approach is able tomaintain the high resolution instead of recovering highresolution from low resolution, and accordingly the learnedrepresentation is potentially spatially more precise. (ii) Mostexisting fusion schemes aggregate high-resolution low-leveland high-level representations obtained by upsamplinglow-resolution representations. Instead, we repeat multi-resolution fusions to boost the high-resolution representa-tions with the help of the low-resolution representations,and vice versa. As a result, all the high-to-low resolutionrepresentations are semantically present two versions of HRNet. The first one, namedas HRNetV1, only outputs the high-resolution representa-tion computed from the high-resolution convolution apply it to human pose estimation by following theheatmap estimation framework.

We empirically demon-strate the superior pose estimation performance on theCOCO keypoint detection dataset [94].The other one, named as HRNetV2, combines the rep-resentations from all the high-to-low resolution parallelstreams. We apply it to semantic segmentation throughestimating segmentation maps from the combined high-resolution representation. The proposed approach achievesstate-of-the-art results on PASCAL-Context, Cityscapes, [ ] 13 Mar 2020 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, MARCH 20202(a)(b)Fig. 1. The structure of recovering high resolution from low resolution. (a) A low-resolution representation learning subnetwork (such asVGGNet [126], ResNet [54]), which is formed by connecting high-to-low convolutions in series. (b) A high-resolution representation recoveringsubnetwork, which is formed by connecting low-to-high convolutions in series.

Representative examples include SegNet [3], DeconvNet [107],U-Net [119] and Hourglass [105], encoder-decoder [112], and SimpleBaseline [152].LIP with similar model sizes and lower computation com-plexity. We observe similar performance for HRNetV1andHRNetV2over COCO pose estimation, and the superiorityof HRNetV2to HRNet1in semantic addition, we construct a multi-level representation,named as HRNetV2p, from the high-resolution representa-tion output from HRNetV2, and apply it to state-of-the-artdetection frameworks, including Faster R-CNN, Cascade R-CNN [12], FCOS [136], and CenterNet [36], and state-of-the-art joint detection and instance segmentation frameworks,including Mask R-CNN [53], Cascade Mask R-CNN, andHybrid Task Cascade [16]. The results show that our methodgets detection performance improvement and in particulardramatic improvement for small RELATEDWORKWe review closely-related representation learning tech-niques developed mainly for human pose estimation [57],semantic segmentation and object detection, from threeaspects: low-resolution representation learning , high-resolution representation recovering, and high-resolutionrepresentation maintaining.

Besides, we mention aboutsome works related to multi-scale low-resolution network approaches [99], [124] compute low-resolution representations by removing the fully-connectedlayers in a classification network, and estimate their coarsesegmentation maps. The estimated segmentation maps areimproved by combining the fine segmentation score mapsestimated from intermediate low-level medium-resolutionrepresentations [99], or iterating the processes [76]. Similartechniques have also been applied to edge detection, ,holistic edge detection [157].The fully convolutional network is extended, by re-placing a few (typically two) strided convolutions andthe associated convolutions with dilated convolutions, tothe dilation version, leading to medium-resolution repre-sentations [18], [19], [86], [168], [181].

The representationsare further augmented to multi-scale contextual representa-tions [19], [21], [181] through feature pyramids for segment-ing objects at multiple high-resolution upsampleprocess can be used to gradually recover the high-resolutionrepresentations from the low-resolution upsample subnetwork could be a symmetric version ofthe downsample process ( , VGGNet), with skipping con-nection over some mirrored layers to transform the poolingindices, , SegNet [3] and DeconvNet [107], or copying thefeature maps, , U-Net [119] and Hourglass [8], [9], [27],[31], [68], [105], [134], [163], [165], encoder-decoder [112],and so on. An extension of U-Net, full-resolution residualnetwork [114], introduces an extra full-resolution streamthat carries information at the full image resolution, to re-place the skip connections, and each unit in the downsampleand upsample subnetworks receives information from andsends information to the full-resolution asymmetric upsample process is also widely stud-ied.

RefineNet [90] improves the combination of upsam-pled representations and the representations of the sameresolution copied from the downsample process. Otherworks include: light upsample process [7], [24], [92], [152],possibly with dilated convolutions used in the back-bone [63], [89], [113]; light downsample and heavy up-sample processes [141], recombinator networks [55]; im-proving skip connections with more or complicated con-volutional units [64], [111], [180], as well as sending in-formation from low-resolution skip connections to high-resolution skip connections [189] or exchanging informa-tion between them [49]; studying the details of the up-sample process [147]; combining multi-scale pyramid rep-resentations [22], [154]; stacking multiple DeconvNets/U-Nets/Hourglass [44], [149] with dense connections [135].

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …

Tags:

Information

Transcription of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …

Related search queries

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …

Tags:

Information

Documents from same domain

Related documents

Related search queries