Transcription of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …
1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, MARCH 20201 Deep High-Resolution Representation Learningfor visual RecognitionJingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu,Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin XiaoAbstract High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation,semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolutionrepresentation through a subnetwork that is formed by connecting high-to-low resolution convolutionsin series( , ResNet,VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposednetwork, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process.
2 There aretwo key characteristics: (i) Connect the high-to-low resolution convolution streamsin parallel; (ii) Repeatedly exchange the informationacross resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show thesuperiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, andobject detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are availableat Terms HRNet, high-resolution representations, low-resolution representations, human pose estimation, semanticsegmentation, object INTRODUCTIONDEEP convolutional neural networks (DCNNs) haveachieved state-of-the-art results in many computervision tasks, such as image classification, object detection,semantic segmentation, human pose estimation, and so strength is that DCNNs are able to learn richer repre-sentations than conventional hand-crafted recently-developed classification networks, in-cluding AlexNet [77], VGGNet [126], GoogleNet [133],ResNet [54], etc.
3 , follow the design rule of LeNet-5[81].The rule is depicted in Figure 1 (a): gradually reduce thespatial size of the feature maps, connect the convolutionsfrom high resolution to low resolution in series, and lead toalow-resolution representation, which is further processed representationsare needed for position-sensitive tasks, , semantic segmentation, human poseestimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery pro-cess to raise the representation resolution from the low-resolution representation outputted by a classification orclassification-like network as depicted in Figure 1 (b), ,Hourglass [105], SegNet [3], DeconvNet [107], U-Net [119],SimpleBaseline [152], and encoder-decoder [112]. In addi-tion, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution represen-tations [19], [181].
4 We present a novel architecture, namely High-ResolutionNet (HRNet), which is able tomaintain high-resolution repre-sentationsthrough the whole process. We start from a high-resolution convolution stream, gradually add high-to-lowresolution convolution streams one by one, and connect the J. Wang is with Microsoft Research, Beijing, : streams in parallel. The resulting networkconsists of several (4in this paper) stages as depicted in Fig-ure 2, and thenth stage containsnstreams corresponding tonresolutions. We conduct repeated multi-resolution fusionsby exchanging the information across the parallel streamsover and high-resolution representations learned from HR-Net are not only semantically strong but also spatiallyprecise. This comes from two aspects. (i) Our approachconnects high-to-low resolution convolution streams in par-allel rather than in series.
5 Thus, our approach is able tomaintain the high resolution instead of recovering highresolution from low resolution, and accordingly the learnedrepresentation is potentially spatially more precise. (ii) Mostexisting fusion schemes aggregate high-resolution low-leveland high-level representations obtained by upsamplinglow-resolution representations. Instead, we repeat multi-resolution fusions to boost the high-resolution representa-tions with the help of the low-resolution representations,and vice versa. As a result, all the high-to-low resolutionrepresentations are semantically present two versions of HRNet. The first one, namedas HRNetV1, only outputs the high-resolution representa-tion computed from the high-resolution convolution apply it to human pose estimation by following theheatmap estimation framework.
6 We empirically demon-strate the superior pose estimation performance on theCOCO keypoint detection dataset [94].The other one, named as HRNetV2, combines the rep-resentations from all the high-to-low resolution parallelstreams. We apply it to semantic segmentation throughestimating segmentation maps from the combined high-resolution representation. The proposed approach achievesstate-of-the-art results on PASCAL-Context, Cityscapes, [ ] 13 Mar 2020 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, MARCH 20202(a)(b)Fig. 1. The structure of recovering high resolution from low resolution. (a) A low-resolution representation learning subnetwork (such asVGGNet [126], ResNet [54]), which is formed by connecting high-to-low convolutions in series. (b) A high-resolution representation recoveringsubnetwork, which is formed by connecting low-to-high convolutions in series.
7 Representative examples include SegNet [3], DeconvNet [107],U-Net [119] and Hourglass [105], encoder-decoder [112], and SimpleBaseline [152].LIP with similar model sizes and lower computation com-plexity. We observe similar performance for HRNetV1andHRNetV2over COCO pose estimation, and the superiorityof HRNetV2to HRNet1in semantic addition, we construct a multi-level representation,named as HRNetV2p, from the high-resolution representa-tion output from HRNetV2, and apply it to state-of-the-artdetection frameworks, including Faster R-CNN, Cascade R-CNN [12], FCOS [136], and CenterNet [36], and state-of-the-art joint detection and instance segmentation frameworks,including Mask R-CNN [53], Cascade Mask R-CNN, andHybrid Task Cascade [16]. The results show that our methodgets detection performance improvement and in particulardramatic improvement for small RELATEDWORKWe review closely-related representation learning tech-niques developed mainly for human pose estimation [57],semantic segmentation and object detection, from threeaspects: low-resolution representation learning , high-resolution representation recovering, and high-resolutionrepresentation maintaining.
8 Besides, we mention aboutsome works related to multi-scale low-resolution network approaches [99], [124] compute low-resolution representations by removing the fully-connectedlayers in a classification network, and estimate their coarsesegmentation maps. The estimated segmentation maps areimproved by combining the fine segmentation score mapsestimated from intermediate low-level medium-resolutionrepresentations [99], or iterating the processes [76]. Similartechniques have also been applied to edge detection, ,holistic edge detection [157].The fully convolutional network is extended, by re-placing a few (typically two) strided convolutions andthe associated convolutions with dilated convolutions, tothe dilation version, leading to medium-resolution repre-sentations [18], [19], [86], [168], [181].
9 The representationsare further augmented to multi-scale contextual representa-tions [19], [21], [181] through feature pyramids for segment-ing objects at multiple high-resolution upsampleprocess can be used to gradually recover the high-resolutionrepresentations from the low-resolution upsample subnetwork could be a symmetric version ofthe downsample process ( , VGGNet), with skipping con-nection over some mirrored layers to transform the poolingindices, , SegNet [3] and DeconvNet [107], or copying thefeature maps, , U-Net [119] and Hourglass [8], [9], [27],[31], [68], [105], [134], [163], [165], encoder-decoder [112],and so on. An extension of U-Net, full-resolution residualnetwork [114], introduces an extra full-resolution streamthat carries information at the full image resolution, to re-place the skip connections, and each unit in the downsampleand upsample subnetworks receives information from andsends information to the full-resolution asymmetric upsample process is also widely stud-ied.
10 RefineNet [90] improves the combination of upsam-pled representations and the representations of the sameresolution copied from the downsample process. Otherworks include: light upsample process [7], [24], [92], [152],possibly with dilated convolutions used in the back-bone [63], [89], [113]; light downsample and heavy up-sample processes [141], recombinator networks [55]; im-proving skip connections with more or complicated con-volutional units [64], [111], [180], as well as sending in-formation from low-resolution skip connections to high-resolution skip connections [189] or exchanging informa-tion between them [49]; studying the details of the up-sample process [147]; combining multi-scale pyramid rep-resentations [22], [154]; stacking multiple DeconvNets/U-Nets/Hourglass [44], [149] with dense connections [135].