Abstract arXiv:2012.09760v3 [cs.CV] 15 Jun 2021

End-to-End Human Pose and Mesh Reconstruction with TransformersKevin Lin Lijuan Wang Zicheng LiuMicrosoft{keli, lijuanw, present a new method, called MEsh TRansfOrmer(METRO), to reconstruct 3D human pose and mesh ver-tices from a single image. Our method uses a transformerencoder to jointly model vertex-vertex and vertex-joint in-teractions, and outputs 3D joint coordinates and mesh ver-tices simultaneously. Compared to existing techniques thatregress pose and shape parameters, METRO does not relyon any parametric mesh models like SMPL, thus it can beeasily extended to other objects such as hands.}

We fur-ther relax the mesh topology and allow the transformerself-attention mechanism to freely attend between any twovertices, making it possible to learn non-local relationshipsamong mesh vertices and joints. With the proposed maskedvertex modeling, our method is more robust and effectivein handling challenging situations like partial generates new state-of-the-art results for humanmesh reconstruction on the public and 3 DPWdatasets. Moreover, we demonstrate the generalizability ofMETRO to 3D hand reconstruction in the wild, outperform-ing existing state-of-the-art methods on FreiHAND and pre-trained models are available Introduction3D human pose and mesh reconstruction from a singleimage has attracted a lot of attention because it has manyapplications including virtual reality, sports motion analy-sis, neurodegenerative condition diagnosis, etc.

It is a chal-lenging problem due to complex articulated motion and work in this area can be roughly divided into twocategories. Methods in the first category use a parametricmodel like SMPL [29] and learn to predict shape and posecoefficients [14, 26, 39, 22, 24, 34, 44, 23]. Great successhas been achieved with this approach. The strong prior en-coded in the parametric model increases its robustness toenvironment variations. A drawback of this approach is thatthe pose and shape spaces are constrained by the limited ex-(a)(b)(c)Figure 1: METRO learns non-local interactions amongbody joints and mesh vertices for human mesh reconstruc-tion.

Given an input image in (a), METRO predicts hu-man mesh by taking non-local interactions into consider-ation. (b) illustrates the attentions between the occludedwrist joint and the mesh vertices where brighter color indi-cates stronger attention. (c) is the reconstructed that are used to construct the parametric model. Toovercome this limitation, methods in the second category donot use any parametric models [25, 8, 32]. These methodseither use a graph convolutional neural network to modelneighborhood vertex-vertex interactions [25, 8], or use 1 Dheatmap to regress vertex coordinates [32].

One limitationwith these approaches is that they are not efficient in mod-eling non-local vertex-vertex have shown that there are strong correla-tions between non-local vertices which may belong to dif-ferent parts of the body ( hand and foot) [55]. Incomputer graphics and robotics, inverse kinematics tech-niques [2] have been developed to estimate the internal jointpositions of an articulated figure given the position of anend effector such as a hand tip. We believe that learningthe correlations among body joints and mesh vertices in-cluding both short range and long range ones is valuable forhandling challenging poses and occlusions in body shapereconstruction.

In this paper, we propose a simple yet effec-tive framework to model global vertex-vertex main ingredient of our framework is a studies show that transformer [53] significantlyimproves the performance on various tasks in natural lan-1 [ ] 15 Jun 2021guage processing [4, 9, 40, 41]. The success is mainly at-tributed to the self-attention mechanism of a transformer,which is particularly effective in modeling the dependen-cies (or interactions) without regard to their distance in bothinputs and outputs. Given the dependencies, transformer isable tosoft-searchthe relevant tokens and performs predic-tion based on the important features [4, 53].

In this work, we propose METRO, a multi-layer Trans-former encoder with progressive dimensionality reduction,to reconstruct 3D body joints and mesh vertices from agiven input image, simultaneously. We design the MaskedVertex Modeling objective with a transformer encoder ar-chitecture to enhance the interactions among joints and ver-tices. As shown in Figure 1, METRO learns to discover bothshort- and long-range interactions among body joints andmesh vertices, which helps to better reconstruct the 3D hu-man body shape with large pose variations and results on multiple public datasets demon-strate that METRO is effective in learning vertex-vertex andvertex-joint interactions, and consequently outperforms theprior works on human mesh reconstruction by a large mar-gin.

To the best of our knowledge, METRO is the first ap-proach that leverages a transformer encoder architecture tojointly learn 3D human pose and mesh reconstruction froma single input image. Moreover, METRO is a general frame-work which can be easily applied to predict a different 3 Dmesh, for example, to reconstruct a 3D hand from an summary, we make the following contributions. We introduce a new transformer-based method, namedMETRO, for 3D human pose and mesh reconstructionfrom a single image. We design the Masked Vertex Modeling objectivewith a multi-layer transformer encoder to model bothvertex-vertex and vertex-joint interactions for better re-construction.

METRO achieves new state-of-the-art performance onthe large-scale benchmark and the chal-lenging 3 DPW dataset. METRO is a versatile framework that can be easily re-alized to predict a different type of 3D mesh, such as3D hand as demonstrated in the experiments. METRO achieves the first place on FreiHAND leaderboard atthe time of paper Related WorksHuman Mesh Reconstruction (HMR):HMR is a task ofreconstructing 3D human body shape, which is an activeresearch topic in recent years. While pioneer works havedemonstrated impressive reconstruction using various sen-sors, such as depth sensors [33, 48] or inertial measurementunits [20, 54], researchers are exploring to use a monocularcamera setting that is more efficient and convenient.

How-ever, HMR from a single image is difficult due to complexpose variations, occlusions, and limited 3D training studies propose to adopt the pre-trained parametrichuman models, , SMPL [29], STAR [35], MANO [43],and estimate the pose and shape coefficients of the para-metric model for HMR. Since it is challenging to regressthe pose and shape coefficients directly from an input im-age, recent works further propose to leverage various humanbody priors such as human skeletons [26, 39] or segmenta-tion maps [34], and explore different optimization strate-gies [24, 22, 51, 14] and temporal information [23] to im-prove the other hand, instead of adopting a parametric hu-man model, researchers have also proposed approaches todirectly regress 3D human body shape from an input example, researchers have explored to represent humanbody using a 3D mesh [25, 8], a volumetric space [52]

Abstract arXiv:2012.09760v3 [cs.CV] 15 Jun 2021

Tags:

Information

Transcription of Abstract arXiv:2012.09760v3 [cs.CV] 15 Jun 2021

Related search queries

Abstract arXiv:2012.09760v3 [cs.CV] 15 Jun 2021

Tags:

Information

Documents from same domain

Related documents

Related search queries