Transcription of A Point Set Generation Network for 3D Object ...
1 A Point Set Generation Network for3D Object Reconstruction from a Single ImageHaoqiang Fan Institute for InterdisciplinaryInformation SciencesTsinghua Su Leonidas GuibasComputer Science DepartmentStanford of 3D data by deep neural networks hasbeen attracting increasing attention in the research com-munity. The majority of extant works resort to regularrepresentations such as volumetric grids or collections ofimages; however, these representations obscure the naturalinvariance of 3D shapes under geometric transformations,and also suffer from a number of other issues. In this paperwe address the problem of 3D reconstruction from a singleimage, generating a straight-forward form of output pointcloud coordinates. Along with this problem arises a uniqueand interesting issue, that the groundtruth shape for aninput image may be ambiguous.
2 Driven by this unorthodoxoutput form and the inherent ambiguity in groundtruth, wedesign architecture, loss function and learning paradigmthat are novel and effective. Our final solution is aconditional shape sampler, capable of predicting multipleplausible 3D Point clouds from an input image . Inexperiments not only can our system outperform state-of-the-art methods on single image based 3d reconstructionbenchmarks; but it also shows strong performance for 3 Dshape completion and promising ability in making multipleplausible IntroductionAs we try to duplicate the successes of current deepconvolutional architectures in the 3D domain, we face afundamental representational issue. Extant deep net archi-tectures for both discriminative and generative learning inthe signal domain are well-suited to data that is regularlysampled, such as images, audio, or video.
3 However,most common 3D geometry representations, such as 2 Dmeshes or Point clouds are not regular structures and donot easily fit into architectures that exploit such regularity equal contributionInputReconstructed 3D Point cloudFigure 3D Point cloud of thecompleteobject can bereconstructed from a single image . Each Point is visualized as asmall sphere. The reconstruction is viewed at two viewpoints (0 and90 along azimuth). A segmentation mask is used to indicatethe scope of the Object in the weight sharing, etc. That is why the majority ofextant works on using deep nets for 3D data resort toeither volumetric grids or collections of images (2D viewsof the geometry). Such representations, however, lead todifficult trade-offs between sampling resolution and netefficiency.
4 Furthermore, they enshrine quantization artifactsthat obscure natural invariances of the data under rigidmotions, this paper we address the problem of generating the3D geometry of an Object based on a single image of thatobject. We explore generative networks for 3D geometrybased on a Point cloud representation. A Point cloudrepresentation may not be as efficient in representing theunderlying continuous 3D geometry as compared to a CADmodel using geometric primitives or even a simple mesh,but for our purposes it has many advantages. A Point cloudis a simple, uniform structure that is easier to learn, asit does not have to encode multiple primitives or combi-natorial connectivity patterns. In addition, a Point cloudallows simple manipulation when it comes to geometrictransformations and deformations, as connectivity does not1605have to be updated.
5 Our pipeline infers the Point positions ina 3D frame determined by the input image and the inferredviewpoint this unorthodox Network output, one of our chal-lenges is how to measure loss during training, as the samegeometry may admit different Point cloud representationsat the same degree of approximation. Unlike the usualL2type losses, we use the solution of a transportationproblem based on the Earth Mover s distance (EMD),effectively solving an assignment problem. We exploit anapproximation to the EMD to provide speed as well asensure differentiability for end-to-end approach effectively attempts to solve the ill-posedproblem of 3D structure recovery from a single projectionusing certain learned priors. The Network has to estimatedepth for the visible parts of the image and hallucinate therest of the Object geometry, assessing the plausibility of sev-eral different completions.
6 From a statistical perspective, itwould be ideal if we can fully characterize the landscapeof the ground truth space, or be able to sample plausiblecandidates accordingly. If we view this as a regressionproblem, then it has a rather unique and interesting featurearising from inherent Object ambiguities in certain are situations where there are multiple, equally good3D reconstructions of a 2D image , making our problem verydifferent from classical regression/classification settings,where each training sample has a unique ground truthannotation. In such settings the proper loss definition canbe crucial in getting the most meaningful final algorithm is a conditional sampler, whichsamples plausible 3D Point clouds from the estimatedground truth space given an input image .
7 Experiments onboth synthetic and real world data verify the effectivenessof our method. Our contributions can be summarized asfollows: We use deep learning techniques to study the Point setgeneration problem; On the task of 3D reconstruction from a singleimage, we apply our Point set Generation Network andsignificantly outperform state of the art; We systematically explore issues in the architectureand loss function design for Point Generation Network ; We discuss and address the ground-truth ambiguityissue for the 3D reconstruction from single image code demonstrating our system can be Related Work3D reconstruction from single imagesWhile mostresearches focus on multi-view geometry such as SFMand SLAM [10,9], ideally, one expect that 3D can bereconstructed from the abundant single-view this setting, however, the problem is ill-posedand priors must be incorporated.
8 Early work such asShapeFromX [12,1] made strong assumptions over theshape or the environment lighting conditions. [11,18]pioneered the use of learning-based approach for simplegeometric structures. Coarse correspondences in an imagecollection can also be used for rough 3D shape estima-tion [14,3]. As commodity 3D sensors become popular,RGBD database has been built and used to train learning-based systems [6,8]. Though great progress has been made,these methods still cannot robustly reconstruct completeand quality shapes from single images. Stronger shapepriors are , large-scale repositories of 3D CAD models,such as ShapeNet [4], have been introduced. They havegreat potential for 3D reconstruction tasks. For example,[19,13] proposed to deform and reassemble existing shapesinto a new model to fit the observed image .
9 These systemsrely on high-quality image -shape correspondence, which isa challenging and ill-posed problem relevant to our work is [5]. Given a single image ,they use a neural Network to predict the underlying 3 Dobject as a 3D volume. There are two key differencesbetween our work and [5]: First, the predicted Object in[5] is a 3D volume; whilst ours is a Point cloud. Asdemonstrated and analyzed in , Point set forms anicer shape space for neural networks, thus the predictedshapes tend to be more complete and natural. Second, weallow multiple reconstruction candidates for a single inputimage. This design reflects the fact that a single imagecannot fully determine the reconstruction of a 3D learning for geometric Object synthesisIn gen-eral, the field of how to predict geometries in an end-to-endfashion is quite a virgin land.
10 In particular, our output, 3 Dpoint set, is still not a typical Object in the deep learningcommunity. A Point set contains orderless samples froma metric-measure space. Therefore, equivalent classes aredefined up to a permutation; in addition, the ground distancemust be taken into consideration. To our knowledge, we arenot aware of prior deep learning systems with the abilitiesto predict such Problem and NotationsOur goal is to reconstruct thecomplete3D shape ofan Object from a single 2D image (RGB or RGB-D). Werepresent the 3D shapes in the form of unordered Point setS={(xi, yi, zi)}Ni=1whereNis a predefined constant. Weobserved that for most objects usingN= 1024is sufficientto preserve the major unionfu lly connectedcon catena settwo prediction branch versionFigure structureOne advantage of Point set comes from its unordered-ness.