Example: air traffic controller

[email protected] arXiv:2011.12100v2 [cs.CV ...

GIRAFFE: Representing Scenes asCompositional Generative Neural Feature FieldsMichael Niemeyer1,2 Andreas Geiger1,21 Max Planck Institute for Intelligent Systems, T ubingen2 University of T generative models allow for photorealistic imagesynthesis at high resolutions. But for many applications,this is not enough: content creation also needs to be con-trollable. While several recent works investigate how to dis-entangle underlying factors of variation in the data, mostof them operate in 2D and hence ignore that our worldis three-dimensional. Further, only few works considerthe compositional nature of scenes. Our key hypothesis isthat incorporating a compositional 3D scene representationinto the generative model leads to more controllable imagesynthesis. Representing scenes as compositional genera-tive neural feature fields allows us to disentangle one ormultiple objects from the background as well as individualobjects shapes and appearances while learning from un-structured and unposed image collections without any ad-ditional supervision.

works [50,51,66,81,92] propose differentiable rendering techniques. Mildenhall et al. [61] propose Neural Radiance Fields (NeRFs) in which they combine an implicit neural model with volume rendering for novel view synthesis of complex scenes. Due to their expressiveness, we use a generative variant of NeRFs as our object-level represen-tation.

Tags:

  Technique, Rendering, Rendering techniques

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of [email protected] arXiv:2011.12100v2 [cs.CV ...

1 GIRAFFE: Representing Scenes asCompositional Generative Neural Feature FieldsMichael Niemeyer1,2 Andreas Geiger1,21 Max Planck Institute for Intelligent Systems, T ubingen2 University of T generative models allow for photorealistic imagesynthesis at high resolutions. But for many applications,this is not enough: content creation also needs to be con-trollable. While several recent works investigate how to dis-entangle underlying factors of variation in the data, mostof them operate in 2D and hence ignore that our worldis three-dimensional. Further, only few works considerthe compositional nature of scenes. Our key hypothesis isthat incorporating a compositional 3D scene representationinto the generative model leads to more controllable imagesynthesis. Representing scenes as compositional genera-tive neural feature fields allows us to disentangle one ormultiple objects from the background as well as individualobjects shapes and appearances while learning from un-structured and unposed image collections without any ad-ditional supervision.

2 Combining this scene representationwith a neural rendering pipeline yields a fast and realisticimage synthesis model. As evidenced by our experiments,our model is able to disentangle individual objects and al-lows for translating and rotating them in the scene as wellas changing the camera IntroductionThe ability to generate and manipulate photorealistic im-age content is a long-standing goal of computer vision andgraphics. Modern computer graphics techniques achieveimpressive results and are industry standard in gaming andmovie productions. However, they are very hardware ex-pensive and require substantial human labor for 3D contentcreation and recent years, the computer vision community hasmade great strides towards highly-realistic image particular, Generative Adversarial Networks(GANs) [24] emerged as a powerful class of generativemodels. They are able to synthesize photorealistic imagesat resolutions of10242pixels and beyond [6, 14, 15, 39, 40].

3 Decoder2D CNNC ameraImplicit 3D SceneRepresentationPoseShape andAppearanceShape andAppearanceShape andAppearancePosePosePoseSampledFeature FieldsPosedFeature FieldsVolume Renderingof Feature ImageNeural Renderingof Output ImageFeature ImageOutput ImageFigure 1 represent scenes as compositionalgenerative neural feature fields. For a randomly sampledcamera, we volume render a feature image of the scenebased on individual feature fields. A 2D neural renderingnetwork converts the feature image into an RGB training only on raw image collections, at test timewe are able to control the image formation process pose, object poses, as well as the objects shapesand appearances. Further, our model generalizes beyondthe training data, we can synthesize scenes with moreobjects than were present in the training images. Note thatfor clarity we visualize volumes in color instead of these successes, synthesizing realistic 2D im-ages is not the only aspect required in applications of gen-erative models.

4 The generation process should also be con-trollable in a simple and consistent manner. To this end,many works [9, 25, 39, 43, 44, 48, 54, 71, 74, 97, 98] investi-gate how disentangled representations can be learned fromdata without explicit supervision. Definitions of disentan-glement vary [5, 53], but commonly refer to being able tocontrol an attribute of interest, object shape, size, orpose, without changing other attributes. Most approaches,however, do not consider the compositional nature of scenesand operate in the 2D domain, ignoring that our world isthree-dimensional. This often leads to entangled represen-tations (Fig. 2) and control mechanisms are not built-in, butneed to be discovered in the latent space a posteriori. Theseproperties, however, are crucial for successful applications, a movie production where complex object [ ] 29 Apr 2021(a) Translation of Left Object (2D-based Method [71])(b) Translation of Left Object (Ours)(c) Circular Translation (Ours)(d) Add Objects (Ours)Figure 2:Controllable Image mostgenerative models operate in 2D, we incorporate a compo-sitional 3D scene representation into the generative leads to more consistent image synthesis results, how, in contrast to our method, translating one objectmight change the other when operating in 2D (Fig.)

5 2a and2b). It further allows us to perform complex operations likecircular translations (Fig. 2c) or adding more objects at testtime (Fig. 2d). Both methods are trained unsupervised onraw unposed image collections of two-object to be generated in a consistent recent works therefore investigate how to incor-porate 3D representations, such as voxels [32, 63, 64], prim-itives [46], or radiance fields [77], directly into generativemodels. While these methods allow for impressive resultswith built-in control, they are mostly restricted to single-object scenes and results are less consistent for higher reso-lutions and more complex and realistic imagery ( sceneswith objects not in the center or cluttered backgrounds).Contribution:In this work, we introduceGIRAFFE, anovel method for generating scenes in a controllable andphotorealistic manner while training from raw unstructuredimage collections. Our key insight is twofold: First, incor-porating a compositional 3D scene representation directlyinto the generative model leads to more controllable im-age synthesis.

6 Second, combining this explicit 3D repre-sentation with a neural rendering pipeline results in fasterinference and more realistic images. To this end, we repre-sent scenes as compositional generative neural feature fields(Fig. 1). We volume render the scene to a feature imageof relatively low resolution to save time and neural renderer processes these feature images and out-puts the final renderings. This way, our approach achieveshigh-quality images and scales to real-world scenes. Wefind that our method allows for controllable image synthesisof single-object as well as multi-object scenes when trainedon raw unstructured image collections. Code and data isavailable at Related WorkGAN-based Image Synthesis:Generative AdversarialNetworks (GANs) [24] have been shown to allow for pho-torealistic image synthesis at resolutions of10242pixelsand beyond [6, 14, 15, 39, 40]. To gain better control overthe synthesis process, many works investigate how factorsof variation can be disentangled without explicit supervi-sion.

7 They either modify the training objective [9, 40, 71]or network architecture [39], or investigate latent spaces ofwell-engineered and pre-trained generative models [1, 16,23, 27, 34, 78, 96]. All of these works, however, do not ex-plicitly model the compositional nature of scenes. Recentworks therefore investigate how the synthesis process canbe controlled at the object-level [3,4,7,18,19,26,45,86,90].While achieving photorealistic results, all aforementionedworks model the image formation process in 2D, ignoringthe three-dimensional structure of our world. In this work,we advocate to model the formation process directly in 3 Dfor better disentanglement and more controllable Functions:Using implicit functions to represent3D geometry has gained popularity in learning-based 3 Dreconstruction [11, 12, 22, 59, 60, 65, 67, 69, 76] and hasbeen extended to scene-level reconstruction [8, 13, 35, 72,79]. To overcome the need of 3D supervision, severalworks [50, 51, 66, 81, 92] propose differentiable renderingtechniques.

8 Mildenhall et al. [61] propose Neural RadianceFields (NeRFs) in which they combine an implicit neuralmodel with volume rendering for novel view synthesis ofcomplex scenes. Due to their expressiveness, we use agenerative variant of NeRFs as our object-level represen-tation. In contrast to our method, the discussed works re-quire multi-view images with camera poses as supervision,train a single network per scene, and are not able to generatenovel scenes. Instead, we learn a generative model from un-structured image collections which allows for controllable,photorealistic image synthesis of generated Image Synthesis:Several works investigatehow 3D representations can be incorporated as inductivebias into generative models [21,29 32,46,55,63,64,75,77].While many approaches use additional supervision [2, 10,87, 88, 99], we focus on works which are trained on raw im-age collections like our et al. [32] learn voxel-based representations us-ing differentiable rendering .

9 The results are 3D control-lable, but show artifacts due to the limited voxel reso-lutions caused by their cubic memory growth. Nguyen-Phuoc et al. [63, 64] propose voxelized feature-grid repre-sentations which are rendered to 2D via a reshaping op-eration. While achieving impressive results, training be-comes less stable and results less consistent for higher reso-lutions. Liao et al. [46] use abstract features in combinationwith primitives and differentiable rendering . While han-dling multi-object scenes, they require additional supervi-sion in the form of pure background images which are hardto obtain for real-world scenes. Schwarz et al. [77] proposeGenerative Neural Radiances Fields (GRAF). While achiev-ing controllable image synthesis at high resolutions, thisrepresentation is restricted to single-object scenes and re-sults degrade on more complex, real-world imagery. In con-trast, we incorporate compositional 3D scene structure intothe generative model such that it naturally handles multi-object scenes.

10 Further, by integrating a neural renderingpipeline [20, 41, 42, 49, 62, 80, 81, 83, 84], our model scalesto more complex, real-world MethodOur goal is a controllable image synthesis pipeline whichcan be trained from raw image collections without addi-tional supervision. In the following, we discuss the maincomponents of our method. First, we model individual ob-jects as neural feature fields (Sec. ). Next, we exploitthe additive property of feature fields to composite scenesfrom multiple individual objects (Sec. ). For rendering ,we explore an efficient combination of volume and neuralrendering techniques (Sec. ). Finally, we discuss howwe train our model from raw image collections (Sec. ).Fig. 3 contains an overview of our Objects as Neural Feature FieldsNeural Radiance Fields:A radiance field is a continuousfunctionfwhich maps a 3D pointx R3and a view-ing directiond S2to a volume density R+and anRGB color valuec R3. A key observation in [61, 82] isthat the low dimensional inputxanddneeds to be mappedto higher-dimensional features to be able to represent com-plex signals whenfis parameterized with a neural specifically, a pre-defined positional encoding is ap-plied element-wise to each component ofxandd: (t,L) =(sin(20t ),cos(20t ).)


Related search queries