Example: tourism industry

arXiv:1711.11585v2 [cs.CV] 20 Aug 2018

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANsTing-Chun Wang1 Ming-Yu Liu1 Jun-Yan Zhu2 Andrew Tao1 Jan Kautz1 Bryan Catanzaro11 NVIDIA Corporation2UC BerkeleyCascaded refinement network[5]Our result(c) Application: Edit object appearance(b) Application: Change label types(a) Synthesized resultFigure 1: We propose a generative adversarial framework for synthesizing2048 1024images from semantic label maps(lower left corner in (a)). Compared to previous work [5], our results express more natural textures and details. (b) We canchange labels in the original label map to create new scenes, like replacing trees with buildings. (c) Our framework alsoallows a user to edit the appearance of individual objects in the scene, changing the color of a car or the texture of a visit our website for more side-by-side comparisons as well as interactive editing present a new method for synthesizing high-resolution photo-realistic images from semantic label mapsusing conditional generative adversarial netw

ing images. Using semantic segmentation methods, we can transform images into a semantic label domain, edit the ob-jects in the label domain, and then transform them back to the image domain. This method also gives us new tools for higher-level image editing, e.g., adding objects to images or changing the appearance of existing objects.

Tags:

  Using, Object, Ject, Ob ject

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of arXiv:1711.11585v2 [cs.CV] 20 Aug 2018

1 High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANsTing-Chun Wang1 Ming-Yu Liu1 Jun-Yan Zhu2 Andrew Tao1 Jan Kautz1 Bryan Catanzaro11 NVIDIA Corporation2UC BerkeleyCascaded refinement network[5]Our result(c) Application: Edit object appearance(b) Application: Change label types(a) Synthesized resultFigure 1: We propose a generative adversarial framework for synthesizing2048 1024images from semantic label maps(lower left corner in (a)). Compared to previous work [5], our results express more natural textures and details. (b) We canchange labels in the original label map to create new scenes, like replacing trees with buildings. (c) Our framework alsoallows a user to edit the appearance of individual objects in the scene, changing the color of a car or the texture of a visit our website for more side-by-side comparisons as well as interactive editing present a new method for synthesizing high-resolution photo-realistic images from semantic label mapsusing conditional generative adversarial networks (condi-tional GANs).

2 Conditional GANs have enabled a varietyof applications, but the results are often limited to low-resolution and still far from realistic. In this work, we gen-erate2048 1024visually appealing results with a noveladversarial loss, as well as new multi-scale generator anddiscriminator architectures. Furthermore, we extend ourframework to interactive visual manipulation with two ad-ditional features. First, we incorporate object instance seg-mentation information, which enables object manipulationssuch as removing/adding objects and changing the objectcategory. Second, we propose a method to generate di-verse results given the same input, allowing users to editthe object appearance interactively.

3 Human opinion stud-ies demonstrate that our method significantly outperformsexisting methods, advancing both the quality and the reso-lution of deep image synthesis and [ ] 20 Aug 20181. IntroductionPhoto-realistic image rendering using standard graphicstechniques is involved, since geometry, materials, and lighttransport must be simulated explicitly. Although existinggraphics algorithms excel at the task, building and edit-ing virtual environments is expensive and is because we have to model every aspect of the worldexplicitly. If we were able to render photo-realistic imagesusing a model learned from data, we could turn the processof graphics rendering into a model learning and inferenceproblem.

4 Then, we could simplify the process of creatingnew virtual worlds by training models on new datasets. Wecould even make it easier to customize environments by al-lowing users to simply specify overall semantic structurerather than modeling geometry, materials, or this paper, we discuss a new approach that produceshigh-resolution images from semantic label maps. Thismethod has a wide range of applications. For example, wecan use it to create synthetic training data for training vi-sual recognition algorithms, since it is much easier to createsemantic labels for desired scenarios than to generate train-ing images. using semantic segmentation methods, we cantransform images into a semantic label domain, edit the ob-jects in the label domain, and then transform them back tothe image domain.

5 This method also gives us new tools forhigher-level image editing, , adding objects to images orchanging the appearance of existing synthesize images from semantic labels, one can usethe pix2pix method, an image-to-image translation frame-work [21] which leverages generative adversarial networks(GANs) [16] in a conditional setting. Recently, Chen andKoltun [5] suggest that adversarial training might be un-stable and prone to failure for high-resolution image gen-eration tasks. Instead, they adopt a modified perceptualloss [11, 13, 22] to synthesize images, which are high-resolution but often lack fine details and realistic we address two main issues of the above state-of-the-art methods: (1) the difficulty of generating high-resolution images with GANs [21] and (2) the lack of de-tails and realistic textures in the previous high-resolutionresults [5].

6 We show that through a new, robust adversar-ial learning objective together with new multi-scale gen-erator and discriminator architectures, we can synthesizephoto-realistic images at2048 1024resolution, whichare more visually appealing than those computed by pre-vious methods [5, 21]. We first obtain our results with ad-versarial training only, without relying on any hand-craftedlosses [44] or pre-trained networks ( VGGNet [48])for perceptual losses [11, 22] (Figs. 9c, 10b). Then weshow that adding perceptual losses from pre-trained net-works [48] can slightly improve the results in some circum-stances (Figs. 9d, 10c), if a pre-trained network is avail-able.

7 Both results outperform previous works substantiallyFigure 2:Example results of using our framework for translatingedges to high-resolution natural photos, using CelebA-HQ [26]and internet cat terms of image , to support interactive semantic manipula-tion, we extend our method in two directions. First, weuse instance-level object segmentation information, whichcan separate different object instances within the same cat-egory. This enables flexible object manipulations, such asadding/removing objects and changing object types. Sec-ond, we propose a method to generate diverse results giventhe same input label map, allowing the user to edit the ap-pearance of the same object compare against state-of-the-art visual synthesis sys-tems [5, 21], and show that our method outperforms theseapproaches regarding both quantitative evaluations and hu-man perception studies.

8 We also perform an ablation studyregarding the training objectives and the importance ofinstance-level segmentation information. In addition to se-mantic manipulation, we test our method on edge2photo ap-plications (Figs. 2,13), which shows the generalizability ofour approach. Code and data are available at our Related WorkGenerative adversarial networksGenerative adversar-ial networks (GANs) [16] aim to model the natural imagedistribution by forcing the generated samples to be indistin-guishable from natural images. GANs enable a wide varietyof applications such as image generation [1, 42, 62], rep-resentation learning [45], image manipulation [64], objectdetection [33], and video applications [38, 51, 54].

9 Variouscoarse-to-fine schemes [4] have been proposed [9,19,26,57]to synthesize larger images ( 256) in an uncon-ditional setting. Inspired by their successes, we propose anew coarse-to-fine generator and multi-scale discriminatorarchitectures suitable for conditional image generation at amuch higher translationMany researchers haveleveraged adversarial learning for image-to-image transla-tion [21], whose goal is to translate an input image fromone domain to another domain given input-output imagepairs as training data. Compared toL1loss, which oftenleads to blurry images [21, 22], the adversarial loss [16]has become a popular choice for many image-to-imagetasks [10, 24, 25, 32, 41, 46, 55, 60, 66].

10 The reason is thatthe discriminator can learn a trainable loss function andautomatically adapt to the differences between the gener-ated and real images in the target domain. For example,the recent pix2pix framework [21] used image-conditionalGANs [39] for different applications, such as transformingGoogle maps to satellite views and generating cats fromuser sketches. Various methods have also been proposed tolearn an image-to-image translation in the absence of train-ing pairs [2, 34, 35, 47, 50, 52, 56, 65].Recently, Chen and Koltun [5] suggest that it might behard for conditional GANs to generate high-resolution im-ages due to the training instability and optimization avoid this difficulty, they use a direct regression objectivebased on a perceptual loss [11, 13, 22] and produce the firstmodel that can synthesize2048 1024images.


Related search queries