AttnGAN: Fine-Grained Text to Image Generation …

AttnGAN: Fine-Grained Text to Image Generationwith attentional generative adversarial NetworksTao Xu 1, Pengchuan Zhang2, Qiuyuan Huang2,Han Zhang3, Zhe Gan4, Xiaolei Huang1, Xiaodong He21 Lehigh University2 Microsoft Research3 Rutgers University4 Duke University{tax313, qihua, this paper, we propose an attentional generative Ad-versarial network (AttnGAN) that allows attention-driven,multi-stage refinement for Fine-Grained text-to- Image gener-ation. with a novel attentional generative network , the At-tnGAN can synthesize Fine-Grained details at different sub-regions of the Image by paying attentions to the relevantwords in the natural language description. In addition, adeep attentional multimodal similarity model is proposed tocompute a Fine-Grained Image -text matching loss for train-ing the generator.}

The proposed AttnGAN significantly out-performs the previous state of the art, boosting the best re-ported inception score by on the CUB dataset on the more challenging COCO dataset. A de-tailed analysis is also performed by visualizing the atten-tion layers of the AttnGAN. It for the first time shows thatthe layered attentional GAN is able to automatically selectthe condition at the word level for generating different partsof the IntroductionAutomatically generating images according to naturallanguage descriptions is a fundamental problem in manyapplications, such as art Generation and computer-aided de-sign. It also drives research progress in multimodal learningand inference across vision and language, which is one ofthe most active research areas in recent years [20, 18, 31,19, 4, 29, 5, 1, 30]Most recently proposed text-to- Image synthesis methodsare based on generative adversarial Networks (GANs) [6].

A commonly used approach is to encode the whole text de-scription into a global sentence vector as the condition forGAN-based Image Generation [20, 18, 31, 32]. Although work was performed when was an intern with Microsoft Researchthis bird is red with white and has a very short beak10:short3:red11:beak9:very8:a3:red5: white1:bird10:short0:thisFigure 1. Example results of the proposed AttnGAN. The first rowgives the low-to-high resolution images generated byG0,G1andG2of the AttnGAN; the second and third row shows the top-5most attended words byFattn1andFattn2of the AttnGAN, re-spectively. Here, images ofG0andG1are bilinearly upsampledto have the same size as that ofG2for better results have been presented, conditioning GANonly on the global sentence vector lacks important Fine-Grained information at the word level, and prevents the gen-eration of high quality images.

This problem becomes evenmore severe when generating complex scenes such as thosein the COCO dataset [14].To address this issue, we propose an attentional Genera-tive adversarial network (AttnGAN) that allows attention-driven, multi-stage refinement for Fine-Grained text-to- Image Generation . The overall architecture of the AttnGANis illustrated in Figure 2. The model consists of two novelcomponents. The first component is an attentional gener-1 [ ] 28 Nov 2017ative network , in which an attention mechanism is devel-oped for the generator to draw different sub-regions of theimage by focusing on words that are most relevant to thesub-region being drawn (see Figure 1). More specifically,besides encoding the natural language description into aglobal sentence vector, each word in the sentence is alsoencoded into a word vector.

The generative network uti-lizes the global sentence vector to generate a low-resolutionimage in the first stage. In the following stages, it usesthe Image vector in each sub-region to query word vectorsby using an attention layer to form a word-context then combines the regional Image vector and the corre-sponding word-context vector to form a multimodal contextvector, based on which the model generates new Image fea-tures in the surrounding sub-regions. This effectively yieldsa higher resolution picture with more details at each other component in the AttnGAN is a Deep AttentionalMultimodal Similarity Model (DAMSM). with an attentionmechanism, the DAMSM is able to compute the similaritybetween the generated Image and the sentence using boththe global sentence level information and the fine-grainedword level information.

Thus, the DAMSM provides an ad-ditional Fine-Grained Image -text matching loss for trainingthe contribution of our method is threefold. (i) AnAttentional generative adversarial network is proposedfor synthesizing images from text descriptions. Specif-ically, two novel components are proposed in the At-tnGAN, including the attentional generative network andthe DAMSM. (ii) Comprehensive study is carried out to em-pirically evaluate the proposed AttnGAN. Experimental re-sults show that the AttnGAN significantly outperforms pre-vious state-of-the-art GAN models. (iii) A detailed analy-sis is performed through visualizing the attention layers ofthe AttnGAN. For the first time, it is demonstrated that thelayered conditional GAN is able to automatically attend torelevant words to form the condition for Image Related WorkGenerating high resolution images from text descrip-tions, though very challenging, is important for many prac-tical applications such as art Generation and computer-aided design.

Recently, great progress has been achievedin this direction with the emergence of deep generativemodels [12, 26, 6]. Mansimovet al. [15] built the align-DRAW model, extending the Deep Recurrent AttentionWriter (DRAW) [7] to iteratively draw Image patches whileattending to the relevant words in the caption. Nguyenet al. [16] proposed an approximate Langevin approachto generate images from captions. Reedet al. [21] usedconditional PixelCNN [26] to synthesize images from textwith a multi-scale model structure. Compared with otherdeep generative models, generative adversarial Networks(GANs) [6] have shown great performance for generatingsharper samples [17, 3, 23, 13, 10]. Reedet al. [20] firstshowed that the conditional GAN was capable of synthesiz-ing plausible images from text descriptions.

Their follow-up work [18] also demonstrated that GAN was able to gen-erate better samples by incorporating additional conditions( , object locations). Zhanget al. [31, 32] stacked sev-eral GANs for text-to- Image synthesis and used differentGANs to generate images of different sizes. However, allof their GANs are conditioned on the global sentence vec-tor, missing Fine-Grained word level information for attention mechanism has recently become an inte-gral part of sequence transduction models. It has been suc-cessfully used in modeling multi-level dependencies in im-age captioning [29], Image question answering [30] andmachine translation [2]. Vaswaniet al. [27] also demon-strated that machine translation models could achieve state-of-the-art results by solely using an attention model.

Inspite of these progress, the attention mechanism has notbeen explored in GANs for text-to- Image synthesis yet. It isworth mentioning that the alignDRAW [15] also used LAP-GAN [3] to scale the Image to a higher resolution. How-ever, the GAN in their framework was only utilized as apost-processing step without attention. To our knowledge,the proposed AttnGAN for the first time develops an atten-tion mechanism that enables GANs to generate fine-grainedhigh quality images via multi-level ( , word level andsentence level) attentional generative adversarial Net-workAs shown in Figure 2, the proposed attentional Gener-ative adversarial network (AttnGAN) has two novel com-ponents: the attentional generative network and the deepattentional multimodal similarity model.

We will elaborateeach of them in the rest of this attentional generative NetworkCurrent GAN-based models for text-to- Image genera-tion [20, 18, 31, 32] typically encode the whole-sentencetext description into a single vector as the condition for im-age Generation , but lack Fine-Grained word level informa-tion. In this section, we propose a novel attention modelthat enables the generative network to draw different sub-regions of the Image conditioned on words that are mostrelevant to those shown in Figure 2, the proposed attentional genera-tive network hasmgenerators (G0,G1,..,Gm 1), whichtake the hidden states (h0,h1,..,hm 1) as input and gen-erate images of small-to-large scales ( x0, x1,.., xm 1).256x256x3 attentional generative Networkcz~N(0,I)h2 h1 h0D0128x128x364x64x3 Text Encodersentence feature word featuresAttention modelsLocal Image featuresDeep attentional Multimodal Similarity Model (DAMSM)Conv3x3 JoiningUpsamplingFC with reshapeResidualthis bird is red with white and has a very short beakD2D1attnF1attnF2caFF0F0 F1 F1 F2 F2 G2 G1 G0 ImageEncoderFigure 2.

AttnGAN: Fine-Grained Text to Image Generation …

Tags:

Information

Transcription of AttnGAN: Fine-Grained Text to Image Generation …

Related search queries

AttnGAN: Fine-Grained Text to Image Generation …

Tags:

Information

Documents from same domain

Related documents

Related search queries