Transcription of Generative Adversarial Nets
1 Generative Adversarial NetsIan J. Goodfellow, Jean Pouget-Abadie , Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair , Aaron Courville, Yoshua Bengio D epartement d informatique et de recherche op erationnelleUniversit e de Montr ealMontr eal, QC H3C 3J7 AbstractWe propose a new framework for estimating Generative models via an adversar-ial process, in which we simultaneously train two models: a Generative modelGthat captures the data distribution , and a discriminative modelDthat estimatesthe probability that a sample came from the training data rather thanG. The train-ing procedure forGis to maximize the probability ofDmaking a mistake. Thisframework corresponds to a minimax two-player game. In the space of arbitraryfunctionsGandD, a unique solution exists, withGrecovering the training datadistribution andDequal to12everywhere. In the case whereGandDare definedby multilayer perceptrons, the entire system can be trained with is no need for any Markov chains or unrolled approximate inference net-works during either training or generation of samples.
2 Experiments demonstratethe potential of the framework through qualitative and quantitative evaluation ofthe generated IntroductionThe promise of deep learning is to discover rich, hierarchical models [2] that represent probabilitydistributions over the kinds of data encountered in artificial intelligence applications, such as naturalimages, audio waveforms containing speech, and symbols in natural language corpora. So far, themost striking successes in deep learning have involved discriminative models, usually those thatmap a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes haveprimarily been based on the backpropagation and dropout algorithms, using piecewise linear units[19, 9, 10] which have a particularly well-behaved gradient . Deepgenerativemodels have had lessof an impact, due to the difficulty of approximating many intractable probabilistic computations thatarise in maximum likelihood estimation and related strategies, and due to difficulty of leveragingthe benefits of piecewise linear units in the Generative context.
3 We propose a new Generative modelestimation procedure that sidesteps these the proposedadversarial netsframework, the Generative model is pitted against an adversary: adiscriminative model that learns to determine whether a sample is from the model distribution or thedata distribution . The Generative model can be thought of as analogous to a team of counterfeiters,trying to produce fake currency and use it without detection, while the discriminative model isanalogous to the police, trying to detect the counterfeit currency. Competition in this game drivesboth teams to improve their methods until the counterfeits are indistiguishable from the genuinearticles. Jean Pouget-Abadie is visiting Universit e de Montr eal from Ecole Polytechnique. Sherjil Ozair is visiting Universit e de Montr eal from Indian Institute of Technology Delhi Yoshua Bengio is a CIFAR Senior code and hyperparameters available [ ] 10 Jun 2014 This framework can yield specific training algorithms for many kinds of model and optimizationalgorithm.
4 In this article, we explore the special case when the Generative model generates samplesby passing random noise through a multilayer perceptron, and the discriminative model is also amultilayer perceptron. We refer to this special case asadversarial nets. In this case, we can trainboth models using only the highly successful backpropagation and dropout algorithms [17] andsample from the Generative model using only forward propagation. No approximate inference orMarkov chains are Related workAn alternative to directed graphical models with latent variables are undirected graphical modelswith latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmannmachines (DBMs) [26] and their numerous variants. The interactions within such models arerepresented as the product of unnormalized potential functions, normalized by a global summa-tion/integration over all states of the random variables.
5 This quantity (thepartition function) andits gradient are intractable for all but the most trivial instances, although they can be estimated byMarkov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learningalgorithms that rely on MCMC [3, 5].Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and sev-eral directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur thecomputational difficulties associated with both undirected and directed criteria that do not approximate or bound the log-likelihood have also been proposed,such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require thelearned probability density to be analytically specified up to a normalization constant. Note thatin many interesting Generative models with several layers of latent variables (such as DBNs andDBMs), it is not even possible to derive a tractable unnormalized probability density.
6 Some modelssuch as denoising auto-encoders [30] and contractive autoencoders have learning rules very similarto score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion isemployed to fit a Generative model. However, rather than fitting a separate discriminative model, thegenerative model itself is used to discriminate generated data from samples a fixed noise NCE uses a fixed noise distribution , learning slows dramatically after the model has learnedeven an approximately correct distribution over a small subset of the observed , some techniques do not involve defining a probability distribution explicitly, but rather traina Generative machine to draw samples from the desired distribution . This approach has the advantagethat such machines can be designed to be trained by back-propagation. Prominent recent work in thisarea includes the Generative stochastic network (GSN) framework [5], which extends generalizeddenoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, , onelearns the parameters of a machine that performs one step of a Generative Markov chain.
7 Comparedto GSNs, the Adversarial nets framework does not require a Markov chain for sampling. Becauseadversarial nets do not require feedback loops during generation, they are better able to leveragepiecewise linear units [19, 9, 10], which improve the performance of backpropagation but haveproblems with unbounded activation when used ina feedback loop. More recent examples of traininga Generative machine by back-propagating into it include recent work on auto-encoding variationalBayes [20] and stochastic backpropagation [24].3 Adversarial netsThe Adversarial modeling framework is most straightforward to apply when the models are bothmultilayer perceptrons. To learn the generator s distributionpgover datax, we define a prior oninput noise variablespz(z), then represent a mapping to data space asG(z; g), whereGis adifferentiable function represented by a multilayer perceptron with parameters g.
8 We also define asecond multilayer perceptronD(x; d)that outputs a single (x)represents the probabilitythatxcame from the data rather thanpg. We trainDto maximize the probability of assigning thecorrect label to both training examples and samples fromG. We simultaneously trainGto minimizelog(1 D(G(z))):2In other words,DandGplay the following two-player minimax game with value functionV(G,D):minGmaxDV(D,G) =Ex pdata(x)[logD(x)] +Ez pz(z)[log(1 D(G(z)))].(1)In the next section, we present a theoretical analysis of Adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution asGandDare givenenough capacity, , in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. OptimizingDto completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting.
9 Instead, we alternate betweenksteps of optimizingDand one step of optimizingG. This results inDbeing maintained near its optimal solution, solong asGchanges slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm practice, equation 1 may not provide sufficient gradient forGto learn well. Early in learning,whenGis poor,Dcan reject samples with high confidence because they are clearly different fromthe training data. In this case,log(1 D(G(z)))saturates. Rather than trainingGto minimizelog(1 D(G(z)))we can trainGto maximizelogD(G(z)). This objective function results in thesame fixed point of the dynamics ofGandDbut provides much stronger gradients early in.
10 XZ(a)(b)(c)(d)Figure 1: Generative Adversarial nets are trained by simultaneously updating thediscriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line)pxfrom those of thegenerative distributionpg(G) (green, solid line). The lower horizontal line isthe domain from whichzis sampled, in this case uniformly. The horizontal line above is part of the domainofx. The upward arrows show how the mappingx=G(z)imposes the non-uniform distributionpgontransformed in regions of high density and expands in regions of low density ofpg. (a)Consider an Adversarial pair near convergence:pgis similar topdataandDis a partially accurate classifier.(b) In the inner loop of the algorithmDis trained to discriminate samples from data, converging toD (x) =pdata(x)pdata(x)+pg(x). (c) After an update toG, gradient ofDhas guidedG(z)to flow to regions that are more likelyto be classified as data.