Noise-contrastive estimation: A new estimation principle ...

297 Noise-contrastive estimation : A new estimation principleforunnormalized statistical modelsMichael GutmannAapo Hyv arinenDept of Computer Scienceand HIIT, University of of Mathematics & Statistics, Dept of ComputerScience and HIIT, University of present a new estimation principle forparameterized statistical models. The ideais to perform nonlinear logistic regression todiscriminate between the observed data andsome artificially generated noise , using themodel log-density function in the regressionnonlinearity. We show that this leads to aconsistent (convergent) estimator of the pa-rameters, and analyze the asymptotic vari-ance. In particular, the method is shown todirectly work for unnormalized models, where the density function does notintegrate to one.

The normalization constantcan be estimated just like any other parame-ter. For a tractable ICA model, we comparethe method with other estimation methodsthat can be used to learn unnormalized mod-els, including score matching, contrastive di-vergence, and maximum-likelihood where thenormalization constant is estimated with im-portance sampling. Simulations show thatnoise- contrastive estimation offers the besttrade-off between computational and statis-tical efficiency. The method is then appliedto the modeling of natural images: We showthat the method can successfully estimatea large-scale two-layer model and a Markovrandom IntroductionEstimation of unnormalized parameterized statisticalmodels is a computationally difficult problem. Here,we propose a new principle for estimating such in Proceedings of the 13thInternational Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy.

Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the method provides, at the same time, an interestingtheoretical connection between unsupervised learningand supervised basic estimation problem is formulated as a sample of a random vectorx Rnis ob-served which follows an unknown probability densityfunction (pdf)pd(.). The data pdfpd(.) is modeled bya parameterized family of functions{pm(.; )} where is a vector of parameters. We assume thatpd(.) be-longs to this family. In other words,pd(.) =pm(.; )for some parameter . The problem we consider hereis how to estimate from the observed sample by max-imizing some objective solution to this estimation problem must yielda properly normalized densitypm(.; ) withZpm(u; )du= 1.(1)This defines essentially a constraint in the optimiza-tion principle , the constraint can alwaysbe fulfilled by redefining the pdf aspm(.)

; ) =p0m(.; )Z( ), Z( ) =Zp0m(u; )du,(2)wherep0m(.; ) specifies the functional form of the pdfand does not need to integrate to one. The calcula-tion of the normalization constant (partition function)Z( ) is, however, very problematic: The integral israrely analytically tractable, and if the data is high-dimensional, numerical integration is difficult. Exam-ples of statistical models where the normalization con-straint poses a problem can be found in Markov ran-dom fields (Roth & Black, 2009; K oster et al., 2009),energy-based models (Hinton, 2002; Teh et al., 2004),and multilayer networks (Osindero et al., 2006; K oster& Hyv arinen, 2007).1 Often, this constraint is imposed onpm(.; ) for all ,but we will see in this paper that it is actually enough toimpose it on the solution obtained.

298 Noise-contrastive estimationA conceptually simple way to deal with the normal-ization constraint would be to consider the normal-ization constantZ( ) as an additional parameter ofthe model. This approach is, however, not possiblefor Maximum Likelihood estimation (MLE). The rea-son is that the likelihood can be made arbitrarily largeby makingZ( ) go to zero. Therefore, methods havebeen proposed which estimate the model directly usingp0m(.; ) without computation of the integral which de-fines the normalization constant; the most recent onesare contrastive divergence (Hinton, 2002) and scorematching (Hyv arinen, 2005).Here, we present a new estimation principle for un-normalized models which shows advantages over con-trastive divergence or score matching. Both the pa-rameter in the unnormalized pdfp0m(.)

; ) and thenormalization constant can be estimated by maximiza-tion of the same objective function. The basic idea isto estimate the parameters by learning to discrimi-nate between the dataxand some artificially gener-ated noisey. The estimation principle thus relies onnoise with which the data is contrasted, so that we willrefer to the new method as Noise-contrastive estima -tion .In Section 2, we formally define Noise-contrastive es- timation , establish fundamental statistical properties,and make the connection to supervised learning ex-plicit. In Section 3, we first illustrate the theory withthe estimation of an ICA model, and compare the per-formance to other estimation methods. Then, we ap-ply Noise-contrastive estimation to the learning of atwo-layer model and a Markov random field model ofnatural images.

Section 4 concludes the Noise-contrastive Definition of the estimatorFor a statistical model which is specified through anunnormalized pdfp0m(.; ), we include the normaliza-tion constant as another parametercof the is, we define lnpm(.; ) = lnp0m(.; ) +c, where ={ , c}. Parametercis an estimate of the negativelogarithm of the normalization constantZ( ). Notethatpm(.; ) will only integrate to one for some spe-cific choice of the byX= (x1, .. ,xT) the observed data set,consisting ofTobservations of the datax, and byY=(y1, .. ,yT) an artificially generated data set of noiseywith distributionpn(.). The estimator Tis definedto be the which maximizes the objective functionJT( ) =12 TXtln [h(xt; )] + ln [1 h(yt; )],(3)whereh(u; ) =11 + exp [ G(u; )],(4)G(u; ) = lnpm(u; ) lnpn(u).

(5)Below, we will denote the logistic function byr(.) sothath(u; ) =r(G(u; )). Connection to supervised learningThe objective function in Eq. (3) occurs also in su-pervised learning. It is the log-likelihood in a logis-tic regression model which discriminates the observeddataXfrom the noiseY. This connection to super-vised learning, namely logistic regression and classifi-cation, provides us with intuition of how the proposedestimator works: By discriminating, or comparing, be-tween data and noise , we are able to learn propertiesof the data in the form of a statistical model. In lessmathematical terms, the idea behind noise -contrastiveestimation is learning by comparison .To make the connection explicit, we show now how theobjective function in Eq. (3) is obtained in the settingof supervised learning.

Denote byU= (u1, .. ,u2T)the union of the two setsXandY, and assign to eachdata pointuta binary class labelCt:Ct= 1 ifut XandCt= 0 ifut Y. In logistic regression, the pos-terior probabilities of the classes given the datautareestimated. As the pdfpd(.) of the dataxis unknown,the class-conditional probabilityp(.|C= 1) is mod-eled withpm(.; ).2 The class-conditional probabilitydensities are thusp(u|C= 1; ) =pm(u; )p(u|C= 0) =pn(u).(6)Since we have equal probabilities for the two class la-bels, (C= 1) =P(C= 0) = 1/2, we obtain thefollowing posterior probabilitiesP(C= 1|u; ) =pm(u; )pm(u; ) +pn(u)(7)=h(u; )(8)P(C= 0|u; ) = 1 h(u; ).(9)The class labelsCtare Bernoulli-distributed so thatthe log-likelihood of the parameters becomes ( ) =XtCtlnP(Ct= 1|ut; ) +(10)(1 Ct) lnP(Ct= 0|ut; )=Xtln [h(xt; )] + ln [1 h(yt; )],(11)which is, up to the factor 1/2T, the same as our ob-jective function in Eq.

(3).2 Classically,pm(.; ) would in the context of this sectionbe a normalized pdf. In our paper, however, the normal-ization constant may also be part of the parameters. 299 Michael Gutmann, Aapo Hyv Properties of the estimatorWe characterize here the behavior of the estimator Twhen the sample sizeTbecomes arbitrarily large. Theweak law of large numbers shows that in that case, theobjective functionJT( ) converges in probability toJ,J( ) =12E ln [h(x; )] + ln [1 h(y; )].(12)Let us denote by Jthe objectiveJseen as a functionoff(.) = lnpm(.; ), J(f) =12E ln [r(f(x) lnpn(x))] +ln [1 r(f(y) lnpn(y))].(13)We start the characterization of the estimator Twitha description of the optimization landscape of J. Thefollowing theorem3shows that the data pdfpd(.) canbe found by maximization of J, by learning aclassifier under the ideal situation of infinite amountof 1(Nonparametric estimation ).

Noise-contrastive estimation: A new estimation principle ...

Tags:

Information

Transcription of Noise-contrastive estimation: A new estimation principle ...

Related search queries

Noise-contrastive estimation: A new estimation principle ...

Tags:

Information

Documents from same domain

Related documents

Related search queries