Example: stock market

Noise-contrastive estimation: A new estimation principle ...

297 Noise-contrastive estimation : A new estimation principleforunnormalized statistical modelsMichael GutmannAapo Hyv arinenDept of Computer Scienceand HIIT, University of of Mathematics & Statistics, Dept of ComputerScience and HIIT, University of present a new estimation principle forparameterized statistical models. The ideais to perform nonlinear logistic regression todiscriminate between the observed data andsome artificially generated noise , using themodel log-density function in the regressionnonlinearity. We show that this leads to aconsistent (convergent) estimator of the pa-rameters, and analyze the asymptotic vari-ance. In particular, the method is shown todirectly work for unnormalized models, where the density function does notintegrate to one.

Estimation of unnormalized parameterized statistical models is a computationally difficult problem. Here, we propose a new principle for estimating such models. Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9.

Tags:

  Model, Statistical, Noise, Estimation, Contrastive, Noise contrastive estimation, Statistical models

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Noise-contrastive estimation: A new estimation principle ...

1 297 Noise-contrastive estimation : A new estimation principleforunnormalized statistical modelsMichael GutmannAapo Hyv arinenDept of Computer Scienceand HIIT, University of of Mathematics & Statistics, Dept of ComputerScience and HIIT, University of present a new estimation principle forparameterized statistical models. The ideais to perform nonlinear logistic regression todiscriminate between the observed data andsome artificially generated noise , using themodel log-density function in the regressionnonlinearity. We show that this leads to aconsistent (convergent) estimator of the pa-rameters, and analyze the asymptotic vari-ance. In particular, the method is shown todirectly work for unnormalized models, where the density function does notintegrate to one.

2 The normalization constantcan be estimated just like any other parame-ter. For a tractable ICA model , we comparethe method with other estimation methodsthat can be used to learn unnormalized mod-els, including score matching, contrastive di-vergence, and maximum-likelihood where thenormalization constant is estimated with im-portance sampling. Simulations show thatnoise- contrastive estimation offers the besttrade-off between computational and statis-tical efficiency. The method is then appliedto the modeling of natural images: We showthat the method can successfully estimatea large-scale two-layer model and a Markovrandom IntroductionEstimation of unnormalized parameterized statisticalmodels is a computationally difficult problem.

3 Here,we propose a new principle for estimating such in Proceedings of the 13thInternational Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the method provides, at the same time, an interestingtheoretical connection between unsupervised learningand supervised basic estimation problem is formulated as a sample of a random vectorx Rnis ob-served which follows an unknown probability densityfunction (pdf)pd(.). The data pdfpd(.) is modeled bya parameterized family of functions{pm(.; )} where is a vector of parameters. We assume thatpd(.) be-longs to this family. In other words,pd(.)

4 =pm(.; )for some parameter . The problem we consider hereis how to estimate from the observed sample by max-imizing some objective solution to this estimation problem must yielda properly normalized densitypm(.; ) withZpm(u; )du= 1.(1)This defines essentially a constraint in the optimiza-tion principle , the constraint can alwaysbe fulfilled by redefining the pdf aspm(.; ) =p0m(.; )Z( ), Z( ) =Zp0m(u; )du,(2)wherep0m(.; ) specifies the functional form of the pdfand does not need to integrate to one. The calcula-tion of the normalization constant (partition function)Z( ) is, however, very problematic: The integral israrely analytically tractable, and if the data is high-dimensional, numerical integration is difficult.

5 Exam-ples of statistical models where the normalization con-straint poses a problem can be found in Markov ran-dom fields (Roth & Black, 2009; K oster et al., 2009),energy-based models (Hinton, 2002; Teh et al., 2004),and multilayer networks (Osindero et al., 2006; K oster& Hyv arinen, 2007).1 Often, this constraint is imposed onpm(.; ) for all ,but we will see in this paper that it is actually enough toimpose it on the solution obtained. 298 Noise-contrastive estimationA conceptually simple way to deal with the normal-ization constraint would be to consider the normal-ization constantZ( ) as an additional parameter ofthe model . This approach is, however, not possiblefor Maximum Likelihood estimation (MLE).

6 The rea-son is that the likelihood can be made arbitrarily largeby makingZ( ) go to zero. Therefore, methods havebeen proposed which estimate the model directly usingp0m(.; ) without computation of the integral which de-fines the normalization constant; the most recent onesare contrastive divergence (Hinton, 2002) and scorematching (Hyv arinen, 2005).Here, we present a new estimation principle for un-normalized models which shows advantages over con-trastive divergence or score matching. Both the pa-rameter in the unnormalized pdfp0m(.; ) and thenormalization constant can be estimated by maximiza-tion of the same objective function. The basic idea isto estimate the parameters by learning to discrimi-nate between the dataxand some artificially gener-ated noisey.

7 The estimation principle thus relies onnoise with which the data is contrasted, so that we willrefer to the new method as Noise-contrastive estima-tion .In Section 2, we formally define Noise-contrastive es-timation, establish fundamental statistical properties,and make the connection to supervised learning ex-plicit. In Section 3, we first illustrate the theory withthe estimation of an ICA model , and compare the per-formance to other estimation methods. Then, we ap-ply Noise-contrastive estimation to the learning of atwo-layer model and a Markov random field model ofnatural images. Section 4 concludes the Noise-contrastive Definition of the estimatorFor a statistical model which is specified through anunnormalized pdfp0m(.)

8 ; ), we include the normaliza-tion constant as another parametercof the is, we define lnpm(.; ) = lnp0m(.; ) +c, where ={ , c}. Parametercis an estimate of the negativelogarithm of the normalization constantZ( ). Notethatpm(.; ) will only integrate to one for some spe-cific choice of the byX= (x1, .. ,xT) the observed data set,consisting ofTobservations of the datax, and byY=(y1, .. ,yT) an artificially generated data set of noiseywith distributionpn(.). The estimator Tis definedto be the which maximizes the objective functionJT( ) =12 TXtln [h(xt; )] + ln [1 h(yt; )],(3)whereh(u; ) =11 + exp [ G(u; )],(4)G(u; ) = lnpm(u; ) lnpn(u).(5)Below, we will denote the logistic function byr(.

9 Sothath(u; ) =r(G(u; )). Connection to supervised learningThe objective function in Eq. (3) occurs also in su-pervised learning. It is the log-likelihood in a logis-tic regression model which discriminates the observeddataXfrom the noiseY. This connection to super-vised learning, namely logistic regression and classifi-cation, provides us with intuition of how the proposedestimator works: By discriminating, or comparing, be-tween data and noise , we are able to learn propertiesof the data in the form of a statistical model . In lessmathematical terms, the idea behind noise -contrastiveestimation is learning by comparison .To make the connection explicit, we show now how theobjective function in Eq.

10 (3) is obtained in the settingof supervised learning. Denote byU= (u1, .. ,u2T)the union of the two setsXandY, and assign to eachdata pointuta binary class labelCt:Ct= 1 ifut XandCt= 0 ifut Y. In logistic regression, the pos-terior probabilities of the classes given the datautareestimated. As the pdfpd(.) of the dataxis unknown,the class-conditional probabilityp(.|C= 1) is mod-eled withpm(.; ).2 The class-conditional probabilitydensities are thusp(u|C= 1; ) =pm(u; )p(u|C= 0) =pn(u).(6)Since we have equal probabilities for the two class la-bels, (C= 1) =P(C= 0) = 1/2, we obtain thefollowing posterior probabilitiesP(C= 1|u; ) =pm(u; )pm(u; ) +pn(u)(7)=h(u; )(8)P(C= 0|u; ) = 1 h(u; ).


Related search queries