Example: quiz answers

Chapter 9 The exponential family: Conjugate priors

Chapter 9 The exponential family: ConjugatepriorsWithin the Bayesian framework the parameter is treated as a random quantity. Thisrequires us to specify aprior distributionp( ), from which we can obtain theposteriordistributionp( |x) via Bayes theorem:p( |x) =p(x| )p( )p(x),( )wherep(x| ) is the inferential conclusions obtained within the Bayesian framework are based in one wayor another on averages computed under the posterior distribution , and thus for the Bayesianframework to be useful it is essential to be able to compute these integrals with some effectiveprocedure. In particular, prediction of future dataxnewis based on the predictive probability:p(xnew|x) =Zp(xnew| )p( |x)d ,( )which is an integral with respect to the posterior. (We have assumed thatXnew X| ).

(9.15) From these results we see that the relative values of α1 and α2 determine the mean, whereas the magnitude α1 + α2 determines the variance. That is, for a fixed value of the mean, the variance goes to zero as α1 +α2 goes to infinity. Applying these results to the posterior distribution in Eq. (9.10), we can compute the posterior ...

Tags:

  Distribution

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Chapter 9 The exponential family: Conjugate priors

1 Chapter 9 The exponential family: ConjugatepriorsWithin the Bayesian framework the parameter is treated as a random quantity. Thisrequires us to specify aprior distributionp( ), from which we can obtain theposteriordistributionp( |x) via Bayes theorem:p( |x) =p(x| )p( )p(x),( )wherep(x| ) is the inferential conclusions obtained within the Bayesian framework are based in one wayor another on averages computed under the posterior distribution , and thus for the Bayesianframework to be useful it is essential to be able to compute these integrals with some effectiveprocedure. In particular, prediction of future dataxnewis based on the predictive probability:p(xnew|x) =Zp(xnew| )p( |x)d ,( )which is an integral with respect to the posterior. (We have assumed thatXnew X| ).

2 Note also that forming the posterior distribution itself involves computing an integral: tonormalize the posterior we must computep(x) =Zp(x| )p( )d ,( )which is an integral with respect to the this section we introduce the idea of aconjugate prior. The basic idea is as a likelihoodp(x| ), we choose a family of prior distributions such that integralsofthe form Eq. ( ) can be obtained tractably (for every priorin the family). Moreover, we12 Chapter 9. THE exponential FAMILY: Conjugate priors choose this family such that prior-to-posterior updating yields a posterior that is also in thefamily. This means that integrals of the form Eq. ( ) can also be obtained tractably forevery posterior distribution in the family. In general these two goals are in conflict.

3 Forexample, the goal of invariance of prior-to-posterior updating ( , asking that the posteriorremains in the same family of distributions of the prior) can beacheived vacuously by definingthe family of all probability distributions, but this would not yield tractable integrals. Onthe other extreme, we could aim to obtain tractable integrals by taking the family of priordistributions to be a single distribution of a simple form ( ,a constant), but the posteriorwould not generally retain this the setting of the exponential family this dilemma is readily resolved. For exponentialfamilies the likelihood is a simple standarized function of the parameter and we can defineconjugate priors by mimicking the form of the likelihood. Multiplication of a likelihoodand a prior that have the same exponential form yields a posterior that retains that , for the exponential families that are most useful inpractice, these exponentialforms are readily integrated.

4 In the remainder of this section we present examples thatillustrate Conjugate priors for exponential family priors thus have appealing computational properties and for this reason theyare widely used in practice. Indeed, for the complex models ofthe kind that are of-ten constructed using the graphical model toolbox, computational considerations may beparamount, and there may be little choice but to use conjugatepriors. On the other hand,there are also good reasonsnotto use Conjugate priors and one should not be lulled into asense of complacency when using Conjugate priors . Before turning to a presentation of exam-ples, let us briefly discuss some of the philosophical issues. we will return to this discussionin Section??after we have obtained a better idea of some of the from our earlier discussion in Section?

5 ?the distinction between subjective Bayesianand objective Bayesian perspectives. The subjective Bayesian perspective takes the opti-mistic view that priors are an opportunity to express knowledge; in particular, a prior maybe a posterior from a previous experiment. The objective Bayesian perspective takes themore pessimistic view that prior knowledge is often not available and that priors shouldbe chosen to have as little impact on the analysis as possible, relative to the impact ofthe data. In this regard, it is important to note that Conjugate priors involve making rel-atively strong assumptions. Indeed, in a sense to be made clear in Section??, conjugatepriorsminimizethe impact of the data on the posterior. From the subjective perspective,this can be viewed favorably Conjugate priors provide an opportunity to express knowledgein a relatively influential way.

6 From the objective perspective, however, Conjugate priorsare decidedly dangerous; objective priors aim tomaximizethe impact of the data on general point to be made is that one should take care with Conjugate priors . Theuse of Conjugate priors involves relatively strong assumptionsand thus it is particularlyimportant to do sensitivity analysis to assess how strongly the posterior is influenced by3the prior. If the answer is not much, then one can proceed with some confidence. If theanswer is a lot, then one should either take great care to assess whether a domain expertis comfortable with these priors on subjective grounds or one should consider other kinds ofpriors (such as those discussed in Section??) and/or gather more data so as to diminish theeffect of the Bernoulli distribution and beta priorsWe have stated that Conjugate priors can be obtained by mimicking the form of the likeli-hood.

7 This is easily understood by considering examples. Let us begin with the the Bernoullli distribution using the mean parameter , the likelihoodtakes the following form:p(x| ) = x(1 )1 x.( )Under sampling, this expression retains the form of a product of powers of and 1 ,with the exponents growing. This suggests that to obtain a Conjugate prior for , we use adistribution that is a product of powers of and 1 , with free parameters in the exponents:p( | ) 1(1 ) 2.( )This expression can be normalized if 1> 1 and 2> 1. The resulting distribution isknown as thebeta distribution , another example of an exponential family beta distribution is traditionally parameterized using i 1 instead of iin theexponents (for a reason that will become clear below), yielding the following standard formfor the Conjugate prior:p( | ) =K( ) 1 1(1 ) 2 1.

8 ( )where the normalization factorK( ) can be obtained analytically (see Exercise??):K( ) = Z 1 1(1 ) 2 1d 1( )= ( 1+ 2) ( 1) ( 2)( )as a ratio of gamma we multiply the beta density by the Bernoulli likelihood weobtain a beta in Bernoulli observations,x= (x1, .. , xN)T:p( |x, ) NYn=1 xn(1 )1 xn! 1 1(1 ) 2 1( )= PNn=1xn+ 1 1(1 )N PNn=1xn+ 2 1.( )4 Chapter 9. THE exponential FAMILY: Conjugate PRIORSThis is a beta density with updated values of the parameters. Inparticular, it is a Beta(PNn=1xn+ 1, N PNn=1xn+ 2) the simple nature of the prior-to-posterior updating procedure. For each observationxnwe simply addxnto the first parameter of the beta distribution and add 1 xnto thesecond parameter of the beta distribution . At each step we simplyretain two numbers asour representation of the posterior distribution .

9 Note also thatthe form of the updatingprocedure provides an interpretation for the parameters 1and 2. In particular, viewingthe posterior as a prior from a previous experiment, we can view 1and 2as effectivecounts ; 1can be viewed as an effective number of prior observations ofX= 1 and 2canbe interpreted as an effective number of prior observations ofX= 0. (In general, however,the parameters are not restricted to integer values.)The fact that the normalization factor of the beta distribution has an analytic form allowsus to compute various averages in closed form. Consider, in particular, the mean of a betarandom variable:E[ | ] =Z K( ) 1 1(1 ) 2 1d ( )=K( )Z 1(1 ) 2 1d ( )= ( 1+ 2) ( 1) ( 2) ( 1+ 1) ( 2) ( 1+ 1 + 2)( )= 1 1+ 2,( )using (a+ 1) =a (a) in the final line.

10 A similar calculation yields the variance:Var[ | ] = 1 2( 1+ 2+ 1)( 1+ 2)2.( )From these results we see that the relative values of 1and 2determine the mean, whereasthe magnitude 1+ 2determines the variance. That is, for a fixed value of the mean,thevariance goes to zero as 1+ 2goes to these results to the posterior distribution in Eq. ( ), we can compute theposterior mean:E[ |x, ] =PNn=1xn+ 1N+ 1+ 2.( )and the posterior variance:Var[ |x, ] =(PNn=1xn+ 1)(N PNn=1xn+ 2)(N+ 1+ 2+ 1)(N+ 1+ 2)2.( )5 These equations yield several significant pieces of information. First, lettingNtend toinfinity, we see thatE[ |x, ] 1 NNXn=1xn,( )which is the maximum likelihood estimate of . Second, we haveVar[ |x, ] 0,( )showing that the posterior distribution concentrates around the maximum likelihood esti-mate for largeN.


Related search queries