Transcription of Parameter estimation for text analysis
1 Parameter estimation for text analysisGregor HeinrichTechnical Notevsonix GmbH+University of Leipzig, Parameter estimation methods common with discrete proba-bility distributions, which is of particular interest in text modeling. Starting withmaximum likelihood, a posteriori and Bayesian estimation , central concepts likeconjugate distributions and Bayesian networks are reviewed. As an application,the model of latent Dirichlet allocation (LDA) is explained in detail with a fullderivation of an approximate inference algorithm based on Gibbs sampling, in-cluding a discussion of Dirichlet hyperparameter :version 1: May 2005, version : August IntroductionThis technical note is intended to review the foundations of Bayesian Parameter esti-mation in the discrete domain, which is necessary to understand the inner workings oftopic-based text analysis approaches like probabilistic latent semantic analysis (PLSA)[Hofm99], latent Dirichlet allocation (LDA) [BNJ02] and other mixture models ofcount data.
2 Despite their general acceptance in the research community, it appears thatthere is no common book or introductory paper that fills this role: Most known texts useexamples from the Gaussian domain, where formulations appear to be rather very good introductory work on topic models ( , [StGr07]) skips details ofalgorithms and other background for clarity of therefore will systematically introduce the basic concepts of Parameter estima-tion with a couple of simple examples on binary data in Section 2. We then will in-troduce the concept of conjugacy along with a review of the most common probabilitydistributions needed in the text domain in Section 3. The joint presentation of conjugacywith associated real-world conjugate pairs directly justifies the choice of distributionsintroduced. Section 4 will introduce Bayesian networks as a graphical language to de-scribe systems via their probabilistic these basic concepts, we present the idea of latent Dirichlet allocation (LDA)in Section 5, a flexible model to estimate the properties of text.
3 On the example ofLDA, the usage of Gibbs sampling is shown as a straight-forward means of approximateinference in Bayesian networks. Two other important aspects of LDA are discussedafterwards: In Section 6, the influence of LDA hyperparameters is discussed and anestimation method proposed, and in Section 7, methods are presented to analyse LDAmodels for querying and Parameter estimation approachesWe face two inference problems, (1) to estimate values for a set of distribution param-eters that can best explain a set of observationsXand (2) to calculate the probabilityof new observations xgiven previous observations, , to findp( x|X). We will referto the former problem as the estimation problem and to the latter as the prediction orregression data setX,{xi}|X|i=1can be considered a sequence of independent and identi-cally distributed ( ) realisations of a random variable ( )X. The parameters aredependent on the distributions considered, , for a Gaussian, ={ , 2}.
4 For these data and parameters , a couple of probability functions are ubiquitous inBayesian statistics. They are best introduced as parts of Bayes rule, which is1:p( |X)=p(X| ) p( )p(X),(1)and we define the corresponding terminology:posterior=likelihood priorevidence.(2)In the next paragraphs, we will show different estimation methods that start from simplemaximisation of the likelihood, then show how prior belief on parameters can be incor-porated by maximising the posterior and finally use Bayes rule to infer a completeposterior Maximum likelihood estimationMaximum likelihood (ML) estimation tries to find parameters that maximise the likeli-hood,L( |X),p(X| )= x X{X=x| }= x Xp(x| ),(3) , the probability of the joint event thatXgenerates the dataX. Because of the productin Eq. 3, it is often simpler to use the log likelihood,L,logL. The ML estimationproblem then can be written as: ML=argmax L( |X)=argmax x Xlogp(x| ).(4)The common way to obtain the Parameter estimates is to solve the system: L( |X) k!
5 =0 k .(5)1 Derivation:p( |X) p(X)=p(X, )=p(X| ) p( ).3 The probability of a new observation xgiven the dataXcan now be found using theapproximation2:p( x|X)= p( x| )p( |X) d (6) p( x| ML)p( |X) d =p( x| ML),(7)that is, the next sample is anticipated to be distributed with the estimated parameters an example, consider a setCofNBernoulli experiments with unknown param-eterp, , realised by tossing a deformed coin. The Bernoulli density function for one experiment is:p(C=c|p)=pc(1 p)1 c,Bern(c|p)(8)where we definec=1 for heads andc=0 for an ML estimator for the parameterpcan be done by expressing the (log)likelihood as a function of the data:L=logN i=1p(C=ci|p)=N i=1logp(C=ci|p)(9)=n(1)logp(C=1|p)+n(0)l ogp(C=0|p)=n(1)logp+n(0)log(1 p)(10)wheren(c)is the number of times a Bernoulli experiment yielded eventc. Differentiatingwith respect to ( ) the parameterpyields: L p=n(1)p n(0)1 p!=0 pML=n(1)n(1)+n(0)=n(1)N,(11)which is simply the ratio of heads results to the total number of samples.
6 To put somenumbers into the example, we could imagine that our coin is strongly deformed, andafter 20 trials, we haven(1)=12 times heads andn(0)=8 times tails. This results in an MLestimation of of pML=12/20= Maximum a posteriori estimationMaximum a posteriori (MAP) estimation is very similar to ML estimation but allowsto include some a priori belief on the parameters by weighting them with a prior dis-tributionp( ). The name derives from the objective to maximise the posterior of theparameters given the data: MAP=argmax p( |X).(12)2 The ML estimate MLis considered a constant, and the integral over the parameters given thedata is the total probability that integrates to notation in Eq. 8 is somewhat peculiar because it makes use of the values ofcto filter the respective parts in the density function and additionally uses these numbers to representdisjoint using Bayes rule (Eq. 1), this can be rewritten to: MAP=argmax p(X| )p( )p(X) p(X),f( )=argmax p(X| )p( )=argmax {L( |X)+logp( )}=argmax { x Xlogp(x| )+logp( )}.
7 (13)Compared to Eq. 4, a prior distribution is added to the likelihood. In practice, the priorp( ) can be used to encode extra knowledge as well as to prevent overfitting by enforc-ing preference to simpler models, which is also called Occam s the incorporation ofp( ), MAP follows the Bayesian approach to data mod-elling where the parameters are thought of as With priors that are parametrisedthemselves, ,p( ) :=p( | ) with hyperparameters , the belief in the anticipatedvalues of can be expressed within the framework of probability5, and a hierarchy ofparameters is Parameter estimates can be found by maximising the termL( |X)+logp( ),similar to Eq. 5. Analogous to Eq. 7, the probability of a new observation, x, given thedata,X, can be approximated using:p( x|X) p( x| MAP)p( |X) d =p( x| MAP).(14)Returning to the simplistic demonstration on ML, we can give an example for theMAP estimator. Consider the above experiment, but now there are values forpthatwe believe to be more likely, , we believe that a coin usually is fair.
8 This can beexpressed as a prior distribution that has a high probability around We choose thebeta distribution:p(p| , )=1B( , )p 1(1 p) 1,Beta(p| , ),(15)with the beta function B( , )= ( ) ( ) ( + ). The function (x) is the Gamma function,which can be understood as a generalisation of the factorial to the domain of real num-bers via the identityx!= (x+1). The beta distribution supports the interval [0,1] andtherefore is useful to generate normalised probability values. For a graphical represen-tation of the beta probability density function (pdf), see Fig. 1. As can be seen, withdifferent parameters the distribution takes on quite different our example, we believe in a fair coin and set = =5, which results in adistribution with a mode (maximum) at The optimisation problem now becomes4 Pluralitas non est ponenda sine necessitate=Plurality should not be posited without s razor is also called the principle of is not identical to probability, which is one of the reasons why Bayesian approaches aredisputed by some theorists despite their practical (20, 20) Beta(1, 1) Beta(2/3, 2/3) Beta(10, 30) Beta(2, 6) Beta(1/3, 1) Beta(1, 3) Beta(2, 1) Beta(4, 4) p(p) p Fig.
9 Functions of the beta distribution with different symmetric and asym-metric parametrisations.(cf. Eq. 11): pL+logp(p)=n(1)p n(0)1 p+ 1p 11 p!=0(16) pMAP=n(1)+ 1n(1)+n(0)+ + 2=n(1)+4n(1)+n(0)+8(17)This result is interesting in two aspects. The first one is the changed behaviour of theestimate the countsn(c): their influence on the estimate is reduced by theadditive values that pull the value towards pMAP=4/8= The higher the valuesof the hyperparameters and , the more actual observations are necessary to revise thebelief expressed by them. The second interesting aspect is the exclusive appearance ofthe sumsn(1)+ 1 andn(0)+ 1: It is irrelevant whether the counts actually derivefrom actual observations or prior belief expressed as hypervariables. This is why the hy-perparameters and are often referred to as pseudo-counts. The higher pseudo-countsexist, the sharper the beta distribution is concentrated around its maximum. Again, weobserve in 20 trialsn(1)=12 times heads andn(0)=8 times tails.
10 This results in an MAPestimation of pMAP=16/28= , which in comparison to pML= shows theinfluence of the prior belief of the fairness of the Bayesian estimationBayesian estimation extends the MAP approach by allowing a distribution over the pa-rameter set instead of making a direct estimate. Not only encodes this the maximum(a posteriori) value of the data-generated parameters , but it also incorporates expec-tation as another Parameter estimate as well as variance information as a measure of6estimation quality or confidence. The main step in this approach is the calculation ofthe posterior according to Bayes rule:p( |X)=p(X| ) p( )p(X).(18)As we do not restrict the calculation to finding a maximum, it is necessary to calculatethe normalisation term, , the probability of the evidence ,p(X), in Eq. 18. Its valuecan be expressed by the total probability the parameters6:p(X)= p(X| )p( ) d .(19)As new data are observed, the posterior in Eq. 18 is automatically adjusted and caneventually be analysed for its statistics.