NLP Lunch Tutorial: Smoothing

NLP Lunch Tutorial: SmoothingBill MacCartney21 April 2005 Preface Everything is from this great paper by Stanley F. Chen and JoshuaGoodman (1998), An Empirical Study of Smoothing Techniquesfor Language Modeling , which I read yesterday. Everything is presented in the context ofn-gram language models,but Smoothing is needed in many problem contexts, and most ofthe Smoothing methods we ll look at generalize without Plan Motivation the problem an example All the Smoothing methods formula after formula intuitions for each So which one is the best? (answer: modified Kneser-Ney) Excel demo for absolute discounting and Good-Turing?2 Probabilistic modeling You have some kind of probabilistic model, which is a distributionp(e) over an event spaceE. You want to estimate the parameters of your model distributionpfrom data. In principle, you might to like to use maximum likelihood (ML)estimates, so that your model ispML(x) =c(x) ec(e) : data sparsity But, you have insufficient data: there are many eventsxsuch thatc(x) = 0, so that the ML estimate ispML(x) = 0.

In problem settings where the event spaceEis unbounded ( NLP problems), this is generally undesirable. Ex: a language model which gives probability 0 to unseen words. Just because an event has never been observed in training data doesnot mean it cannot occur in test data. So ifc(x) = 0, what shouldp(x) be? If data sparsity isn t a problem for you, your model is too simple!4 Whenever data sparsity is an issue, Smoothing can help performance,and data sparsity is almost always an issue in statistical modeling. In theextreme case where there is so much training data that all parameterscan be accurately trained without Smoothing , one can almost alwaysexpand the model, such as by moving to a highern-gram model, toachieve improved performance. With more parameters data sparsitybecomes an issue again, but with proper Smoothing the models areusually more accurate than the original models. Thus, no matter howmuch data one has, Smoothing can almost always help performace, andfor a relatively small effort.

Chen & Goodman (1998)5 Example: bigram modelJOHN READ MOBY DICKMARY READ A DIFFERENT BOOKSHE READ A BOOK BY CHERp(wi|wi 1) =c(wi 1wi) wic(wi 1wi)p(s) =l+1 i=1p(wi|wi 1)6 JOHN READ MOBY DICKMARY READ A DIFFERENT BOOKSHE READ A BOOK BY CHERp(JOHN READ A BOOK)=p(JOHN| )p(READ|JOHN)p(A|READ)p(BOOK|A)p( |BOOK)=c( JOHN) wc( w)c(JOHN READ) wc(JOHNw)c(READ A) wc(READw)c(A BOOK) wc(Aw)c(BOOK ) wc(BOOKw)=1311231212 READ MOBY DICKMARY READ A DIFFERENT BOOKSHE READ A BOOK BY CHERp(CHER READ A BOOK)=p(CHER| )p(READ|CHER)p(A|READ)p(BOOK|A)p( |BOOK)=c( CHER) wc( w)c(CHER READ) wc(CHERw)c(READ A) wc(READw)c(A BOOK) wc(Aw)c(BOOK ) wc(BOOKw)=0301231212= 08 Add-one smoothingp(wi|wi 1) =1 +c(wi 1wi) wi[1 +c(wi 1wi)]=1 +c(wi 1wi)|V|+ wic(wi 1wi) Originally due to Laplace. Typically, we assumeV={w:c(w)>0} {UNK} Add-one Smoothing is generally a horrible READ MOBY DICKMARY READ A DIFFERENT BOOKSHE READ A BOOK BY CHERp(JOHN READ A BOOK)=1+111+31+111+11+211+31+111+21+111+ 2 (CHER READ A BOOK)=1+011+31+011+11+211+31+111+21+111+ 2 methods Additive Smoothing Good-Turing estimate Jelinek-Mercer Smoothing (interpolation) Katz Smoothing (backoff) Witten-Bell Smoothing Absolute discounting Kneser-Ney smoothing11 Additive smoothingpadd(wi|wi 1i n+1) = +c(wii n+1) |V|+ wic(wii n+1) Idea: pretend we ve seen eachn-gram times more than we have.

Typically, 0< 1. Lidstone and Jeffreys advocate = 1. Gale & Church (1994) argue that this method performs estimation Idea: reallocate the probability mass ofn-grams that occurr+ 1times in the training data to then-grams that occurrtimes. In particular, reallocate the probability mass ofn-grams that wereseen once to then-grams that were never seen. For each countr, we compute an adjusted countr :r = (r+ 1)nr+1nrwherenris the number ofn-grams seen exactlyrtimes. Then we have:pGT(x:c(x) =r) =r NwhereN= r=0r nr= r= problems Problem: what ifnr+1= 0? This is common for highr: there are holes in the counts of counts. Problem: even if we re not just below a hole, for highr, thenrarequite noisy. Really, we should think ofr as:r = (r+ 1)E[nr+1]E[nr] But how do we estimate that expectation? (The original formulaamounts to using the ML estimate.) Good-Turing thus requires elaboration to be useful. It forms a foun-dation on which other Smoothing methods Smoothing (interpolation) Observation: Ifc(BURNISH THE) = 0 andc(BURNISH THOU) = 0,then under both additive Smoothing and Good-Turing:p(THE|BURNISH) =p(THOU|BURNISH) This seems wrong: we should havep(THE|BURNISH)> p(THOU|BURNISH)becauseTHEis much more common thanTHOU Solution:interpolatebetween bigram and unigram Smoothing (interpolation) Unigram ML model:pML(wi) =c(wi) wic(wi) Bigram interpolated model:pinterp(wi|wi 1) = pML(wi|wi 1) + (1 )pML(wi)16 Jelinek-Mercer Smoothing (interpolation) Recursive formulation:nth-order smoothed model is defined recur-sively as a linear interpolation between thenth-order ML model andthe (n 1)th-order smoothed (wi|wi 1i n+1) = wi 1i n+1pML(wi|wi 1i n+1) + (1 wi 1i n+1)pinterp(wi|wi 1i n+2) Can ground recursion with: 1st-order model: ML (or otherwise smoothed) unigram model 0th-order model.

Uniform modelpunif(wi) =1|V|17 Jelinek-Mercer Smoothing (interpolation) The wi 1i n+1can be estimated using EM on held-out data (held-outinterpolation) or in cross-validation fashion (deleted interpolation). The optimal wi 1i n+1depend on context: high-frequency contextsshould get high s. But, can t tune all s separately: need to bucket them. Bucket by wic(wii n+1): total count in higher-order Smoothing : bigrams As in Good-Turing, we compute adjusted counts. Bigrams with nonzero countrare discounted according todiscountratiodr, which is approximatelyr r, the discount predicted by Good-Turing. (Details below.) Count mass subtracted from nonzero counts is redistributed amongthe zero-count bigrams according to next lower-order distribution( the unigram model).19 Katz Smoothing : bigrams Katz adjusted counts:ckatz(wii 1) ={drrifr >0 (wi 1)pML(wi) ifr= 0 (wi 1) is chosen so that wickatz(wii 1) = wic(wii 1): (wi 1) =1 wi:c(wii 1)>0pkatz(wi|wi 1)1 wi:c(wii 1)>0pML(wi) Computepkatz(wi|wi 1) from corrected count by normalizing:pkatz(wi|wi 1) =ckatz(wii 1) wickatz(wii 1)20 Katz Smoothing What aboutdr?}

Large counts are taken to be reliable, sodr= 1 forr > k, where Katz suggestsk= 5. Forr We want discounts to be proportional to Good-Turing discounts:1 dr= (1 r r) We want the total count mass saved to equal the count mass whichGood-Turing assigns to zero counts:k r=1nr(1 dr)r=n1 The unique solution is:dr=r r (k+1)nk+1n11 (k+1)nk+1n121 Katz Smoothing Katz Smoothing for higher-ordern-grams is defined analogously. Like Jelinek-Mercer, can be given a recursive formulation: the Katzn-gram model is defined in terms of the Katz (n 1)-gram Smoothing An instance of Jelinek-Mercer Smoothing :pW B(wi|wi 1i n+1) = wi 1i n+1pML(wi|wi 1i n+1) + (1 wi 1i n+1)pW B(wi|wi 1i n+2) Motivation: interpret wi 1i n+1as the probability of using the higher-order model. We should use higher-order model ifn-gramwii n+1was seen intraining data, and back off to lower-order model otherwise. So 1 wi 1i n+1should be the probability that a word not seen afterwi 1i n+1in training data occurs after that history in test data.

Estimate this by the number of unique words that follow the historywi 1i n+1in the training Smoothing To compute the s, we ll need the number of unique words thatfollow the historywi 1i n+1:N1+(wi 1i n+1 ) =|{wi:c(wi 1i n+1wi)>0}| Set s such that1 wi 1i n+1=N1+(wi 1i n+1 )N1+(wi 1i n+1 ) + wic(wii n+1)24 Absolute discounting Like Jelinek-Mercer, involves interpolation of higher- and lower-ordermodels. But instead of multiplying the higher-orderpMLby a , we subtracta fixed discount [0,1] from each nonzero count:pabs(wi|wi 1i n+1) =max{c(wii n+1) ,0} wic(wii n+1)+ (1 wi 1i n+1)pabs(wi|wi 1i n+2) To make it sum to 1:1 wi 1i n+1= wic(wii n+1)N1+(wi 1i n+1 ) Choose using held-out Smoothing An extension of absolute discounting with a clever way of construct-ing the lower-order (backoff) model. Idea: the lower-order model is signficant only when count is smallor zero in the higher-order model, and so should be optimized forthat purpose.

Example: suppose San Francisco is common, but Francisco occurs only after San . Francisco will get a high unigram probability, and so absolutediscounting will give a high probability to Francisco appearingafter novel bigram histories. Better to give Francisco a low unigram probability, because theonly time it occurs is after San , in which case the bigram modelfits Smoothing Let the count assigned to each unigram be the number of differentwords that it follows. Define:N1+( wi) =|{wi 1:c(wi 1wi)>0}|N1+( ) = wiN1+( wi) Let lower-order distribution be:pKN(wi) =N1+( wi)N1+( ) Put it all together:pKN(wi|wi 1i n+1) =max{c(wii n+1) ,0} wic(wii n+1)+ wic(wii n+1)N1+(wi 1i n+1 )pKN(wi|wi 1i n+2)27 Interpolation vs. backoff Both interpolation (Jelinek-Mercer) and backoff (Katz) involve com-bining information from higher- and lower-order models. Key difference: in determining the probability ofn-grams with nonzerocounts, interpolated models use information from lower-order mod-els while backoff models do not.

(In both backoff and interpolated models, lower-order models areused in determining the probability ofn-grams with zero counts.) It turns out that it s not hard to create a backoff version of aninterpolated algorithm, and vice-versa. (Kneser-Ney was originallybackoff; Chen & Goodman made interpolated version.)28 Modified Kneser-Ney Chen and Goodman introducedmodified Kneser-Ney: Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of asingle discount for all counts. Estimates discounts on held-out data instead of using a formulabased on training counts. Experiments show all three modifications improve performance. Modified Kneser-Ney consistently had best The factor with the largest influence is the use of a modified backoffdistribution as in Kneser-Ney Smoothing . Jelinek-Mercer performs better on small training sets; Katz performsbetter on large training sets. Katz Smoothing performs well onn-grams with large counts; Kneser-Ney is best for small counts.

Absolute discounting is superior to linear discounting. Interpolated models are superior to backoff models for low (nonzero)counts. Adding free parameters to an algorithm and optimizing these pa-rameters on held-out data can improve from Chen & Goodman (1998)30 END31 32

NLP Lunch Tutorial: Smoothing

Tags:

Information

Advertisement

Transcription of NLP Lunch Tutorial: Smoothing

Related search queries

NLP Lunch Tutorial: Smoothing

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries