CHAPTER N-gram Language Models

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright 2021. Allrights reserved. Draft of December 29, Language Models You are uniformly charming! cried he, with a smile of associating and nowand then I bowed and they perceived a chaise and four to wish sentence generated from a Jane Austen trigram modelPredicting is difficult especially about the future, as the old quip goes. But howabout predicting something that seems much easier, like the next few words someoneis going to say? What word, for example, is likely to followPlease turn your homework ..Hopefully, most of you concluded that a very likely word isin, or possiblyover,but probably notrefrigeratororthe.

In the following sections we will formalizethis intuition by introducing Models that assign aprobabilityto each possible nextword. The same Models will also serve to assign a probability to an entire a model , for example, could predict that the following sequence has a muchhigher probability of appearing in a text:all of a sudden I notice three guys standing on the sidewalkthan does this same set of words in a different order:on guys all I of notice sidewalk three a sudden standing theWhy would you want to predict upcoming words, or assign probabilities to sen-tences? Probabilities are essential in any task in which we have to identify words innoisy, ambiguous input, likespeech recognition.

For a speech recognizer to realizethat you saidI will be back soonishand notI will be bassoon dish, it helps to knowthatback soonishis a much more probable sequence thanbassoon dish. For writingtools likespelling correctionorgrammatical error correction, we need to find andcorrect errors in writing likeTheir are two midterms, in whichTherewas mistypedasTheir, orEverything has improve, in whichimproveshould have phraseThere arewill be much more probable thanTheir are, andhas improvedthanhas improve, allowing us to help users by detecting and correcting these probabilities to sequences of words is also essential inmachine trans-lation. Suppose we are translating a Chinese source sentence: He to reporters introduced main contentAs part of the process we might have built the following set of potential roughEnglish translations.

He introduced reporters to the main contents of the statementhe briefed to reporters the main contents of the statementhe briefed reporters on the main contents of the statement2 CHAPTER3 N-GRAMLANGUAGEMODELSA probabilistic model of word sequences could suggest thatbriefed reporters onis a more probable English phrase thanbriefed to reporters(which has an awkwardtoafterbriefed) orintroduced reporters to(which uses a verb that is less fluentEnglish in this context), allowing us to correctly select the boldfaced sentence are also important foraugmentative and alternative communi-cationsystems (Trnka et al. 2007, Kane et al. 2017). People often use suchAACAAC devices if they are physically unable to speak or sign but can instead use eye gaze orother specific movements to select words from a menu to be spoken by the prediction can be used to suggest likely words for the that assign probabilities to sequences of words are calledlanguage mod-elsorLMs.

In this CHAPTER we introduce the simplest model that assigns probabil- Language modelLMities to sentences and sequences of words, then- gram . An N-gram is a sequencen-gramofnwords: a 2- gram (which we ll callbigram) is a two-word sequence of wordslike please turn , turn your , or your homework , and a 3- gram (atrigram) isa three-word sequence of words like please turn your , or turn your homework .We ll see how to use N-gram Models to estimate the probability of the last word ofan N-gram given the previous words, and also to assign probabilities to entire se-quences. In a bit of terminological ambiguity, we usually drop the word model ,and use the termn- gram (andbigram, etc.)

To mean either the word sequence itselfor the predictive model that assigns it a probability. While N-gram Models are muchsimpler than state-of-the art neural Language Models based on the RNNs and trans-formers we will introduce in CHAPTER 9, they are an important foundational tool forunderstanding the fundamental concepts of Language N-GramsLet s begin with the task of computingP(w|h), the probability of a wordwgivensome historyh. Suppose the historyhis its water is so transparent that and wewant to know the probability that the next word isthe:P(the|its water is so transparent that).( )One way to estimate this probability is from relative frequency counts: take avery large corpus, count the number of times we seeits water is so transparent that,and count the number of times this is followed bythe.

This would be answering thequestion Out of the times we saw the historyh, how many times was it followed bythe wordw , as follows:P(the|its water is so transparent that) =C(its water is so transparent that the)C(its water is so transparent that)( )With a large enough corpus, such as the web, we can compute these counts andestimate the probability from Eq. You should pause now, go to the web, andcompute this estimate for this method of estimating probabilities directly from counts works fine inmany cases, it turns out that even the web isn t big enough to give us good estimatesin most cases. This is because Language is creative; new sentences are created all thetime, and we won t always be able to count entire sentences.

Even simple N-GRAMS3of the example sentence may have counts of zero on the web (such as WaldenPond s water is so transparent that the ; well,used tohave counts of zero).Similarly, if we wanted to know the joint probability of an entire sequence ofwords likeits water is so transparent, we could do it by asking out of all possiblesequences of five words, how many of them areits water is so transparent? Wewould have to get the count ofits water is so transparentand divide by the sum ofthe counts of all possible five word sequences. That seems rather a lot to estimate!For this reason, we ll need to introduce more clever ways of estimating the prob-ability of a wordwgiven a historyh, or the probability of an entire word sequenceW.

Let s start with a little formalizing of notation. To represent the probability of aparticular random variableXitaking on the value the , orP(Xi= the ), we will usethe simplificationP(the). We ll represent a sequence ofNwords either :n(so the expressionw1:n 1means the stringw1,w2,..,wn 1). For the jointprobability of each word in a sequence having a particular valueP(X1=w1,X2=w2,X3=w3,..,Xn=wn)we ll useP(w1,w2,..,wn).Now how can we compute probabilities of entire sequences likeP(w1,w2,..,wn)?One thing we can do is decompose this probability using thechain rule of proba-bility:P( ) =P(X1)P(X2|X1)P(X3|X1:2)..P(Xn|X1:n 1)=n k=1P(Xk|X1:k 1)( )Applying the chain rule to words, we getP(w1:n) =P(w1)P(w2|w1)P(w3|w1:2).

P(wn|w1:n 1)=n k=1P(wk|w1:k 1)( )The chain rule shows the link between computing the joint probability of a sequenceand computing the conditional probability of a word given previous words. Equa-tion suggests that we could estimate the joint probability of an entire sequence ofwords by multiplying together a number of conditional probabilities. But using thechain rule doesn t really seem to help us! We don t know any way to compute theexact probability of a word given a long sequence of preceding words,P(wn|w1:n 1).As we said above, we can t just estimate by counting the number of times every wordoccurs following every long string, because Language is creative and any particularcontext might have never occurred before!

CHAPTER N-gram Language Models

Tags:

Information

Transcription of CHAPTER N-gram Language Models

Related search queries

CHAPTER N-gram Language Models

Tags:

Information

Documents from same domain

Related documents

Related search queries