Transcription of CS224n: Natural Language Processing with Deep Learning ...
1 CS224n: Natural Language Processing with DeepLearning11 Course Instructors: ChristopherManning, Richard SocherLecture Notes: Part VLanguage Models, RNN, GRU and LSTM22 Authors: Milad Mohammadi, RohitMundra, Richard Socher, Lisa Wang,Amita KamathWinter2019 Keyphrases: Language Models. RNN. Bi-directional RNN. DeepRNN. GRU. models compute the probability of occurrence of a numberof words in a particular sequence . The probability of a sequence ofmwords{w1, ..,wm}is denoted asP(w1, ..,wm). Since the numberof words coming before a word,wi, varies depending on its locationin the input document,P(w1, ..,wm)is usually conditioned on awindow ofnprevious words rather than all previous words:P(w1, ..,wm) =i=m i=1P(wi|w1.)
2 ,wi 1) i=m i=1P(wi|wi n, ..,wi 1)(1)Equation1is especially useful for speech and translation systemswhen determining whether a word sequence is an accurate transla-tion of an input sentence. In existing Language translation systems,for each phrase / sentence translation, the software generates a num-ber of alternative word sequences ( {I have, I had, I has, me have, mehad}) and scores them to identify the most likely translation machine translation, the model chooses the best word orderingfor an input phrase by assigning agoodnessscore to each outputword sequence alternative. To do so, the model may choose betweendifferent word ordering or word choice alternatives. It would achievethis objective by running all word sequence candidates through aprobability function that assigns each a score.
3 The sequence withthe highest score is the output of the translation. For example, themachine would give a higher score to"the cat is small"comparedto"small the is cat", and a higher score to"walking home after school"compared to"walking house after school". Language ModelsTo compute the probabilities mentioned above, the count of each n-gram could be compared against the frequency of each word. This iscs224n: Natural Language Processing with deep learninglecture notes:part vlanguage models,rnn,gru and lstm2called an n-gram Language Model. For instance, if the model takesbi-grams, the frequency of each bi-gram, calculated via combining aword with its previous word, would be divided by the frequency ofthe corresponding uni-gram.
4 Equations2and3show this relation-ship for bigram and trigram (w2|w1) =count(w1,w2)count(w1)(2)p(w3|w1,w2) =count(w1,w2,w3)count(w1,w2)(3)The relationship in Equation3focuses on making predictionsbased on a fixed window of context ( thenprevious words) usedto predict the next word. But how long should the context be? Insome cases, the window of past consecutivenwords may not be suf-ficient to capture the context. For instance, consider the sentence "Asthe proctor started the clock, the students opened their". If thewindow only conditions on the previous three words "the studentsopened their", the probabilities calculated based on the corpus maysuggest that the next word be "books" - however, ifnhad been largeenough to include the "proctor" context, the probability might havesuggested "exam".
5 This leads us to two main issues with n-gram Language Models:Sparsity and problems with n-gram Language modelsSparsity problems with these models arise due to two , note the numerator of Equation3. Ifw1,w2andw3neverappear together in the corpus, the probability ofw3is 0. To solvethis, a small could be added to the count for each word in thevocabulary. This is , consider the denominator of Equation3. Ifw1andw2never occurred together in the corpus, then no probability can becalculated forw3. To solve this, we could condition is sparsity problems worse. Typically,n problems with n-gram Language modelsWe know that we need to store the count for all n-grams we saw inthe corpus. Asnincreases (or the corpus size increases), the modelsize increases as Neural Language ModelThe "curse of dimensionality" above was first tackled by Bengio etal in A Neural Probabilistic Language Model, which introduced thecs224n: Natural Language Processing with deep learninglecture notes:part vlanguage models,rnn,gru and lstm3first large-scale deep Learning for Natural Language Processing model learns adistributed representation of words, along with theprobability function for word sequences expressed in terms of theserepresentations.
6 Figure1shows the corresponding neural networkarchitecture. The input word vectors are used by both the hiddenlayer and the output Figure1and shows the parameters of thesoftmax()function, consisting of the standard tanh()function ( hidden layer) as well as the linear function,W(3)x+b(3), thatcaptures all the previousninput word : The first deep neural networkarchitecture model for NLP presentedby Bengio et : A simplified representation ofFigure1. y=softmax(W(2)tanh(W(1)x+b(1)) +W(3)x+b(3))(4)Note that the weight matrixW(1)is applied to the word vectors (solidgreen arrows in Figure1),W(2)is applied to the hidden layer (alsosolid green arrow) andW(3)is applied to the word vectors (dashedgreen arrows).A simplified version of this model can be seen in Figure2, wherethe blue layer signifies concatenated word embeddings for the inputwords:e= [e(1);e(2);e(3);e(4)], the red layer signifies the hidden layer:h=f(We+b1), and the green output distribution is a softmax overthe vocabulary: y=softmax(Uh+b2).
7 2 Recurrent Neural Networks (RNN)Unlike the conventional translation models, where only a finite win-dow of previous words would be considered for conditioning thelanguage model, Recurrent Neural Networks (RNN) are capable ofconditioning the model onallprevious words in the 1 xt xt+1 ht 1 ht ht+1 W"W"yt 1 yt yt+1 Figure3: A Recurrent Neural Network(RNN). Three time-steps are the RNN architecture where each vertical rect-angular box is a hidden layer at a time-step,t. Each such layer holdsa number of neurons, each of which performs a linear matrix oper-ation on its inputs followed by a non-linear operation ( tanh()).At each time-step, there are two inputs to the hidden layer: the out-put of the previous layerht 1, and the input at that timestepxt.
8 Theformer input is multiplied by a weight matrixW(hh)and the latterby a weight matrixW(hx)to produce output featuresht, which aremultiplied with a weight matrixW(S)and run through a softmaxover the vocabulary to obtain a prediction output yof the next word(Equations5and6). The inputs and outputs of each single neuronare illustrated in (W(hh)ht 1+W(hx)x[t])(5) yt=so f tmax(W(S)ht)(6) CS224n: Natural Language Processing with deep learninglecture notes:part vlanguage models,rnn,gru and lstm4 What is interesting here is that the same weightsW(hh)andW(hx)areapplied repeatedly at each timestep. Thus, the number of parametersthe model has to learn is less, and most importantly, is independentof the length of the input sequence - thus defeating the curse of di-mensionality!
9 Figure4: The inputs and outputs to aneuron of a RNNB elow are the details associated with each parameter in the net-work: x1, ..,xt 1,xt,xt+1, ..xT: the word vectors corresponding to a cor-pus with T words. ht= (W(hh)ht 1+W(hx)xt): the relationship to compute thehidden layer output features at each time-stept xt Rd: input word vector at timet. Whx RDh d: weights matrix used to condition the input wordvector,xt Whh RDh Dh: weights matrix used to condition the output ofthe previous time-step,ht 1 ht 1 RDh: output of the non-linear function at the previoustime-step,t RDhis an initialization vector for thehidden layer at time-stept=0. (): the non-linearity function (sigmoid here) yt=so f tmax(W(S)ht): the output probability distribution over thevocabulary at each time-stept.
10 Essentially, ytis the next predictedword given the document context score so far ( 1) and thelast observed word vectorx(t). Here,W(S) R|V| Dhand y R|V|where|V|is the : An RNN Language ModelAn example of an RNN Language model is shown in notation in this image is slightly different: here, the equivalentofW(hh)isWh,W(hx)isWe, andW(S) word inputsx(t)to word embeddingse(t). The final softmax over the vocabularyshows us the probability of various options for tokenx(5), condi-tioned on all previous tokens. The input could be much longer Loss and PerplexityThe loss function used in RNNs is often the cross entropy error in-troduced in earlier notes. Equation7shows this function as the sumover the entire vocabulary at (t)( ) = |V| j=1yt,j log( yt,j)(7) CS224n: Natural Language Processing with deep learninglecture notes:part vlanguage models,rnn,gru and lstm5 The cross entropy error over a corpus of sizeTis:J=1TT t=1J(t)( ) = 1TT t=1|V| j=1yt,j log( yt,j)(8)Equation9is called theperplexityrelationship; it is basically2tothe power of the negative log probability of the cross entropy errorfunction shown in Equation8.