Statistical Machine Translation of French and German into ...

Statistical Machine Translation of French and German into English UsingIBM Model 2 Greedy DecodingMichael TuritzinDepartment of Computer ScienceStanford University, Stanford, job of a decoder in Statistical Machine Translation is to findthe most probable Translation of a given sentence, as definedby a set of previously learned parameters. Because the searchspace of potential translations is essentially infinite, there is al-ways a trade-off between accuracy and speed when designing adecoder. Germann et al. [4] recently presented a fast, greedy de-coder that starts with an initial guess and then refines that guessthrough small mutations that produce more probable transla-tions. The greedy decoder in [4] was designed to work with theIBM Model 4 Translation model, which, while being a sophis-ticated model of the Translation process, is also quite complexand therefore difficult to implement and fairly slow in trainingand decoding. We present modifications to the greedy decoderpresented in [4] that allow it to work with the simpler and moreefficient IBM Model 2.

We have tested our modified decoderby having it translate equivalent French and German sentencesinto English, and we present the results and Translation accura-cies that we have obtained. Because we are interested in the rel-ative effectiveness of our decoder in translating between differ-ent languages, we discuss the discrepancies between the resultswe obtained when performing French -to-English and German -to-English Translation , and we speculate on the factors inherentto these languages that may have contributed to these Introduction and Related WorkThe goal of Statistical Machine Translation is to translate a sen-tence in one language, say French , into another language, sayEnglish. The Statistical Machine Translation process is typicallydivided into three parts: a language model, a Translation model,and a decoder. A language model assigns probalities to Englishstrings, a Translation model assigns probabilities to English- French sentence pairs based on the likelihood that one is a trans-lation of the other, and a decoder attempts to find the Englishsentence for which the product of the language and translationmodel probabilities is highest.

The details and derivation of thistype of Machine Translation system are described in Section research has been done on language models, but wewill not discuss such work here as it is only tangential to ourwork. Our language model, which is described in Section ,was chosen for its favorable balance between simplicity, re-source requirements, and effectiveness. Brown et al. [2] intro-duced a set of Translation models that have since been put intouse by many researchers in Machine Translation . These mod-els aregenerativein that they model a process that generates aFrench sentence from an English one; to translate from Frenchto English, it is necessary to reverse this process and determinewhich English sentence was most likely to have generated theFrench one that is to be translated. This generative process isrelated to thenoisy channel modelof Translation , which postu-lates (falsely) that French speakers have in mind an Englishsentence that is garbled in some way when spoken so as to be-come a French employ the second of the models presented in [2], IBMM odel 2, and we have adapted the greedy decoder presentedin [4] to work with this model.

Brown et al. did not includea decoding algorithm in their original paper, and their onlypublic work to date on the subject was published in the formof a patent application [3], which describes a priority-queue( stack ) based IBM Model 3 decoder. Priority-queue baseddecoders typically enumerate a large subset of possible decod-ings, forming hypotheses that are inserted into priority queuesand ordered based on heuristics. Such decoders are typicallyextremely slow, especially for longer sentences ( , sentencesover 15 words in length). Wang and Waibel [7] have presenteda similar priority-queue based decoder that was designed towork with IBM Model 2, and Jahr [5] has presented a sophisti-cated priority-queue based decoder designed to work with IBMM odel 4. Germann et al. [4] recently presented an entirely newtype of decoder: a decoder that is greedy in the sense that itvery aggressively prunes paths from the search space, only fol-lowing the search direction that is locally best, as determinedby a number of simple sentence mutation operations.

Ger-mann et al. showed their greedy decoder to be extremely fastin comparison to stack-based decoders and only marginally lessaccurate. We have chosen to base our work on this decoder. Toour knowledge, no previous work has been done in analyzingthe relative effectiveness of a particular decoder in obtainingtranslations between different Section 2 of our paper, we describe the Machine transla-tion process and our choices of language model and translationmodel. In Section 3, we describe our IBM Model 2 greedy de-coder, which is a modified version of the IBM Model 4 greedydecoder presented in [4]. In Section 4, we describe several ex-periments we have performed to test our decoder s ability totranslate French and German sentences into English, and wepresent the Translation accuracies we were able to obtain. Fi-nally, in Section 5, we discuss the issue of decoding errors andlook at language-specific issues that may have contributed to thediscrepancy in accuracy that we observed between translationsfrom French and translations from German in our Statistical Machine TranslationSay we wish to translate French sentences into English a French sentence andebe a possible Englishtranslation, we therefore wish to find the most likely Englishtranslation e=argmaxeP(e|f),(1)whereP(e|f)is the probability that a given French sentencefwas produced from an English sentencee(under the noisychannel model of Translation ).

By Bayes theorem, we can rewriteP(e|f)asP(e|f) =P(e)P(f|e)P(f).(2)Thus, equation (1) becomes: e=argmaxeP(e)P(f|e)P(f).(3)We can think ofP(e)(orP(f)) as the probability that an En-glish (or French ) speaker utters the phrasee(orf) out of allpossible utterances. BecauseP(f)is independent ofe, we cansimplify equation (3) to obtain what the authors of [2] refer toas the Fundamental Equation of Machine Translation : e=argmaxeP(e)P(f|e).(4)Thus, in order to evaluate a particular English sentence transla-tion candidatee, we must have some way of computingP(e)andP(f|e). The computation ofP(e), which is done by a lan-guage model, will be discussed in Section Because we areusing a word alignment model (specifically, IBM Model 2) fortranslation modeling, we do not explicitly computeP(f|e). Aswill be discussed in Section , we instead generate an align-mentaalong with each English sentenceeand compute e, a =argmax e,a P(e)P(a,f|e)(5)The concept of an alignment between a French and English sen-tence will be defined in Section The Language ModelWe have chosen to use a trigram mixture model (with interpo-lation) as our language model for computingP(e)for a givenEnglish sentencee.

To computeP(e), we iterate through thewords ofe, multiplying together the conditional trigram proba-bilitiesPCT(w)for each wordwas we that we encounter a wordw3after having seen thewordsw1andw2(in that order). We define the conditionaltrigram probability ofw3to be:PCT(w3) = 1 Pabs(w3)+ 2 Pabs(w3|w2)+ 3 Pabs(w3|w1, w2),(6)where 1, 2, and 3are manually chosen interpolation pa-rameters (we use 1= , 2= , and 3= ), andPabs(w3),Pabs(w3|w2), andPabs(w3|w1, w2)are the unigramand conditional bigram and trigram probabilities ofw3derivedfrom English training sentences, usingabsolute our absolute discounting smoothing scheme, all pre-viously unseen words encountered in testing data are viewed asinstances of one unk token. Using the unigram case as anexample, we computePabs(w3)asPabs(w3) = (r )/Nifr >0|S| /Notherwise,(7)whereNis the total number of words seen in the training data,Sis thesetof unique words seen in the training data,C(w3) =ris the number of times wordw3appears in the training data,and is a manually adjusted parameter (we used = ).

The Translation Model: IBM Model 2 The goal of a Translation model is to compute the probabilityP(f|e)of a French sentencefbeing the Translation of a givenEnglish sentencee(or, under the noisy channel model, hav-ing been producedfrome). The IBM Translation models (ofwhich there are five) rely on the concept ofsentence alignmentto compute Translation probabilities. An alignmentabetween aFrench sentencefand an English sentenceeis the mapping be-tween words in the two sentences; if an English wordeianda French wordfjare aligned, we say thatfjwas producedbyeiin the Translation process. We define the length ofetobeland the length offto bem; thus,i= 0,1,2, .. , landj= 1,2, .. , m. We define there to be a imaginary wordcalledNULLat the 0th position ofe, which is whyimay equal0. If a French word is aligned to theNULLE nglish word, thatFrench word is said to have been spontaneously generated; thatis, no word in the English sentence generated it during the trans-lation alignmentacorresponding to a French -English sen-tence pair(f,e)is a vector of alignment indicesajmappingwords of the French sentence to words of the English sentence(or theNULLE nglish word).

Eachaj(wherej= 1,2, .. , m)takes one of the values0,1,2, .. , l. Thus, if (for example)a2= 3, then the French sentence wordf2is aligned to the En-glish sentence worde3. In general, the French sentence wordfjis aligned to the English sentence the IBM models deal with alignments, it is muchmore efficient to computeP(a,f|e)than it is to computeP(f|e)(which equalsPaP(a,f|e)). We therefore attempt tofind the most probable sentence and alignmentpair e,a fora Translation , rather than just the most probable sentencee(asdescribed in Section 2).We have chosen to use IBM Model 2 because it is quite a bitsimpler and more efficient than the higher IBM models, yet it isstill reasonably effective in modeling sentence Translation . IBMM odel 2 definesP(a,f|e)according to the following equation:P(a,f|e) =mYj=1P(aj|j, m, l)P(fj|eaj),(8)whereP(aj|j, m, l)is analignment probability specifically,the probability that for a French sentence of lengthmand anEnglish sentence of lengthlthejth French sentence word willalign to theajth English sentence word andP(fj|eaj)is atranslation probability specifically, the probability that a par-ticular French wordfjis the Translation of a particular Model 2 is trained on a set of French -English sen-tence Translation pairs.

Theexpectation maximization(EM) al-gorithm, which works in an iterative fashion, is used to estimatethe optimal value for each alignment and Translation probabil-ity. Each iteration of the EM algorithm computes the expectednumber of co-occurrences of a particular pair of French and En-glish words (or of a type of alignment) and compares this valueto the actual number of co-occurrences (or alignments). Theprobabilities are then adjusted to account for the differences inthese numbers. A thorough discussion of how the EM algorithmworks for training IBM Model 2 is beyond the scope of this pa-per, and we refer the reader to [2] for a full treatment of net effect of the IBM Model 2 Translation model is thatunlikely translations such asdogtranslating tobanane(En-glish:banana) and unlikely alignments such as those in-volving sentences of widely varying lengths are anomalies are still not penalized however; for example,under IBM Model 2 it is just as likely for a particular Englishword to align with 10 French words as it is for it align with 1 French word.

Statistical Machine Translation of French and German into ...

Tags:

Information

Transcription of Statistical Machine Translation of French and German into ...

Related search queries

Statistical Machine Translation of French and German into ...

Tags:

Information

Documents from same domain

Related documents

Related search queries