Example: stock market

BLEU: a Method for Automatic Evaluation of Machine …

BLEU: a MethodforAutomaticEvaluationofMachineTra nslationKishore Papineni,SalimRoukos,ToddWard,andWei-Jin gZhuIBMT. J. monthstofinishandin-volve proposea methodofautomaticma-chinetranslationeval uationthatis quick,inexpensive,andlanguage-independen t,thatcorrelateshighlywithhumanevalu-ati on, presentthismethodasanauto-matedunderstud yto (MT)weighmany aspectsoftranslation,includingade-quacy, fidelity, andfluencyofthetranslation(Hovy,1999;Whi teandO Connell,1994).Acompre-hensive catalogofMTevaluationtechniquesandtheirr ichliteratureis givenbyReeder(2001).

tor and a standard (poor) machine translation system using 4 reference translations for each of 127 source sentences. The average precision results are shown in Figure 1. Figure 1: Distinguishing Human from Machine ˘ ˇ ˆ The strong signal differentiating human (high pre-cision) from machine (low precision) is striking.

Tags:

  Machine, Blue, Precision, Cisions, Pre cision

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of BLEU: a Method for Automatic Evaluation of Machine …

1 BLEU: a MethodforAutomaticEvaluationofMachineTra nslationKishore Papineni,SalimRoukos,ToddWard,andWei-Jin gZhuIBMT. J. monthstofinishandin-volve proposea methodofautomaticma-chinetranslationeval uationthatis quick,inexpensive,andlanguage-independen t,thatcorrelateshighlywithhumanevalu-ati on, presentthismethodasanauto-matedunderstud yto (MT)weighmany aspectsoftranslation,includingade-quacy, fidelity, andfluencyofthetranslation(Hovy,1999;Whi teandO Connell,1994).Acompre-hensive catalogofMTevaluationtechniquesandtheirr ichliteratureis givenbyReeder(2001).

2 Forthemostpart,thesevarioushumanevaluati onap-proachesarequiteexpensive (Hovy, 1999).More-over, they cantakeweeksormonthsto believe thatMTprogressstemsfromevaluationandthat thereis a logjamoffruitfulresearchideaswaitingtobe releasedfrom1 Sowecallourmethodthebilingualevaluationu nderstudy, automaticevaluationthatisquick,language- independent, machinetranslationis to a professionalhumantranslation,thebetterit judgethequalityofa machinetranslation,onemeasuresitsclosene sstooneormorereferencehumantranslationsa ccord-ingtoa ,ourMTevaluationsystemrequirestwo numerical translationcloseness corpusofgoodqualityhumanreferencetrans-l ationsWe fashionourclosenessmetricafterthehighlys uc-cessfulword errorratemetricusedbythespeechrecognitio ncommunity, appropriatelymodifiedformultiplereferenc etranslationsandallowingforle-gitimatedi fferencesinwordchoiceandwordor-der.

3 Themainideais tousea givesrisetoa have se-lecteda , ,weevaluatetheperformanceofBLEU. InSection4, wedescribea ,wecompareourbaselinemetricperformancewi thhumanevaluations. Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. Proceedings of the 40th Annual Meeting of the Association for2 TheBaselineBLEUM etricTypically, therearemany perfect goodtranslationfroma ,considerthesetwo candidatetranslationsofa :It is a guideto actionwhichensuresthatthe militaryalwaysobeysthe commandsof the :It is to insurethe troopsforeverhearingthe appearto beonthesamesubject,theydiffermarkedlyinq uality.

4 Forcomparison, :It is a guideto actionthatensuresthatthe :It is the guidingprinciplewhichguaranteesthe militaryforcesalwaysbeingunderthe commandof :It is the practicalguideforthe armyalwaysto heedthe directionsof the is clearthatthegoodtranslation,Candidate1,s haresmany wordsandphraseswiththesethreeref-erencet ranslations,whileCandidate2 thatCandidate1 shares"It is a guideto action"withReference1,"which"withReferen ce2,"ensuresthatthemilitary"withReferenc e1,"always"withRef-erences2 and3,"commands"withReference1, andfinally"of the party"withReference2 (allig-noringcapitalization).

5 Incontrast,Candidate2 ex-hibitsfarfewermatches,andtheirextenti s isclearthata programcanrankCandidate1higherthanCandid ate2 showthatthisrankingabilityisa generalphe-nomenon,andnotanartifactofa few toy BLEU imple-mentoris , , computeprecision,onesimplycountsupthenum berofcandidatetranslationwords(unigrams) whichoccurinany , MTsys-temscanovergenerate reasonable words,result-inginimprobable,buthigh- pre cision ,translationslike thatofexample2 below. Intuitivelytheprob-lemis clear:a referencewordshouldbeconsideredexhausted aftera matchingcandidatewordis formalizethisintuitionasthemodifiedunigr amprecision.

6 To computethis,onefirstcountsthemaximumnumb eroftimesa ,oneclipstheto-talcountofeachcandidatewo rdbyitsmaximumreferencecount,2addsthesec lippedcountsup,anddividesbythetotal(uncl ipped) :thethethe the the the :Thecat is on :Thereis a cat on the ,Candidate1 achievesa modifiedunigramprecisionof17=18;whereasC andidate2 achievesa modifiedunigramprecisionof8= , themodifiedunigramprecisioninExam-ple2 is 2=7,eventhoughitsstandardunigrampre-cisi onis 7= (Count;MaxRefCount). Inotherwords,onetruncateseachword s count,if necessary, tonotexceedthelargestcountobservedinany guidetotheeye,wehave computedsimilarlyforanyn: ,summed, ,Candidate1 achievesa mod-ifiedbigramprecisionof10/17,whereast helowerqualityCandidate2 achievesa modifiedbigrampre-cisionof1 ,the(implausible)can-didateachievesa aspectsoftranslation:adequacy andflu-ency.

7 A translationusingthesamewords(1-grams) multi-sentencetestset?Althoughonetypical lyevaluatesMTsystemsona corpusofentiredocu-ments,ourbasicunitofe valuationis sourcesentencemaytranslatetomany targetsen-tences,inwhichcaseweabusetermi nologyandre-fertothecorrespondingtargets entencesasa sen-tence. We ,weaddtheclippedn-gramcountsforallthecan didatesentencesanddividebythenumberofcan didaten-gramsinthetestcorpustocomputea modifiedprecisionscore,pn, C2fCandidatesg n-gram2 CCountclip(n-gram) C02fCandidatesg n-gram02C0 Count(n-gram0):4 BLEU onlyneedsto matchhumanjudgmentwhenaveragedover a testcorpus; ,a systemwhichproducesthefluentphrase EastAsianeconomy is penalizedheavilyonthelongern-gramprecisi onsif allthereferenceshappentoread economyofEastAsia.

8 Thekey toBLEU s successisthatallsystemsaretreatedsimilar lyandmultiplehumantranslatorswithdiffere ntstylesareused, verifythatmodifiedn-gramprecisiondistin- guishesbetweenverygoodtranslationsandbad translations,wecomputedthemodifiedprecis ionnumbersontheoutputofa (good)humantransla-toranda standard(poor)machinetranslationsystemus ing4 :DistinguishingHumanfromMachine Thestrongsignaldifferentiatinghuman(high pre-cision)frommachine(lowprecision) appearsthatany beuseful,however, themetricmustalsoreliablydistinguishbetw eentranslationsthatdonotdiffersogreatlyi nquality.

9 Furthermore,it mustdistinguishbetweentwo thisend,weobtaineda humantranslationbysomeonelackingnative proficiency inboththesource(Chinese)andthetargetlang uage(English).Forcomparison,weacquiredhu mantranslationsofthesamedocumentsbya native Englishspeaker. systems two humansandthreemachines arescoredagainsttwo :MachineandHumanTranslations ranking:H2(Human-2)is betterthanH1(Human-1),andthereis a bigdropinqualitybetweenH1andS3( Machine /S ystem-3). ,thisis thesamerankorderassignedtothese sys-tems byhumanjudges,aswediscusslater. Whilethereseemstobeamplesignalinany singlen-gramprecision,it is morerobusttocombineallthesesig-nalsintoa weightedlinearav-erageofthemodifiedpreci sionsresultedinencour-agingresultsforthe 5 , ascanbeseeninFigure2,themodifiedn-grampr ecisionde-caysroughlyexponentiallywithn: themodifiedun-igramprecisionis muchlargerthanthemodifiedbi-gramprecisio nwhichin turnis thisexponentialdecayintoac-count;a ,whichis.

10 6 Experi-mentally, weobtainthebestcorrelationwithmono-5 Thegeometricaverageis harshif any ofthemodifiedpre-cisionsvanish,butthissh ouldbeanextremelyrareeventintestcorporao freasonablesize(forNmax 4). maximumn-gramorderof4, although3-gramsand5-gramsgive , someextent, , modifiedprecisionispenalizedif a wordoc-cursmorefrequentlyina wordasmany timesaswarrantedandpenalizesusinga wordmoretimesthanit occursinany , modifiedn-gramprecisionalonefailstoenfor cethepropertranslationlength,asis illustratedintheshort, :Candidate:of theReference1:It is a guideto actionthatensuresthatthe :It is the guidingprinciplewhichguaranteesthe militaryforcesalwaysbeingunderthe commandof.


Related search queries