Example: air traffic controller

Collocations - Stanford University

DRAFT!c January 7, 1999 Christopher Manning & Hinrich Sch an expression consisting of two or more words thatcorrespond to some conventional way of saying things. Or in the wordsof Firth (1957: 181): Collocations of a given word are statements of thehabitual or customary places of that word. Collocations include nounphrases likestrong teaandweapons of mass destruction, phrasal verbs liketo make up, and other stock phrases likethe rich and powerful. Particularlyinteresting are the subtle and not-easily-explainable patterns of word usagethat native speakers all know: why we saya stiff breezebut not??a stiff wind(while eithera strong breezeora strong windis okay), or why we speak ofbroad daylight(but not?bright daylightor??narrow darkness). Collocations are characterized by limitedcompositionality.

rest of this chapter, we will use a stop list that excludes words whose most frequent tag is not a verb, noun or adjective. Exercise 5-1 Add part-of-speech patterns useful for collocation discovery to Table 5.2, including patterns longer than two tags. 2. This search was performed on AltaVista on March 28, 1998.

Tags:

  Lists, Collocations

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Collocations - Stanford University

1 DRAFT!c January 7, 1999 Christopher Manning & Hinrich Sch an expression consisting of two or more words thatcorrespond to some conventional way of saying things. Or in the wordsof Firth (1957: 181): Collocations of a given word are statements of thehabitual or customary places of that word. Collocations include nounphrases likestrong teaandweapons of mass destruction, phrasal verbs liketo make up, and other stock phrases likethe rich and powerful. Particularlyinteresting are the subtle and not-easily-explainable patterns of word usagethat native speakers all know: why we saya stiff breezebut not??a stiff wind(while eithera strong breezeora strong windis okay), or why we speak ofbroad daylight(but not?bright daylightor??narrow darkness). Collocations are characterized by limitedcompositionality.

2 We call a nat-COMPOSITIONALITY ural language expression compositional if the meaning of the expressioncan be predicted from the meaning of the parts. Collocations are not fullycompositional in that there is usually an element of meaning added to thecombination. In the case ofstrong tea,stronghas acquired the meaningrich in some active agentwhich is closely related, but slightly different fromthe basic sensehaving great physical strength. Idioms are the most extremeexamples of non-compositionality. Idioms liketo kick the bucketorto hearit through the grapevineonly have an indirect historical relationship to themeanings of the parts of the expression. We are not talking about bucketsor grapevines literally when we use these idioms. Most Collocations exhibitmilder forms of non-compositionality, like the expressioninternational bestpracticethat we used as an example earlier in this book.

3 It is very nearly asystematic composition of its parts, but still has an element of added mean-ing. It usually refers to administrative efficiency and would, for example,not be used to describe a cooking technique although that meaning wouldbe compatible with its literal is considerable overlap between the concept ofcollocationand no-tions liketerm,technical term, andterminological phrase. As these names sug-TERMTECHNICAL TERMTERMINOLOGICALPHRASE1425 Collocationsgest, the latter three are commonly used when Collocations are extractedfrom technical domains (in a process calledterminology extraction). TheTERMINOLOGY EXTRACTION reader be warned, though, that the wordtermhas a different meaning ininformation retrieval. There, it refers to both words and phrases.

4 So itsubsumes the more narrow meaning that we will use in this are important for a number of applications: natural lan-guage generation (to make sure that the output sounds natural and mis-takes likepowerful teaorto take a decisionare avoided), computational lexi-cography (to automatically identify the important Collocations to be listedin a dictionary entry), parsing (so that preference can be given to parseswith natural Collocations ), and corpus linguistic research (for instance, thestudy of social phenomena like the reinforcement of cultural stereotypesthrough language (Stubbs 1996)).There is much interest in Collocations partly because this is an area thathas been neglected in structural linguistic traditions that follow Saussureand Chomsky.

5 There is, however, a tradition in British linguistics, associ-ated with the names of Firth, Halliday, and Sinclair, which pays close at-tention to phenomena like Collocations . Structural linguistics concentrateson general abstractions about the properties of phrases and sentences. Incontrast, Firth sContextual Theory of Meaningemphasizes the importanceof context: the context of the social setting (as opposed to the idealizedspeaker), the context of spoken and textual discourse (as opposed to theisolated sentence), and, important for Collocations , the context of surround-ing words (hence Firth s famous dictum that a word is characterized by thecompany it keeps). These contextual features easily get lost in the abstracttreatment that is typical of structural good example of the type of problem that is seen as important in thiscontextual view of language is Halliday s example of strong vs.

6 Power-ful tea (Halliday 1966: 150). It is a convention in English to talk aboutstrong tea, notpowerful tea, although any speaker of English would alsounderstand the latter unconventional expression. Arguably, there are nointeresting structural properties of English that can be gleaned from thiscontrast. However, the contrast may tell us something interesting aboutattitudes towards different types of substances in our culture (why do weusepowerfulfor drugs like heroin, but not for cigarettes, tea and coffee?)and it is obviously important to teach this contrast to students who wantto learn idiomatically correct English. Social implications of language useand language teaching are just the type of problem that British linguistsfollowing a Firthian approach are interested this chapter, we will introduce the principal approaches to finding Frequency143locations: selection of Collocations by frequency, selection based on meanand variance of the distance between focal word and collocating word, hy-pothesis testing, and mutual information.

7 We will then return to the ques-tion of what a collocation is and discuss in more depth different definitionsthat have been proposed and tests for deciding whether a phrase is a col-location or not. The chapter concludes with further readings and pointersto some of the literature that we were not able to reference corpus we will use in examples in this chapter consistsof four months of the New York Times newswire: from August throughNovember of 1990. This corpus has about 115 megabytes of text and roughly14 million words. Each approach will be applied to this corpus to makecomparison easier. For most of the chapter, the New York Times exampleswill only be drawn from fixed two-word phrases (or bigrams). It is im-portant to keep in mind, however, that we chose this pool for convenienceonly.

8 In general, both fixed and variable word combinations can be colloca-tions. Indeed, the section on mean and variance looks at the more looselyconnected the simplest method for finding Collocations in a text corpus is count-ing. If two words occur together a lot, then that is evidence that they havea special function that is not simply explained as the function that resultsfrom their , just selecting the most frequently occurring bigrams is notvery interesting as is shown in Table The table shows the bigrams(sequences of two adjacent words) that are most frequent in the corpus andtheir frequency. Except forNew York, all the bigrams are pairs of is, however, a very simple heuristic that improves these results alot (Justeson and Katz 1995b): pass the candidate phrases through a part-of-speech filter which only lets through those patterns that are likely to be phrases.

9 1 Justeson and Katz (1995b: 17) suggest the patterns in Table is followed by an example from the text that they use as a test set. Inthese patterns A refers to an adjective, P to a preposition, and N to a shows the most highly ranked phrases after applying the results are surprisingly good. There are only 3 bigrams that we wouldnot regard as non-compositional phrases:last year,last week, andfirst Similar ideas can be found in (Ross and Tukey 1975) and (Kupiec et al. 1995).1445 CollocationsC(w1w2)w1w280871ofthe58841in the26430tothe21842onthe21839forthe18568a ndthe16121thatthe15630atthe15494tobe1389 9ina13689ofa13361bythe13183withthe12622f romthe11428 NewYork10007hesaid9775asa9231isa8753hasb een8573foraTable Collocations : Raw ( )is the frequency of some-thing in the PatternExampleANlinear functionNNregression coefficientsAANG aussian random variableANNcumulative distribution functionNANmean squared errorNNNclass probability functionNPNdegrees of freedomTable of speech tag patterns for collocation filtering.

10 These patterns wereused by Justeson and Katz to identify likely Collocations among frequently occur-ring word Frequency145C(w1w2)w1w2tag pattern11487 NewYorkA N7261 UnitedStatesA N5412 LosAngelesN N3301lastyearA N3191 SaudiArabiaN N2699lastweekA N2514vicepresidentA N2378 PersianGulfA N2161 SanFranciscoN N2106 PresidentBushN N2001 MiddleEastA N1942 SaddamHusseinN N1867 SovietUnionA N1850 WhiteHouseA N1633 UnitedNationsA N1337 YorkCityN N1328oilpricesN N1210nextyearA N1074chiefexecutiveA N1073realestateA NTable Collocations : Justeson and Katz part-of-speech Cityis an artefact of the way we have implemented the Justeson andKatz filter. The full implementation would search for the longest sequencethat fits one of the part-of-speech patterns and would thus find the longerphraseNew York City, which containsYork twenty highest ranking phrases containingstrongandpowerfulallhave the form A N (where A is eitherstrongorpowerful).


Related search queries