Example: confidence

Developing a lexicon of word families for closely-related ...

Developing a lexicon of word families for closely - related languagesNuria GalaLIF-CNRS UMR 6166163 Av. de Luminy case 901 F-13288 Marseille cedex resources are of interest in linguistic research and its applications. However, building and enriching them is very time consu-ming and expensive. In specific fields such as morphology, unsupervised and (semi-)supervised approaches consisting in automaticallydiscovering word structure have gained in popularity in the last few years. While encouraging results have been obtained for a largevariety of languages, few resources are currently available. In this paper, we describe a morphological lexicon under development forRomance languages. It is based on an initial seed set of manually identified 2,004 word families in French. Our goal is to map thesefamilies on related languages in order to obtain a resource based on family clusters, capable to provide morphological and semanticinformation on each family crosslingually.

It is based on an initial seed set of manually identified 2,004 word families in French. Our goal is to map these families on related languages in order to obtain a resource based on family clusters, capable to provide morphological and semantic

Tags:

  Related, Words, Closely, Families, Word families for closely related, Word families

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Developing a lexicon of word families for closely-related ...

1 Developing a lexicon of word families for closely - related languagesNuria GalaLIF-CNRS UMR 6166163 Av. de Luminy case 901 F-13288 Marseille cedex resources are of interest in linguistic research and its applications. However, building and enriching them is very time consu-ming and expensive. In specific fields such as morphology, unsupervised and (semi-)supervised approaches consisting in automaticallydiscovering word structure have gained in popularity in the last few years. While encouraging results have been obtained for a largevariety of languages, few resources are currently available. In this paper, we describe a morphological lexicon under development forRomance languages. It is based on an initial seed set of manually identified 2,004 word families in French. Our goal is to map thesefamilies on related languages in order to obtain a resource based on family clusters, capable to provide morphological and semanticinformation on each family crosslingually.

2 Such a resource will be of help in contrastive linguistics and in different NLP and humanapplications, such as crosslingual information retrieval and interlingual language IntroductionA variety of multilingual lexical resources have been deve-loped by different civilizations ever since the birth of writ-ing, as a result of practical needs (learning, archiving, trans-mitting linguistic and other kind of knowledge, etc.). Theshape and the contents of these resources have evolved sig-nificantly over time. Technical revolutions such as print-ing and computerization have had a profound influence onthe way to develop lexicons. From linear presentationsof word lists to lexical networks, multilingual lexical re-sources present interlingual correspondences and often veryspecific linguistic , manually building and enriching such resourcesis very time consuming and expensive. In recent years,collaborative and automatic approaches have emerged as aplausible alternative to build resources in a large-scale per-spective thus limiting the time of development.

3 Collabora-tive multilingual resources such asPapillon(Boitet et al.,2002) are based on the principle of sharing contributions,that is, anyone collaborates to enrich the database accord-ing to his/her possibilities. While the underlying philoso-phy is interesting, the results can easily be disappointing,as enriching a resource is a tedious task and in practice fewpeople accept. Hence, it is hard to get the expected volumeof order to address both shortcomings (manual and con-tributive), automatic approaches have gained in popular-ity in NLP, especially when it comes to collecting spe-cific linguistic morphology, differentmethods exist to automatically acquire information aboutthe internal structure of words (Lavalle and Langlais,2010): probabilistic approaches which regroup words intoparadigms by removing common affixes (Paramor(Mon-son et al., 2007)) or community-based detection in net-works (MorphoNet(Bernhard, 2010)), unsupervised learn-ing of word structure by decomposition (Linguistica(Gold-smith, 2001),Morfessor(Creutz and Lagus, 2005)), super-vised or semi-supervised methods using formal analogies1 For a discussion, see (Cristea et al.)

4 , 2008).to identify stems and morphological information (Lepage,1998), (Hathout, 2008), (Lavalle and Langlais, 2010).These methods may differ with respect to the kind of resultthey obtain: word segments, complete morphemic analysis,morphological links between words , etc. Furthermore, asraw text is used for knowledge acquisition, most systems donot make a difference between inflectional or work presented in this paper aims at building cross-linguistic morpho-phonological morpho-phonological family groups lexical units sharing morpho-logical2and semantic features. Such a family is usuallybuilt around a common stem. Hence, the stemterre earth ,will induce the family made of lexical units such aster-rasse terrace ,terrestre terrestrial ,terrien landowner ,etc. All the words in this family share the following se-mantic components: surface , ground , area , etc. Hav-ing translated the stem in closely - related languages and us-ing multilingual corpora and lexica, we will build the cor-responding families and compare their organization acrosslanguages.

5 Our aim is thus to create a resource present-ing word families among closely - related languages and tocheck whether they can be mapped on each other. Thelinguistic description provided is strictly synchronical andconcerns both derivational morphology (stems and affixes)and morphosemantic links (semantic components within aword family).This paper is structured as follows. In the next section weprovide an overview concerning some existing mono- andmultilingual resources by focusing on those containing amorphological description. Section 3 describes first exper-iments to map our initial resource for French to other Ro-mance languages. We conclude the paper by discussing theachieved results and present some ideas concerning alternations are possible, flower andfloraison flowering ,croc/croch-incroc hook andcrochet little hook .2. Morphological resources: an overviewAlthough a significant number of existing NLP lexiconspresent primarily syntactic or semantic information sub-categorization (Briscoe and Carrol, 1997), concepts as inWordNet(Fellbaum, 1998), etc.

6 An increasing interestin morphology has led over the past few years to the deve-lopment of morphological lexica. Such resources presenta fine-grained and explicit description of the morphologi-cal organization of the lexicon . The resources are mainlymonolingual, though some multilingual examples can Monolingual lexicaFor morphology rich languages such as Romance or Slaviclanguages, monolingual lexica may display morphotac-tics (ordering of morphemes, derivational morphology) ormorphosyntactic information (word forms associated to: alemma, a part-of-speech tag, inflectional categories, subca-tegorization patterns, etc.).TheDigital Dictionary of Catalan Derivational Affixes(DSVC)(Bernal and DeCesaris, 2008) illustrates a deriva-tional morphology lexicon . It has been created manuallyand is of limited coverage: about one hundred (Clement et al., 2004) is an example of morphosyntacticlexicon for French verbs (about 5,000 entries).

7 It has beenbuilt automatically by extracting information from largeraw corpora and other existing for Slavic languages,Unimorph3is a derivational mor-phology database with 92,970 Russian words . There is alsoa morphosyntactic lexicon for Polish (Sagot, 2007) whichhas been created using the same formalism as the one inLefffthrough automatic lexical acquisition from a formalism has also been used for other Europeanlanguages ( Spanish), as well as for less resourced lan-guages such as Kurdish and Persian (Walther and Sagot,2010). Multilingual lexicaMultilingual resources provide the basis for translation, thatis, the mapping from one language to the other (Calzolari etal., 1999). Yet this does not always hold for all multilingualmorphological leading example is CELEX (Baayen et al., 1995),a manually-tagged morphological database for English,Dutch and German. For each language, words are analyzedmorphologically and the processes of derivation are madeexplicit ( concern [V], unconcern ((un)[N.))]

8 N],((concern)[V])[N])[N]). Unfortunately, the morphologi-cal information is not explicit crosslinguistically, that is,CELEX is a database for three languages independent onefrom (Cartoni and Lefer, 2010) is a morphologicaldatabase aiming to present word-formation processes in amultilingual environment. Word formation is presented asa set of multilingual rules available by affixes, rules andconstructed words ( by the rule above (n<a) , the3 affixes are displayed:sopre, sovra, super(Ita-lian),sur, supra(French),supra(English), along with somewords containing such affixes in each language). Word for-mation processes are thus represented in a multilingual con-text. Although morphological knowledge was partly auto-matically acquired from corpora, the coverage of MuleX-FoR is limited to one hundred , unsupervised learning of morphologically relatedwords in various languages (English, German, Turkish,Finnish and Arabic) has been the main goal of systemsparticipating to Morpho Challenge 20095, (Creutz and Lagus, 2005),Rali-Cof(Lavalle and Langlais,2010),MorphoNet(Bernhard, 2010), etc.

9 While such com-petition allows the comparison of different statistical ma-chine learning techniques (in terms of precision and recall),the challenge does not yield any avalaible morphologicallyannotated RemarksTwo general observations can be made at this point: first,very few available resources present morphological linkscrosslinguistically, and if they do, their coverage is limited;second, morphological processes described by the existingresources mainly focus on word-formation (word construc-tion) conveyed by affixes. To our knowledge, word families although described in the literature (Bybee, 1985) have been brought to the forefront only in psycholinguisticsto show their impact in lexical decision tasks (Schreuderand Baayen, 1997).3. Mapping from French to other RomancelanguagesConsidering that closely - related languages have a com-mon origin, morphological regularities may be conveyedby means of similar constructions.

10 Our aim is thus to usea manually built morphological lexicon (Polymots6(Galaet al., 2010), with 2,004 stems and nearly 20,000 derivedwords for French) and map it to other Romance Word families : definition and propertiesWe consider a family (cluster or paradigm) to be a set oflexical units sharing a formal and a semantic words in a lexical cluster share: a stem ( ,humanism, humanist,humanitarian, humanity, humanize, dehumanize, etc.); semantic continuity (all the words in the previous se-rie are related to the notion of bipedal primate mam-mal ).While in some families there is a continuity of meaning( words sharing a significant number of semantic features, thehumanfamily), in others meaning is distributed, a single and precise meaning is impossible to seizeamong the lexical units of the cluster, as the words haveevolved and the semantic components are widely such cases, the semantic features of the common stem areto be found among the words in the family ( Frenchval- glen includes features such asgeographic areaandgoingdownhilland at least one of these notions is to be foundinvall ee valley andavaler swallow ).


Related search queries