HEIDELTIME.HR: Extracting and Normalizing Temporal ...

: Extracting and NormalizingTemporal Expressions in CroatianLuka Skukan, Goran Glava , Jan najderUniversity of Zagreb, Faculty of Electrical Engineering and ComputingText Analysis and Knowledge Engineering LabUnska 3, 10000 Zagreb, Croatia{ , , expression extraction and normalization are important for many NLP tasks and have been the topic of extensive research. Whilethe majority of research on Temporal expression extraction was performed for English, there has recently also been work on temporalprocessing for other languages. In this paper, we describe , the croatian resources for HeidelTime a multilingual,cross-domain Temporal expression tagger. HeidelTime recognizes Temporal expressions in text and normalizes them according to theTIMEX3 annotation standard. We compile WikiWarsHr, a corpus of historical narratives in croatian manually annotated for temporalexpressions.}

On WikiWarsHr, results comparable to those originally achieved by HeidelTime on Englishtexts, with F1-scores of and for expression extraction and normalization, : lu cenje in normaliziranje casovnih izrazov v hrva ciniLu cenje in normalizacija casovnih izrazov sta pomembna za raznovrstne naloge s podro cja ra cunalni ke obravnave naravnega jezika insta bila predmet tevilnih raziskav. Medtem ko je bila ve cina raziskav lu cenja casovnih izrazov opravljenih za angle cino, pa so bilev zadnjem casu raziskave izvedene tudi za druge jezike. V prispevku opi emo , hrva ke vire za HeidelTime ve cjezi cniin prekdomenski ozna cevalec za casovne izraze. HeidelTime prepozna casovne izraze v besedilu in jih normalizira glede na standard zaozna cevanje TIMEX3. Izdelamo WikiWarsHr, korpus zgodovinskih pripovedi v hrva cini, ki je bil ro cno ozna cen za casovne izraze.

NaWikiWarsHr dose e rezultate, primerljive s tistimi, ki jih je HeidelTime dosegal na angle kih besedilih, z mero F 0,93 zalu cenje in 0,86 za normalizacijo casovnih IntroductionThe ability to extract and normalize Temporal expres-sions in natural language texts is of major importance fornatural language processing tasks, such as summarizationand question answering, but also for reasoning about eventsand time in general. Temporal expression extraction is thetask of identifying Temporal expressions and their normalization task amounts to turning extracted tem-poral expressions into a fully specified value and formattingthem according to some standard, including a number of Temporal taggers are available,mostly for English and other major languages, a temporalexpression tagger for croatian does not yet exist.

A newtemporal expression tagger could be implemented, or anexisting multilingual system could be adapted to work forCroatian. We chose the latter approach in this work, build-ing on an existing and widely used this paper, we describe , the Croa-tian resources for the rule-based Temporal expression tag-ger HeidelTime (Str tgen et al., 2013).1 HeidelTime ex-tracts and normalizes Temporal expressions according to theTIMEX3 standard (Pustejovsky et al., 2003), and emergedas a winner in the TempEval-2 (Verhagen et al., 2010)and TempEval-3 (UzZaman et al., 2012) shared evalua-tion tasks. HeidelTime is a multilingual tagger, with re-sources been developed for English, German (Str tgen etal., 2013), Arabic, Italian, Spanish, Vietnamese (Str tgen1 al.)

, 2014a), French (Moriceau and Tannier, 2014), Chi-nese (Li et al., 2014), Dutch, and Russian. We have devel-oped croatian resources, which will be included in the nextHeidelTime develop and evaluate the tagger, we compiled Wiki-WarsHr, a corpus of historical narratives in croatian man-ually annotated for Temporal expressions. On this cor-pus, results comparable to thoseoriginally achieved by HeidelTime on English structure of this paper is as follows. We describethe mechanisms of HeidelTime in Section 2. Section 3 de-scribes the In Section 4, wedescribe the WikiWarsHr corpus and present the evaluationresults. Section 5 concludes the HeidelTime taggerThe HeidelTime tagger extracts and normalizes tempo-ral expressions according to the TIMEX3 standard (Puste-jovsky et al.

, 2003). In TIMEX3, each Temporal expres-sions is assigned a Type and a Value. A Type may be aDate,Time,DurationorSet. The Value corresponds to atemporal value, partially dependent on Type ( a Date 2014-10 for October of 2014).HeidelTime features a generic, language-independentcore, written in Java, and a language-dependent part, theso-called language resources. A language resource con-sist of three sets: (1) expression resources , (2) normaliza-2 The are also available KONFERENCA JEZIKOVNE TEHNOLOGIJE Informacijska dru ba - IS 20149th Language Technologies Conference Information Society - IS 201499(DCT: June 21st 2014)The field of AI research was founded at a conferenceon the campus of Dartmouth College in the <TIMEX3tid= t1 type= DATE value= 1956-SU >summer of1956</TIMEX3>.

<TIMEX3 tid= t2 type= DATE value= 2014 >58 years later</TIMEX3>., we stillhaven t achieved many of the goals proposed ,artificial intelligence has advanced and is<TIMEX3 tid= t3 type= DATE value= 2014-06-21 >today</TIMEX3> a part of our daily lives withoutmost of us knowing 1: Example of under-specification resources, and (3) rule resources. expression resourcesare regular expressions used for extraction Temporal expres-sions from text, , phrases for months, weekdays, num-bers, etc. Normalization resources translate matched tokensto their canonical form, according to TIMEX3, by applyingnormalization mapping to extracted patterns ( , May 05 ). Finally, the rule resources combine the previoustwo resources to extract and normalize Temporal expres-sions.

These may be complemented with additional regularexpressions to form more complex match-and-normalizerules, , for discarding parts of extracted expressions orfor adding a modifier ( early , middle , etc.).Normalization is performed both on fully specified ex-pressions ( June 28, 1995 ) and relative Temporal expres-sions ( tomorrow ). The latter are expressions that cannotbe normalized without contextual information. Normaliza-tion of relative Temporal expressions is performed by leav-ing the expressions under-specified and relying on Heidel-Time s generic focus-tracking system to assign them a morespecific value. For example, given a document creationtime (DCT) of June 20th, 2014, the expression tomorrow might be resolved as 2014-06-21 . This step is performedby taking into account the type of the document (narrative,news, scientific, or colloquial) and the tenses of the verbsused in the sentence containing the under-specified tempo-ral expression .

Either the DCT or a previously mentionedvalue can be used in under-specified expression normaliza-tion, depending on the document type and the normaliza-tion rule. An example of resolving under-specified datesusing both DCT and current focus is shown in Fig. 1. Ad-ditionally, HeidelTime supports functionality extensions inform of text post-processors written as Java code. Theseallow for more verbose expression resolution, , com-puting the date of lunar holidays such as task of developing resources for croatian languageconsisted of developing three above-mentioned sets of re-sources. We next describe the resources and the develop-ment PreprocessingHeidelTime requires text to be pre-annotated with to-ken, sentence and part-of-speech (POS) information. Weused the CSTL emma lemmatiser (Jongejan and Haltrup,2005) for token splitting and lemmatization,3and the Hun-Pos part-of-speech tagger (Hal csy et al.)

, 2007) to ob-tain the POS information. To integrate this functionalitywith HeidelTime, we wrote a Java wrapper that allows thetagger s engine to invoke it during pre-processing. Hun-Pos and CSTL emma were previously trained to work withCroatian texts (Agi c et al., 2013). are divided into severalclasses. The expression and normalization resources aredivided into descriptive classes, according to their com-mon roles in Temporal constructs, with each normalizationresource corresponding to an expression resource. Someexamples include rules are divided according to their semantics in theTIMEX3 standard intoDate,Time,DurationandSetre-sources. Altogether, there are 199 rule resources for Croat-ian: 123 for dates, 37 for time, 24 for durations, and 15 forsets.

HEIDELTIME.HR: Extracting and Normalizing Temporal ...

Tags:

Information

Transcription of HEIDELTIME.HR: Extracting and Normalizing Temporal ...

Related search queries

HEIDELTIME.HR: Extracting and Normalizing Temporal ...

Tags:

Information

Documents from same domain

Related documents

Related search queries