CHAPTER Regular Expressions, Text Normalization, Edit …

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright 2021. Allrights reserved. Draft of December 29, ,TextNormalization, Edit DistanceUser: I am : DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPYUser: I need some help, that much seems : WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELPUser: Perhaps I could learn to get along with my : TELL ME MORE ABOUT YOUR FAMILYUser: My mother takes care of : WHO ELSE IN YOU FAMILY TAKES CARE OF YOUUser: My : YOUR FATHERUser: You are like my father in some (1966)The dialogue above is fromELIZA, an early natural language processing systemELIZA that could carry on a limited conversation with a user by imitating the responses ofa Rogerian psychotherapist (Weizenbaum, 1966).

ELIZA is a surprisingly simpleprogram that uses pattern matching to recognize phrases like I need X and translatethem into suitable outputs like What would it mean to you if you got X? . Thissimple technique succeeds in this domain because ELIZA doesn t actually need toknowanything to mimic a Rogerian psychotherapist. As Weizenbaum notes, this isone of the few dialogue genres where listeners can act as if they know nothing of theworld. ELIZA s mimicry of human conversation was remarkably successful: manypeople who interacted with ELIZA came to believe that it reallyunderstoodthemand their problems, many continued to believe in ELIZA s abilities even after theprogram s operation was explained to them (Weizenbaum, 1976), and even todaysuchchatbotsare a fun course modern conversational agents are much more than a diversion; theycan answer questions, book flights, or find restaurants, functions for which they relyon a much more sophisticated understanding of the user s intent, as we will see inChapter 24.

Nonetheless, the simple pattern-based methods that powered ELIZAand other chatbots play a crucial role in natural language ll begin with the most important tool for describing text patterns: theregularexpression. Regular expressions can be used to specify strings we might want toextract from a document, from transforming I need X in ELIZA above, to definingstrings like$199or$ extracting tables of prices from a ll then turn to a set of tasks collectively calledtext normalization , in whichtextnormalizationregular expressions play an important part. Normalizing text means converting itto a more convenient, standard form. For example, most of what we are going todo with language relies on first separating out ortokenizingwords from runningtext, the task oftokenization.

English words are often separated from each othertokenizationby whitespace, but whitespace is not always Yorkandrock n rollare sometimes treated as large words despite the fact that they contain spaces, whilesometimes we ll need to separateI minto the two wordsIandam. For processingtweets or texts we ll need to tokenizeemoticonslike:)orhashtagslike# REGULAREXPRESSIONS, TEXTNORMALIZATION, EDITDISTANCESome languages, like Japanese, don t have spaces between words, so word tokeniza-tion becomes more part of text normalization islemmatization, the task of determininglemmatizationthat two words have the same root, despite their surface differences. For example,the wordssang,sung, andsingsare forms of the verbsing.

The wordsingis thecommonlemmaof these words, and alemmatizermaps from all of these is essential for processing morphologically complex languages to a simpler version of lemmatization in which we mainlystemmingjust strip suffixes from the end of the word. Text normalization also includessen-tence segmentation: breaking up a text into individual sentences, using cues likesentencesegmentationperiods or exclamation , we ll need to compare words and other strings. We ll introduce a metriccallededit distancethat measures how similar two strings are based on the numberof edits (insertions, deletions, substitutions) it takes to change one string into theother. Edit distance is an algorithm with applications throughout language process-ing, from spelling correction to speech recognition to coreference Regular ExpressionsOne of the unsung successes in standardization in computer science has been theregular expression(RE), a language for specifying text search strings.

This prac-regularexpressiontical language is used in every computer language, word processor, and text pro-cessing tools like the Unix tools grep or Emacs. Formally, a Regular expression isan algebraic notation for characterizing a set of strings. They are particularly use-ful for searching in texts, when we have apatternto search for and acorpusofcorpustexts to search through. A Regular expression search function will search through thecorpus, returning all texts that match the pattern. The corpus can be a single docu-ment or a collection. For example, the Unix command-line toolgreptakes a regularexpression and returns every line of the input document that matches the search can be designed to return every match on a line, if there are more thanone, or just the first match.

In the following examples we generally underline theexact part of the pattern that matches the Regular expression and show only the firstmatch. We ll show Regular expressions delimited by slashes but note that slashes arenotpart of the Regular expressions come in many variants. We ll be describingextended regu-lar expressions; different Regular expression parsers may only recognize subsets ofthese, or treat some expressions slightly differently. Using an online Regular expres-sion tester is a handy way to test out your expressions and explore these Basic Regular Expression PatternsThe simplest kind of Regular expression is a sequence of simple characters; puttingcharacters in sequence is calledconcatenation.

To search forwoodchuck, we typeconcatenation/woodchuck/. The expression/Buttercup/matches any string containing thesubstringButtercup;grepwith that expression would return the lineI m called lit-tle Buttercup. The search string can consist of a single character (like/!/) or asequence of characters (like/urgl/). Regular expressions arecase sensitive; lower case/s/is distinct from uppercase/S/(/s/matches a lower casesbut not an upper caseS). This means REGULAREXPRESSIONS3 REExample Patterns Matched/woodchucks/ interesting links to woodchucksand lemurs /a/ Mary Ann stopped by Mona s /!/ You ve left the burglar behind again! said NoriFigure simple regex pattern/woodchucks/will not match the stringWoodchucks.

We can solve thisproblem with the use of the square braces[and]. The string of characters inside thebraces specifies adisjunctionof characters to match. For example, Fig. showsthat the pattern/[wW]/matches patterns containing Patterns/[wW]oodchuck/Woodchuck or woodchuck Woodchuck /[abc]/ a , b ,or c In uomini, in soldati /[1234567890]/any digit plenty of 7to 5 Figure use of the brackets[]to specify a disjunction of Regular expression/[1234567890]/specifies any single digit. While suchclasses of characters as digits or letters are important building blocks in Expressions, they can get awkward ( , it s inconvenient to specify/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/to mean any capital letter ).

In cases where there is a well-defined sequence asso-ciated with a set of characters, the brackets can be used with the dash (-) to specifyany one character in arange. The pattern/[2-5]/specifies any one of the charac-rangeters2,3,4, or5. The pattern/[b-g]/specifies one of the charactersb,c,d,e,f, org. Some other examples are shown in Fig. Patterns Matched/[A-Z]/an upper case letter we should call it Drenched Blossoms /[a-z]/a lower case letter my beans were impatient to be hoed! /[0-9]/a single digit CHAPTER 1: Down the Rabbit Hole Figure use of the brackets[]plus the dash-to specify a square braces can also be used to specify what a single charactercannotbe,by use of the caret.

CHAPTER Regular Expressions, Text Normalization, Edit …

Tags:

Information

Advertisement

Transcription of CHAPTER Regular Expressions, Text Normalization, Edit …

Related search queries

CHAPTER Regular Expressions, Text Normalization, Edit …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries