i Data-Intensive Text Processing with MapReduce

IData-Intensive Text Processingwith MapReduceJimmy Lin and Chris DyerUniversity of Maryland, College ParkManuscript prepared April 11, 2010 This is the pre-production manuscript of a book in the Morgan & Claypool SynthesisLectures on Human Language Technologies. Anticipated publication date is .. ii1 Introduction .. in the Clouds .. Ideas .. Is This Different? .. This Book Is Not .. 172 MapReduce Basics .. Programming Roots .. and Reducers .. Execution Framework .. and Combiners .. Distributed File System .. Cluster Architecture .. 383 MapReduce Algorithm Design .. Aggregation .. Combiners and In-Mapper Algorithmic Correctness with Local and Stripes.

Relative Frequencies .. Sorting .. Joins .. Reduce-Side Map-Side Memory-Backed Join67 CONTENTS .. 684 Inverted Indexing for Text Retrieval .. Crawling .. Indexes .. Indexing: Baseline Implementation .. Indexing: Revised Implementation .. Compression .. Byte-Aligned and Word-Aligned Bit-Aligned Postings About Retrieval? .. and Additional Readings .. 895 Graph Algorithms .. Representations .. Breadth-First Search .. with Graph Processing .. and Additional Readings .. 1106EM Algorithms for Text Processing .. Maximization .. Maximum Likelihood A Latent variable Marble MLE with Latent Expectation An EM Markov Models.

Three Questions for Hidden Markov The Forward The Viterbi Algorithm126iv Parameter Estimation for Forward-Backward Training: in MapReduce .. HMM Training in Study: Word Alignment for statistical Machine Translation .. statistical Phrase-Based Brief Digression: Language Modeling with Word Algorithms .. Gradient-Based Optimization and Log-Linear and Additional Readings .. 1507 Closing Remarks .. of MapReduce .. Computing Paradigms .. and Beyond .. 1561C H A P T E R 1 IntroductionMapReduce [45] is a programming model for expressing distributed computations onmassive amounts of data and an execution framework for large-scale data processingon clusters of commodity servers.

It was originally developed by Google and built onwell-known principles in parallel and distributed Processing dating back several has since enjoyed widespread adoption via an open-source implementationcalled Hadoop, whose development was led by Yahoo (now an Apache project). Today,a vibrant software ecosystem has sprung up around Hadoop, with significant activityin both industry and book is about scalable approaches to Processing large amounts of text withMapReduce. Given this focus, it makes sense to start with the most basic question:Why? There are many answers to this question, but we focus on two. First, big data is a fact of the world, and therefore an issue that real-world systems must grapple , across a wide range of text Processing applications, more data translates intomore effective algorithms, and thus it makes sense to take advantage of the plentifulamounts of data that surround information societies are defined by vast repositories of data, both publicand private.

Therefore, any practical application must be able to scale up to datasetsof interest. For many, this means scaling up to the web, or at least a non-trivial frac-tion thereof. Any organization built around gathering, analyzing, monitoring, filtering,searching, or organizing web content must tackle large-data problems: web-scale pro-cessing is practically synonymous with Data-Intensive Processing . This observation ap-plies not only to well-established internet companies, but also countless startups andniche players as well. Just think, how many companies do you know that start theirpitch with we re going to harvest information on the web and.. ?Another strong area of growth is the analysis of user behavior data.

Any operatorof a moderately successful website can record user activity and in a matter of weeks (orsooner) be drowning in a torrent of log data. In fact, logging user behavior generatesso much data that many organizations simply can t cope with the volume, and eitherturn the functionality off or throw away data after some time. This represents lostopportunities, as there is a broadly-held belief that great value lies in insights derivedfrom mining such data. Knowing what users look at, what they click on, how muchtime they spend on a web page, etc. leads to better business decisions and competitiveadvantages. Broadly, this is known as business intelligence, which encompasses a widerange of technologies including data warehousing, data mining, and CHAPTER 1.

INTRODUCTIONHow much data are we talking about? A few examples: Google grew from pro-cessing 100 TB of data a day with MapReduce in 2004 [45] to Processing 20 PB a daywith MapReduce in 2008 [46]. In April 2009, a blog post1was written about eBay stwo enormous data warehouses: one with 2 petabytes of user data, and the other petabytes of user data spanning 170 trillion records and growing by 150 billion newrecords per day. Shortly thereafter, Facebook revealed2similarly impressive numbers,boasting of petabytes of user data, growing at about 15 terabytes per day. Petabytedatasets are rapidly becoming the norm, and the trends are clear: our ability to storedata is fast overwhelming our ability to process what we store.

More distressing, in-creases in capacity are outpacing improvements in bandwidth such that our ability toevenreadback what we store is deteriorating [91]. Disk capacities have grown from tensof megabytes in the mid-1980s to about a couple of terabytes today (several orders ofmagnitude). On the other hand, latency and bandwidth have improved relatively little:in the case of latency, perhaps 2 improvement during the last quarter century, andin the case of bandwidth, perhaps 50 . Given the tendency for individuals and organi-zations to continuously fill up whatever capacity is available, large-data problems aregrowing increasingly beyond the commercial sphere, many have recognized the importance ofdata management in many scientific disciplines, where petabyte-scale datasets are alsobecoming increasingly common [21].

For example: The high-energy physics community was already describing experiences withpetabyte-scale databases back in 2005 [20]. Today, the Large Hadron Collider(LHC) near Geneva is the world s largest particle accelerator, designed to probethe mysteries of the universe, including the fundamental nature of matter, byrecreating conditions shortly following the Big Bang. When it becomes fully op-erational, the LHC will produce roughly 15 petabytes of data a Astronomers have long recognized the importance of a digital observatory thatwould support the data needs of researchers across the globe the Sloan DigitalSky Survey [145] is perhaps the most well known of these projects.

Looking intothe future, the Large Synoptic Survey Telescope (LSST) is a wide-field instrumentthat is capable of observing the entire sky every few days. When the telescopecomes online around 2015 in Chile, its gigapixel primary camera will produceapproximately half a petabyte of archive images every month [19]. The advent of next-generation DNA sequencing technology has created a delugeof sequence data that needs to be stored, organized, and delivered to scientists for1 study. Given the fundamental tenant in modern genetics that genotypesexplain phenotypes, the impact of this technology is nothing less than transfor-mative [103]. The European Bioinformatics Institute (EBI), which hosts a centralrepository of sequence data called EMBL-bank, has increased storage capacityfrom petabytes in 2008 to 5 petabytes in 2009 [142].

i Data-Intensive Text Processing with MapReduce

Tags:

Information

Advertisement

Transcription of i Data-Intensive Text Processing with MapReduce