Resilient Distributed Datasets: A Fault-Tolerant ...

Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster ComputingMatei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion StoicaUniversity of California, BerkeleyAbstractWe present Resilient Distributed Datasets (RDDs), a dis-tributed memory abstraction that lets programmers per-form in-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks han-dle inefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarse-grained transformations rather than fine-grained updatesto shared state.

However, we show that RDDs are expres-sive enough to capture a wide class of computations, in-cluding recent specialized programming models for iter-ative jobs, such as Pregel, and new applications that thesemodels do not capture. We have implemented RDDs in asystem called spark , which we evaluate through a varietyof user applications and IntroductionCluster computing frameworks like MapReduce [10] andDryad [19] have been widely adopted for large-scale dataanalytics. These systems let users write parallel compu-tations using a set of high-level operators, without havingto worry about work distribution and fault current frameworks provide numerous ab-stractions for accessing a cluster s computational re-sources, they lack abstractions for leveraging distributedmemory.

This makes them inefficient for an importantclass of emerging applications: those thatreuseinterme-diate results across multiple computations. Data reuse iscommon in manyiterativemachine learning and graphalgorithms, including PageRank, K-means clustering,and logistic regression. Another compelling use case isinteractivedata mining, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortu-nately, in most current frameworks, the only way to reusedata between computations ( ,between two MapRe-duce jobs) is to write it to an external stable storage sys-tem, ,a Distributed file system. This incurs substantialoverheads due to data replication, disk I/O, and serializa-tion, which can dominate application execution this problem, researchers have developedspecialized frameworks for some applications that re-quire data reuse.

For example, Pregel [22] is a system foriterative graph computations that keeps intermediate datain memory, while HaLoop [7] offers an iterative MapRe-duce interface. However, these frameworks only supportspecific computation patterns ( ,looping a series ofMapReduce steps), and perform data sharing implicitlyfor these patterns. They do not provide abstractions formore general reuse, ,to let a user load several datasetsinto memory and run ad-hoc queries across this paper, we propose a new abstraction calledre-silient Distributed datasets (RDDs)that enables efficientdata reuse in a broad range of applications. RDDs arefault-tolerant, parallel data structures that let users ex-plicitly persist intermediate results in memory, controltheir partitioning to optimize data placement, and ma-nipulate them using a rich set of main challenge in designing RDDs is defining aprogramming interface that can provide fault toleranceefficiently.

Existing abstractions for in-memory storageon clusters, such as Distributed shared memory [24], key-value stores [25], databases, and Piccolo [27], offer aninterface based on fine-grained updates to mutable state( ,cells in a table). With this interface, the only waysto provide fault tolerance are to replicate the data acrossmachines or to log updates across machines. Both ap-proaches are expensive for data-intensive workloads, asthey require copying large amounts of data over the clus-ter network, whose bandwidth is far lower than that ofRAM, and they incur substantial storage contrast to these systems, RDDs provide an inter-face based oncoarse-grainedtransformations ( ,map,filter and join) that apply the same operation to manydata items. This allows them to efficiently provide faulttolerance by logging the transformations used to build adataset (itslineage) rather than the actual a parti-tion of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to recompute1 Checkpointing the data in some RDDs may be useful when a lin-eage chain grows large, however, and we discuss how to do it in that partition.

Thus, lost data can be recovered, oftenquite quickly, without requiring costly an interface based on coarse-grained trans-formations may at first seem limited, RDDs are a goodfit for many parallel applications, becausethese appli-cations naturally apply the same operation to multipledata items. Indeed, we show that RDDs can efficientlyexpress many cluster programming models that have sofar been proposed as separate systems, including MapRe-duce, DryadLINQ, SQL, Pregel and HaLoop, as well asnew applications that these systems do not capture, likeinteractive data mining. The ability of RDDs to accom-modate computing needs that were previously met onlyby introducing new frameworks is, we believe, the mostcredible evidence of the power of the RDD have implemented RDDs in a system called spark ,which is being used for research and production applica-tions at UC Berkeley and several companies.

spark pro-vides a convenient language-integrated programming in-terface similar to DryadLINQ [31] in the Scala program-ming language [2]. In addition, spark can be used inter-actively to query big datasets from the Scala believe that spark is the first system that allows ageneral-purpose programming language to be used at in-teractive speeds for in-memory data mining on evaluate RDDs and spark through both mi-crobenchmarks and measurements of user show that spark is up to 20 faster than Hadoop foriterative applications, speeds up a real-world data analyt-ics report by 40 , and can be used interactively to scan a1 TB dataset with 5 7s latency. More fundamentally, toillustrate the generality of RDDs, we have implementedthe Pregel and HaLoop programming models on top ofSpark, including the placement optimizations they em-ploy, as relatively small libraries (200 lines of code each).

This paper begins with an overview of RDDs ( 2) andSpark ( 3). We then discuss the internal representationof RDDs ( 4), our implementation ( 5), and experimen-tal results ( 6). Finally, we discuss how RDDs captureseveral existing cluster programming models ( 7), sur-vey related work ( 8), and Resilient Distributed Datasets (RDDs)This section provides an overview of RDDs. We first de-fine RDDs ( ) and introduce their programming inter-face in spark ( ). We then compare RDDs with finer-grained shared memory abstractions ( ). Finally, wediscuss limitations of the RDD model ( ). RDD AbstractionFormally, an RDD is a read-only, partitioned collectionof records. RDDs can only be created through determin-istic operations on either (1) data in stable storage or (2)other RDDs.

We call these operationstransformationstodifferentiate them from other operations on RDDs. Ex-amples of transformations includemap,filter, do not need to be materialized at all times. In-stead, an RDD has enough information about how it wasderived from other datasets (itslineage) tocomputeitspartitions from data in stable storage. This is a power-ful property: in essence, a program cannot reference anRDD that it cannot reconstruct after a , users can control two other aspects of RDDs:persistenceandpartitioning. Users can indicate whichRDDs they will reuse and choose a storage strategy forthem ( ,in-memory storage). They can also ask thatan RDD s elements be partitioned across machines basedon a key in each record. This is useful for placement op-timizations, such as ensuring that two datasets that willbe joined together are hash-partitioned in the same spark Programming InterfaceSpark exposes RDDs through a language-integrated APIsimilar to DryadLINQ [31] and FlumeJava [8], whereeach dataset is represented as an object and transforma-tions are invoked using methods on these start by defining one or more RDDsthrough transformations on data in stable storage( , mapandfilter).

They can then use these RDDs inactions, which are operations that return a value to theapplication or export data to a storage system. Examplesof actions includecount(which returns the number ofelements in the dataset ),collect(which returns the ele-ments themselves), andsave(which outputs the datasetto a storage system). Like DryadLINQ, spark computesRDDs lazily the first time they are used in an action, sothat it can pipeline addition, programmers can call apersistmethod toindicate which RDDs they want to reuse in future oper-ations. spark keeps persistent RDDs in memory by de-fault, but it can spill them to disk if there is not enoughRAM. Users can also request other persistence strategies,such as storing the RDD only on disk or replicating itacross machines, through flags topersist.

Resilient Distributed Datasets: A Fault-Tolerant ...

Tags:

Information

Transcription of Resilient Distributed Datasets: A Fault-Tolerant ...

Related search queries

Resilient Distributed Datasets: A Fault-Tolerant ...

Tags:

Information

Documents from same domain

Related documents

Related search queries