Spark SQL: Relational Data Processing in Spark - MIT CSAIL

Spark SQL: Relational data Processing in SparkMichael Armbrust , Reynold S. Xin , Cheng Lian , Yin Huai , Davies Liu , Joseph K. Bradley ,Xiangrui Meng , Tomer Kaftan , Michael J. Franklin , Ali Ghodsi , Matei Zaharia Databricks Inc. MIT CSAIL AMPLab, UC BerkeleyABSTRACTS park SQL is a new module in Apache Spark that integrates rela-tional Processing with Spark s functional programming API. Builton our experience with Shark, Spark SQL lets Spark program-mers leverage the benefits of Relational Processing ( ,declarativequeries and optimized storage), and lets SQL users call complexanalytics libraries in Spark ( ,machine learning). Compared toprevious systems, Spark SQL makes two main additions. First, itoffers much tighter integration between Relational and proceduralprocessing, through a declarative DataFrame API that integrateswith procedural Spark code. Second, it includes a highly extensibleoptimizer, Catalyst, built using features of the Scala programminglanguage, that makes it easy to add composable rules, control codegeneration, and define extension points.

Using Catalyst, we havebuilt a variety of features ( ,schema inference for JSON, machinelearning types, and query federation to external databases) tailoredfor the complex needs of modern data analysis. We see Spark SQLas an evolution of both SQL-on- Spark and of Spark itself, offeringricher APIs and optimizations while keeping the benefits of theSpark programming and Subject [Database Management]: SystemsKeywordsDatabases; data Warehouse; Machine Learning; Spark ; Hadoop1 IntroductionBig data applications require a mix of Processing techniques, datasources and storage formats. The earliest systems designed forthese workloads, such as MapReduce, gave users a powerful, butlow-level, procedural programming interface. Programming suchsystems was onerous and required manual optimization by the userto achieve high performance. As a result, multiple new systemssought to provide a more productive user experience by offeringrelational interfaces to big data . Systems like Pig, Hive, Dremel andShark [29, 36, 25, 38] all take advantage of declarative queries toprovide richer automatic to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page.

Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from 15,May 31 June 4, 2015, Melbourne, Victoria, is held by the owner/author(s). Publication rights licensed to 978-1-4503-2758-9/15/05 ..$ the popularity of Relational systems shows that users oftenprefer writing declarative queries, the Relational approach is insuffi-cient for many big data applications. First, users want to performETL to and from various data sources that might be semi- or un-structured, requiring custom code. Second, users want to performadvanced analytics, such as machine learning and graph Processing ,that are challenging to express in Relational systems. In practice,we have observed that most data pipelines would ideally be ex-pressed with a combination of both Relational queries and complexprocedural algorithms.

Unfortunately, these two classes of systems Relational and procedural have until now remained largely disjoint,forcing users to choose one paradigm or the paper describes our effort to combine both models in SparkSQL, a major new component in Apache Spark [39]. Spark SQLbuilds on our earlier SQL-on- Spark effort, called Shark. Ratherthan forcing users to pick between a Relational or a procedural API,however, Spark SQL lets users seamlessly intermix the SQL bridges the gap between the two models through twocontributions. First, Spark SQL provides aDataFrame APIthatcan perform Relational operations on both external data sources andSpark s built-in distributed collections. This API is similar to thewidely used data frame concept in R [32], but evaluates operationslazily so that it can perform Relational optimizations. Second, tosupport the wide range of data sources and algorithms in big data , Spark SQL introduces a novel extensible optimizer makes it easy to add data sources, optimization rules, anddata types for domains such as machine DataFrame API offers rich Relational /procedural integrationwithin Spark programs.

DataFrames are collections of structuredrecords that can be manipulated using Spark s procedural API, orusing new Relational APIs that allow richer optimizations. They canbe created directly from Spark s built-in distributed collections ofJava/Python objects, enabling Relational Processing in existing Sparkprograms. Other Spark components, such as the machine learninglibrary, take and produce DataFrames as well. DataFrames are moreconvenient and more efficient than Spark s procedural API in manycommon situations. For example, they make it easy to computemultiple aggregates in one pass using a SQL statement, somethingthat is difficult to express in traditional functional APIs. They alsoautomatically store data in a columnar format that is significantlymore compact than Java/Python objects. Finally, unlike existingdata frame APIs in R and Python, DataFrame operations in SparkSQL go through a Relational optimizer, support a wide variety of data sources and analytics workloadsin Spark SQL, we designed an extensible query optimizer calledCatalyst.

Catalyst uses features of the Scala programming language,such as pattern-matching, to express composable rules in a Turing-complete language. It offers a general framework for transformingtrees, which we use to perform analysis, planning, and runtimecode generation. Through this framework, Catalyst can also beextended with new data sources, including semi-structured datasuch as JSON and smart data stores to which one can push filters( ,HBase); with user-defined functions; and with user-definedtypes for domains such as machine learning. Functional languagesare known to be well-suited for building compilers [37], so it isperhaps no surprise that they made it easy to build an extensibleoptimizer. We indeed have found Catalyst effective in enabling usto quickly add capabilities to Spark SQL, and since its release wehave seen external contributors easily add them as SQL was released in May 2014, and is now one of themost actively developed components in Spark . As of this writing,Apache Spark is the most active open source project for big dataprocessing, with over 400 contributors in the past year.

Spark SQLhas already been deployed in very large scale environments. Forexample, a large Internet company uses Spark SQL to build datapipelines and run queries on an 8000-node cluster with over 100PB of data . Each individual query regularly operates on tens ofterabytes. In addition, many users adopt Spark SQL not just for SQLqueries, but in programs that combine it with procedural example, 2/3 of customers of Databricks Cloud, a hosted servicerunning Spark , use Spark SQL within other programming , we find that Spark SQL is competitive withSQL-only systems on Hadoop for Relational queries. It is also upto10 faster and more memory-efficient than naive Spark code incomputations expressible in generally, we see Spark SQL as an important evolution ofthe core Spark API. While Spark s original functional programmingAPI was quite general, it offered only limited opportunities forautomatic optimization. Spark SQL simultaneously makes Sparkaccessible to more users and improves optimizations for existingones.

Within Spark , the community is now incorporating Spark SQLinto more APIs: DataFrames are the standard data representationin a new ML pipeline API for machine learning, and we hope toexpand this to other components, such as GraphX and start this paper with a background on Spark and the goals ofSpark SQL ( 2). We then describe the DataFrame API ( 3), theCatalyst optimizer ( 4), and advanced features we have built onCatalyst ( 5). We evaluate Spark SQL in 6. We describe externalresearch built on Catalyst in 7. Finally, 8 covers related Background and Spark OverviewApache Spark is a general-purpose cluster computing engine withAPIs in Scala, Java and Python and libraries for streaming, graphprocessing and machine learning [6]. Released in 2010, it is to ourknowledge one of the most widely-used systems with a language-integrated API similar to DryadLINQ [20], and the most activeopen source project for big data Processing . Spark had over 400contributors in 2014, and is packaged by multiple offers a functional programming API similar to other recentsystems [20, 11], where users manipulate distributed collectionscalled Resilient Distributed Datasets (RDDs) [39].

Each RDD isa collection of Java or Python objects partitioned across a can be manipulated through operations likemap,filter,andreduce, which take functions in the programming languageand ship them to nodes on the cluster. For example, the Scala codebelow counts lines starting with ERROR in a text file:lines = ("hdfs://..")errors = (s => ("ERROR"))println( ())This code creates an RDD of strings calledlinesby reading anHDFS file, then transforms it usingfilterto obtain another RDD,errors. It then performs acounton this are fault-tolerant, in that the system can recover lost datausing the lineage graph of the RDDs (by rerunning operations suchas thefilterabove to rebuild missing partitions). They can alsoexplicitly be cached in memory or on disk to support iteration [39].One final note about the API is that RDDs are RDD represents a logical plan to compute a dataset, butSpark waits until certain output operations, such ascount, to launcha computation. This allows the engine to do some simple queryoptimization, such as pipelining operations.

For instance, in theexample above, Spark will pipeline reading lines from the HDFS file with applying the filter and computing a running count, so thatit never needs to materialize the intermediatelinesanderrorsresults. While such optimization is extremely useful, it is alsolimited because the engine does not understand the structure ofthe data in RDDs (which is arbitrary Java/Python objects) or thesemantics of user functions (which contain arbitrary code). Previous Relational Systems on SparkOur first effort to build a Relational interface on Spark was Shark [38],which modified the Apache Hive system to run on Spark and im-plemented traditional RDBMS optimizations, such as columnarprocessing, over the Spark engine. While Shark showed good perfor-mance and good opportunities for integration with Spark programs,it had three important challenges. First, Shark could only be usedto query external data stored in the Hive catalog, and was thus notuseful for Relational queries on datainsidea Spark program ( ,ontheerrorsRDD created manually above).

Spark SQL: Relational Data Processing in Spark - MIT CSAIL

Tags:

Information

Advertisement

Transcription of Spark SQL: Relational Data Processing in Spark - MIT CSAIL

Related search queries

Spark SQL: Relational Data Processing in Spark - MIT CSAIL

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries