Spark SQL: Relational Data Processing in Spark

Spark SQL: Relational Data Processing in SparkMichael Armbrust , Reynold S. Xin , Cheng Lian , Yin Huai , Davies Liu , Joseph K. Bradley ,Xiangrui Meng , Tomer Kaftan , Michael J. Franklin , Ali Ghodsi , Matei Zaharia Databricks Inc. MIT CSAIL AMPLab, UC BerkeleyABSTRACTS park SQL is a new module in Apache Spark that integrates rela-tional Processing with Spark s functional programming API. Builton our experience with Shark, Spark SQL lets Spark program-mers leverage the benefits of Relational Processing ( ,declarativequeries and optimized storage), and lets SQL users call complexanalytics libraries in Spark ( ,machine learning).

Compared toprevious systems, Spark SQL makes two main additions. First, itoffers much tighter integration between Relational and proceduralprocessing, through a declarative DataFrame API that integrateswith procedural Spark code. Second, it includes a highly extensibleoptimizer, Catalyst, built using features of the Scala programminglanguage, that makes it easy to add composable rules, control codegeneration, and define extension points. Using Catalyst, we havebuilt a variety of features ( ,schema inference for JSON, machinelearning types, and query federation to external databases) tailoredfor the complex needs of modern data analysis.

We see Spark SQLas an evolution of both SQL-on- Spark and of Spark itself, offeringricher APIs and optimizations while keeping the benefits of theSpark programming and Subject [Database Management]: SystemsKeywordsDatabases; Data Warehouse; Machine Learning; Spark ; Hadoop1 IntroductionBig data applications require a mix of Processing techniques, datasources and storage formats. The earliest systems designed forthese workloads, such as MapReduce, gave users a powerful, butlow-level, procedural programming interface. Programming suchsystems was onerous and required manual optimization by the userto achieve high performance.

As a result, multiple new systemssought to provide a more productive user experience by offeringrelational interfaces to big data. Systems like Pig, Hive, Dremel andShark [29, 36, 25, 38] all take advantage of declarative queries toprovide richer automatic to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from 15,May 31 June 4, 2015, Melbourne, Victoria, is held by the owner/author(s). Publication rights licensed to 978-1-4503-2758-9/15/05 ..$ the popularity of Relational systems shows that users oftenprefer writing declarative queries, the Relational approach is insuffi-cient for many big data applications. First, users want to performETL to and from various data sources that might be semi- or un-structured, requiring custom code.

Second, users want to performadvanced analytics, such as machine learning and graph Processing ,that are challenging to express in Relational systems. In practice,we have observed that most data pipelines would ideally be ex-pressed with a combination of both Relational queries and complexprocedural algorithms. Unfortunately, these two classes of systems Relational and procedural have until now remained largely disjoint,forcing users to choose one paradigm or the paper describes our effort to combine both models in SparkSQL, a major new component in Apache Spark [39].

Spark SQLbuilds on our earlier SQL-on- Spark effort, called Shark. Ratherthan forcing users to pick between a Relational or a procedural API,however, Spark SQL lets users seamlessly intermix the SQL bridges the gap between the two models through twocontributions. First, Spark SQL provides aDataFrame APIthatcan perform Relational operations on both external data sources andSpark s built-in distributed collections. This API is similar to thewidely used data frame concept in R [32], but evaluates operationslazily so that it can perform Relational optimizations.

Second, tosupport the wide range of data sources and algorithms in big data, Spark SQL introduces a novel extensible optimizer makes it easy to add data sources, optimization rules, anddata types for domains such as machine DataFrame API offers rich Relational /procedural integrationwithin Spark programs. DataFrames are collections of structuredrecords that can be manipulated using Spark s procedural API, orusing new Relational APIs that allow richer optimizations. They canbe created directly from Spark s built-in distributed collections ofJava/Python objects, enabling Relational Processing in existing Sparkprograms.

Other Spark components, such as the machine learninglibrary, take and produce DataFrames as well. DataFrames are moreconvenient and more efficient than Spark s procedural API in manycommon situations. For example, they make it easy to computemultiple aggregates in one pass using a SQL statement, somethingthat is difficult to express in traditional functional APIs. They alsoautomatically store data in a columnar format that is significantlymore compact than Java/Python objects. Finally, unlike existingdata frame APIs in R and Python, DataFrame operations in SparkSQL go through a Relational optimizer, support a wide variety of data sources and analytics workloadsin Spark SQL, we designed an extensible query optimizer calledCatalyst.

Catalyst uses features of the Scala programming language,such as pattern-matching, to express composable rules in a Turing-complete language. It offers a general framework for transformingtrees, which we use to perform analysis, planning, and runtimecode generation. Through this framework, Catalyst can also beextended with new data sources, including semi-structured datasuch as JSON and smart data stores to which one can push filters( ,HBase); with user-defined functions; and with user-definedtypes for domains such as machine learning.

Spark SQL: Relational Data Processing in Spark

Tags:

Information

Transcription of Spark SQL: Relational Data Processing in Spark

Related search queries

Spark SQL: Relational Data Processing in Spark

Tags:

Information

Documents from same domain

Related documents

Related search queries