Transcription of Declarative Machine Learning A Classification of Basic ...
1 Declarative Machine Learning . A Classification of Basic Properties and Types Matthias Boehm, Alexandre V. Evfimievski, Niketan Pansare, Berthold Reinwald IBM Research Almaden; San Jose, CA, USA. [ ] 19 May 2016. Table 1: Delineation of Types of Declarative ML. ABSTRACT Declarative ML Tasks , MLbase [27, 33], Declarative Machine Learning (ML) aims at the high-level (fixed task) Columbus [42], DeepDive [32]. specification of ML tasks or algorithms, and automatic gen- Declarative ML Algorithms , OptiML [35], SciDB [9, 34]. eration of optimized execution plans from these specifica- (fixed algorithm) SystemML [7, 22], SimSQL [11]. tions. The fundamental goal is to simplify the usage and/or Large-Scale ML Libraries , MLlib [30], Mahout [37], development of ML algorithms, which is especially impor- (fixed plan) MADlib [12, 23], ORE, Rev R. tant in the context of large-scale computations. However, ML systems at different abstraction levels have emerged over Declarative ML: Declarative ML aims at a high-level time and accordingly there has been a controversy about the specification of ML tasks or algorithms to simplify the usage meaning of this general definition of Declarative ML.
2 Spec- and/or development of ML algorithms by separating appli- ification alternatives range from ML algorithms expressed cation or algorithm semantics from the underlying data rep- in domain-specific languages (DSLs) with optimization for resentations and execution plans. Table 1 categorizes types performance, to ML task ( Learning problem) specifications of Declarative ML and delineates them from ML libraries. with optimization for performance and accuracy. We ar- Overall, the major benefits of Declarative ML are: gue that these different types of Declarative ML complement Simple, Analysis-Centric Specification, each other as they address different users (data scientists Physical Data Independence, and end users). This paper makes an attempt to create a Automatic Execution Plan Generation (optimization, taxonomy for Declarative ML, including a definition of essen- platform independence, data-size independence), tial Basic properties and types of Declarative ML.
3 Along the way, we provide insights into implications of these proper- Ease of Deployment (platform independence, adaptiv- ties. We also use this taxonomy to classify existing systems. ity of packaged applications), and Finally, we draw conclusions on defining appropriate bench- Separation of Concerns (skill sets of users/devs). marks and specification languages for Declarative ML. Over time, systems at different levels of abstraction have been proposed by industry and academia. Example sys- 1. INTRODUCTION tems range from UDF-centric ML extensions of data-parallel Large-scale Machine Learning (ML) leverages large data frameworks, to domain-specific languages (DSLs) for ML. collections for advanced analytics in order to find interest- tasks or ML algorithms. This broad spectrum of systems ing patterns and train robust predictive models. Tradi- aiming for Declarative ML naturally led to a controversy re- tional frameworks and tools like R, Matlab, Weka, SPSS, garding the scope of Declarative ML.
4 Not surprisingly as or SAS provide rich functionality but except for dedicated many ML algorithms are iterative the discussion centers packages struggle to provide scalable analytics. Due to around the syntax for specifying loops and control flow in the data-intensive characteristics, increasingly often data- general. Various projects adopt an R/Python-like syntax parallel frameworks like MapReduce [14], Spark [41], or [1, 15, 22, 40, 42, 43] inheriting the full flexibility of loops, Flink [3] are used for cost-effective parallelization on com- branches, and functions. Others support loops with (1) more modity hardware. However, large-scale computation inher- restrictive iteration constructs [13, 17], (2) model updates ently increases the complexity of specifying ML algorithms, with implicit convergence checks [8], or encapsulating entire especially with regard to efficient and scalable execution. algorithm classes as ML tasks [27, 33, 42]. We argue that Large-Scale ML Libraries: Large-scale ML libraries the specific language-level syntax is actually irrelevant if the like MLlib (aka SparkML) [30], Mahout [37], and MADlib ML task or algorithm specification conforms to a set of Basic [12, 23] are currently the predominant tools for large- properties required for Declarative ML.
5 Scale ML. These libraries provide algorithms with fixed dis- Contributions and Structure: The primary contribu- tributed runtime plans and often expose the underlying tion of this paper is a systematic analysis and classifica- physical data representation. Although such libraries are tion of Declarative Machine Learning . We first define in very valuable tools for end-users, it takes substantial effort an syntax-independent manner a set of Basic properties in to write new or customize existing algorithms because it Section 2, that any system for Declarative ML should sat- requires knowledge of ML algorithms, their distributed im- isfy. Subsequently, we describe the types of Declarative ML. plementation, and the underlying data-parallel framework. in Section 3. Finally, we use this taxonomy to classify exist- Similarly, improvements often require a modification of all ing systems in Section 4, and draw conclusions for defining individual algorithms to exploit these improvements.
6 Appropriate benchmarks and languages in Section 5. 1. 2. Basic PROPERTIES X = read("./X"); var X = drmFromHDFS("./X"). y = read("./y"); val y = drmFromHDFS("./y"). As a foundation for discussing types of Declarative ML, p = t(X) %*% y; var p = ( %*% y).collect we define essential, Basic properties in the three categories of w = matrix(0,ncol(X),1); var w = dense(..). data, operations, and result correctness. We also discuss the X = (256).checkpoint(). implications of the individual properties and provide exam- while(..) { while(..) {. ples of how Apache SystemML as a representative system q = t(X) %*% X %*% p; q = ( %*% X %*% p).collect .. for Declarative ML realizes these properties. } }. (a) SystemML (b) Mahout Samsara Physical Data Independence Figure 1: Examples of ML Algorithm Specifications. The most significant goal of Declarative ML is data inde- pendence because it decouples the high-level specification of ML tasks or algorithms from the underlying data represen- Operation Semantics tations and related runtime plans and operations.
7 The second major goal of Declarative ML is to specify ML. tasks or algorithms using domain-specific, high-level opera- Property 1. Independence of Data Structures: The tions with well-defined semantics to simplify algorithm usage data types of inputs, intermediate results, and outputs like or development, and enable efficient evaluation plans. matrices or scalars are exposed as abstract data types without access to the underlying physical data representations. Property 3. Analysis-Centric Operation Primitives: Basic operation primitives, common in the target analytics In the context of Declarative ML, independence of data domain, are supported. For ML algorithms, this includes structures serves two major purposes. First, abstract data linear algebra and statistical functions, whereas for ML. types like matrix hide the decision on distributed vs local tasks, this includes task-specific primitives and models. data representations. Accordingly, specified tasks or algo- rithms become independent of data size and deployment In order to allow Declarative ML with simple specifica- context ( , distributed computation vs streaming, differ- tion, there is a need for operation primitives that closely ent runtime backends).
8 Second, abstract data types also resemble a natural description of ML tasks or algorithms hide the physical data representation ( , dense/sparse ma- at a conceptual level. For ML algorithms this includes lin- trices, lossless compression), which allow internal improve- ear algebra, aggregations, and statistical functions but spe- ments of storage and operation efficiency. cific domains like deep Learning might require additional Property 2. Independence of Data Flow Properties: domain-specific operations like convolution. Similarly, for Data flow properties are not exposed, , the user has, at ML tasks this includes task-specific abstractions, operations, specification level, no explicit control over properties like and models. For example, for specifying a task like clas- partitioning, caching, and blocking configurations. sify, common Classification algorithms, loss functions, and parameters should be supported to describe candidates and The property of independence of data flow properties fur- the optimization objective.
9 The same applies for the task ther restricts the notion of abstract data types, disallowing of general-purpose optimization optimize, where one would the explicit specification of interesting data flow properties. expect a way to specify gradient/loss functions, and termi- Examples are (1) caching and checkpointing ( , local and nation conditions. Note that this property excludes systems distributed caching, with certain storage levels), (2) logical that are Declarative but unrelated to ML. and physical partitioning ( , row/column/block partition- ing and range/hash partitioning of distributed data sets), Property 4. Known Semantics of Operation Prim- (3) blocking configurations (row/column block sizes, fixed itives: The semantics of operation primitives used to specify or variable logical/physical block sizes), as well as (4) data ML tasks or algorithms are known to the system in terms of formats (text or binary cell/block formats). Note, however, knowledge of operation characteristics and equivalences.
10 That it is valid to allow the specification of ordering because We require operational semantics, where there exists at it is both a logical and physical data flow property. least one na ve evaluation plan or straightforward mapping. Example SystemML: SystemML satisfies both proper- Knowledge of operation semantics is essential for generating ties by exposing only the abstract data types frame, matrix, efficient evaluation plans for example, via rewrites and op- and scalar without their physical data structures or inter- erator selection from high-level specifications. In the con- esting data flow properties as shown in Figure 1(a), as an text of ML, operation semantics also cover characteristics example of a valid specification. The decisions on phys- like commutativity and associativity, sparse-safeness (cor- ical data flow properties are, however, crucial for perfor- rectness of processing only non-zero cells), value reposition- mance. Hence, the system automatically injects, for exam- ing ( , reorg operations like transpose, or order), sym- ple, caching and partitioning directives via rewrites.