TFX: A TensorFlow-Based Production-Scale Machine …

KDD 2017 Applied Data Science Paper KDD'17, August 13 17, 2017, Halifax, NS, Canada TFX: A TensorFlow-Based Production-Scale Machine learning Platform Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, Martin Zinkevich Google Inc.. ABSTRACT adopt Machine learning as a tool to gain knowledge from Creating and maintaining a platform for reliably producing data across a broad spectrum of use cases and products, rang- and deploying Machine learning models requires careful or- ing from recommender systems [6, 7], to clickthrough rate chestration of many components a learner for generating prediction for advertising [13, 15], and even the protection models based on training data, modules for analyzing and val- of endangered species [5].

Idating both data as well as models, and finally infrastructure The conceptual workflow of applying Machine learning for serving models in production. This becomes particularly to a specific use case is simple: at the training phase, a challenging when data changes over time and fresh models learner takes a dataset as input and emits a learned model;. need to be produced continuously. Unfortunately, such or- at the inference phase, the model takes features as input and chestration is often done ad hoc using glue code and custom emits predictions. However, the actual workflow becomes scripts developed by individual teams for specific use cases, more complex when Machine learning needs to be deployed leading to duplicated effort and fragile systems with high in production. In this case, additional components are re- technical debt. quired that, together with the learner and model, comprise We present TensorFlow Extended (TFX), a TensorFlow- a Machine learning platform.

The components provide au- based general-purpose Machine learning platform implemented tomation to deal with a diverse range of failures that can at Google. By integrating the aforementioned components happen in production and to ensure that model training and into one platform, we were able to standardize the compo- serving happen reliably. Building this type of automation is nents, simplify the platform configuration, and reduce the non-trivial, and it becomes even more challenging when we time to production from the order of months to weeks, while consider the following complications: providing platform stability that minimizes disruptions. Building one Machine learning platform for many different We present the case study of one deployment of TFX in the learning tasks: Products can have substantially different Google Play app store, where the Machine learning models needs in terms of data representation, storage infrastruc- are refreshed continuously as new data arrive.

Deploying ture, and Machine learning tasks. The Machine learning TFX led to reduced custom code, faster experiment cycles, platform must be generic enough to handle the most com- and a 2% increase in app installs resulting from improved mon set of learning tasks as well as be extensible to support data and model analysis. one-off atypical use-cases. Continuous training and serving: The platform has to KEYWORDS support the case of training a single model over fixed data, large-scale Machine learning ; end-to-end platform; continuous but also the case of generating and serving up-to-date training models through continuous training over evolving data ( , a moving window over the latest n days of a log 1 INTRODUCTION stream). It is hard to overemphasize the importance of Machine learn- Human-in-the-loop: The Machine learning platform needs ing in modern computing. More and more organizations to expose simple user interfaces to make it easy for engineers to deploy and monitor the platform with minimal Corresponding authors: Heng-Tze Cheng, Clemens Mewald, Neoklis Polyzotis, and Steven Euijong Whang: configuration.

Furthermore, it also needs to help users with various levels of Machine - learning expertise under- Permission to make digital or hard copies of part or all of this work stand and analyze their data and models. for personal or classroom use is granted without fee provided that Production-level reliability and scalability: The platform copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. needs to be resilient to disruptions from inconsistent data, Copyrights for third-party components of this work must be honored. software, user configurations, and failures in the underlying For all other uses, contact the owner/author(s). execution environment. In addition, the platform must KDD'17, August 13 17, 2017, Halifax, NS, Canada. 2017 Copyright held by the owner/author(s). 978-1-4503-4887- scale gracefully to the high data volume that is common 4/17/08.

In training, and also to increases in the production traffic DOI: to the serving system. 1387. KDD 2017 Applied Data Science Paper KDD'17, August 13 17, 2017, Halifax, NS, Canada Having this type of platform enables teams to easily deploy it also imposes requirements on all other components. Data Machine learning in production for a wide range of prod- analysis, validation, and visualization tools need to support ucts, ensures best practices for different components of the sparse, dense, or sequence data. Model validation, evalua- platform, and limits the technical debt arising from one-off tion, and serving tools need to support all kinds of inference implementations that cannot be reused in different contexts. types, including (among others) regression, classification, and This paper presents the anatomy of end-to-end Machine sequences. learning platforms and introduces TensorFlow Extended (TFX), one implementation of such a platform that we built Continuous training.

Most Machine learning pipelines are at Google to address the aforementioned challenges. We set up as workflows or dependency graphs ( [14, 20]) that describe the key platform components and the salient points execute specific operations or jobs in a defined sequence. If behind their design and functionality. We also present a case a team needs to train over new data, the same workflow or study of deploying the platform in Google Play, a commercial graph is executed again. However, many real-world use-cases mobile app store with over one billion active users and over require continuous training. TFX supports several continua- one million apps, and discuss the lessons that we learned in tion strategies that result from the interaction between data this process. These lessons reflect best practices for Machine - visitation and warm-starting options. Data visitation can learning platforms in a diverse set of contexts and are thus of be configured to be static or dynamic (over a rolling range general interest to researchers and practitioners in the field.)

Of directories). Warm-starting initializes a subset of model parameters from a previous state. 2 PLATFORM OVERVIEW Easy-to-use configuration and tools. Providing a uni- Background and Related Work fied configuration framework is only possible if components Prior art has addressed a subset of the challenges in deploying also share utilities that allow them to communicate and Machine learning in production. Related work has reported share assets. A TFX user is only exposed to one common that the learning algorithm is only one component of a ma- configuration that is passed to all components and shared chine learning platform that represents a small fraction of the where necessary. Utilities that are used by all components en- code [19, 20]. Data and model parallelism require distributed able enforcement of global garbage collection policies, unified systems and orchestration that exceed capabilities of many debugging and status signals, etc.

Single- Machine solutions [12, 16]. Beyond simply stitching Production-level reliability and scalability. Only a together components, a Machine learning pipeline also needs small fraction of a Machine learning platform is the actual to be simple to set up [16], maybe even support automated code implementing the training algorithm [19]. If the plat- pipeline construction [20]. Once a team can train multiple form handles and encapsulates the complexity of Machine models it needs to keep track of their experiment history learning deployment, engineers and scientists have more time in a centralized database [21]. Ideally, the platform auto- to focus on the modeling tasks. Since it is difficult to pre- matically surveys different Machine learning techniques and dict whether a learning algorithm will behave reasonably on suggests the best solution, allowing even non-experts access new data [8], model validation is critical.

In turn, model to Machine learning [10]. However, putting together several validation must be coupled with data validation in order to disjoint components to do the job can result in significant detect corrupted training data and thus prevent bad (yet, technical debt in forms of hard-to-maintain glue code, hidden validated) models from reaching production. To give an dependencies, feedback loops, etc. [19]. example, training data that accidentally includes the label will lead to a good quality model that passes validation, Platform Design and Anatomy but would not perform well in production where the label In this paper we expand on existing literature and address is not available. Validating the serving infrastructure before the challenges outlined in the introduction by presenting a pushing to the production environment is vital to the relia- reusable Machine learning platform developed at Google. Our bility and robustness of any Machine learning platform.

TFX: A TensorFlow-Based Production-Scale Machine …

Tags:

Information

Transcription of TFX: A TensorFlow-Based Production-Scale Machine …

Related search queries

TFX: A TensorFlow-Based Production-Scale Machine …

Tags:

Information

Related documents

Related search queries