One button machine for automating feature engineering in ...

One button machine for automating featureengineering in relational databasesHoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn,Bei Chen, Tiep Mai and Oznur AlkanIBM ResearchDublin, IrelandEmail: and feature engineering is one of the most importantand time consuming tasks in predictive analytics projects. Itinvolves understanding domain knowledge and data explorationto discover relevant hand-crafted features from raw data. In thispaper, we introduce a system called One button machine , orOneBM for short, which automates feature discovery in relationaldatabases. OneBM automatically performs a key activity ofdata scientists, namely, joining of database tables and applyingadvanced data transformations to extract useful features fromdata.

We validated OneBM in Kaggle competitions in whichOneBM achieved performance as good as top 16% to 24%data scientists in three Kaggle competitions. More importantly,OneBM outperformed the state-of-the-art system in a Kagglecompetition in terms of prediction accuracy and ranking onKaggle leaderboard. The results show that OneBM can be usefulfor both data scientists and non-experts. It helps data scientistsreduce data exploration time allowing them to try and error manyideas in short time. On the other hand, it enables non-experts,who are not familiar with data science, to quickly extract valuefrom their data with a little effort, time and INTRODUCTIONOver the last decade, data analytics has become an importanttrend in many industries including e-commerce, healthcare,manufacture and more.

The reasons behind the increasinginterest are the availability of data, variety of open-sourcemachine learning tools and powerful computing , machine learning tools for analyzing data arestill difficult to be utilized by non-experts, since a typical dataanalytics project contains many tasks that have not been fullyautomated fact, Figure 1 shows five basic steps in a predictive dataanalytics project. Although there exists many automation toolsfor the last step, no tools exist to fully automate the remainingsteps. Among these steps, feature engineering is one of themost important tasks because it prepares inputs to machinelearning models, thus deciding how machine learning modelswill general, automation of feature engineering is hard be-cause it requires highly skilled data scientists having strongdata mining and statistics backgrounds in addition to domainknowledge to extract useful patterns from data.

The given taskis known as a bottle-neck in any data analytics project. In fact,in recent public data science competitions, top data scientistsreported that most time they spent on such competitions wasRelevant data acquisition Problem formulation Data cleaning & curation feature engineering Model selection & tuning Fig. 1. Five basic steps of a data analytics feature engineering , on working with raw data toprepare input for machine learning models (see the Kaggle sblog post:Learning from the best1). In an extreme case such asin theGrupo Bimbo Inventory Prediction, the winners reportedthat 95% of their time was for feature engineering and only5% is for modelling (see Grupo Bimbo inventory predictionwinner interview2).

Therefore, automation of feature engineering may helpreducing data scientist s workload significantly, allowing themto try and error many ideas to improve prediction resultswith significant less efforts. Moreover, in many data scienceprojects, it is very popular that companies want to quicklytry some simple ideas first to check if there is any value intheir datasets before investing more time, effort and moneyon a data analytics project. Automation helps the companyto make quick decision with lower cost. Last but not least,automation solves shortage of data scientists enabling non-experts to extract values from their data by order to build a fully automatic system for featureengineering, we need to tackle the following challenges: diverse basic data types: columns in tables can havedifferent basic data types including simple ones likenumerical or categorical, or complicated ones like text,trajectories, gps location, images, sequences and time-series collective data types: the complexity of data types isincreased when the data is the result of joining multipletables.

The joint results may correspond to a set orsequence of basic types. temporal information: the data might be associated withtimestamps which introduces order in the data. complex relational graph: the relational graph might be1 [ ] 1 Jun 2017very complex, the number of possible relational paths canbe exponential in the number of tables in the databaseswhich make exhaustive data exploration intractable large transformation search space: there is infinite ways oftransform joint tables into features, which transformationis useful for a given type of problem is not known inadvance given no domain knowledge about the this work, we propose the one button machine (OneBM),a framework that supports feature engineering from rela-tional data, aiming at tackling the aforementioned works directly with multiple raw tables in a joins the tables incrementally, following different paths onthe relational graph.

It automatically identifies data types ofthe joint results, including simple data types (numerical orcategorical) and complex data types (set of numbers, set ofcategories, sequences, time series and texts), and applies cor-responding pre-defined feature engineering techniques on thegiven types. In doing so, new feature engineering techniquescould be plugged in via an interface with OneBM s featureextractor modules to extract desired types of features in spe-cific domain. OneBM supports data scientists by automatingthe most popular feature engineering techniques on differentstructured and unstructured summary, the key contribution of this work is as follows: we proposed an efficient method based on depth-firstsearch to explore complex relational graph for automatingfeature engineering from relational databases we proposed methods to synthesize raw data and au-tomatically extract advanced features from structuredand unstructured data.

The state-of-the-art system onlysupports numerical data and it extracts only basic featuresbased on simple aggregation statistics. OneBM, implemented in Apache Spark, is the first frame-work being able to automate feature engineering on largedatasets with 100GB of raw data. we demonstrate the significance of OneBM via Kagglecompetitions in which OneBM competes with data sci-entists we compared our results to the state-of-the-art system viaa Kaggle competition in which our system outperformedthe-state-of- the-art system in terms of prediction accu-racy and ranking on leaderboardsII. RELATEDWORKA utomation of data science is a broad topic which includesautomation of five basic steps displayed in Figure 1.

Mostrelated work in the literature focuses on the last two steps:automation of model selection, hyper-parameter tuning andfeature engineering . In the following subsections, related workregarding automation of these last two steps is Automatic model selection and tuningAuto-Weka [8], [11] and Auto-SkLearn [3] are amongthe first works trying to find the best combination of datapreprocessing, hyper-parameter tuning and model works are based on Bayesian optimization [2] to avoidexhaustive grid-search parameter enumeration. These worksare built on top of existing algorithms and data preprocessingtechniques in Weka3and Scikit-Learn4, thus they are veryhandy for practical Automation of Data Science (CADS) [1], [6]is another system built on top of Weka, SPSS and R toautomate model selection and hyper-parameter tuning was made of three basic components: a repository ofanalytics algorithm with meta data, a learning control strategythat determines model and configuration for different analyticstasks and an interactive user interface.

CADS is one of the firstsolutions, that was deployed in the aforementioned works, Automatic Ensemble[12] is the most recent work which uses stacking and meta-data to assist model selection and tuning. TPOT [10] isanother system that uses genetic programming to find the bestmodel configuration and preprocessing work-flow. AutomaticStatistician [9] is similar to the works just described butfocuses more on time-series data and interpretation of themodels in natural summary, automation of hyper-parameter tuning andmodel selection is a very attractive research topic with veryrich literature. The key difference between our work andthese works is that, while the state-of-the-art focuses onoptimization of models given a ready set of features storedin a single table, our work focuses on preparing features as aninput to these systems from relational databases with multipletables.

One button machine for automating feature engineering in ...

Tags:

Information

Transcription of One button machine for automating feature engineering in ...

Related search queries

One button machine for automating feature engineering in ...

Tags:

Information

Documents from same domain

Related documents

Related search queries