Knowledge Discovery and Data Mining: Towards a Unifying ...

Knowledge Discovery and data Mining: Towards a Unifying FrameworkUsama FayyadMicrosoft ResearchOne Microsoft WayRedmond, WA 98052, Piatetsky-ShapiroGTE Laboratories, MS 44 Waltham, MA 02154, SmythInformation and Computer ScienceUniversity of California, IrvineCA 92717-3425, paper presents a first step Towards a unifyingframework for Knowledge Discovery in Databases. Wedescribe finks between data milfing, Knowledge dis-covery, and other related fields. We then define theKDD process and basic data mining algorithms, dis-cuss application issues and conclude with an analysisof challenges facing practitioners in the IntroductionAcross a wide variety of fields, data are being collectedand accumulated at a dramatic pace. There is an ur-gent need for a new generation of computational tech-niques and tools to assist humans in extracting use-ful information ( Knowledge ) from the rapidly growingvolumes of data , These techniques and tools are thesubject of the emerging field of Knowledge discoveryin databases (KDD).

This paper is an initial step to-wards a common framework that we hope will allowus to understand the variety of activities in this multi-disciplinary field and how they fit together. We viewthe Knowledge Discovery process as a set of various ac-tivities for making sense of data . At the core of thisprocess is the application of data mining methods forpatternt Discovery . We examine how data mining isused and outline some of its methods . Finally, we lookat practical application issues of KDD and enumeratechallenges for future research and KDD, data mining , and Relation toother FieldsHistorically the notion of finding useful patterns indata has been given a variety of names including datamining, Knowledge extraction, information Discovery ,information harvesting, data archaeology, and datapattern processing.

The term data mining has beenmostly used by statisticians, data analysts, and the~Throughout this paper we use the term "pattern" todesignate pattern or model extracted from the information systems (MIS) has also gained popularity in the database term KDD was coined at the first KDD work-shop in 1989 (Piatetsky-Shapiro 199t) to emphasizethat " Knowledge " is the end product of a data -drivendiscovery. It has been popularized in artificial intelli-gence and machine our view KDD refers to the overall process of dis-covering useful Knowledge from data while data miningrefers to a particular step in this process. data miningis the application of specific algorithms for extractingpatterns from data . The distinction between the KDDprocess and the data mining step (within the process)is a central point of this paper.

The additional steps inthe KDD process, such as data preparation, data se-lection, data cleaning, incorporating appropriate priorknowledge, and proper interpretation of the results ofmining, are essential to ensure that useful Knowledge isderived from the data . Blind application of data min-ing methods (rightly criticised as " data dredging" inthe statistical literature) can be a dangerous activityeasily leading to Discovery of meaningless has evolved, and continues to evolve, from theintersection of research fields such as machine learning,pattern recognition, databases, statistics, artificial in-telligence, Knowledge acquisition for expert systems, data visualization, and high performance Unifying goal is extracting high-level knowledgefrom low-level data in the context of large data overlaps with machine learning and patternrecognition in the study of particular data mining the-ories and algorithms: means for modeling data andextracting patterns.

KDD focuses on aspects of find-ing understandable patterns that can be interpreted asuseful or interesting Knowledge , and puts a strong em-phasis on working with large sets of real-world , scaling properties of algorithms to large datasets are of fundamental also has much in common with statistics, par-ticularly exploratory data analysis methods . The sta-82I<DD-96 From: KDD-96 Proceedings. Copyright 1996, AAAI ( ). All rights reserved. tistical approach offers precise methods for quantifyingthe inherent uncertainty which results when one triesto infer general patterns from a particular sample of anoverall population. KDD software systems often em-bed particular statistical procedures for sampling andmodeling data , evaluating hypotheses, and handlingnoise within an overall Knowledge Discovery contrast to traditional approaches in statistics, KDDapproaches typically employ more search in model ex-traction and operate in the context of larger data setswith richer data addition to its strong relation to the databasefield (the 2nd D in KDD), another related area data warehousing, which refers to the popular businesstrend for collecting and cleaning transactional datato make them available for on-line analysis and de-cision support.

A popular approach for analysis ofdata warehouses has been called OLAP (on-line an-alytical processing), after a set of principles proposedby Codd (1993). OLAP tools focus on providing multi-dimensional data analysis, which is superior to SQL incomputing summaries and breakdowns along many di-mensions. OLAP tools are targeted Towards simplify-ing and supporting interactive data analysis, while theKDD tool s goal is to automate as much of the processas Basic DefinitionsWe define KDD (Fayyad, Piatetsky-Shapiro, & Smyth1996) asKnowledge Discovery in Databases is the non-trivial process oi" identifying valid, novel, potentiallyuseful, and ultimately understandable patterns data is a set of facts ( , cases in a database)and pattern is an expression in some language describ-ing a subset of the data or a model applicable to thatsubset.

Hence, in our usage here, extracting a patternalso designates fitting a model to data , finding struc-ture from data , or in general any high-level descrip-tion of a set of data . The term process implies thatKDD is comprised of many steps, which involve datapreparation, search for patterns, Knowledge evaluation,and refinement, all repeated in multiple iterations. Bynon-trivial we mean that some search or inference is in-volved, it is not a straightforward computation ofpredefined quantities like computing the average valueof a set of numbers. The discovered patterns shouldbe valid on new data with some degree of also want patterns to be novel (at least to the sys-tem, and preferably to the user) and potentially useful, , lead to some benefit to the user/task.

Finally, thepatterns should be understandable, if not immediatelythen after some above implies that we can define quantitativemeasures for evaluating extracted patterns. In manycases, it is possible to define measures of certainty ( ,estimated prediction accuracy on new data ) or utility( gain, perhaps in dollars saved due to better pre-dictions or speed-up in response time of a system). No-tions such as novelty and understandability are muchmore subjective. In certain contexts understandabilitycan be estimated by simplicity ( , the number of bitsto describe a pattern). An important notion, called in-terestingness ( see Piatetsky-Shapiro ~ Matheus1994, Silberschatz & Tuzhilin 1995), is usually taken asan overall measure of pattern value, combining valid-ity, novelty, usefulness, and simplicity.

Interestingnessfunctions can be explicitly defined or can be manifestedimplicitly via an ordering placed by the KDD systemon the discovered patterns or mining is a step in the KDD process consistingof applying data analysis and Discovery algorithmsthat, under acceptable computational efficiency lim-itations, produce a particular enumeration of pat-terns over the data (see Section 5 for more details).Note that the space of patterns is often infinite, and theenumeration of patterns involves some form of search inthis space. Practical computational constraints placesevere limits on the subspace that can be explored bya data mining Process is the process of using the databasealong with any required selection, preprocessing,subsampling, and transformations of it; to applydata mining methods (algorithms) to enumerate pat-terns from it; and to evaluate the products of datamnining to identify the subset of the enumeratedpatterns deemed " Knowledge ".

The data mining component of the KDD process isconcerned with the algorithmic means by which pat-terns are extracted and enumerated from data . Theoverall KDD process (Figure 1) includes the evalua-tion and possible interpretation of the "mined" pat-terns to determine which patterns may be considerednew " Knowledge ." The KDD process also includes allof the additional steps described in Section 4. The no-tion of an overall user-driven process is not unique toKDD: analogous proposals have been put forward instatistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).4 The KDD ProcessThe KDD process is interactive and iterative, involvingnumerous steps with many decisions being made byData mining General Overview83pa0msITransformedSI~..~, | data =IIDataI.

Knowledge Discovery and Data Mining: Towards a Unifying ...

Tags:

Information

Advertisement

Transcription of Knowledge Discovery and Data Mining: Towards a Unifying ...

Knowledge Discovery and Data Mining: Towards a Unifying ...

Tags:

Information

Advertisement

Documents from same domain