Transcription of Data Mining: Practical Machine Learning Tools and ...
1 Data MiningPractical Machine Learning Tools and 5/3/05 2:21 PM Page iThe Morgan Kaufmann Series in Data Management SystemsSeries Editor:Jim Gray, Microsoft ResearchData Mining: Practical Machine LearningTools and Techniques, Second EditionIan H. Witten and Eibe FrankFuzzy Modeling and Genetic Algorithms forData Mining and ExplorationEarl CoxData Modeling Essentials, Third EditionGraeme C. Simsion and Graham C. WittLocation-Based ServicesJochen Schiller and Agn s VoisardDatabase Modeling with Microsoft Visio forEnterprise ArchitectsTerry Halpin, Ken Evans, Patrick Hallock,and Bill MacleanDesigning Data-Intensive Web ApplicationsStefano Ceri, Piero Fraternali, Aldo Bongio,Marco Brambilla, Sara Comai, andMaristella MateraMining the Web: Discovering Knowledgefrom Hypertext DataSoumen ChakrabartiAdvanced SQL: 1999 UnderstandingObject-Relational and Other AdvancedFeaturesJim MeltonDatabase Tuning: Principles, Experiments,and Troubleshooting TechniquesDennis Shasha and Philippe BonnetSQL: 1999 Understanding RelationalLanguage ComponentsJim Melton and Alan R.
2 SimonInformation Visualization in Data Miningand Knowledge DiscoveryEdited by Usama Fayyad, Georges , and Andreas WierseTransactional Information Systems: Theory,Algorithms, and the Practice of ConcurrencyControl and RecoveryGerhard Weikum and Gottfried VossenSpatial Databases: With Application to GISP hilippe Rigaux, Michel Scholl, and Agn sVoisardInformation Modeling and RelationalDatabases: From Conceptual Analysis toLogical DesignTerry HalpinComponent Database SystemsEdited by Klaus R. Dittrich and AndreasGeppertManaging Reference Data in EnterpriseDatabases: Binding Corporate Data to theWider WorldMalcolm ChisholmData Mining: Concepts and TechniquesJiawei Han and Micheline KamberUnderstanding SQL and Java Together: AGuide to SQLJ, JDBC, and RelatedTechnologiesJim Melton and Andrew EisenbergDatabase: Principles, Programming, andPerformance, Second EditionPatrick O Neil and Elizabeth O NeilThe Object Data Standard: ODMG by R. G. G. Cattell, Douglas , Mark Berler, Jeff Eastman, DavidJordan, Craig Russell, Olaf Schadow,Torsten Stanienda, and Fernando VelezData on the Web: From Relations toSemistructured Data and XMLS erge Abiteboul, Peter Buneman, and DanSuciuData Mining: Practical Machine LearningTools and Techniques with JavaImplementationsIan H.
3 Witten and Eibe FrankJoe Celko s SQL for Smarties: Advanced SQLP rogramming, Second EditionJoe CelkoJoe Celko s Data and Databases: Concepts inPracticeJoe CelkoDeveloping Time-Oriented DatabaseApplications in SQLR ichard T. SnodgrassWeb Farming for the Data WarehouseRichard D. HackathornDatabase Modeling & Design, Third EditionTo b y J . Te o r e yManagement of Heterogeneous andAutonomous Database SystemsEdited by Ahmed Elmagarmid, MarekRusinkiewicz, and Amit ShethObject-Relational DBMSs: Tracking the NextGreat Wave, Second EditionMichael Stonebraker and Paul Brown, withDorothy MooreA Complete Guide to DB2 UniversalDatabaseDon ChamberlinUniversal Database Management: A Guideto Object/Relational TechnologyCynthia Maro SaraccoReadings in Database Systems, Third EditionEdited by Michael Stonebraker and JosephM. HellersteinUnderstanding SQL s Stored Procedures: AComplete Guide to SQL/PSMJim MeltonPrinciples of Multimedia Database SystemsV. S. SubrahmanianPrinciples of Database Query Processing forAdvanced ApplicationsClement T.
4 Yu and Weiyi MengAdvanced Database SystemsCarlo Zaniolo, Stefano Ceri, ChristosFaloutsos, Richard T. Snodgrass, V. , and Roberto ZicariPrinciples of Transaction Processing for theSystems ProfessionalPhilip A. Bernstein and Eric NewcomerUsing the New DB2: IBM s Object-RelationalDatabase SystemDon ChamberlinDistributed AlgorithmsNancy A. LynchActive Database Systems: Triggers and RulesFor Advanced Database ProcessingEdited by Jennifer Widom and Stefano CeriMigrating Legacy Systems: Gateways,Interfaces & the Incremental ApproachMichael L. Brodie and Michael StonebrakerAtomic TransactionsNancy Lynch, Michael Merritt, WilliamWeihl, and Alan FeketeQuery Processing For Advanced DatabaseSystemsEdited by Johann Christoph Freytag, DavidMaier, and Gottfried VossenTransaction Processing: Concepts andTechniquesJim Gray and Andreas ReuterBuilding an Object-Oriented DatabaseSystem: The Story of O2 Edited by Fran ois Bancilhon, ClaudeDelobel, and Paris KanellakisDatabase Transaction Models For AdvancedApplicationsEdited by Ahmed K.
5 ElmagarmidA Guide to Developing Client/Server SQLA pplicationsSetrag Khoshafian, Arvola Chan, AnnaWong, and Harry K. T. WongThe Benchmark Handbook For Databaseand Transaction Processing Systems, SecondEditionEdited by Jim GrayCamelot and Avalon: A DistributedTransaction FacilityEdited by Jeffrey L. Eppinger, Lily , and Alfred Z. SpectorReadings in Object-Oriented DatabaseSystemsEdited by Stanley B. Zdonik and 5/3/05 5:42 PM Page iiData MiningPractical Machine Learning Tools and Techniques,Second EditionIan H. WittenDepartment of Computer ScienceUniversity of WaikatoEibe FrankDepartment of Computer ScienceUniversity of WaikatoAMSTERDAM BOSTON HEIDELBERG LONDONNEW YORK OXFORD PARIS SAN DIEGOSAN FRANCISCO SINGAPORE SYDNEY TOKYOMORGAN KAUFMANN PUBLISHERS IS AN IMPRINT OF 4/30/05 10:55 AM Page iiiPublisher:Diane CerraPublishing Services Manager:Simon CrumpProject Manager:Brandy LillyEditorial Assistant:Asma StephanCover Design:Yvo Riezebos DesignCover Image:Getty ImagesComposition:SNP Best-set Typesetter Ltd.
6 , Hong KongTechnical Illustration:Dartmouth Publishing, :Graphic World :Graphic World :Graphic World printer:The Maple-Vail Book Manufacturing GroupCover printer:Phoenix Color CorpMorgan Kaufmann Publishers is an imprint of Sansome Street, Suite 400, San Francisco, CA 94111 This book is printed on acid-free paper. 2005 by Elsevier Inc. All rights used by companies to distinguish their products are often claimed as trademarksor registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of aclaim, the product names appear in initial capital or all capital letters. Readers, however, shouldcontact the appropriate companies for more complete information regarding trademarks part of this publication may be reproduced, stored in a retrieval system, or transmitted inany form or by any means electronic, mechanical, photocopying, scanning, or otherwise without prior written permission of the may be sought directly from Elsevier s Science & Technology Rights Department inOxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, You may also complete your request on-line via the Elsevierhomepage ( ) by selecting Customer Support and then ObtainingPermissions.
7 Library of Congress Cataloging-in-Publication DataWitten, (Ian H.)Data mining : Practical Machine Learning Tools and techniques / Ian H. Witten, Eibe Frank. 2nd cm. (Morgan Kaufmann series in data management systems)Includes bibliographical references and : 0-12-088407-01. Data Frank, dc222005043385 For information on all Morgan Kaufmann publications,visit our Web site at in the United States of America050607080954321 Working together to grow libraries in developing | | 5/3/05 2:22 PM Page ivForewordJim Gray, Series EditorMicrosoft ResearchTechnology now allows us to capture and store vast quantities of data. Findingpatterns, trends, and anomalies in these datasets, and summarizing them with simple quantitative models, is one of the grand challenges of the infor-mation age turning data into information and turning information has been stunning progress in data mining and Machine Learning . Thesynthesis of statistics, Machine Learning , information theory, and computing hascreated a solid science, with a firm mathematical base, and with very powerfultools.
8 Witten and Frank present much of this progress in this book and in thecompanion implementation of the key algorithms. As such, this is a milestonein the synthesis of data mining, data analysis, information theory, and machinelearning. If you have not been following this field for the last decade, this is agreat way to catch up on this exciting progress. If you have, then Witten andFrank s presentation and the companion open-source workbench, called Weka,will be a useful addition to your present the basic theory of automatically extracting models from data,and then validating those models. The book does an excellent job of explainingthe various models (decision trees, association rules, linear models, clustering,Bayes nets, neural nets) and how to apply them in practice. With this basis, theythen walk through the steps and pitfalls of various approaches. They describehow to safely scrub datasets, how to build models, and how to evaluate a model spredictive quality.
9 Most of the book is tutorial, but Part II broadly describes howcommercial systems work and gives a tour of the publicly available data miningworkbench that the authors provide through a website. This Weka workbenchhas a graphical user interface that leads you through data mining tasks and hasexcellent data visualization Tools that help understand the models. It is a greatcompanion to the text and a useful and popular tool in its own 5/3/05 2:23 PM Page vThis book presents this new discipline in a very accessible form: as a text both to train the next generation of practitioners and researchers and to informlifelong learners like myself. Witten and Frank have a passion for simple andelegant solutions. They approach each topic with this mindset, grounding allconcepts in concrete examples, and urging the reader to consider the simpletechniques first, and then progress to the more sophisticated ones if the simpleones prove you are interested in databases, and have not been following the machinelearning field, this book is a great way to catch up on this exciting progress.
10 Ifyou have data that you want to analyze and understand, this book and the asso-ciated Weka toolkit are an excellent way to 5/3/05 2:23 PM Page viContentsForewordvPrefacexxiiiUpdated and revised contentxxviiAcknowledgmentsxxixPart I Machine Learning Tools and techniques11 What s it all about? mining and Machine learning4 Describing structural patterns6 Machine learning7 Data examples: The weather problem and others9 The weather problem10 Contact lenses: An idealized problem13 Irises: A classic numeric dataset15 CPU performance: Introducing numeric prediction16 Labor negotiations: A more realistic example17 Soybean classification: A classic Machine Learning applications22 Decisions involving judgment22 Screening images23 Load forecasting24 Diagnosis25 Marketing and sales26 Other 4/30/05 10:55 AM Page Learning and as search30 Enumerating the concept mining and reading372 Input: Concepts, instances, and s a concept? s in an example? s in an attribute? the input52 Gathering the data together52 ARFF format53 Sparse data55 Attribute types56 Missing values58 Inaccurate values59 Getting to know your reading603 Output: Knowledge with involving for numeric 4/30/05 10:55 AM Page viii4 Algorithms: The basic rudimentary rules84 Missing values and numeric modeling88 Missing values and numeric attributes92 Bayesian models for document : Constructing decision trees97 Calculating information100 Highly branching algorithms: Constructing rules105 Rules versus trees107A simple covering algorithm107 Rules versus decision association rules112 Item sets113 Association rules113 Generating rules models119 Numeric prediction: Linear regression119 Linear classification: Logistic regression121 Linear classification using the perceptron124 Linear classification using learning128 The distance function128 Finding nearest neighbors distance-based clustering137 Faster distance 4/30/05 10.