Example: bankruptcy

Data Preparation for Data Mining - TEMIDA

data Preparation for data Mining Dorian Pyle Senior Editor: Diane D. Cerra Director of Production & Manufacturing: Yonie Overton Production Editor: Edward Wade Editorial Assistant: Belinda Breyer Cover Design: Wall-To-Wall Studios Cover Photograph: 1999 PhotoDisc, Inc. Text Design & Composition: Rebecca Evans & Associates Technical Illustration: Dartmouth Publishing, Inc. Copyeditor: Gary Morris Proofreader: Ken DellaPenta Indexer: Steve Rath Printer: Courier Corp. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances where Morgan Kaufmann Publishers, Inc. is aware of a claim, the product names appear in initial capital or all capital letters.

Data Preparation for Data Mining Dorian Pyle Senior Editor: Diane D. Cerra Director of Production & Manufacturing: Yonie Overton Production Editor: Edward Wade Editorial Assistant: Belinda Breyer

Tags:

  Data, Preparation, Mining, Data preparation for data mining

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Data Preparation for Data Mining - TEMIDA

1 data Preparation for data Mining Dorian Pyle Senior Editor: Diane D. Cerra Director of Production & Manufacturing: Yonie Overton Production Editor: Edward Wade Editorial Assistant: Belinda Breyer Cover Design: Wall-To-Wall Studios Cover Photograph: 1999 PhotoDisc, Inc. Text Design & Composition: Rebecca Evans & Associates Technical Illustration: Dartmouth Publishing, Inc. Copyeditor: Gary Morris Proofreader: Ken DellaPenta Indexer: Steve Rath Printer: Courier Corp. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances where Morgan Kaufmann Publishers, Inc. is aware of a claim, the product names appear in initial capital or all capital letters.

2 Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Morgan Kaufmann Publishers, Inc. Editorial and Sales Office 340 Pine Street, Sixth Floor San Francisco, CA 94104-3205 USA Telephone 415-392-2665 Facsimile 415-982-2665 Email WWW Order toll free 800-745-7323 1999 by Morgan Kaufmann Publishers, Inc. All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means electronic, mechanical, photocopying, or otherwise without the prior written permission of the publisher. Dedication To my dearly beloved Pat, without whose love, encouragement, and support, this book, and very much more, would never have come to be Table of Contents data Preparation for data Mining Preface Introduction Chapter 1 - data Exploration as a Process Chapter 2 - The Nature of the World and Its Impact on data Preparation Chapter 3 - data Preparation as a Process Chapter 4 - Getting the data Basic Preparation Chapter 5 - Sampling, Variability.

3 And Confidence Chapter 6 - Handling Nonnumerical Variables Chapter 7 - Normalizing and Redistributing Variables Chapter 8 - Replacing Missing and Empty Values Chapter 9 - Series Variables Chapter 10 - Preparing the data Set Chapter 11 - The data Survey Chapter 12 - Using Prepared data Appendix A - Using the Demonstration Code on the CD-ROM Appendix B - Further Reading Preface What This Book Is About This book is about what to do with data to get the most out of it. There is a lot more to that statement than first meets the eye. Much information is available today about data warehouses, data Mining , KDD, OLTP, OLAP, and a whole alphabet soup of other acronyms that describe techniques and methods of storing, accessing, visualizing, and using data .

4 There are books and magazines about building models for making predictions of all types fraud, marketing, new customers, consumer demand, economic statistics, stock movement, option prices, weather, sociological behavior, traffic demand, resource needs, and many more. In order to use the techniques, or make the predictions, industry professionals almost universally agree that one of the most important parts of any such project, and one of the most time-consuming and difficult, is data Preparation . Unfortunately, data Preparation has been much like the weather as the old aphorism has it, Everyone talks about it, but no one does anything about it. This book takes a detailed look at the problems in preparing data , the solutions, and how to use the solutions to get the most out of the data whatever you want to use it for.

5 This book tells you what can be done about it, exactly how it can be done, and what it achieves, and puts a powerful kit of tools directly in your hands that allows you to do it. How important is adequate data Preparation ? After finding the right problem to solve, data Preparation is often the key to solving the problem. It can easily be the difference between success and failure, between useable insights and incomprehensible murk, between worthwhile predictions and useless guesses. For instance, in one case data carefully prepared for warehousing proved useless for modeling. The Preparation for warehousing had destroyed the useable information content for the needed Mining project. Preparing the data for Mining , rather than warehousing, produced a 550% improvement in model accuracy.

6 In another case, a commercial baker achieved a bottom-line improvement approaching $1 million by using data prepared with the techniques described in this book instead of previous approaches. Who This Book Is For This book is written primarily for the computer savvy analyst or modeler who works with data on a daily basis and who wants to use data Mining to get the most out of data . The type of data the analyst works with is not important. It may be financial, marketing, business, stock trading, telecommunications, healthcare, medical, epidemiological, genomic, chemical, process, meteorological, marine, aviation, physical, credit, insurance, retail, or any type of data requiring analysis. What is important is that the analyst needs to get the most information out of the data .

7 At a second level, this book is also intended for anyone who needs to understand the issues in data Preparation , even if they are not directly involved in preparing or working with data . Reading this book will give anyone who uses analyses provided from an analyst s work a much better understanding of the results and limitations that the analyst works with, and a far deeper insight into what the analyses mean, where they can be used, and what can be reasonably expected from any analysis. Why I Wrote It There are many good books available today that discuss how to collect data , particularly in government and business. Simply look for titles about databases and data warehousing. There are many equally good books about data Mining that discuss tools and algorithms.

8 But few, if any books, address what to do with the dirty data after it is collected and before exploring it with a data Mining tool. Yet this part of the process is critical. I wrote this book to address that gap in the process between identifying data and building models. It will take you from the point where data has been identified in some form or other, if not assembled. It will walk you through the process of identifying an appropriate problem, relating the data back to the world from which it was collected, assembling the data into mineable form, discovering problems with the data , fixing the problems, and discovering what is in the data that is, whether continuing with Mining will deliver what you need. It walks you through the whole process, starting with data discovery, and deposits you on the very doorstep of building a data -mined model.

9 This is not an easy journey, but it is one that I have trodden many times in many projects. There is a beaten path, and my express purpose in writing this book is to show exactly where the path leads, why it goes where it does, and to provide tools and a map so that you can tread it again on your own when you need to. Special Features A CD-ROM accompanies the book. Preparing data requires manipulating it and looking at it in various ways. All of the actual data manipulation techniques that are conceptually described in the book, mainly in Chapters 5 through 8 and 10, are illustrated by C programs. For ease of understanding, each technique is illustrated, so far as possible, in a separate, well-commented C source file.

10 If compiled as an integrated whole, these provide an automated data Preparation tool. The CD-ROM also includes demonstration versions of other tools mentioned, and useful for preparing data , including WizWhy and WizRule from WizSoft, KnowledgeSEEKER from Angoss, and Statistica from StatSoft. Throughout the book, several data sets illustrate the topics covered. They are included on the CD-ROM for reader investigation. Acknowledgments I am indebted beyond measure to my dearly beloved wife, Pat Thompson, for her devoted help, support, and encouragement while this book was in progress. Her reading and rereading of the manuscript helped me to clarify many difficult points. There are many points that would without doubt be far less clear but for her help.


Related search queries