Transcription of R and Data Mining: Examples and Case Studies
1 R and Data mining : Examples and Case Studies1 Yanchang 20, 20151c 2012-2015 Yanchang Zhao. Published by Elsevier in December 2012. All rights from the AuthorCase Studies :The case Studies are not included in this online version. They are reserved exclu-sively for a book version published by Elsevier in December version:The latest online version is available at links below. See the websites also for anR Reference Card for Data mining . (for readers having no access to above website)R code, data and FAQs:R code, data and FAQs are provided at links below. to add:topic modelling and stream graph; spatial data analysis; perfor-mance evaluation of classification/prediction models (with ROC and AUC); parallel computingand big data. Please let me know if some topics are interesting to you but not covered yet by and feedback:If you have any questions or comments, or come across any problemswith this document or its book version, please feel free to post them tothe RDataMining groupbelow or email them to me.
2 Forum:Please join our discussions on R and data mining atthe RDataMining group(16,000+ members, as of October 2015) on LinkedIn< >.Twitter:Follow @RDataMining on Twitter (2,200+ followers, as of October 2015).A sister book:See a new edited book titledData mining Application with Rat links below,which features 15 real-world applications on data mining with R. of FiguresvList of Abbreviationsvii1 Data mining .. R .. Basics .. Datasets .. Iris Dataset .. Bodyfat Dataset ..42 Data Import and Save and Load R Data .. Import from and Export .. Import Data from SAS .. Import/Export via ODBC .. from Databases .. to and Input from EXCEL Files .. Read and Write EXCEL files with packagexlsx.. Further Readings ..113 Data Exploration and Have a Look at Data .. Explore Individual Variables .. Explore Multiple Variables .. More Explorations .. Save Charts into Files.
3 Further Readings ..324 Decision Trees and Random Decision Trees with Packageparty.. Decision Trees with Packagerpart.. Random Forest ..405 Linear Regression .. Logistic Regression .. Generalized Linear Regression .. Non-linear Regression ..52iiiCONTENTS6 The k-Means Clustering .. The k-Medoids Clustering .. Hierarchical Clustering .. Density-based Clustering ..577 Outlier Univariate Outlier Detection .. Outlier Detection with LOF .. Outlier Detection by Clustering .. Outlier Detection from Time Series .. Discussions ..728 Time Series Analysis and Time Series Data in R .. Time Series Decomposition .. Time Series Forecasting .. Time Series Clustering .. Time Warping .. Control Chart Time Series Data .. Clustering with Euclidean Distance .. Clustering with DTW Distance .. Time Series Classification .. with Original Data.
4 With Extracted Features .. Classification .. Discussions .. Further Readings ..889 Association Basics of Association Rules .. The Titanic Dataset .. Association Rule mining .. Removing Redundancy .. Interpreting Rules .. Visualizing Association Rules .. Further Readings ..9910 Text Retrieving Text from Twitter .. Transforming Text .. Stemming Words .. Building a Term-Document Matrix .. Frequent Terms and Associations .. Word Cloud .. Clustering Words .. Clustering Tweets .. Clustering Tweets with thek-means Algorithm .. Clustering Tweets with thek-medoids Algorithm .. Packages, Further Readings and Discussions .. 114 CONTENTSiii11 Social Network Network of Terms .. Network of Tweets .. Two-Mode Network .. Discussions and Further Readings .. 12912 Case Study I: Analysis and Forecasting of House Price Indices13113 Case Study II: Customer Response Prediction and Profit Optimization13314 Case Study III: Predictive Modeling of Big Data with Limited Memory13515 Online R Reference Cards.
5 R .. Data mining .. Data mining with R .. Classification/Prediction with R .. Time Series Analysis with R .. Association Rule mining with R .. Spatial Data Analysis with R .. Text mining with R .. Network Analysis with R .. Cleansing and Transformation with R .. Data and Parallel Computing with R .. 141 Bibliography143 General Index149 Package Index151 Function Index153 Appendix: Book Promotion - Data mining Applications with R155ivCONTENTSList of RStudio .. Histogram .. Density .. Pie Chart .. Bar Chart .. Boxplot .. Scatter Plot .. Scatter Plot with Jitter .. Smooth Scatter Plot .. A Matrix of Scatter Plots .. 3D Scatter plot .. Heat Map .. Level Plot .. Contour .. 3D Surface .. Parallel Coordinates .. Parallel Coordinates with Packagelattice.. Scatter Plot with Packageggplot2.. Decision Tree.
6 Decision Tree (Simple Style) .. Decision Tree with Packagerpart.. Selected Decision Tree .. Prediction Result .. Error Rate of Random Forest .. Variable Importance .. Margin of Predictions .. Australian CPIs in Year 2008 to 2010 .. Prediction with Linear Regression Model - 1 .. A 3D Plot of the Fitted Model .. Prediction of CPIs in 2011 with Linear Regression Model .. Prediction with Generalized Linear Regression Model .. Results of k-Means Clustering .. Clustering with thek-medoids Algorithm - I .. Clustering with thek-medoids Algorithm - II .. Cluster Dendrogram .. Density-based Clustering - I .. Density-based Clustering - II .. Density-based Clustering - III ..60vviLIST OF Prediction with Clustering Model .. Univariate Outlier Detection with Boxplot .. Outlier Detection - I .. Outlier Detection - II .. Density of outlier factors.
7 Outliers in a Biplot of First Two Principal Components .. Outliers in a Matrix of Scatter Plots .. Outliers with k-Means Clustering .. Outliers in Time Series Data .. A Time Series ofAirPassengers.. Seasonal Component .. Time Series Decomposition .. Time Series Forecast .. Alignment with Dynamic Time Warping .. Six Classes in Synthetic Control Chart Time Series .. Hierarchical Clustering with Euclidean Distance .. Hierarchical Clustering with DTW Distance .. Decision Tree .. Decision Tree with DWT .. A Scatter Plot of Association Rules .. A Balloon Plot of Association Rules .. A Graph of Association Rules .. A Graph of Items .. A Parallel Coordinates Plot of Association Rules .. Frequent Terms .. Word Cloud .. Clustering of Words .. Clusters of Tweets .. A Network of Terms - I .. A Network of Terms - II .. Cohesive Blocks.
8 Cliques .. Cliques .. Distribution of Degree .. A Network of Tweets - I .. A Network of Tweets - II .. A Network of Tweets - III .. Two-Mode Network of Terms and Tweets - I .. Two-Mode Network of Terms and Tweets - II .. 129 List of AbbreviationsARIMAA utoregressive integrated moving averageARMAA utoregressive moving averageAVFA ttribute value frequencyCLARAC lustering for large applicationsCRISP-DMCross industry standard process for data miningDBSCAND ensity-based spatial clustering of applications with noiseDTWD ynamic time warpingDWTD iscrete wavelet transformGLMG eneralized linear modelIQRI nterquartile range, , the range between the first and third quartilesLOFL ocal outlier factorPAMP artitioning around medoidsPCAP rincipal component analysisSTLS easonal-trend decomposition based on LoessTF-IDFTerm frequency-inverse document frequencyviiviiiLIST OF FIGURESC hapter 1 IntroductionThis book introduces into using R for data mining .
9 It presents many Examples of various datamining functionalities in R and three case Studies of real world applications. The supposed audienceof this book are postgraduate students, researchers, data miners and data scientists who areinterested in using R to do their data mining research and projects. We assume that readersalready have a basic idea of data mining and also have some basic experience with R. We hopethat this book will encourage more and more people to use R to do data mining work in theirresearch and chapter introduces basic concepts and techniques for data mining , including a data miningprocess and popular data mining techniques. It also presents R and its packages, functions andtask views for data mining . At last, some datasets used in this book are Data MiningData mining is the process to discover interesting knowledge from large amounts of data [Hanand Kamber, 2000]. It is an interdisciplinary field with contributions from many areas, such asstatistics, machine learning, information retrieval, pattern recognition and bioinformatics.
10 Datamining is widely used in many domains, such as retail, finance, telecommunication and main techniques for data mining include classification and prediction, clustering, outlierdetection, association rules, sequence analysis, time series analysis and text mining , and also somenew techniques such as social network analysis and sentiment analysis. Detailed introduction ofdata mining techniques can be found in text books on data mining [Han and Kamber, 2000, Handet al., 2001, Witten and Frank, 2005]. In real world applications, a data mining process canbe broken into six major phases: business understanding, data understanding, data preparation,modeling, evaluation and deployment, as defined by the CRISP-DM (Cross Industry StandardProcess for Data mining )1. This book focuses on the modeling phase, with data exploration andmodel evaluation involved in some chapters. Readers who want more information on data miningare referred to online resources in Chapter RR2[R Core Team, 2015b] is a free software environment for statistical computing and provides a wide variety of statistical and graphical techniques.