Transcription of Data Visualization with R - GitHub Pages
1 data Visualization with RRob Kabacoff2018-09-032 ContentsWelcome7 Preface9 How to use this book..9 Prequisites..10 Setup..101 data Importing data .. Cleaning data ..122 Introduction to A worked example.. Placing thedataandmappingoptions.. Graphs as objects..323 Univariate Categorical.. Quantitative..514 Bivariate Categorical vs. Categorical.. Quantitative vs. Quantitative.. Categorical vs. Quantitative..795 Multivariate Grouping..1036 Dot density maps.. Choropleth maps..11934 CONTENTS7 Time-dependent Time series.. Dummbbell charts.. Slope graphs.. Area Charts..1358 Statistical Correlation plots.. Linear Regression.. Logistic regression.. Survival plots.. Mosaic plots..1509 Other 3-D Scatterplot.. Biplots.. Bubble charts.. Flow diagrams.. Heatmaps.. Radar charts.. Scatterplot matrix.. Waterfall charts.. Word clouds..18010 Customizing Axes.
2 Colors.. Points & Lines.. Legends.. Labels.. Annotations.. Themes..20611 Saving Via menus.. Via code.. File formats.. External editing..221 CONTENTS512 Interactive leaflet.. plotly.. rbokeh.. rCharts.. highcharter..22613 Advice / Best Labeling.. Signal to noise ratio.. Color choice.. scaling.. Attribution.. Going further.. Final Note..239A Academic salaries.. Starwars.. Mammal sleep.. Marriage records.. Fuel economy data .. Gapminder data .. Current Population Survey (1985).. Houston crime data .. US economic timeseries.. Saratoga housing data .. US population by age and year.. NCCTG lung cancer data .. Titanic data .. JFK Cuban Missle speech.. UK Energy forecast data .. US Mexican American Population..244B About the Author245C About the QAC2476 CONTENTSW elcomeR is an amazing platform for data analysis, capable of creating almost any type of graph.
3 This book helpsyou create the most popular visualizations - from quick and dirty plots to publication-ready graphs. Thetext relies heavily on the ggplot2 package for graphics, but other approaches are covered as work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives Interna-tional goal is make this book as helpful and user-friendly as possible. Any feedback is both welcome to use this bookYou don t need to read this book from start to finish in order to start building effective graphs. Feel free tojump to the section that you need and then explore others that you find are organized by the number of variables to be plotted the type of variables to be plotted the purpose of the visualizationChapterDescriptionCh 1provides a quick overview of how to get your data into R and how to prepare itfor 2provides an overview of 3describes graphs for visualizing the distribution of a single categorical ( race)or quantitative ( income) 4describes graphs that display the relationship between two 5describes graphs that display the relationships among 3 or more variables.
4 It ishelpful to read chapters 3 and 4 before this 6provides a brief introduction to displaying data 7describes graphs that display change over 8describes graphs that can help you interpret the results of statistical 9covers graphs that do not fit neatly elsewhere (every book needs a miscellaneouschapter).Ch 10describes how to customize the look and feel of your graphs. If you are going toshare your graphs with others, be sure to skim this 11covers how to save your graphs. Different formats are optimized for 12provides an introduction to interactive 13gives advice on creating effective graphs and where to go to learn more. It sworth a Appendicesdescribe each of the datasets used in this book, and provides a short blurb aboutthe author and the Wesleyan Quantitative Analysis isno one right graphfor displaying data . Check out the examples, and see which type best fitsyour s assumed that you have some experience with the R language and that you have already installedRandRStudio.
5 If not, here are some resources for getting started: A (very) short introduction to R DataCamp - Introduction to R with Jonathon Cornelissen Quick-R Getting up to speed with RSetupIn order to create the graphs in this guide, you ll need to install some optional R packages. To installallofthe necessary packages, run the following code in the RStudio console <-c("ggplot2","dplyr","tidyr","mosaicDat a","carData","VIM","scales","treemapify" ,"gapminder","ggmap","choroplethr","chor oplethrMaps","CGPfunctions","ggcorrplot" ,"visreg","gcookbook","forcats","surviva l","survminer","ggalluvial","ggridges"," GGally","superheat","waterfalls","factoe xtra","networkD3","ggthemes","hrbrthemes ","ggpol","ggbeeswarm") (pkgs)Alternatively, you can install a given package the first time it is example, if you executelibrary(gapminder)and get the messageError in library(gapminder) : there is no package called gapminder you know that the package has never been installed.
6 Simply ("gapminder")once andlibrary(gapminder)will work from that point 1 data PreparationBefore you can visualize your data , you have to get it into R. This involves importing the data from anexternal source and massaging it into a useful Importing dataR can import data from almost any source, including text files, excel spreadsheets, statistical packages, anddatabase management systems. We ll illustrate these techniques using theSalariesdataset, containing the 9month academic salaries of college professors at a single institution in Text filesThereadrpackage provides functions for importing delimited text files into R data (readr)# import data from a comma delimited fileSalaries <-read_csv(" ")# import data from a tab delimited fileSalaries <-read_tsv(" ")These function assume that the first line of data contains the variable names, values are separated by commasor tabs respectively, and that missing data are represented by blanks.
7 For example, the first few lines of thecomma delimited file looks like this."rank","discipline"," "," ","sex","salary""Prof","B",19,18,"Male", 139750"Prof","B",20,16,"Male",173200"Ass tProf","B",4,3,"Male",79750"Prof","B",45 ,39,"Male",115000"Prof","B",40,41,"Male" ,141500"AssocProf","B",6,6,"Male",97000 Options allow you to alter these assumptions. See thedocumentationfor more 1. data Excel spreadsheetsThereadxlpackage can import data from Excel workbooks. Both xls and xlsx formats are (readxl)# import data from an Excel workbookSalaries <-read_excel(" ",sheet=1)Since workbooks can have more than one worksheet, you can specify the one you want with default issheet= Statistical packagesThehavenpackage provides functions for importing data from a variety of statistical (haven)# import data from StataSalaries <-read_dta(" ")# import data from SPSSS alaries <-read_sav(" ")# import data from SASS alaries <-read_sas(" ") DatabasesImporting data from a database requires additional steps and is beyond the scope of this book.
8 Depending onthe database containing the data , the following packages can help:RODBC,RMySQL,ROracle,RPostgreSQL,RS QLite, andRMongo. In the newest versions of RStudio, you can use theConnections paneto quickly accessthe data stored in database management Cleaning dataThe processes of cleaning your data can be the most time-consuming part of any data analysis. The mostimportant steps are considered below. While there are many approaches, those using thedplyrandtidyrpackages are some of the quickest and easiest to Function Usedplyr selectselect variables/columnsdplyr filterselect observations/rowsdplyr mutatetransform or recode variablesdplyr summarize summarize datadplyr group_by identify subgroups for further processingtidyr gatherconvert wide format dataset to long formattidyr spreadconvert long format dataset to wide CLEANING DATA13 Examples in this section will use thestarwarsdataset from thedplyrpackage.
9 The dataset providesdescriptions of 87 characters from the Starwars universe on 13 variables. (I actually prefer StarTrek, but wework with what we have.) Selecting variablesTheselectfunction allows you to limit your dataset to specified variables (columns).library(dplyr)# keep the variables name, height, and gendernewdata <-select(starwars, name, height, gender)# keep the variables name and all variables# between mass and species inclusivenewdata <-select(starwars, name, mass:species)# keep all variables except birth_year and gendernewdata <-select(starwars,-birth_year,-gender) Selecting observationsThefilterfunction allows you to limit your dataset to observations (rows) meeting a specific criteria can be combined with the&(AND) and|(OR) (dplyr)# select femalesnewdata <-filter(starwars,gender=="female")# select females that are from Alderaannewdata <-select(starwars,gender=="female"&homew orld=="Alderaan")# select individuals that are from# Alderaan, Coruscant, or Endornewdata <-select(starwars,homeworld=="Alderaan"| homeworld=="Coruscant"|homeworld=="Endor ")# this can be written more succinctly asnewdata <-select(starwars,homeworld%in%c("Aldera an","Coruscant","Endor")) Creating/Recoding variablesThemutatefunction allows you to create new variables or transform existing 1.
10 data PREPARATION library(dplyr)# convert height in centimeters to inches,# and mass in kilograms to poundsnewdata <-mutate(starwars,height =height* ,mass =mass* )Theifelsefunction (part of base R) can be used for recoding data . The format isifelse(test, returnif TRUE, return if FALSE).library(dplyr)# if height is greater than 180# then heightcat = "tall",# otherwise heightcat = "short"newdata <-mutate(starwars,heightcat =ifelse(height>180,"tall","short")# convert any eye color that is not# black, blue or brown, to othernewdata <-mutate(starwars,eye_color =ifelse(eye_color%in%c("black","blue","brown"),eye_color,"other")# set heights greater than 200 or# less than 75 to missingnewdata <-mutate(starwars,height =ifelse(height<75|height>200,NA,height) Summarizing dataThesummarizefunction can be used to reduce multiple values down to a single value (such as a mean).)))