An introduction to data cleaning with R

Discussion PaperEdwin de JongeMark van der LooAn introduction to data cleaning with RThe views expressed in this paper are those of the author(s) and do not necesarily reflect the policies of Statistics Netherlands2013 | 13 PublisherStatistics NetherlandsHenri Faasdreef 312, 2492 JP The : Statistics netherlands , GrafimediaDesign: EdenspiekermannInformationTelephone +31 88 570 70 70, fax +31 70 337 59 94 Via contact form: to +31 45 570 62 68 ISSN 1572-0314 Statistics netherlands , The hague /Heerlen 2013. Reproduction is permitted, provided Statistics netherlands is quoted as the 201313- X-10-13An introduction to data cleaningwith REdwin de Jonge and Mark van der cleaning , or data preparation is an essential part of statistical analysis. In fact,in practice it is often more time-consuming than the statistical analysis itself. These lecturenotes describe a range of techniques, implemented in theRstatistical environment, that allowthe reader to build data cleaning scripts for data suffering from a wide range of errors andinconsistencies, in textual format.

These notes cover technical as well as subject-matter relatedaspects of data cleaning . Technical aspects include data reading, type conversion and stringmatching and manipulation. Subject-matter related aspects include topics like data checking,error localization and an introduction to imputation methods in R. References to relevantliterature andRpackages are provided lecture notes are based on a tutorial given by the authors at theuseR!2013conference inAlbacete, :methodology, data editing, statistical softwareAn introduction to data cleaning with R 3 ContentsNotes to the reader..61 analysis in five steps.. general background inR.. types and indexing techniques.. values..10 Exercises..112 From raw data to technically correct correct data inR.. Reading text data into aR .. its cousins.. data withreadLines.. conversion.. toR's typing system.. factors.. dates.. normalization.. string matching.. Character encoding issues.

26 Exercises..293 From technically correct data to consistent and localization of errors.. values.. values.. inconsistencies.. localization.. transformation rules.. correction.. imputation.. numeric imputation models.. deck imputation..47An introduction to data cleaning with R .. value adjustment..49 Exercises..51An introduction to data cleaning with R 5 Notes to the readerThis tutorial is aimed at users who have someRprogramming experience. That is, the reader isexpected to be familiar with concepts such as variable assignment,vector,list, ,writing simple loops, and perhaps writing simple functions. More complicated constructs, whenused, will be explained in the text. We have adopted the following conventions in this code examples in this tutorial can be executed, unless otherwise indicated. Codeexamples are shown in gray boxes, like this:1 + 1## [1] 2where output is preceded by a double hash sign##. When code, function names or argumentsoccur in the main text, these are typeset infixed widthfont, just like the code in gray we refer toRdata types, likevectorornumericthese are denoted in fixed width font the main text, variables are written in slanted format while their values (whentextual) are written in fixed-width format.

For example: theMarital small data files are used as an example. These files are printed in thedocument in fixed-width format and can easily be copied from thepdffile. Here is an example:1%% data on the Dalton Brothers2 Gratt,1861,18923 Bob,189241871,Emmet,19375% Names, birth and death datesAlternatively, the files can be found we have tips, best practices, or other remarks that are relevant but not partof the main text. These are shown in separate paragraphs as become anRmaster, you must practice every is usual inR, we use the forward slash (/) as file name separator. Under windows,one may replace each forward slash with a double backslash\\. brevity, references are numbered, occurring as superscript in the main introduction to data cleaning with R 61 IntroductionAnalysis of data is a process of inspecting, cleaning , transforming, and modelingdata with the goal of highlighting useful information, suggesting conclusions, andsupporting decision , July 2013 Most statistical theory focuses on data modeling, prediction and statistical inference while it isusually assumed that data are in the correct state for data analysis.

In practice, a data analystspends much if not most of his time on preparing the data before doing any statistical is very rare that the raw data one works with are in the correct format, are without errors, arecomplete and have all the correct labels and codes that are needed for analysis. data cleaning isthe process of transforming raw data into consistent data that can be analyzed. It is aimed atimproving the content of statistical statements based on the data as well as their cleaning may profoundly influence the statistical statements based on the data . Typicalactions like imputation or outlier handling obviously influence the results of a statisticalanalyses. For this reason, data cleaning should be considered a statistical operation, to beperformed in a reproducible manner. TheRstatistical environment provides a goodenvironment for reproducible data cleaning since all cleaning actions can be scripted andtherefore analysis in ive stepsIn this tutorial a statistical analysis is viewed as the result of a number of data processing stepswhere each step increases the ``value'' of the data *.

Raw checking, and , analyze, derive, , cleaningFigure 1: Statistical analysis value chainFigure1shows an overview of a typical dataanalysis project. Each rectangle representsdata in a certain state while each arrowrepresents the activities needed to get fromone state to the other. The first state (Rawdata) is the data as it comes in. Raw datafiles may lack headers, contain wrong datatypes ( stored as strings), wrongcategory labels, unknown or unexpectedcharacter encoding and so on. In short,reading such files into anR is either difficult or impossiblewithout some sort of this preprocessing has taken place, data can be deemedTechnically is, in this state data can be read intoanR , with correct names, typesand labels, without further trouble. However,that does not mean that the values are error-free or complete. For example, an age variablemay be reported negative, an under-aged person may be registered to possess a driver's license,or data may simply be missing.

Such inconsistencies obviously depend on the subject matter*In fact, such a value chain is an integral part of Statistics netherlands business introduction to data cleaning with R 7that the data pertains to, and they should be ironed out before valid statistical inference fromsuch data can be datais the stage where data is ready for statistical inference. It is the data that moststatistical theories use as a starting point. Ideally, such theories can still be applied withouttaking previous data cleaning steps into account. In practice however, data cleaning methodslike imputation of missing values will influence statistical results and so must be accounted for inthe following analyses or interpretation resultshave been produced they can be stored for reuse and finally, results canbeFormattedto include in statistical reports or the input data for each stage (raw, technically correct,consistent, aggregated and formatted) separately for reuse.

Each step between thestages may be performed by a separateRscript for , a statistical analysis can be separated in five stages, from raw data to formattedoutput, where the quality of the data improves in every step towards the final result. Datacleaning encompasses two of the five stages in a statistical analysis, which again emphasizes itsimportance in statistical general background inRWe assume that the reader has some proficiency inR. However, as a service to the reader, belowwe summarize a few concepts which are fundamental to working withR, especially whenworking with ``dirty data ''. types and indexing techniquesIf you had to choose to be proficient in just oneR-skill, it should be indexing. By indexing wemean all the methods and tricks inRthat allow you to select and manipulate data usinglogical,integeror named indices. Since indexing skills are important for data cleaning , wequickly reviewvectors, indexing most basic variable inRis avector.

AnRvector is a sequence of values of the same basic operations inRact on vectors (think of the element-wise arithmetic, for example). Thebasic types inRare as data (approximations of the real numbers, )integerInteger data (whole numbers, )factorCategorical data (simple classifications, likegender)orderedOrdinal data (ordered classifications, likeeducational level)characterCharacter data (strings)rawBinary dataAll basic operations inRwork element-wise on vectors where the shortest argument is recycledif necessary. This goes for arithmetic operations (addition, subtraction,..), comparisonoperators (==,<=,..), logical operators (&,|,!,..) and basic math functions likesin,cos,expand so on. If you want to brush up your basic knowledge of vector and recycling properties, youcan execute the following code and think about why it works the way it introduction to data cleaning with R 8# vectors have variables of _one_ typec(1, 2, "three")# shorter arguments are recycled(1:3) * 2(1:4) * c(1, 2)# warning!

(why?)(1:4) * (1:3)Each element of a vector can be given aname. This can be done by passing named arguments tothec()function or later with thenamesfunction. Such names can be helpful giving meaning toyour variables. For example compare the vectorx <- c("red", "green", "blue") with the one = c(huey = "red", duey = "blue", louie = "green")Obviously the second version is much more suggestive of its meaning. The names of a vectorneed not be unique, but in most applications you'll want unique names (if any).Elements of a vector can be selected or replaced using the square bracket operator[ ]. Thesquare brackets accept either a vector of names, index numbers, or a logical. In the case of alogical, the index is recycled if it is shorter than the indexed vector. In the case of numericalindices, negative indices omit, in stead of select elements. Negative and positive indices are notallowed in the same index vector. You can repeat a name or an index number, which results inmultiple instances of the same value.

An introduction to data cleaning with R

Tags:

Information

Transcription of An introduction to data cleaning with R

Related search queries

An introduction to data cleaning with R

Tags:

Information

Documents from same domain

Related documents

Related search queries