Transcription of Preparing Data for Analysis Using Microsoft Excel - …
1 TOOLS AND ISSUESP reparing data for Analysis Using MicrosoftExcelAlan C. Elliott, Linda S. Hynan, Joan S. Reisch, Janet P. SmithABSTRACTA critical component essential to good research is the accurate andefficient collection and preparation of data for Analysis . Mostmedical researchers have little or no training in data management,often causing not only excessive time spent cleaning data but alsoa risk that the data set contains collection or recording errors. Theimplementation of simple guidelines based on techniques used byprofessional data management teams will save researchers timeand money and result in a data set better suited to answer researchquestions. Because MicrosoftExcelis often used by researchers tocollect data , specific techniques that can be implemented inExcelare Words: data collection, database management system, re-search design, pilot projects, informaticsDuring a presentation at an IBM technicians meeting in1966, a programmer named Wilf Hey coined the phrase Garbage In, Garbage Out.
2 Now abbreviated GIGO, thisterm has become a catchphrase for the too common sit-uation in which inaccurate data entered into a computerare used to produce misleading or erroneous teams, ranging from a few individuals tocomplex organizations at universities, governments, andcorporations, are all involved in the planning, execution,and Analysis of research. A critical component essential toeach of these research projects is collecting data andentering results into a computer in preparation for statis-tical Analysis . Often because of a lack of funding, expe-rience, or both, these data are entered into the computerusing an ad hoc process that results in poorly codeddata, incorrectly formatted values, incomplete informa-tion, and typographical errors.
3 It is not uncommon forresearchers to discover that their data sets must be exten-sively cleaned before they can be properly for research teams who do not have the servicesof a professional data management team, this articleprovides guidance on how to develop a strategy to createa well-designed and verified data set. Following theseguidelines will save researchers time and money duringall phases of the research project and will result in datathat can be used in a statistical software program withminimal of the strategies in this article can (and should) beemployed whether the data are entered into the computerusingSAS(SAS Institute, Cary, NC),SPSS(SPSS Inc, Chi-cago, IL), Access( Microsoft Corporation, Redmond, WA), orany number of programs.
4 To illustrate the guidelines inthis article and appendix, we use the MicrosoftExcel( Microsoft Corporation) spreadsheet program. AlthoughExcelwas not designed to be a research data entry tool, itis commonly used because almost every researcher al-ready knows the basics of how to use it. This article doesnot address the use ofExcelfor data Analysis because itslimited data Analysis capabilities and sometimes confus-ing output make it suitable only for preliminary Jonathan Cryer put it, Friends don t let friends useExcel for statistics. 2 Other articles have also discussedthe problems associated with performing statistical analy-sis ,4 Furthermore,Excelis limited to spread-sheets containing less than 256 variables (columns) and65,536 records (rows).
5 In well-funded studies, a professional data manage-ment team works with investigators from the planningstage through forms development, database design, anddata collection and entry and in the preparation of data foranalysis. The characteristics of a professionally designeddata management process include a thorough descriptionof the data variables, validation of data values as they areentered into the computer, and the use of a double-entrydata process into a relational database. Such processes usespecialized programs for data entry rather ,6 Programs designed for professional data entry includetheSPSS data Entry Builder,Key Entry IIIfrom SouthernComputer Systems (Birmingham, AL)SAS/AF, of these programs require the expertise of a pro-grammer to create data entry screens, validation code, anddata verification procedures.
6 For smaller projects in whichthe use of a professional data management team is notFrom the Department of Clinical Sciences ( , , , ), Division of Biostatistics, UT Southwestern MedicalCenter, Dallas, Dallas, correspondence to: Alan C. Elliott, Department ofClinical Sciences, Division of Biostatistics, UT SouthwesternMedical Center, Dallas, Dallas, TX 75390; tel: 214-648-2712;fax: 214-648-7673; e-mail: OF INVESTIGATIVE MEDICINE volume 54 number 6 September 2006334feasible and data entry is performed usingExcel, the savvyinvestigator can still implement the good practice tech-niques described AND DESCRIPTION OF data ELEMENTSA ccurate data collection begins with planning.
7 Beforecollecting any data , an investigator should define researchquestions and determine what measurements are neededto answer them. Typically, a research data set includes atleast one outcome variable (dependent variable) and oneor more predictor (independent) variables. Other demo-graphic, covariate, or verification measures may also berecorded. It is essential that a unique key identifier beincluded for every observed subject (or entity) in a dataset. Table 1 describes types of variables collected for dataanalysis. For each project, the researcher should use thisinformation to verify that all of the variables needed toperform an Analysis are included.
8 It is disheartening torealize too late that a variable needed to complete ananalysis was not important part of a well-designed study is the doc-umentation of each variable in a table called a data dictio-nary. An example data dictionary is shown in Table 2. Thistable can be created in a word processor or spreadsheetprogram and, once created, defines the characteristics ofvariables in both the data entry spreadsheet and statis-tical brief explanation of each item of the dictionaryfollows: Variable name. Select simple variable names usingnaming conventions compatible with the program thatwill be used for Analysis . General naming conventionsthat work for most programs (such asSASandSPSS)include the following: Make variable names short and explanatory.
9 Typicalvariable names are ID, GENDER, B_DATE, COST,GROUP, IQ_SCORE, and 1 Variable Types Collected for ResearchVariable TypeDescriptionID number or codeAn identification variable that uniquely identifies a subject or entity. This could be a patient identifier, a numberassigned by the computer, or some other assigned variablesThese variables include measures such as age, gender, and ethnicity that describe the subject variable(s)(Dependent variable) This variable is the primary outcome measure. It can be discrete (eg, dead/alive, cured/notcured) or continuous (eg, time to recurrence, blood pressure, cost incurred).Predictor variables(Independent variables) These variables may include treatment (grouping) variables and other measured or observedinformation, such as weight, height, or smoking measuresCovariate measures are variables that are related to the outcome variable and may be used to adjust the mean of theresponse variable and account for variability in a statistical measuresThese variables may be used to verify the reliability of the data , such as a measure of how compliant a subjectis in taking medicine or multiple laboratory values to assess the reliability of laboratory 2 Example data DictionaryColumnVariable NameLabel (Units)
10 FormatCodes and RangesMissing ValuesASUBJECTS ubject ID numberText (4)1000 9999 Not allowedBVDATEDate client visited clinicDate (MM/DD/YYYY)None. (dot) or 11/11/1111 CAGEAge at visit dateNumeric ( )Range 0 100 9 DTEMP_FTemperature (jF)Numeric ( )None 9 EGENDERG enderText (1)F = femaleM = maleXFARRIVEMode of arrivalString (4)CarBusWalkMISSGANTIBIOWas antibiotic prescribed?Numeric ( )1 = yes0=no9 MISS = data for Analysis UsingExcel/ELLIOTT ET AL335 Because some programs (such as older versions ofSPSS) limit names to eight characters, it is best tocomply with this restriction. Begin variable names with a letter (A Z). Someprograms also allow variable names to begin withan underscore (_).