Transcription of simpleR Using R for Introductory Statistics
1 simpleR {UsingR forIntroductoryStatisticsJohnVerzani2000 04000060000800001200001600002e+054e+056e +058e+05ypageiPrefaceThesenotesareanintr oductionto to accompany anintroductorystatisticsbooksuch as Kitchens\ExploringStatistics". Thegoalsarenotto show allthefeaturesofR, or to replacea standardtextbook,butratherto be usedwitha textbooktoillustratethefeaturesofRthatca nbe learnedin a one-semester, take pedagogicalreasonstheequalssign,=, is usedas anassignment operatorandnotthetraditionalarrow combination<-. Thiswas addedtoRin onlyanolderversionis availablethereaderwillhave to make dataandfunctionsin thistextthatneedto be installedpriorto easy, ,youneedtodownloadthe\zip" le, andtheninstallfromthe\packages" ,oneusesthecommandR Someof in thehelp availableas anRpackagefrom: necessary, the lecansent in ,theindividualdatasetscanbe foundonlinein thesenotesandwerelastgeneratedonAugust22 , ,youshouldcheck forthemostrecent versionavailablefromtheCSIM athdepartment ( ).Copyrightc.}
2 1A noteonnotation..1 Data1 StartingR..1 Enteringdatawithc..2 Datais avector..3 Problems..7 UnivariateData8 Categoricaldata..8 Numericaldata.. 10 Problems.. 18 BivariateData19 Handlingbivariatecategoricaldata.. 20 .. 21 .. 24 Problems.. 31 MultivariateData32 Storingmultivariatedatain dataframes.. 32 Accessingdatain dataframes.. 33 Manipulatingdataframes:stackandunstack.. 34 UsingR's modelformulanotation.. 35 Ways to viewmultivariatedata.. 35 Thelatticepackage.. 40 Problems.. 40simpleR{ UsingRforIntroductoryStatisticspageiiRan domData41 Randomnumber generatorsinR{ the\r" .. 41 Problems.. 46 Simulations47 Thecentrallimittheorem.. 49 Problems.. 51 ExploratoryDataAnalysis54 Ourtoolbox .. 54 Examples.. 54 Problems.. 58 Con denceInterval Estimation59 Populationproportiontheory.. 59 Proportiontest.. 61 Thez-test.. 62 Thet-test.. 62 Con denceinterval forthemedian.. 64 Problems.. 65 HypothesisTesting66 Testinga populationparameter.. 66 Testinga mean.}}
3 67 Testsforthemedian.. 67 Problems.. 68 Two-sampletests68 Two-sampletestsof proportion.. 68 Two-samplet-tests.. 69 Resistant two-sampletests.. 71 Problems.. 71 ChiSquareTests72 Thechi-squareddistribution.. 72 Chi-squaredgoodnessof t tests.. 72 Chi-squaredtestsof independence.. 74 Chi-squaredtestsforhomogeneity.. 75 Problems.. 76 RegressionAnalysis77 Simplelinearregressionmodel.. 77 Testingtheassumptionsof themodel.. 78 Statisticalinference.. 79 Problems.. 83 MultipleLinearRegression84 Themodel.. 84 Problems.. 89 Analysisof Variance89one-way analysisof variance.. 89 Problems.. 92 Appendix:InstallingR94 Appendix:ExternalPackages94 Appendix:A sampleR session94A samplesessioninvolvingregression.. 94t-tests.. 97A simulationexample.. 99simpleR{ UsingRforIntroductoryStatisticspageiiiAp pendix:WhathappenswhenR starts?100 Appendix:UsingFunctions100 Thebasictemplate.. 100 For loops.. 102 Conditionalexpressions.. 103 Appendix:EnteringDataintoR103 Usingc.}
4 104usingscan.. 104 Usingscanwitha le.. 104 Editingyourdata.. 104 Readingin tablesof data.. 105 Fixed-width elds.. 105 Spreadsheetdata.. 105 XML,urls.. 106\Foreign"formats.. 106 Appendix:TeachingTricks106 Appendix:Sourcesof help,documentation107simpleR{ UsingRforIntroductoryStatisticsDatapage1 Section1: IntroductionWhatisRThesenotesdescribe how to to allow this nesoftwareto be usedin "lower-level"courseswhereoftenMINITAB,SP SS,Excel, is expectedthatthereaderhashadat leasta is thehope, thatstudents shownhow to useRat thisearlylevel willbetterunderstandthestatisticalissues andwillultimatelybene tfromthemoresophisticatedprogramdespitei tssteeper\learningcurve".Thebene tsofRforanintroductorystudent are Ris open-sourceandrunsonUNIX,WindowsandMacin tosh. Rhasanexcellent built-inhelpsystem. Rhasexcellent graphingcapabilities. Students caneasilymigrateto thecommerciallysupportedS-Plusprogramif commercialsoftwareis desired. R's languagehasa powerful,easyto learnsyntaxwithmany built-instatisticalfunctions.}
5 Thelanguageis easyto extendwithuser-writtenfunctions. Ris a programmersit willfeelmorefamiliarthanothersandfornewc omputerusers,thenextleapto programmingwillnotbe so othersoftwaresolutions? It hasa limitedgraphicalinterface(S-Plushasa good one).Thismeans,it canbe harderto learnat theoutset. Thereis nocommercialsupport.(Althoughonecanargue theinternationalmailinglistis evenbetter) Thecommandlanguageis a programminglanguageso students mustlearnto anopen-source(GPL)statisticalenvironment modeledafterS andS-Plus( ).TheS languagewas developed in thelate1980sat AT& startedby RobertGentlemanandRossIhaka of theStatisticsDepartment of theUniversity of Aucklandin hasquicklygaineda is currentlymaintainedby theRcore-development team,a hard-working, themainsiteforinformationonR. At thissitearedirectionsforobtainingthesoft ware,accompanyingpackagesandothersources of noteonnotationA fewtypographicalconventionsareusedin erent fonts forurls,R commands,datasetnamesanddi erent typesettingforlongersequencesof R : DataStatisticsis thestudyof to startR, the rstthingwe needto be ableto dois learnhowto enterdataintoRandhow to { UsingRforIntroductoryStatisticsDatapage2 Ris mosteasilyusedin aninteractive a questionandRgives startupR's commandlineyoucandothefollowing:in Windows ndtheRiconanddoubleclick, onUnix,fromthecommandlinetypeR.}
6 Otheroperatingsystemsmay have di started,youshouldbe greetedwitha commandsimilartoR : Copyright2001,The R (2001-12-19)R is freesoftwareand comeswithABSOLUTELYNO are welcometo redistributeit `license()'or `licence()'for is a `contributors()'for `demo()'for somedemos,`help()'for on-linehelp,or` ()'for a HTML browserinterfaceto `q()'to quitR.[Previouslysavedworkspacerestored] >The>is calledtheprompt. In whatfollowsbelow it is nottyped,butis usedto indicatewhereyouareto type ifyoufollow a commandis too longto t ona line,a+is smalldatasetsis , ,supposewe have thefollowingcount of thenumber of typos per pageof thesenotes:2 3 0 3 1 0 0 1To enterthisinto anRsessionwe doso with> typos= c(2,3,0,3,1,0,0,1)> typos[1] 2 3 0 3 1 0 0 1 Noticea fewthings We assignedthevaluesto a variablecalledtypos Theassignment operatoris a=. Thisis validas was (andstillcanbe) a< used,although,youshouldlearnoneandstick withit. Thevalueof thetyposdoesn'tautomaticallyprint does whenwe type justthenamethoughas thelastinputlineindicates Thevalueof typos is prefacedwitha funny looking[1].
7 Thisindicatesthatthevalueis avector. many implementationsofRyoucansave yourselfa lotof typingif youlearnthatthearrow keyscanbe usedto retrieve particular,each commandis storedin a historyandtheuparrow willtraversebackwardsalongthishistoryand thedownarrow arrow keyswillworkas mousecanmake it quiteeasyto dosimpleeditingof functionRcomeswithmany builtin functionsthatonecanapplyto datasuch astypos. Oneof themis themeanfunctionfor ndingthemeanor averageof useit is easysimpleR{ UsingRforIntroductoryStatisticsDatapage3 > mean(typos)[1] ,we couldcallthemedian, orvarto ndthemedianor thesame{ thefunctionnamefollowed by parenthesesto containtheargument(s):> median(typos)[1] 1> var(typos)[1] avectorThedatais storedinRas avector. Thismeanssimplythatit keepstrack of theorderthatthedatais particularthereis a rstelement, a secondelement upto a lastelement. Thisis a good thingforseveralreasons: Oursimpledatavectortyposhasa naturalorder{ page1, page2 wouldn'twant to mixtheseup. We wouldlike to be ableto make changesto thedataitemby iteminsteadof havingto enterin theentiredatasetagain.}}}
8 Vectorsarealsoa mathematicalconceptssuch as additionandmultiplicationthatmake it easyto 'sseehow theseapplyto ourtypos ,supposethesearethetypos forthe rstdraftof section1of might want to keeptrack of ourvariousdraftsas thetypos doneby thefollowing:> c(2,3,0,3,1,0,0,1)> c(0,3,0,3,1,0,0,1)Thatis,thetwo typos onthe rstpagewere di erent many otherlanguages,theperiod is onlyusedas 'tusean_(underscore)to punctuatenamesas youmightin otherprogramminglanguagesso it is , youmight say, thatis a lotof workto type in thedataa 'tI justtellRto changethe rstpage?Theanswer of courseis \yes".Hereis how> c(2,3,0,3,1,0,0,1)> # makea copy> [1]= 0# assignthe firstpage0 typosNow noticea ,thecomment character,#, is usedto make characteris ignored(byR, hopefullynotthereader).Moreimportantly, theassignment to the rstentryin doneby referencingthe rstentryin donewithsquarebrackets[].It is important to keepthisin mind:parentheses()areforfunctions,andsqu arebrackets[]areforvectors(andlaterarray s andlists).
9 In particular,we have > # printout the value[1] 0 3 0 3 1 0 0 1> [2]# print2nd pages'value[1] 3> [4]# 4th page[1] 3> [-4]# all but the 4th page[1] 0 3 0 1 0 0 1> [c(1,2,3)]# fancy,print1st,2nd and 3rd.[1] 0 3 0 Noticenegative indicesgive veryimportant. Youcantake morethanonevalueat a timeby usinganothervectorof , we needto workthesenotesinto shape, let's ,we cannoticethatpages2 and4 area dothiswithRin a moresystematicmanner?1 Theunderscorewas originallyusedas assignment so a namesuch asTheDatawouldactuallyassignthevalueofDa tato thevariableThe. Theunderscoreis beingphasedoutandtheequalssignis { UsingRforIntroductoryStatisticsDatapage4 > max( )# whatare worstpages?[1] 3# 3 typosper page> 3# Whereare they?[1] FALSETRUEFALSETRUEFALSEFALSEFALSEFALSEN otice,theusageof doubleequalssigns(==). seeif theyareequalto yes (TRUE) thisas askingRa thevalueequalto 3?R/ answersallat oncewitha longvectorof TRUE'sandFALSE' thequestionis { how canwe gettheindices(pages)correspondingto theTRUE values?}}
10 Let'srephrase,whichindiceshave 3 typos?If youguessedthatthecommandwhichwillwork,yo uareonyourway toRmastery:> which( 3)[1] 2 4 Now,whatif youdidn'tthinkof thecommandwhich? Youarenotoutof luck { butyouwillneedto to createa newvector1 2 3 ..keepingtrack of thepagenumbers,andthenslicingo :> n = length( )# how manypages> pages= 1:n# how we getthe pagenumbers> pages# pagesis simply1 to numberof pages[1] 1 2 3 4 5 6 7 8> pages[ 3]# [1] 2 4To createthevector1 2 3 ..we couldhave typedthisin,butthisis ausefulthingto :bis simplya, a+1,a+2,..,bifa,bareintegersandintuitive lyde nedif moregeneralRfunctionisseq()which is a seeit' producetheabove tryseq(a,b,1).Theuseof extractingelements of a vectorusinganothervectorof thesamesizewhich is comprisedofTRUEs andFALSEs is referredto asextractionby a logicalvector. Noticethisis di erent fromextractingby pagenumbersby slicingas we to useslicingandlogicalvectorsgives youtheability to easilyaccessyourdataas ,we couldhave donealltheabove at oncewiththiscommand(butwhy?)}