Transcription of Practical Regression and Anova using R
1 PracticalRegressionandAnova usingRJulianJ. FarawayJuly20021 Copyrightc 1999,2000,2002 JulianJ. FarawayPermissiontoreproduceindividualco piesofthisbookforpersonaluseis a notintroductory. It theessentialsofstatisticalinferencelike estimation, is tolearnwhatmethodsareavailableandmoreimp ortantly, whenthey relativelylessemphasisonmathematicaltheo ry, partlybecausesomepriorknowledgeis importantbecauseit take a widerviewofstatisticaltheory. It is statisticalconceptsarejustasimportantin Statisticsbecausetheseenableusto actuallydoit principlesarehardertolearnbecausethey aredifficulttostatepreciselybutthey a aredesignedfordifferentaudiencesandhave have chosentouseR( (1996)).
2 WhydoI useR? alsoa programminglanguage,soI amnotlimitedbytheproceduresthatareprepro grammedbya is relativelyeasytoprogramnew Dataanalysisis , a time,allowingustomake ,Macintosh, SASis themostcommonstatisticspackageingeneralb utRorS is alsopopularforquantitative thatit is notanintroductiontoR. Thereis a I have Thereadermaychoosetostartworkingthrought histextbeforelearningRandpickit .. a goodestimate?.. Theorem.. 2.. pairofpredictors.. subspace.. 2known.. 2unknown.. 1049 ScaleChanges, .. 13612 ChicagoInsuranceRedlining- a two-level example .. predictors.
3 Three-level example .. e s theoremformultiplecomparisons.. advantageofRCBD overCRD.. youstartStatisticsstartswitha problem,continueswiththecollectionofdata , is a commonmistake ofinexperiencedStatisticianstoplungeinto a ! problemis formulatetheproblemcorrectly, learnsomethingnewratherthana ,oftenyouwillbeworkingwitha fishingexpeditions - if youlookhardenough,you llalmostalwaysfindsomethingbutthatsometh ingmayjustbea translatedintothelanguageofStatistics,th esolutionis s importanttounderstandhow thedatawascollected. Arethedataobservationalorexperimental?Ar ethedataa sampleofconvenienceorweretheyobtainedvia a designedsamplesurvey.
4 Howthedatawerecollectedhasa crucialimpactonwhatconclusionscanbemade. Is therenon-response?Thedatayoudon t seemaybejustasimportantasthedatayoudosee . Aretheremissingvalues?Thisis a commonproblemthatis troublesomeandtimeconsumingtodealwith. How arethedatacoded?Inparticular, how arethequalitative variablesrepresented. Whataretheunitsofmeasurement?Sometimesda tais collectedorrepresentedwithfarmoredigitst hanarenecessary. Considerroundingif thiswillhelpwiththeinterpretationorstora gecosts. alltoocommon almosta certaintyinany realdatasetofat a lookssimplebutit is vital. Numericalsummaries- means,sds,five-numbersummaries,correlati ons.
5 Graphicalsummaries Onevariable- Boxplots,histogramsetc. Two variables- scatterplots. Many variables- interactive , ,allthedatawillbereadytoanalyzebutyousho uldrealizethatinpracticethisis s andKidney Diseasesconducteda :Numberoftimespregnant,Plasmaglucoseconc entrationa 2 hoursinanoralglucosetolerancetest,Diasto licbloodpressure(mmHg),Tricepsskinfoldth ickness(mm),2-Hourseruminsulin(muU/ml),B odymassindex (weightinkg/(heightinm2)),Diabetespedigr eefunction,Age(years)anda testwhetherthepatientshowssignsofdiabete s(coded0 if negative, 1 if positive). ,beforedoinganythingelse,oneshouldfindou twhatthepurposeofthestudywasandmoreabout how s skipaheadtoa lookat > library(faraway)> data(pima)> pimapregnantglucosediastolictricepsinsul inbmi diabetesage test1614872350 (faraway)makesthedatausedinthisbookavail ablewhiledata(pima) , s toolongtoshow it datasetofthissize,onecanjustaboutvisuall yskimoverthedataforanythingoutofplacebut it is startwithsomenumericalsummaries:> summary(pima)pregnantglucosediastolictri cepsinsulinMin.
6 : : 0 Min.: : : Qu. Qu.:991st Qu. Qu. Qu. : :117 Median: : : :121 Mean: : Qu. Qu.:1403rd Qu. Qu. Qu. :199 Max. : Qu. Qu. Qu. Qu. Qu. Qu. ()commandis a ,wearelookingforanythingunusualorunexpec tedperhapsindicatinga dataentryerror. Forthispurpose,acloselookat theminimumandmaximumvaluesofeachvariable is ,weseea , wethenseethatthenext5 variableshave notgoodforthehealth s lookat thesortedvalues:> sort(pima$diastolic)[1]00 0 0 0 0 00 0 0 0 00 0 0 0 0 0[19]00 0 0 0 0 00 0 0 0 00 0 0 0 0 24[37]30 30 38 4044 44 44 44 46 4648 48 48 48 4850 50 butit seemslikelythatthezerohasbeenusedasa , realinvestigation,onewouldlikelybeableto , A theerrorwaslaterdiscovered,they mightthenblametheresearchersforusing0 asa missingvaluecode(nota goodchoicesinceit is a validvalueforsomeofthevariables)andnotme ntioningit sizeorcomplexity.
7 Setallzerovaluesofthefive variablestoNAwhichis themissingvaluecodeusedbyR.> pima$diastolic[pima$diastolic == 0] <- NA> pima$glucose[pima$glucose == 0] <- NA> pima$triceps[pima$triceps == 0] <- NA> pima$insulin[pima$insulin == 0] <- NA> pima$bmi[pima$bmi == 0] <- NAThevariabletestis notquantitative However,becauseofthenumericalcoding,thisvariablehasbeentreatedasif it s bestto designatesuchvariablesasfactorssothatthey aretreatedappropriately. Sometimespeopleforgetthisandcomputestupidstatisticssuchas averagezipcode .> pima$test<- factor(pima$test)> summary(pima$test)0 1500 268We now seethat500caseswerenegative tousedescriptive labels:> levels(pima$test) <- c("negative","positive")> summary(pima)pregnantglucosediastolictri cepsinsulinMin.
8 : : 44 Min.: : : Qu. Qu.:991st Qu. Qu. : :117 Median: : : :122 Mean: : Qu. Qu.:1413rd Qu. Qu. :199 Max. : s: 5 NA s: :5001st Qu. Qu. :268 Qu. Qu. thatwe ve thehistogram:hist(pima$diastolic) $ = 733 Bandwidth = (pima$diastolic) :Firstpanelshowshistogramofthediastolicb loodpressures,thesecondshowsa kerneldensityestimateofthesamewhiletheth ethirdshowsanindex seea ,I prefertouseKernelDensityEstimateswhichar eessentiallya smoothedversionofthehistogram(seeSimonof f (1996)foradiscussionoftherelative meritsofhistogramsandkernelestimates).
9 > plot(density(pima$diastolic, )) seethatit is tosimplyplotthesorteddataagainstitsindex :plot(sort(pima$diastolic),pch=".") canalsoseethediscretenessin themeasurementofbloodpressure- valuesareroundedtothenearestevennumberan dhencewethe steps a :> plot(diabetes diastolic,pima)> plot(diabetes test,pima)hist(pima$diastolic)First,wese ethestandardscatterplotshowingtwo quantitative ,weseea side-by-sideboxplotsuitableforshowinga quantitative anda qualititative a scatterplotmatrix,notshownhere, :Firstpanelshowsscatterplotofthediastoli cbloodpressuresagainstdiabetesfunctionan dthesecondshowsboxplotsofdiastolicbloodp ressurebrokendownbytestresult> pairs(pima)We willbeseeingmoreadvancedplotslaterbutthe numericalandgraphicalsummariespresentedh erearesufficientfora firstlookat usedforexplainingormodelingtherelationsh ipbetweena singlevariableY, calledtheresponse,outputordependentvaria ble,andoneormorepredictor,input,independ entorexplanatoryvariables,X1 Xp.
10 Whenp 1,it is calledsimpleregressionbutwhenp 1 it is morethanoneY, thenit is calledmultivariatemultipleregressionwhic hwewon t continuousvariablebuttheexplanatoryvaria blescanbecontinuous,discreteorcategorica lalthoughweleave ,a regressionofdiastolicandbmiondiabeteswou ldbeamultipleregressioninvolvingonlyquan titative variableswhichweshallbetacklingshortly. A regressionofdiastolicandbmiontestwouldin volve onepredictorwhichis quantitative whichwewillconsiderinlaterinthechapteron AnalysisofCovariance. A regressionofdiastoliconjusttestwouldinvo lvejustqualitative predictors,a topiccalledAnalysisof VarianceorANOVA althoughthiswouldjustbea simpletwo regressionoftest(theresponse)ondiastolic andbmi(thepredictors)wouldinvolve a qualitative ,orrelationshipbetween, ,binaryresponses(logisticregressionanaly sis)andcountresponses(poissonregression) .