Example: confidence

Practical Regression and Anova using R

PracticalRegressionandAnova usingRJulianJ. FarawayJuly20021 Copyrightc 1999,2000,2002 JulianJ. FarawayPermissiontoreproduceindividualco piesofthisbookforpersonaluseis a notintroductory. It theessentialsofstatisticalinferencelike estimation, is tolearnwhatmethodsareavailableandmoreimp ortantly, whenthey relativelylessemphasisonmathematicaltheo ry, partlybecausesomepriorknowledgeis importantbecauseit take a widerviewofstatisticaltheory. It is statisticalconceptsarejustasimportantin Statisticsbecausetheseenableusto actuallydoit principlesarehardertolearnbecausethey aredifficulttostatepreciselybutthey a aredesignedfordifferentaudiencesandhave have chosentouseR( (1996)).WhydoI useR? alsoa programminglanguage,soI amnotlimitedbytheproceduresthatareprepro grammedbya is relativelyeasytoprogramnew Dataanalysisis , a time,allowingustomake ,Macintosh, SASis themostcommonstatisticspackageingeneralb utRorS is alsopopularforquantitative thatit is notanintroductiontoR.

presumed. Some linear algebra and calculus is also required. The emphasis of this text is on the practice of regression and analysis of variance. The objective is to learn what methods are available and more importantly, when they should be applied. Many examples are

Tags:

  Linear

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Practical Regression and Anova using R

1 PracticalRegressionandAnova usingRJulianJ. FarawayJuly20021 Copyrightc 1999,2000,2002 JulianJ. FarawayPermissiontoreproduceindividualco piesofthisbookforpersonaluseis a notintroductory. It theessentialsofstatisticalinferencelike estimation, is tolearnwhatmethodsareavailableandmoreimp ortantly, whenthey relativelylessemphasisonmathematicaltheo ry, partlybecausesomepriorknowledgeis importantbecauseit take a widerviewofstatisticaltheory. It is statisticalconceptsarejustasimportantin Statisticsbecausetheseenableusto actuallydoit principlesarehardertolearnbecausethey aredifficulttostatepreciselybutthey a aredesignedfordifferentaudiencesandhave have chosentouseR( (1996)).WhydoI useR? alsoa programminglanguage,soI amnotlimitedbytheproceduresthatareprepro grammedbya is relativelyeasytoprogramnew Dataanalysisis , a time,allowingustomake ,Macintosh, SASis themostcommonstatisticspackageingeneralb utRorS is alsopopularforquantitative thatit is notanintroductiontoR.

2 Thereis a I have Thereadermaychoosetostartworkingthrought histextbeforelearningRandpickit .. a goodestimate?.. Theorem.. 2.. pairofpredictors.. subspace.. 2known.. 2unknown.. 1049 ScaleChanges, .. 13612 ChicagoInsuranceRedlining- a two-level example.. predictors.. Three-level example.. e s theoremformultiplecomparisons.. advantageofRCBD overCRD.. youstartStatisticsstartswitha problem,continueswiththecollectionofdata , is a commonmistake ofinexperiencedStatisticianstoplungeinto a ! problemis formulatetheproblemcorrectly, learnsomethingnewratherthana ,oftenyouwillbeworkingwitha fishingexpeditions - if youlookhardenough,you llalmostalwaysfindsomethingbutthatsometh ingmayjustbea translatedintothelanguageofStatistics,th esolutionis s importanttounderstandhow thedatawascollected. Arethedataobservationalorexperimental?Ar ethedataa sampleofconvenienceorweretheyobtainedvia a designedsamplesurvey.

3 Howthedatawerecollectedhasa crucialimpactonwhatconclusionscanbemade. Is therenon-response?Thedatayoudon t seemaybejustasimportantasthedatayoudosee . Aretheremissingvalues?Thisis a commonproblemthatis troublesomeandtimeconsumingtodealwith. How arethedatacoded?Inparticular, how arethequalitative variablesrepresented. Whataretheunitsofmeasurement?Sometimesda tais collectedorrepresentedwithfarmoredigitst hanarenecessary. Considerroundingif thiswillhelpwiththeinterpretationorstora gecosts. alltoocommon almosta certaintyinany realdatasetofat a lookssimplebutit is vital. Numericalsummaries- means,sds,five-numbersummaries,correlati ons. Graphicalsummaries Onevariable- Boxplots,histogramsetc. Two variables- scatterplots. Many variables- interactive , ,allthedatawillbereadytoanalyzebutyousho uldrealizethatinpracticethisis s andKidney Diseasesconducteda :Numberoftimespregnant,Plasmaglucoseconc entrationa 2 hoursinanoralglucosetolerancetest,Diasto licbloodpressure(mmHg),Tricepsskinfoldth ickness(mm),2-Hourseruminsulin(muU/ml),B odymassindex (weightinkg/(heightinm2)),Diabetespedigr eefunction,Age(years)anda testwhetherthepatientshowssignsofdiabete s(coded0 if negative, 1 if positive).

4 ,beforedoinganythingelse,oneshouldfindou twhatthepurposeofthestudywasandmoreabout how s skipaheadtoa lookat > library(faraway)> data(pima)> pimapregnantglucosediastolictricepsinsul inbmi diabetesage test1614872350 (faraway)makesthedatausedinthisbookavail ablewhiledata(pima) , s toolongtoshow it datasetofthissize,onecanjustaboutvisuall yskimoverthedataforanythingoutofplacebut it is startwithsomenumericalsummaries:> summary(pima)pregnantglucosediastolictri cepsinsulinMin.: : 0 Min.: : : Qu. Qu.:991st Qu. Qu. Qu. : :117 Median: : : :121 Mean: : Qu. Qu.:1403rd Qu. Qu. Qu. :199 Max. : Qu. Qu. Qu. Qu. Qu. Qu. ()commandis a ,wearelookingforanythingunusualorunexpec tedperhapsindicatinga dataentryerror. Forthispurpose,acloselookat theminimumandmaximumvaluesofeachvariable is ,weseea , wethenseethatthenext5 variableshave notgoodforthehealth s lookat thesortedvalues:> sort(pima$diastolic)[1]00 0 0 0 0 00 0 0 0 00 0 0 0 0 0[19]00 0 0 0 0 00 0 0 0 00 0 0 0 0 24[37]30 30 38 4044 44 44 44 46 4648 48 48 48 4850 50 butit seemslikelythatthezerohasbeenusedasa , realinvestigation,onewouldlikelybeableto , A theerrorwaslaterdiscovered,they mightthenblametheresearchersforusing0 asa missingvaluecode(nota goodchoicesinceit is a validvalueforsomeofthevariables)andnotme ntioningit sizeorcomplexity.

5 Setallzerovaluesofthefive variablestoNAwhichis themissingvaluecodeusedbyR.> pima$diastolic[pima$diastolic == 0] <- NA> pima$glucose[pima$glucose == 0] <- NA> pima$triceps[pima$triceps == 0] <- NA> pima$insulin[pima$insulin == 0] <- NA> pima$bmi[pima$bmi == 0] <- NAThevariabletestis notquantitative However,becauseofthenumericalcoding,thisvariablehasbeentreatedasif it s bestto designatesuchvariablesasfactorssothatthey aretreatedappropriately. Sometimespeopleforgetthisandcomputestupidstatisticssuchas averagezipcode .> pima$test<- factor(pima$test)> summary(pima$test)0 1500 268We now seethat500caseswerenegative tousedescriptive labels:> levels(pima$test) <- c("negative","positive")> summary(pima)pregnantglucosediastolictri cepsinsulinMin.: : 44 Min.: : : Qu. Qu.:991st Qu. Qu. : :117 Median: : : :122 Mean: : Qu. Qu.:1413rd Qu. Qu. :199 Max. : s: 5 NA s: :5001st Qu. Qu. :268 Qu.

6 Qu. thatwe ve thehistogram:hist(pima$diastolic) $ = 733 Bandwidth = (pima$diastolic) :Firstpanelshowshistogramofthediastolicb loodpressures,thesecondshowsa kerneldensityestimateofthesamewhiletheth ethirdshowsanindex seea ,I prefertouseKernelDensityEstimateswhichar eessentiallya smoothedversionofthehistogram(seeSimonof f (1996)foradiscussionoftherelative meritsofhistogramsandkernelestimates).> plot(density(pima$diastolic, )) seethatit is tosimplyplotthesorteddataagainstitsindex :plot(sort(pima$diastolic),pch=".") canalsoseethediscretenessin themeasurementofbloodpressure- valuesareroundedtothenearestevennumberan dhencewethe steps a :> plot(diabetes diastolic,pima)> plot(diabetes test,pima)hist(pima$diastolic)First,wese ethestandardscatterplotshowingtwo quantitative ,weseea side-by-sideboxplotsuitableforshowinga quantitative anda qualititative a scatterplotmatrix,notshownhere, :Firstpanelshowsscatterplotofthediastoli cbloodpressuresagainstdiabetesfunctionan dthesecondshowsboxplotsofdiastolicbloodp ressurebrokendownbytestresult> pairs(pima)We willbeseeingmoreadvancedplotslaterbutthe numericalandgraphicalsummariespresentedh erearesufficientfora firstlookat usedforexplainingormodelingtherelationsh ipbetweena singlevariableY, calledtheresponse,outputordependentvaria ble,andoneormorepredictor,input,independ entorexplanatoryvariables,X1 Xp.

7 Whenp 1,it is calledsimpleregressionbutwhenp 1 it is morethanoneY, thenit is calledmultivariatemultipleregressionwhic hwewon t continuousvariablebuttheexplanatoryvaria blescanbecontinuous,discreteorcategorica lalthoughweleave ,a regressionofdiastolicandbmiondiabeteswou ldbeamultipleregressioninvolvingonlyquan titative variableswhichweshallbetacklingshortly. A regressionofdiastolicandbmiontestwouldin volve onepredictorwhichis quantitative whichwewillconsiderinlaterinthechapteron AnalysisofCovariance. A regressionofdiastoliconjusttestwouldinvo lvejustqualitative predictors,a topiccalledAnalysisof VarianceorANOVA althoughthiswouldjustbea simpletwo regressionoftest(theresponse)ondiastolic andbmi(thepredictors)wouldinvolve a qualitative ,orrelationshipbetween, ,binaryresponses(logisticregressionanaly sis)andcountresponses(poissonregression) . FrancisGaltoncoinedthetermregressiontome diocrityin1875inreferencetothesimpleregr essionequationintheformy ySDy r x x SDx canillustratethiseffectwithsomedataonsco resfroma ,weseea scaleeachvariableto have mean0 andSD1 sothatwearenotdistractedbytherelative ,thissimplifiestheregressionequationtoy rx> data(stat500)> stat500<- (scale(stat500))> plot(final midterm,stat500)> abline(0,1) 2 1012 2 shownwitha dottedlinewhiley xis shownasa have addedthey x(solid) studentscoring,sayonestandarddeviationab ove computetheleastsquaresregressionfitandpl ottheregressionline(moreonthedetailslate r).

8 We alsocomputethecorrelations.> g <- lm(final midterm,stat500)> abline(g$coef,lty=5)> cor(stat500) seethatthethestudentscoring1 SDabove averageonthemidtermis predictedtoscoresomewhatlessabove averageonthefinal(seethedottedregression line)- s above , a studentscoringbelow averageonthemidtermmightexpecttodorelati velybetterinthefinalalthoughstillbelow examsmanagedtomeasuretheabilityofstudent sperfectly, thenprovidedthatabilityremainedun-change dfrommidtermtofinal,wewouldexpecttoseea ,it s toomuchtoexpectsucha perfectexamandsomevariationis ,individualeffortis highscoreonthemidtermcanpartlybeattribut edtoskillbutalsoa bemaintainedin regressionto mediocrity .Ofcoursethisappliestoany x y situationlike this anexampleistheso-calledsophomorejinxinsp ortswhena rookiestarhasa so-sosecondseasonaftera greatfirstyear. Althoughinthefather-sonexample,it doespredictthatsuccessive descendantswillcomeclosertothemean,it otherapplicationsofregression,theregress ioneffectis notofinterestsoit is , s thefuelconsumptionofa discretebutthat s ,say, asa predictorwouldnotbepossiblewithinthecurr entdevelopment(wewillseehow todothislaterinthecourse).

9 Typicallythedatawillbeavailableintheform ofanarraylike thisy1x11x12x13y2x21x22x23 ynxn1xn2xn3wherenis f X1 X2 X3 wherefis someunknownfunctionand is theerrorin thisrepresentationwhichis additive t have enoughdatatotrytoestimatefdirectly, weusuallyhave toassumethatit hassomemorerestrictedform,perhapslineara sinY 0 1X1 2X2 3X3 where i,i 0 1 2 3 areunknownparameters. 0is linearmodeltheparameters enterlinearly thepredictorsdonothave tobelinear. ForexampleY 0 1X1 2logX2 linearbutY 0 1X 21 is forexampley 0x 1 butbecausethepredictorscantransformedand combinedin any way, they ,wemaywriteyi 0 1x1i 2x2i 3x3i ii 1 willfindit simplerbothnotationallyandtheoreticallyt ousea writtenasy X wherey y1 yn T, 1 n T, 0 3 TandX 1x11x12x131x21x22x23 1xn1xn2xn3 , meanonlymodely y1 yn 1 1 1 n We canassumethatE 0 sinceif thiswerenotso,wecouldsimplyabsorbthenon- zeroexpectationfortheerrorintothemean togeta sampleproblemwitha treatmentgrouphavingtheresponsey1 ymwithmean yandcontrolgrouphavingresponsez1 znwithmean zwehave y1 ymz1 zn 10 1001 01 y z 1 m n We have theregressionequationy X - whatestimateof wouldbestseparatethesystematiccom-ponent X fromtherandomcomponent.

10 Geometricallyspeaking,y IRnwhile IRpwherepis thenumberofparameters(ifweincludetheinte rceptthenpis thenumberofpredictorsplusone). spanned by XFitted in p dimensionsy in n dimensionsResidual inn p :Geometricrepresentationoftheestimation . ThedatavectorY is projectedorthogonallyontothemodelspacesp annedbyX. Thefitis representedbyprojection y X withthedifferencebetweenthefitandthedata representedbytheresidualvector .Theproblemis tofind suchthatX is closetoY. Thebestchoiceof is is insomesensethebestestimateof y X orH ywhereHis denotedby torepresent,asaccuratelyaspossible,somet hingcomplex ywhichisn-dimensional intermsofsomethingmuchsimpler ourmodelis successful,thestructureinthedatashouldbe capturedinthosepdimensions,leavingjustra ndomvariationintheresidualswhichlieinann haveData SystematicStructure RandomVariationndimensions pdimensions n p canbeconsideredfroma non-geometricpointofview. We mightdefinethebestestimateof asthatwhichminimizesthesumofthesquareder rors, T.


Related search queries