Example: stock market

Practical Regression and Anova using R

PracticalRegressionandAnova usingRJulianJ. FarawayJuly20021 Copyrightc 1999,2000,2002 JulianJ. FarawayPermissiontoreproduceindividualco piesofthisbookforpersonaluseis a notintroductory. It theessentialsofstatisticalinferencelike estimation, is tolearnwhatmethodsareavailableandmoreimp ortantly, whenthey relativelylessemphasisonmathematicaltheo ry, partlybecausesomepriorknowledgeis importantbecauseit take a widerviewofstatisticaltheory. It is statisticalconceptsarejustasimportantin Statisticsbecausetheseenableusto actuallydoit principlesarehardertolearnbecausethey aredifficulttostatepreciselybutthey a aredesignedfordifferentaudiencesandhave have chosentouseR( (1996)).

A basic knowledge of data analysis is presumed. Some linear algebra and calculus is also required. The emphasis of this text is on the practice of regression and analysis of variance. The objective is to learn what methods are available and more importantly, when they should be applied. Many examples are

Tags:

  Analysis, Example

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Practical Regression and Anova using R

1 PracticalRegressionandAnova usingRJulianJ. FarawayJuly20021 Copyrightc 1999,2000,2002 JulianJ. FarawayPermissiontoreproduceindividualco piesofthisbookforpersonaluseis a notintroductory. It theessentialsofstatisticalinferencelike estimation, is tolearnwhatmethodsareavailableandmoreimp ortantly, whenthey relativelylessemphasisonmathematicaltheo ry, partlybecausesomepriorknowledgeis importantbecauseit take a widerviewofstatisticaltheory. It is statisticalconceptsarejustasimportantin Statisticsbecausetheseenableusto actuallydoit principlesarehardertolearnbecausethey aredifficulttostatepreciselybutthey a aredesignedfordifferentaudiencesandhave have chosentouseR( (1996)).

2 WhydoI useR? alsoa programminglanguage,soI amnotlimitedbytheproceduresthatareprepro grammedbya is relativelyeasytoprogramnew Dataanalysisis , a time,allowingustomake ,Macintosh, SASis themostcommonstatisticspackageingeneralb utRorS is alsopopularforquantitative thatit is notanintroductiontoR. Thereis a I have Thereadermaychoosetostartworkingthrought histextbeforelearningRandpickit .. a goodestimate?.. Theorem.. 2.. pairofpredictors.. subspace.. 2known.. 2unknown.. 1049 ScaleChanges, .. 13612 ChicagoInsuranceRedlining- a two-level example .. predictors.

3 Three-level example .. e s theoremformultiplecomparisons.. advantageofRCBD overCRD.. youstartStatisticsstartswitha problem,continueswiththecollectionofdata , is a commonmistake ofinexperiencedStatisticianstoplungeinto a ! problemis formulatetheproblemcorrectly, learnsomethingnewratherthana ,oftenyouwillbeworkingwitha fishingexpeditions - if youlookhardenough,you llalmostalwaysfindsomethingbutthatsometh ingmayjustbea translatedintothelanguageofStatistics,th esolutionis s importanttounderstandhow thedatawascollected. Arethedataobservationalorexperimental?Ar ethedataa sampleofconvenienceorweretheyobtainedvia a designedsamplesurvey.

4 Howthedatawerecollectedhasa crucialimpactonwhatconclusionscanbemade. Is therenon-response?Thedatayoudon t seemaybejustasimportantasthedatayoudosee . Aretheremissingvalues?Thisis a commonproblemthatis troublesomeandtimeconsumingtodealwith. How arethedatacoded?Inparticular, how arethequalitative variablesrepresented. Whataretheunitsofmeasurement?Sometimesda tais collectedorrepresentedwithfarmoredigitst hanarenecessary. Considerroundingif thiswillhelpwiththeinterpretationorstora gecosts. alltoocommon almosta certaintyinany realdatasetofat a lookssimplebutit is vital. Numericalsummaries- means,sds,five-numbersummaries,correlati ons.

5 Graphicalsummaries Onevariable- Boxplots,histogramsetc. Two variables- scatterplots. Many variables- interactive , ,allthedatawillbereadytoanalyzebutyousho uldrealizethatinpracticethisis s andKidney Diseasesconducteda :Numberoftimespregnant,Plasmaglucoseconc entrationa 2 hoursinanoralglucosetolerancetest,Diasto licbloodpressure(mmHg),Tricepsskinfoldth ickness(mm),2-Hourseruminsulin(muU/ml),B odymassindex (weightinkg/(heightinm2)),Diabetespedigr eefunction,Age(years)anda testwhetherthepatientshowssignsofdiabete s(coded0 if negative, 1 if positive). ,beforedoinganythingelse,oneshouldfindou twhatthepurposeofthestudywasandmoreabout how s skipaheadtoa lookat > library(faraway)> data(pima)> pimapregnantglucosediastolictricepsinsul inbmi diabetesage test1614872350 (faraway)makesthedatausedinthisbookavail ablewhiledata(pima) , s toolongtoshow it datasetofthissize,onecanjustaboutvisuall yskimoverthedataforanythingoutofplacebut it is startwithsomenumericalsummaries:> summary(pima)pregnantglucosediastolictri cepsinsulinMin.

6 : : 0 Min.: : : Qu. Qu.:991st Qu. Qu. Qu. : :117 Median: : : :121 Mean: : Qu. Qu.:1403rd Qu. Qu. Qu. :199 Max. : Qu. Qu. Qu. Qu. Qu. Qu. ()commandis a ,wearelookingforanythingunusualorunexpec tedperhapsindicatinga dataentryerror. Forthispurpose,acloselookat theminimumandmaximumvaluesofeachvariable is ,weseea , wethenseethatthenext5 variableshave notgoodforthehealth s lookat thesortedvalues:> sort(pima$diastolic)[1]00 0 0 0 0 00 0 0 0 00 0 0 0 0 0[19]00 0 0 0 0 00 0 0 0 00 0 0 0 0 24[37]30 30 38 4044 44 44 44 46 4648 48 48 48 4850 50 butit seemslikelythatthezerohasbeenusedasa , realinvestigation,onewouldlikelybeableto , A theerrorwaslaterdiscovered,they mightthenblametheresearchersforusing0 asa missingvaluecode(nota goodchoicesinceit is a validvalueforsomeofthevariables)andnotme ntioningit sizeorcomplexity.

7 Setallzerovaluesofthefive variablestoNAwhichis themissingvaluecodeusedbyR.> pima$diastolic[pima$diastolic == 0] <- NA> pima$glucose[pima$glucose == 0] <- NA> pima$triceps[pima$triceps == 0] <- NA> pima$insulin[pima$insulin == 0] <- NA> pima$bmi[pima$bmi == 0] <- NAThevariabletestis notquantitative However,becauseofthenumericalcoding,thisvariablehasbeentreatedasif it s bestto designatesuchvariablesasfactorssothatthey aretreatedappropriately. Sometimespeopleforgetthisandcomputestupidstatisticssuchas averagezipcode .> pima$test<- factor(pima$test)> summary(pima$test)0 1500 268We now seethat500caseswerenegative tousedescriptive labels:> levels(pima$test) <- c("negative","positive")> summary(pima)pregnantglucosediastolictri cepsinsulinMin.

8 : : 44 Min.: : : Qu. Qu.:991st Qu. Qu. : :117 Median: : : :122 Mean: : Qu. Qu.:1413rd Qu. Qu. :199 Max. : s: 5 NA s: :5001st Qu. Qu. :268 Qu. Qu. thatwe ve thehistogram:hist(pima$diastolic) $ = 733 Bandwidth = (pima$diastolic) :Firstpanelshowshistogramofthediastolicb loodpressures,thesecondshowsa kerneldensityestimateofthesamewhiletheth ethirdshowsanindex seea ,I prefertouseKernelDensityEstimateswhichar eessentiallya smoothedversionofthehistogram(seeSimonof f (1996)foradiscussionoftherelative meritsofhistogramsandkernelestimates).

9 > plot(density(pima$diastolic, )) seethatit is tosimplyplotthesorteddataagainstitsindex :plot(sort(pima$diastolic),pch=".") canalsoseethediscretenessin themeasurementofbloodpressure- valuesareroundedtothenearestevennumberan dhencewethe steps a :> plot(diabetes diastolic,pima)> plot(diabetes test,pima)hist(pima$diastolic)First,wese ethestandardscatterplotshowingtwo quantitative ,weseea side-by-sideboxplotsuitableforshowinga quantitative anda qualititative a scatterplotmatrix,notshownhere, :Firstpanelshowsscatterplotofthediastoli cbloodpressuresagainstdiabetesfunctionan dthesecondshowsboxplotsofdiastolicbloodp ressurebrokendownbytestresult> pairs(pima)We willbeseeingmoreadvancedplotslaterbutthe numericalandgraphicalsummariespresentedh erearesufficientfora firstlookat usedforexplainingormodelingtherelationsh ipbetweena singlevariableY, calledtheresponse,outputordependentvaria ble,andoneormorepredictor,input,independ entorexplanatoryvariables,X1 Xp.

10 Whenp 1,it is calledsimpleregressionbutwhenp 1 it is morethanoneY, thenit is calledmultivariatemultipleregressionwhic hwewon t continuousvariablebuttheexplanatoryvaria blescanbecontinuous,discreteorcategorica lalthoughweleave ,a regressionofdiastolicandbmiondiabeteswou ldbeamultipleregressioninvolvingonlyquan titative variableswhichweshallbetacklingshortly. A regressionofdiastolicandbmiontestwouldin volve onepredictorwhichis quantitative whichwewillconsiderinlaterinthechapteron AnalysisofCovariance. A regressionofdiastoliconjusttestwouldinvo lvejustqualitative predictors,a topiccalledAnalysisof VarianceorANOVA althoughthiswouldjustbea simpletwo regressionoftest(theresponse)ondiastolic andbmi(thepredictors)wouldinvolve a qualitative ,orrelationshipbetween, ,binaryresponses(logisticregressionanaly sis)andcountresponses(poissonregression) .


Related search queries