Example: bankruptcy

An Introduction to Applied Multivariate Analysis with R ...

UseR!SeriesEditors:RobertGentlemanKurtHo rnikGiovanniParmigianiForothertitlespubl ishedinthisseries, An Introduction to AppliedMultivariate Analysis with RBrian Everitt Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876 USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversit t WienA-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011 USAA ugasse 2-6 Seattle, Washington 98109 Printed on acid-free paper Springer New York Dordrecht Heidelberg London Springer Science+Business Media, LLC 2011 software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. subject to proprietary rights. Springer is part of Springer Science+Business Media ( ) ISBN 978-1-4419-9649-7e-ISBN 978-1-4419-9650-3 DOI of Congress Control Number: 2011926793 All rights reserved.

In this book, we concentrate on what might be termed the\core"or\clas-sical"multivariate methodology, although mention will be made of recent de-velopments where these are considered relevant and useful. But there is an area of multivariate statistics that we …

Tags:

  Acls, Cisal, Clas sical

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of An Introduction to Applied Multivariate Analysis with R ...

1 UseR!SeriesEditors:RobertGentlemanKurtHo rnikGiovanniParmigianiForothertitlespubl ishedinthisseries, An Introduction to AppliedMultivariate Analysis with RBrian Everitt Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876 USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversit t WienA-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011 USAA ugasse 2-6 Seattle, Washington 98109 Printed on acid-free paper Springer New York Dordrecht Heidelberg London Springer Science+Business Media, LLC 2011 software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. subject to proprietary rights. Springer is part of Springer Science+Business Media ( ) ISBN 978-1-4419-9649-7e-ISBN 978-1-4419-9650-3 DOI of Congress Control Number: 2011926793 All rights reserved.

2 This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly Analysis . Use in connection with any form of information storage and retrieval, electronic adaptation, computer The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are King s CollegeLondon, SE5 8AF UK Ludwigstr. 33 80539 M nchen Germany Torsten Hothorn Brian Everitt Institut f r Statistik Ludwig-Maximilians-Universit t M nchen Professor EmeritusToourwives, ,meaningthatseveralmeasurements,observat ions, ,archaeologicalartifacts,countries, ,itmaybesensibletoisolateeachvariableand studyitseparately, ,oneoranothermethodofmultivariateanalysi smightbehelpful, ,inageneralsense, , ,mostanalysesofmultivariatedatashouldinv olvetheconstructionofappropriategraphsan ddiagrams, ,flexible,and,inaddition, ,weconcentrateonwhatmightbetermedthe core or clas-sical multivariatemethodology, ,andthatismultivariateanalysisofvariance (MANOVA)andrelatedtechniquessuchasFisher slineardiscriminantfunction(LDF).

3 ,wearenotconvincedthatMANOVA isnowofmuchmorethanhistoricalinterest;re searchersmayoccasionallypaylipservicetou singthetechnique, ,aclassificationtechniquesuchasLDFneedst obeconsideredinthecontextofmodernclassif icationalgorithms, ,butthemainconcernofeachchapteristhecorr ectapplicationofthemeth-odssoastoextract asmuchinformationaspossiblefromthedataat hand,particularlyassometypeofgraphicalre presentation, ,bothunder-graduateandpost-graduate,whoh aveattendedagoodintroductorycourseinstat isticsthatcoveredhypothesistesting,confi denceintervals,simplere-gressionandcorre lation,analysisofvariance, , ,weassumethatreaderswillhavesomefamiliar itywithRatthelevelof,say,Dalgaard(2002). Inadditiontosuchastudentreadership, , :R>library("MVA")Here,R>denotestheprompt signfromtheRcommandline, +indicatesadditionallines, ,outputproducedbyfunctioncallsisshownbel owtheassociatedcode:R>rnorm(10)[1] [8] ,weuseseveralRpackagestoaccessdifferente xampledatasets(manyofthemcontainedinthep ackageHSAUR2),standardfunctionsforthegen eralparametricanalyses, (CRAN), >library("MVA")R>demo("Ch-MVA")###Introd uctiontoMultivariateAnalysisR>demo("Ch-V iz")###VisualizationPrefaceixR>demo("Ch- PCA")###PrincipalComponentsAnalysisR>dem o("Ch-EFA")###ExploratoryFactorAnalysisR >demo("Ch-MDS")###MultidimensionalScalin gR>demo("Ch-CA")###ClusterAnalysisR>demo ("Ch-SEM")###StructuralEquationModelsR>d emo("Ch-LME")###LinearMixed-EffectsModel sThanksareduetoLisaM ost,BSc.

4 ,forhelpwithdataprocessingandLATEX typesetting,thecopyeditorformanyhelpfulc orrections,andtoJohnKimmel, ,LondonTorstenHothorn,M ,correlations, (PCA).. ,identification, Data and Multivariate IntroductionMultivariate data arise when researchers record the values of several randomvariables on a number of subjects or objects or perhaps one of a variety ofother things (we will use the general term units ) in which they are interested,leading to avector-valuedormultidimensionalobservat ion for each. Such dataare collected in a wide range of disciplines, and indeed it is probably reasonableto claim that the majority of data sets met in practise are Multivariate . Insome studies, the variables are chosen by design because they are known tobe essential descriptors of the system under investigation. In other studies,particularly those that have been difficult or expensive to organise, manyvariables may be measured simply to collect as much information as possibleas a matter of expediency or data are ubiquitous as is illustrated by the following fourexamples: Psychologists and other behavioural scientists often record the values ofseveral different cognitive variables on a number of subjects.

5 Educational researchers may be interested in the examination marks ob-tained by students for a variety of different subjects. Archaeologists may make a set of measurements on artefacts of interest. Environmentalists might assess pollution levels of a set of cities along withnoting other characteristics of the cities related to climate and Multivariate data sets can be represented in the same way, namely ina rectangular format known from spreadsheets, in which the elements of eachrow correspond to the variable values of a particular unit in the data set andthe elements of the columns correspond to the values taken by a particularvariable. We can write data in such a rectangular format as1 DOI , Springer Science+Business Media, LLC 2011 B. Everitt and T. Hothorn, An Introduction to Applied Multivariate Analysis with R: Use R!, ,qisthenumberofvariablesrecordedoneachun it, qdatamatrix, ,thetheoreticalentitiesdescribingtheuniv ariatedistributionsofeachoftheqvariables andtheirjointdistributionaredenotedbyso- calledrandomvariablesX1.

6 , , , ,ifeachvariableisanalysedinisolation, , ,andanyinteresting patterns in, , ,theunitscannotreallybesaidtohavebeensam pledfromsomepopulationinanymeaningfulsen se, , , ,meth-odsareusedthatallowthedetectionofp ossiblyunanticipatedpatternsinthedata, , ( , ). ,thisimplieddistinctionbetweentheexplora toryandtheinferentialmaybearedherringbec ausethegeneralaimofmostmultivariateanaly ses,whetherimplicitlyexploratoryorinfere ntialistouncover,display,orextractany signal ,intheearlyyearsofthe20thcentury,Charles Spearmanlaiddownthefoundationsoffactoran alysis(seeChapter5)whilstinvestigatingco rrelatedintelligencequotient(IQ) ,Spearman , sintroductionofanalysisofvarianceinthe19 20swassoonfollowedbyitsmultivariategener alisation,multivariateanalysisofvari-anc e,basedonworkbyBartlettandRoy.(Thesetech niquesarenotcoveredinthistextforthereaso nssetoutinthePreface.)Intheseearlydays,c omputationalaidstotaketheburdenofthevast amountsofarithmeticinvolvedintheapplicat ionofthemultivariatemeth-odsbeingpropose dwereverylimitedand,consequently,develop mentswereprimarilymathematicalandmultiva riateresearchwas,atthetime, , ,thewideavailabilityofrelativelycheapand extremelypowerfulpersonalcomputersandlap topsalliedwithflexiblestatisticalsoftwar ehasmeantthatallthemethodsofmultivariate analysiscanbeappliedroutinelyeventoveryl argedatasetssuchasthosegeneratedin,forex ample,genetics,imaging, ,datamining,whichhasbeendefinedas thenontrivialextrac-tionofimplicit,previ ouslyunknownandpotentiallyusefulinformat ionfrom41 MultivariateDataandMultivariateAnalysisd ata.

7 UsefulbooksondataminingarethoseofFayyad, Piatetsky-Shapiro,Smyth,andUthurusamy(19 96)andHand,Mannila,andSmyth(2001). (beingNotAvailable); ,thenumberofunits(peopleinthiscase)isn=1 0,withthenumberofvariablesbeingq=7and,fo rexample,x34= ,a (rows)orvariables(columns)canbeextracted viathe[subsetoperator; ,R>hypo[1:2,c("health","weight")]healthw eight1 Verygood1502 Verygood160extractsthevaluesx15,x16andx2 5, : ,thesexoftherespondent,haircolour,presen ceorabsenceofdepression, ,self-perceptionofhealth(eachcodedfromIt oV,say),andeducationallevel(noschooling, primary,secondary,ortertiaryeducation). :Thehighestlevelofmeasurement, (inKelvin,forexample),butothercommonones includesage(oranyothertimefromafixedeven t),weight, ,discussionofdifferenttypesofmeasure-men tsisoftenfollowedbyrecommendationsastowh ichstatisticaltechniquesaresuitableforea chtype;forexample,analysesonnominaldatas houldbelimitedtosummarystatisticssuchast henumberofcases,themode, ,forordinaldata, (1993)maketheimportantpointthatrestricti ngthechoiceofstatisticalmethodsinthisway maybeadangerouspractisefordataanalysis , ,wewillnotagoniseovertreatingvariablessu chasmeasuresofde-pression,anxiety,orinte lligenceasiftheyareinterval-scaled, ,namelythepresenceofmissingvaluesintheda ta; ,observationsandmea-surementsthatshouldh avebeenrecordedbutforonereasonoranother. ]

8 Forexample,non-responseinsamplesurveys,d ropoutsinlongitudinaldata(seeChapter8), , , : Omittingapossiblysubstantialnumberofindi vidualswillcausealargeamountofinformatio ntobediscardedandlowertheeffectivesample sizeofthedata,makinganyanalyseslesseffec tivethantheywouldhavebeenifalltheorigina lsamplehadbeenavailable. Moreworrisomeisthatdroppingthecaseswithm issingvaluesononeormorevariablescanleadt oseriousbiasesinbothestimationandinfer-e nceunlessthediscardedcasesareessentially arandomsubsampleoftheobserveddata(theter mmissingcompletelyatrandomisoftenused;se eChapter8andLittleandRubin(1987)formored etails).So,attheveryleast,complete-casea nalysisleadstoaloss,andperhapsasubstanti alloss,inpowerbydiscardingdata,butworse, ,iftheresearcherisinterestedinestimating thecorrelationmatrix( )ofasetofmultivariatedata, , , ,suchasfactoranalysis(seeChap-ter5)andst ructuralequationmodelling(seeChapter7), small .Analternativeanswertothemissing-datapro blemistoconsidersomeformofimputation,the prac-tiseof fillingin ,unlikeincomplete-caseanalysis, , ,fromastatisticalviewpoint,carefulconsid -erationneedstobegiventothemethodusedfor imputationorotherwiseitmaycausemoreprobl emsthanitsolves;forexample,imputinganobs ervedvariablemeanforavariable smissingvaluespreservestheobservedsample meansbutdistortsthecovariancematrix( ), ,imputingpredictedvaluesfromregressionmo delstendstoinflateobservedcorrelations,b iasingthemawayfromzero(seeLittle2005).

9 Real (1987) >1simulatedversions,wheremistypicallysma ll(say3 10).Eachofthesimulatedcom-pletedatasetsi sanalysedusingthemethodappropriateforthe investigationathand,andtheresultsarelate rcombinedtoproduce,say, (1987)andmoreconciselyinSchafer(1999). , !Andifthereisasubstantialproportionofind ividualswithlargeamountsofmiss-ingdata, ,waist, ; Couldbodysizeandbodyshapebesummarisedins omewaybycombiningthethreemeasurementsint oasinglenumber? Aretheresubtypesofbodyshapesamongsttheme nandamongstthewomenwithinwhichindividual sareofsimilarshapesandbetweenwhichbodysh apesdiffer?Thefirstquestionmightbeanswer edbyprincipalcomponentsanalysis(seeChapt er3),andthesecondquestioncouldbeinvestig atedusingclusteranal-ysis(seeChapter6).( Inpractise, ) ,waist,andhipmeasurementson20individuals (ininches).chestwaisthipsgenderchestwais thipsgender343032male362435female373237m ale362537female383036male342437female363 339male332234female382933male362638femal e433238male372637female403342male342538f emale383040male362637female403037male382 840female413239male352335femaleOursecond setofmultivariatedataconsistsoftheresult sofchemicalanalysisonRomano-Britishpotte rymadeinthreedifferentregions(region1con tainskiln1,region2containskilns2and3,and region3containskilns4and5).

10 Thecompletedataset,whichweshallmeetinCha pter6,consistsofthechemicalanalysisresul tson45pots, :potterydata(continued). :Tubb,A.,etal.,Archaeometry,22,153 171, ; ,perhaps generalintelligence ?Thequestioncouldbeinvestigatedbyusingex ploratoryfactoranalysis(seeChapter5). :SO2:SO2contentofairinmicrogramspercubic metre;temp:averageannualtemperatureindeg reesFahrenheit;manu:numberofmanufacturin genterprisesemploying20ormoreworkers;pop ul:populationsize(1970census)inthousands ;wind:averageannualwindspeedinmilesperho ur;precip:averageannualprecipitationinin ches; :USairpollutiondata(continued). :Sokal, ,Rohlf, ,Biometry, ,SanFrancisco, howispollutionlevelasmeasuredbysulphurdi oxideconcentrationrelatedtothesixotherva riables? Inthefirstinstanceatleast,thisquestionsu ggeststheapplicationofmultiplelinearregr ession,withsulphurdioxideconcentrationas theresponsevariableandtheremainingsixvar iablesbeingtheindependentorexplanatoryva riables(thelatterisamoreacceptablelabelb ecausethe independent variablesarerarelyindependentofoneanothe r).


Related search queries