Example: biology

An Introduction to Applied Multivariate Analysis with R ...

UseR!SeriesEditors:RobertGentlemanKurtHo rnikGiovanniParmigianiForothertitlespubl ishedinthisseries, An Introduction to AppliedMultivariate Analysis with RBrian Everitt Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876 USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversit t WienA-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011 USAA ugasse 2-6 Seattle, Washington 98109 Printed on acid-free paper Springer New York Dordrecht Heidelberg London Springer Science+Business Media, LLC 2011 software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.

An Introduction to Applied Multivariate Analysis with R Brian Everitt • Torsten Hothorn . Series Editors: ... one or another method of multivariate analysis might be helpful, and it is with such methods that this book is largely concerned. Multivariate ... R is a statistical computing environment that is powerful, exible, and, in addition ...

Tags:

  Analysis, Statistical, Applied, Multivariate, Multivariate analysis, Applied multivariate analysis

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of An Introduction to Applied Multivariate Analysis with R ...

1 UseR!SeriesEditors:RobertGentlemanKurtHo rnikGiovanniParmigianiForothertitlespubl ishedinthisseries, An Introduction to AppliedMultivariate Analysis with RBrian Everitt Torsten Hothorn Series Editors:Robert GentlemanProgram in Computational BiologyDivision of Public Health SciencesFred Hutchinson Cancer Research Center1100 Fairview Avenue, N. M2-B876 USAKurt HornikDepartment of Statistik and MathematikWirtschaftsuniversit t WienA-1090 WienAustriaGiovanni ParmigianiThe Sidney Kimmel ComprehensiveCancer Center at Johns Hopkins University550 North BroadwayBaltimore, MD 21205-2011 USAA ugasse 2-6 Seattle, Washington 98109 Printed on acid-free paper Springer New York Dordrecht Heidelberg London Springer Science+Business Media, LLC 2011 software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.

2 Subject to proprietary rights. Springer is part of Springer Science+Business Media ( ) ISBN 978-1-4419-9649-7e-ISBN 978-1-4419-9650-3 DOI of Congress Control Number: 2011926793 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly Analysis . Use in connection with any form of information storage and retrieval, electronic adaptation, computer The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are King s CollegeLondon, SE5 8AF UK Ludwigstr.

3 33 80539 M nchen Germany Torsten Hothorn Brian Everitt Institut f r Statistik Ludwig-Maximilians-Universit t M nchen Professor EmeritusToourwives, ,meaningthatseveralmeasurements,observat ions, ,archaeologicalartifacts,countries, ,itmaybesensibletoisolateeachvariableand studyitseparately, ,oneoranothermethodofmultivariateanalysi smightbehelpful, ,inageneralsense, , ,mostanalysesofmultivariatedatashouldinv olvetheconstructionofappropriategraphsan ddiagrams, ,flexible,and,inaddition, ,weconcentrateonwhatmightbetermedthe core or clas-sical multivariatemethodology, ,andthatismultivariateanalysisofvariance (MANOVA)andrelatedtechniquessuchasFisher slineardiscriminantfunction(LDF). ,wearenotconvincedthatMANOVA isnowofmuchmorethanhistoricalinterest;re searchersmayoccasionallypaylipservicetou singthetechnique, ,aclassificationtechniquesuchasLDFneedst obeconsideredinthecontextofmodernclassif icationalgorithms, ,butthemainconcernofeachchapteristhecorr ectapplicationofthemeth-odssoastoextract asmuchinformationaspossiblefromthedataat hand,particularlyassometypeofgraphicalre presentation, ,bothunder-graduateandpost-graduate,whoh aveattendedagoodintroductorycourseinstat isticsthatcoveredhypothesistesting,confi denceintervals,simplere-gressionandcorre lation,analysisofvariance, , ,weassumethatreaderswillhavesomefamiliar itywithRatthelevelof,say,Dalgaard(2002).

4 Inadditiontosuchastudentreadership, , :R>library("MVA")Here,R>denotestheprompt signfromtheRcommandline, +indicatesadditionallines, ,outputproducedbyfunctioncallsisshownbel owtheassociatedcode:R>rnorm(10)[1] [8] ,weuseseveralRpackagestoaccessdifferente xampledatasets(manyofthemcontainedinthep ackageHSAUR2),standardfunctionsforthegen eralparametricanalyses, (CRAN), >library("MVA")R>demo("Ch-MVA")###Introd uctiontoMultivariateAnalysisR>demo("Ch-V iz")###VisualizationPrefaceixR>demo("Ch- PCA")###PrincipalComponentsAnalysisR>dem o("Ch-EFA")###ExploratoryFactorAnalysisR >demo("Ch-MDS")###MultidimensionalScalin gR>demo("Ch-CA")###ClusterAnalysisR>demo ("Ch-SEM")###StructuralEquationModelsR>d emo("Ch-LME")###LinearMixed-EffectsModel sThanksareduetoLisaM ost,BSc.,forhelpwithdataprocessingandLAT EX typesetting,thecopyeditorformanyhelpfulc orrections,andtoJohnKimmel, ,LondonTorstenHothorn,M ,correlations, (PCA).

5 ,identification, Data and Multivariate IntroductionMultivariate data arise when researchers record the values of several randomvariables on a number of subjects or objects or perhaps one of a variety ofother things (we will use the general term units ) in which they are interested,leading to avector-valuedormultidimensionalobservat ion for each. Such dataare collected in a wide range of disciplines, and indeed it is probably reasonableto claim that the majority of data sets met in practise are Multivariate . Insome studies, the variables are chosen by design because they are known tobe essential descriptors of the system under investigation. In other studies,particularly those that have been difficult or expensive to organise, manyvariables may be measured simply to collect as much information as possibleas a matter of expediency or data are ubiquitous as is illustrated by the following fourexamples: Psychologists and other behavioural scientists often record the values ofseveral different cognitive variables on a number of subjects.

6 Educational researchers may be interested in the examination marks ob-tained by students for a variety of different subjects. Archaeologists may make a set of measurements on artefacts of interest. Environmentalists might assess pollution levels of a set of cities along withnoting other characteristics of the cities related to climate and Multivariate data sets can be represented in the same way, namely ina rectangular format known from spreadsheets, in which the elements of eachrow correspond to the variable values of a particular unit in the data set andthe elements of the columns correspond to the values taken by a particularvariable. We can write data in such a rectangular format as1 DOI , Springer Science+Business Media, LLC 2011 B.

7 Everitt and T. Hothorn, An Introduction to Applied Multivariate Analysis with R: Use R!, ,qisthenumberofvariablesrecordedoneachun it, qdatamatrix, ,thetheoreticalentitiesdescribingtheuniv ariatedistributionsofeachoftheqvariables andtheirjointdistributionaredenotedbyso- calledrandomvariablesX1,.., , , ,ifeachvariableisanalysedinisolation, , ,andanyinteresting patterns in, , ,theunitscannotreallybesaidtohavebeensam pledfromsomepopulationinanymeaningfulsen se, , , ,meth-odsareusedthatallowthedetectionofp ossiblyunanticipatedpatternsinthedata, , ( , ). ,thisimplieddistinctionbetweentheexplora toryandtheinferentialmaybearedherringbec ausethegeneralaimofmostmultivariateanaly ses,whetherimplicitlyexploratoryorinfere ntialistouncover,display,orextractany signal ,intheearlyyearsofthe20thcentury,Charles Spearmanlaiddownthefoundationsoffactoran alysis(seeChapter5)whilstinvestigatingco rrelatedintelligencequotient(IQ) ,Spearman , sintroductionofanalysisofvarianceinthe19 20swassoonfollowedbyitsmultivariategener alisation,multivariateanalysisofvari-anc e,basedonworkbyBartlettandRoy.

8 (Thesetechniquesarenotcoveredinthistextf orthereasonssetoutinthePreface.)Inthesee arlydays,computationalaidstotaketheburde nofthevastamountsofarithmeticinvolvedint heapplicationofthemultivariatemeth-odsbe ingproposedwereverylimitedand,consequent ly,developmentswereprimarilymathematical andmultivariateresearchwas,atthetime, , ,thewideavailabilityofrelativelycheapand extremelypowerfulpersonalcomputersandlap topsalliedwithflexiblestatisticalsoftwar ehasmeantthatallthemethodsofmultivariate analysiscanbeappliedroutinelyeventoveryl argedatasetssuchasthosegeneratedin,forex ample,genetics,imaging, ,datamining,whichhasbeendefinedas thenontrivialextrac-tionofimplicit,previ ouslyunknownandpotentiallyusefulinformat ionfrom41 MultivariateDataandMultivariateAnalysisd ata. UsefulbooksondataminingarethoseofFayyad, Piatetsky-Shapiro,Smyth,andUthurusamy(19 96)andHand,Mannila,andSmyth(2001).

9 (beingNotAvailable); ,thenumberofunits(peopleinthiscase)isn=1 0,withthenumberofvariablesbeingq=7and,fo rexample,x34= ,a (rows)orvariables(columns)canbeextracted viathe[subsetoperator; ,R>hypo[1:2,c("health","weight")]healthw eight1 Verygood1502 Verygood160extractsthevaluesx15,x16andx2 5, : ,thesexoftherespondent,haircolour,presen ceorabsenceofdepression, ,self-perceptionofhealth(eachcodedfromIt oV,say),andeducationallevel(noschooling, primary,secondary,ortertiaryeducation). :Thehighestlevelofmeasurement, (inKelvin,forexample),butothercommonones includesage(oranyothertimefromafixedeven t),weight, ,discussionofdifferenttypesofmeasure-men tsisoftenfollowedbyrecommendationsastowh ichstatisticaltechniquesaresuitableforea chtype;forexample,analysesonnominaldatas houldbelimitedtosummarystatisticssuchast henumberofcases,themode, ,forordinaldata, (1993)maketheimportantpointthatrestricti ngthechoiceofstatisticalmethodsinthisway maybeadangerouspractisefordataanalysis , ,wewillnotagoniseovertreatingvariablessu chasmeasuresofde-pression,anxiety,orinte lligenceasiftheyareinterval-scaled, ,namelythepresenceofmissingvaluesintheda ta; ,observationsandmea-surementsthatshouldh avebeenrecordedbutforonereasonoranother. ]

10 Forexample,non-responseinsamplesurveys,d ropoutsinlongitudinaldata(seeChapter8), , , : Omittingapossiblysubstantialnumberofindi vidualswillcausealargeamountofinformatio ntobediscardedandlowertheeffectivesample sizeofthedata,makinganyanalyseslesseffec tivethantheywouldhavebeenifalltheorigina lsamplehadbeenavailable. Moreworrisomeisthatdroppingthecaseswithm issingvaluesononeormorevariablescanleadt oseriousbiasesinbothestimationandinfer-e nceunlessthediscardedcasesareessentially arandomsubsampleoftheobserveddata(theter mmissingcompletelyatrandomisoftenused;se eChapter8andLittleandRubin(1987)formored etails).So,attheveryleast,complete-casea nalysisleadstoaloss,andperhapsasubstanti alloss,inpowerbydiscardingdata,butworse, ,iftheresearcherisinterestedinestimating thecorrelationmatrix( )ofasetofmultivariatedata, , , ,suchasfactoranalysis(seeChap-ter5)andst ructuralequationmodelling(seeChapter7), small.


Related search queries