Example: tourism industry

Multivariate Data Analysis

Multivariate Data AnalysisSusan Holmes susan/Bio-X and StatisticsIMA Workshop, October, do not really understand something unless you canexplain it to your grandmother-- Albert EinsteinI am your grandmother ..you do not really understand something unless you canexplain it to your grandmother-- Albert EinsteinI am your grandmother ..What are Multivariate data ?Simplest format: matrices:If we have measured 10,000 genes on hundreds of patientsand all the genes are independent, we can't do better thananalyze each gene's behavior by using histograms or boxplots, looking at the means, medians, variances and other `onedimensional statistics'. However if some of the genes areacting together, either that they are positively correlated orthat they inhibit each other, we will miss a lot of importantinformation by slicing the data up into those column vectorsand studying them separately.

information by slicing the data up into those column vectors and studying them separately. Thus important connections ... If the data were multivariate normal with p variables,all the information would be contained in thep pcovariance matrix ... the 9,000 species are a random sample of bacteria since these

Tags:

  Normal, Vector, Multivariate, Random, Multivariate normal

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Multivariate Data Analysis

1 Multivariate Data AnalysisSusan Holmes susan/Bio-X and StatisticsIMA Workshop, October, do not really understand something unless you canexplain it to your grandmother-- Albert EinsteinI am your grandmother ..you do not really understand something unless you canexplain it to your grandmother-- Albert EinsteinI am your grandmother ..What are Multivariate data ?Simplest format: matrices:If we have measured 10,000 genes on hundreds of patientsand all the genes are independent, we can't do better thananalyze each gene's behavior by using histograms or boxplots, looking at the means, medians, variances and other `onedimensional statistics'. However if some of the genes areacting together, either that they are positively correlated orthat they inhibit each other, we will miss a lot of importantinformation by slicing the data up into those column vectorsand studying them separately.

2 Thus important connectionsbetween genes are only available to us if we consider thedata as a whole. We start by giving a few examples of datathat we Athletes, performances in the long poid haut 400 110 disq perc jave 15001 Clinical measurements (diabetes data).relwt glufast glutest steady insulin Group1 1245533 14310535 24014337 22111939 142983 OTU read counts:469478 208196 378462 265971 RNA-seq, transcriptomic:FBgn0000017 FBgn0000018 FBgn0000022 FBgn0000024 FBgn0000028 FBgn0000032untreated1466458301001446untr eated2871476111111713untreated4315031003 0672treated1620572201001698treated333343 08051757 Mass spec:Samples 60515 1814950 196526 25500 252579 54697412 487800 48775 187859 56318 46425 454226 45626 the data were all independent columns, then the data wouldhave no Multivariate structure and we could just do univariatestatistics on each variable (column) in statistics means we are interested in how the can compute covariances to evaluate the the data were Multivariate normal with p variables, all theinformation would be contained in the p p covariance matrix and the mean.

3 Parametric Multivariate Statistics: Non parametric, Multivariate Exploratory Analyses: Hypotheses generating. Projection Methods (new coordinates) Principal Component Analysis Principal Coordinate Analysis -Multidimensional Scaling(PCO,MDS) Correspondence Analysis Discriminant Analysis Tree based methods Phylogenetic Trees Clustering Trees Decision Trees Confirmatory Analyses: Hypothesis verification. Permutation tests (Monte Carlo). Bootstrap (Monte Carlo). Bayesian nonparametrics (Monte Carlo)..Modern Methods: Robust MethodsVarianceVariability of one continuous variable !the ROBUST, low : Take ranks, clumps, logs, or trimming the IEDA:Exploratory Data AnalysisData CheckingHypothesis by Visualization d = llllllllllllllllllllllllllll 1 2 3 ..Basic Visualization Tools Boxplots, barplots. Scatterplots of projected data. Scatterplots with binning variable Hierarchical Clustering, heatmaps, Phylogenies.

4 Combination of Phylogenetic Trees and Structuration (Tukey, 1977)..One table methods: PCA, MDS, PCoA, CA, ..DATA meansvarianc esk?Change of Da taen dChoice Choice variabl esob se rvatio nsAll based on the principle of finding the largest axis Variables/coordinates from old or distancesBest Projection Directions??..PCA esk?Change of Da taen dChoice Choice variabl esob se rvatio nsAll based on the principle of finding the largest axis Variables/coordinates from old or distancesBest Projection Directions because they explain the most first task is often to rescale the data so that all thevariables, or columns of the matrix have the same standarddeviation, this will put all the variables on the same also make sure that the means of all columns are zero,this is called centering. After that we will try to simplify thedata by doing what we call rank reduction, we'll explain thisconcept from several different perspectives.

5 A favorite toolfor simplifying the data is called Principal Component Analysis (abbreviated PCA)..What is PCA?PCA is an `unsupervised learning technique' because it treats allvariables as having the same status, there is no particular responsevariable that we are trying to predict using the other variables asexplanatory predictors as in supervised methods. PCA is primarily avisualization technique which produces maps that show therelations between the variables in a useful Facts to Remember Each PC is defined to maximize the variance it explains. The new variables are made to be orthogonal, if the dataare Multivariate normal they will be independent. Always check the screeplot before deciding how manycomponents to retain (how much signal you have)..A Geometrical data are p variables measured on n with n rows (the observations) and p columns (thevariables).

6 An n n matrix of weights on the ``observations'',which is most often definite positive matrix Q,oftenQ=0 BBB@1 210 0 0:::01 220 0:::0 01 230::::::::::::01 2p1 CCCA:..Euclidean SpacesThese three matrices form the essential ``triplet"(X;Q;D)defininga Multivariate data and D define geometries or inner products inRpandRn,respectively, throughxtQy=<x;y>Qx;y2 RpxtDy=<x;y>Dx;y2Rn:..An Algebraic Approach Q can be seen as a linear function fromRptoRp =L(Rp), the space of scalar linear functions onRp. D can be seen as a linear function fromRntoRn =L(Rn). Rp !XRnQx????yVD??yx??WRp XtRn ..An Algebraic ApproachRp !XRnQx????yVD??yx??WRp XtRn of XtDXQ= of XQXtD= (1) Suppose we have data and inner products defined by Q and D :(x;y)2Rp Rp7 !xtQy=<x;y>Q2R(x;y)2Rn Rn7 !xtDy=<x;y>D2R:jjxjj2Q=<x;x>Q=p j=1qj(x:j)2jjxjj2D=<x;x>D=p j=1pi(xi:)2(2) We say an operator O is B-symmetric if<x;Oy>B=<Ox;y>B,or equivalently BO= diagramis equivalent to(X;Q;D)such that X is n p.

7 Escoufier (1977) defined as XQXtD=WD and XtDXQ=VQ as thecharacteristic operators of the (3) V=XtDX will be the variance-covariance matrix, if X iscentered with regards to D (X D1n= 0)..Transposable DataThere is an important symmetry between the rows and columns ofX in the diagram, and one can imagine situations where the role ofobservation or variable is not uniquely defined. For instance inmicroarray studies the genes can be considered either as variablesor observations. This makes sense in many contemporary situationswhich evade the more classical notion of n observations seen as arandom sample of a population. It is certainly not the case thatthe 9,000 species are a random sample of bacteria since theseprobes try to be an exhaustive Dual of the DiagramRank of the diagram: X;Xt;VQ and WD all have the same Q and D symmetric matrices, VQ and WD are diagonalisableand have the same eigenvalues.

8 1 2 3 ::: r 0 0:Eigendecomposition of the diagram: VQ is Q symmetric, thus wecan find Z such thatVQZ=Z ;ZtQZ=Ip;where =diag( 1; 2;:::; p):(1)..Practical ComputationsCholesky decompositions of Q and D, (symmetric and positivedefinite) HtH=Q and KtK= the singular value decomposition of KXH:KXH=USTt;with TtT=Ip;UtU=In;S Z= (H 1)tT satisfiesVQZ=Z ;ZtQZ=Ipwith =S2:The renormalized columns of Z, A=SZ are called the principalaxes and satisfy:AtQA= :..Practical ComputationsSimilarly, we can define L=K 1U that satisfiesWDL=L ;LtDL=In;where =diag( 1; 2;:::; r;0;:::;0):(2)C=LS is usually called the matrix of principal components. It isnormed so thatCtDC= :..Transition Formul :Of the four matrices Z;A;L and C we only have to compute one, allothers are obtained by the transition formul provided by theduality property of the diagram:XQZ=LS=C XtDL=ZS= Features1.

9 Inertia: Trace(VQ) =Trace(WD)(inertia in the sense of Huyghens inertia formula for instance).Huygens, C. (1657),n i=1pid2(xi;a)Inertia with regards to a point a of a cloud of pi-weighted with Q=Ip, D=1nIn, and the variables are centered, theinertia is the sum of the variances of all the the variables are standardized (Q is the diagonal matrix ofinverse variances), then the inertia is the number of variables correspondence Analysis the inertia is the Chi-squared MethodsMany discrete measurements ! from 2005 House of Representatives roll call votes. Wefurther restricted our Analysis to the 401 Representatives thatvoted on at least90% of the roll calls (220 Republicans, 180 Democrats and 1 Independent) leading to a401 669matrix V ofvoting DataV1 V2 V3 V4 V5 V6 V7 V8 V9 V101 -1 -1 1 -1 0 1 1 1 1 12 -1 -1 1 -1 0 1 1 1 1 13 1 1 -1 1 -1 1 1 -1 -1 -14 1 1 -1 1 -1 1 1 -1 -1 -15 1 1 -1 1 -1 1 1 -1 -1 -16 -1 -1 1 -1 0 1 1 1 1 17 -1 -1 1 -1 -1 1 1 1 1 18 -1 -1 1 -1 0 1 1 1 1 19 1 1 -1 1 -1 1 1 -1 -1 -110 -1 -1 1 -1 0 1 1 0 0 !

10 ! ! ! ! ! ! ! MDS mapping of legislators based on the House of Representatives roll call ! ! ! ! ! ! ! ! MDS mapping of legislators based on the House of Representatives roll call votes. Color has beenadded to indicate the party affiliation of each Multidimensional ScalingGiven a distance matrix (or its square) how do we find the points inEuclidean space whose distances are given by this matrix?Can we always find such a map?Schoenberg (1935) but also Borschadt of towns, whose road distances are known for whom we wantto reconstruct a of DistancesIf we started with original data inRpthat are not centered: Y,apply the centering matrixX=HY;with H= (I 1n11 );and1 = (1;1;1:::;1)Call B=XX , if D(2)is the matrix of squared distances betweenrows of X in the euclidean coordinates, we can show that 12HD(2)H=BWe can go backwards from a matrix D to X by taking theeigendecomposition of B in much the same way that PCA providesthe best rank r approximation for data by taking the singular valuedecomposition of X, or the eigendecomposition of XX.


Related search queries