Transcription of Chapter Multivariate Analysis Concepts - SAS Support
1 Chapter1 Multivariate Introduction Random Vectors, Means, Variances, and Covariances Multivariate Normal Distribution Sampling from Multivariate Normal Populations Some Important Sample Statistics and Their Distributions Tests for Multivariate Normality Random Vector and Matrix Generation IntroductionThe subject of Multivariate Analysis deals with the statistical Analysis of the data collectedon more than one (response) variable. These variables may be correlated with each other,and their statistical dependence is often taken into account when analyzing such data. Infact, this consideration of statistical dependence makes Multivariate Analysis somewhatdifferent in approach and considerably more complex than the corresponding univariateanalysis, when there is only one response variable under variables under consideration are often described as random variables andsince their dependence is one of the things to be accounted for in the analyses, these re-sponse variables are often described by their joint probability distribution.
2 This considera-tion makes the modeling issue relatively manageable and provides a convenient frameworkfor scientific Analysis of the data. Multivariate normal distribution is one of the most fre-quently made distributional assumptions for the Analysis of Multivariate data. However,if possible, any such consideration should ideally be dictated by the particular , in many cases, such as when the data are collected on a nominal or ordinal scales, Multivariate normality may not be an appropriate or even viable the real world, most data collection schemes or designed experiments will result inmultivariate data. A few examples of such situations are given below. During a survey of households, several measurements oneachhousehold are measurements, being taken on the same household, will be dependent.
3 For ex-ample, the education level of the head of the household and the annual income of thefamily are related. During a production process, a number of different measurements such as the tensilestrength, brittleness, diameter, etc. are taken on the same unit. Collectively such data areviewed as Multivariate data. On a sample of 100 cars, various measurements such as the average gas mileage, numberof major repairs, noise level, etc. are taken. Also each car is followed for the first 50,000miles and these measurements are taken after every 10,000 miles. Measurements takenon the same car at the same mileage and those taken at different mileage are going to becorrelated. In fact, these data represent a very complex Multivariate Analysis Multivariate Statistics An engineer wishes to set up a control chart to identify the instances when the produc-tion process may have gone out of control.
4 Since an out of control process may producean excessively large number of out of specification items, detection at an early stage isimportant. In order to do so, she may wish to monitor several process characteristicson the same units. However, since these characteristics are functions of process param-eters (conditions), they are likely to be correlated leading to a set of Multivariate many times, it is appropriate to set up a single (or only a few) Multivariate controlchart(s) to detect the occurrence of any out of control conditions. On the other hand, ifseveral univariate control charts are separately set up and individually monitored, onemay witness too many false alarms, which is clearly an undesirable situation. A new drug is to be compared with a control for its effectiveness.
5 Two different groups ofpatients are assigned to each of the two treatments and they are observed weekly for nexttwo months. The periodic measurements on the same patient will exhibit dependenceand thus the basic problem is Multivariate in nature. Additionally, if the measurementson various possible side-effects of the drugs are also considered, the subsequent analysiswill have to be done under several carefully chosen models. In a designed experiment conducted in a research and development center, various fac-tors are set up at desired levels and a number of response variables are measured foreach of these treatment combinations. The problem is to find a combination of the lev-els of these factors where all the responses are at their optimum.
6 Since a treatmentcombination which optimizes one response variable may not result in the optimum forthe other response variable, one has a problem of conflicting objectives especially whenthe problem is treated as collection of several univariate optimization problems. Due todependence among responses, it may be more meaningful to analyze response variablessimultaneously. In many situations, it is more economical to collect a large number of measurements onthe same unit but such measurements are made only on a few units. Such a situation isquite common in many remote sensing data collection plans. Obviously, it is practicallyimpossible to collectively interpret hundreds of univariate analyses to come up withsome definite conclusions.
7 A better approach may be that of data reduction by usingsome meaningful approach. One may eliminate some of the variables which are deemedredundant in the presence of others. Better yet, one may eliminate some of the linearcombinations of all variables which contain little or no information and then concentrateonly on a few important ones. Which linear combinations of the variables should beretained can be decided using certain Multivariate methods such as principal componentanalysis. Such methods are not discussed in this book, of the problems stated above require (at least for the convenience of modeling andfor performing statistical tests) the assumption of Multivariate normality. There are how-ever, several other aspects of Multivariate Analysis such as factor Analysis , cluster Analysis ,etc.
8 Which are largely distribution free in nature. In this volume, we will only considerthe problems of the former class, where Multivariate normality assumption may be , in the next few sections, we will briefly review the theory of Multivariate normaland other related distributions. This theory is essential for a proper understanding of vari-ous Multivariate statistical techniques, notation, and nomenclature. The material presentedhere is meant to be only a refresher and is far from complete. A more complete discussionof this topic can be found in Kshirsagar (1972), Seber (1984) or Rencher (1995). Random Vectors, Means, Variances, and CovariancesSupposey1,..,ypareppossibly correlated random variables with respective means (ex-pected values) 1.
9 , p. Let us arrange these random variables as a column vector de- Chapter 1 Multivariate Analysis Concepts3noted byy,thatis,lety= .We do the same for 1, 2,.., pand denote the corresponding vector by .Thenwesay that the vectoryhas the mean or in notationE(y)= .Let us denote the covariance betweenyiandyjby ij,i,j=1,..,p,thatis ij=cov(yi,yj)=E[(yi i)(yj j)]=E[(yi i)yj]=E(yiyj) i jand let6=( ij)= 11 1p 21 2p p1 pp .Since cov(yi,yj)=cov(yj,yi),wehave ij= ji. Therefore,6is symmetric with(i,j)thand(j,i)thelements representing the covariance betweenyiandyj. Further, sincevar(yi)=cov(yi,yi)= ii,theithdiagonal place of6contains the variance called the dispersion or the variance-covariance matrix ofy. In notation, wewrite this fact asD(y)=6.
10 Various books follow alternative notations forD(y)such ascov(y)or var(y). However, we adopt the less ambiguous notation ofD(y).Thus,6=D(y)=E[(y )(y ) ]=E[(y )y ]=E(yy ) ,where for any matrix (vector)A, the notationA represents its quantitytr(6)= pi=1 iiis calledtotal varianceand a determinant of6, denotedby|6|, is often referred to as thegeneralized variance. The two are often taken as theoverall measures of the variability of the random vectory. However, both of these twomeasures suffer from certain shortcomings. For example, the total variancetr(6)being thesum of only diagonal elements, essentially ignores all covariance terms. On the other hand,the generalized variance|6|can be misleading since two very different variance covariancestructures can sometimes result in the same value of generalized variance.