Transcription of Introduction to bivariate analysis - Statistics
1 Introduction to bivariate analysis When one measurement is made on each observation,univariateanalysisis more than one measurement is made on each observation,multivariate analysisis this section, we focus onbivariate analysis , where exactly twomeasurements are made on each two measurements will be calledXandY. SinceXandYare obtained for each observation, the data for one observationis the pair (X, Y). bivariate data can be stored in a table with two columns:X YObs. 12 1 Obs. 24 4 Obs. 33 1 Obs. 47 5 Obs. 55 6 Obs. 62 1 Obs. 74 4 Obs. 83 1 Obs. 97 5 Obs. 105 6 Some examples: Height (X) and weight (Y) are measured for each individ-ual in a sample. Stock market valuation (X) and quarterly corporate earn-ings (Y) are recorded for each company in a sample. A cell culture is treated with varying concentrations of adrug, and the growth rate (X) and drug concentration(Y) are recorded for each trial.
2 Temperature (X) and precipitation (Y) are measured ona given day at a set of weather stations. Be clear about the difference betweenbivariatedata andtwosampledata. In two sample data, theXandYvalues are notpaired, and there aren t necessarily the same number data:Sample 1: 3,2,5,1,3,4,2,3 Sample 2: 4,4,3,6,5 A bivariate simple random sample (SRS) can be written(X1, Y1),(X2, Y2), .. ,(Xn, Yn).Each observation is a pair of values, for example (X3, Y3) is thethird a bivariate SRS, the observations are independent of eachother, but the two measurements within an observation may notbe (taller individuals tend to be heavier, profitable companiestend to have higher stock market valuations, etc.). The distribution ofXand the distribution ofYcan be consideredindividually using univariate methods. That is, we can analyzeX1, X2, .. , XnorY1, Y2, .. , Ynusing CDF s, densities, quantile functions, etc.
3 Any propertythat described the behavior of theXivalues alone or theYivalues alone is calledmarginal example the ECDF FX(t) ofX, the quantile function QY(p)ofY, the sample standard deviation of YofY, and the samplemean XofXare all marginal properties. The most interesting questions relating to bivariate data questions are investigated using properties that properties are calledjoint properties. For example the meanofX Y, the IQR ofX/Y, and the average of allXisuch thatthe correspondingYiis negative are all joint properties. A complete summary of the statistical properties of (X, Y) isgiven by thejoint distribution. If the sample space is finite, the joint distribution is represented ina table, where theXsample space corresponds to the rows, andtheYsample space corresponds to the columns. For example,if we flip two coins, the joint distribution isH TH1/4 1/4T1/4 1 marginal distributions can always be obtained from the jointdistribution by summing the rows (to get the marginalXdistri-bution), or by summing the columns (to get the marginalYdis-tribution).
4 For this example, the marginalXandYdistributionsare both{H 1/2, T 1/2}. For another example, suppose we flip a fair coin three times, letXbe the number of heads in the first and second flips, and letYbe the number of heads in the second and third flips. Theseare the possible outcomes:HHH HTH HTT TTHHHT THH THT joint distribution is:0 1 201/8 1/8 011/8 1/4 1/820 1/8 1/8 The marginalXandYdistributions are both{0 1/4,1 1/2,2 1/4}. An important fact is that two different joint distributions canhave the sameXandYmarginal distributions. In other words,the joint distribution is not determined completely by the marginaldistributions, so information is lost if we summarize a bivariatedistribution using only the two marginal distributions. The fol-lowing two joint distributions have the same marginal distribu-tions:0102/5 1/511/10 3/100103/10 3/1011/5 1/5 The most important graphical summary of bivariate data is thescatterplot.
5 This is simply a plot of the points (Xi, Yi) in theplane. The following figures show scatterplots of June maximumtemperatures against January maximum temperatures, and ofJanuary maximum temperatures against latitude. 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 JuneJanuary 10 20 30 40 50 60 70 80 20 25 30 35 40 45 50 January temperatureLatitude A key feature in a scatterplot is theassociation, January temperatures tend to be paired with higher Junetemperatures, so these two values have latitudes tend to be paired with lower January tempera-ture decreases, so these values have higherXvalues are paired with low or with highYvaluesequally often, there is no association. Do not draw causal implications from statements about associ-ations, unless your data come from a randomized because January and June temperatures increase togetherdoes not mean that January temperatures cause June tempera-tures to increase (or vice versa).
6 The only certain way to sort out causality is to move beyondstatistical analysis and talk aboutmechanisms. In general, ifXandYhave an association, then(i)Xcould causeYto change(ii)Ycould causeXto change(iii)a third unmeasured (perhaps unknown) variableZcouldcause bothXandYto your data come from a randomized experiment, statisti-cal analysis alone is not capable of answering questions aboutcausality. For the association between January and July temperatures, wecan try to propose some simple mechanisms:Possible mechanism for (i): warmer or cooler air masses in Jan-uary persist in the atmosphere until July, causing similar effectson the July mechanism for (ii): None, it is impossible for one eventto cause another event that preceded it in mechanism (iii): IfZis latitude, then latitude influencestemperature because it determines the amount of atmospherethat solar energy must traverse to reach a particular point onthe Earth s (iii) is the correct one.
7 Suppose we would like to numerically quantify the trend in abivariate most common means of doing this is thecorrelation coef-ficient(sometimes calledPearson s correlation coefficient):r= i(Xi X)(Yi Y)/(n 1) X numerator i(Xi X)(Yi Y)/(n 1)is called thecovariance. The correlation coefficientris a function of the data, so it reallyshould be called thesamplecorrelation (sample) correlation coefficientrestimates thepopulationcorrelation coefficient . If either theXior theYivalues are constant ( all have thesame value), then one of the sample standard deviations is zero,and therefore the correlation coefficient is not defined. Both the sample and population correlation coefficients alwaysfall between 1 and 1 then theXi, Yipairs fall exactly on a line with 1 then theXi, Yipairs fall exactly on a line with strictly between 1 and 1, then theXi, Yipoints do notfall exactly on any line.
8 Consider one term in the correlation coefficient:(Xi X)(Yi Y).IfXiandYiboth fall on the same side of their respective means,Xi> XandYi> YorXi< XandYi< Ythen this term is positive. IfXiandYifall on opposite sides oftheir respective means,Xi> XandYi< YorXi< XandYi> Ythen this term is >0 ifXiandYitend to fall on the same side of their meanstogether. If they tend to fall on opposite sides of their means,thenris 0 5 10-10-5 0 5 10( X, Y) X > X, Y > YX > X, Y < YX < X, Y > YX < X, Y < Y-3-2-1 0 1 2 3-3-2-1 0 1 2 3 The green points contribute positively tor, the blue points con-tribute negatively tor. In this case the result will ber > 0 1 2 3-3-2-1 0 1 2 3 The green points contribute positively tor, the blue points con-tribute negatively tor. In this case the result will ber <0. Summary of the interpretation of the correlation coefficient: Positive values ofrindicate a positive linear association( largeXiand largeYivalues tend to occur together,smallXiand smallYivalues tend to occur together).
9 Negative values ofrindicate a negative linear association( largeXivalues tend to occur with smallYivalues,smallXivalues tend to occur with largeXivalues). Values ofrclose to zero indicate no linear association ( equally likely to occur with large or smallYivalues). Suppose we calculate XforX1, X2, .. , Xn. Construct two newdata sets:Yi=Xi+bZi=cXiThen Y= X+band Z=c atranslationof ascalingof follows thatYi Y=Xi XandZi Z=c(Xi X). From the previous slide, if we are calculating the sample covari-anceC= i(Xi X)(Yi Y)/(n 1),it follows that if we translate theXior theYi,Cdoes not we scale theXibyaand theYibyb, thenCis changed toabC. Suppose we calculate XforX1, X2, .. , Xn. Construct two newdata sets:Yi=Xi+bZi=cXiThen Y= Xand Z=|c| X. From the previous two slides, it follows that the sample correla-tion coefficient is not affected by we scale theXibyaand theYibyb, then the sample covariancegets scaled by|ab|, Xis scaled by|a|, and Yis scaled by|b|.
10 This correlationris scaled byab/|ab|, which is the sign ofab:sgn(ab). Four key properties of covariance and correlation are:cor(X, X) = 1cov(X, X) = var(X)var(X+Y) = var(X) + var(Y) + 2cov(X, Y)var(X Y) = var(X) + var(Y) 2cov(X, Y) More on the paired two sample paired dataXi, Yiare observed and we are interested in testingwhether theXmean and theYmean differ, the paired andunpaired test Statistics are n Y X Dand Y X the properties given above,var(D) = cov(X Y, X Y) = var(X) + var(Y) 2cov(X, Y)If cov(X, Y)>0 then D< XY, so the paired test statistic willbe larger and hence more cov(X, Y)<0 then D> XY, so the paired test statistic willbe less significant. In the paired two sample test, the covariance will be generallybe positive, so the paired test statistic gives a more example, consider the typical before and after a certain cancer drug kills 30% of the cells in everypatient s tumor.