Lecture 2 Linear Regression: A Model for the Mean

Lecture 2 Linear Regression: A Model for the MeanSharyn O HalloranSpring 20052U9611 Closer Look at: Linear regression Model Least squares procedure Inferential tools confidence and Prediction Intervals Assumptions Robustness Model checking Log transformation (of Y, X, or both)Spring 20053U9611 Linear regression : Introduction Data: (Yi, Xi) for i = 1,..,n Interest is in the probability distribution of Y as a function of X Linear regression Model : Mean of Y is a straight line function of X, plus an error term or residual Goal is to find the best fit line that minimizes the sum of the error termsSpring valuesPHEstimated regression lineSteer example (see Display , p. 177).73 Intercept= for estimated regression line:Equation for estimated regression line:^Fitted lineError termSpring 20055U9611 Create a new variableltime=log(time) regression analysisSpring 20056U9611 regression TerminologyRegressionRegression: the mean of a response variable as a function of one or more explanatory variables: {Y | X} regression modelRegression Model : an ideal formula to approximate the regressionSimple Linear regression modelSimple Linear regression Model :XXY10}|{ +=InterceptSlope mean of Y given X or regression of Y on X UnknownparameterSpring 20057U9611Y s probability distribution is to be explained by Xb0and b1are the regression coefficientsregression coefficients(See Display , p.)

180)Note: Y = b0+ b1X is NOT simple regressionControl variableResponse variableExplanatory variableExplained variableIndependent variableIndependent variableDependent variableDependent variableXXYYR egression TerminologySpring 20058U9611 regression Terminology: Estimated coefficients101010 XX++10 +10 +X10 +X10 +Choose and to make the residuals small0 1 Spring 20059U9611 Fitted valueFitted valuefor obs. i is its estimated mean: ResidualResidualfor obs. i: Least SquaresLeast Squaresstatistical estimation method finds those estimates that minimize the sum of squared (from calculus) on p. 182 of SleuthXXYfitYi10}|{ +=== regression TerminologyYYeii fit - Y resiii = = == =+ niiniiiyyxy121210) ())(( Spring 200510U9611 Least Squares Procedure The Least-squares procedure obtains estimates of the Linear equation coefficients 0and 1, in the Model by minimizing the sum of the squared residualsor errors (ei) This results in a procedure stated as Choose 0and 1so that the quantity is minimized.

Iixy10 +=22) (iiiyyeSSE == 2102))((iiixyeSSE + == Spring 200511U9611 Least Squares Procedure The slope coefficient estimator is And the constant or intercept indicator isXYxyniiniiissrXxYyXx= = ==1211)())(( XY10 =CORRELATION BETWEEN X AND YSTANDARD DEVIATION OF Y OVER THE STANDARD DEVIATION OF XSpring 200512U9611 Least Squares Procedure(cont.) Note that the regression line always goes through the mean X, Between Yield and Fertilizer020406080100010020030040050060 0700800 Fertilizer (lb/Acre)Yield (Bushel/Acre)Trend lineThat is, for any value of the independent variable there is a single most likely value for the dependent variable Think of this regression line as the expected value of Y for a given value of X. Spring 200513U9611 Tests and confidence Intervals for 0, 1 Degrees of freedom: (n-2) = sample size - number of coefficients Variance{Y|X} 22= (sum of squared residuals)/(n-2) Standard errors(p.)

184) Ideal normal Model : the sampling distributions of 0 and 1 have the shape of a t-distribution on (n-2) Do t-tests and CIs as usual (df=n-2)Spring 200514U9611 confidence intervalsP values for Ho=0 Spring 200515U9611 Inference Tools Hypothesis TestHypothesis Testand confidence IntervalConfidence Intervalfor mean of Y at some X: Estimate the mean of Y at X = X0by Standard Error of Conduct t-test and confidence interval in the usual way (df = n-2)0100 }|{ XXY +=2200)1()(1 }]|{ [xsnXXnXYSE += 0 Spring 200516U9611 confidence bands for conditional meansthe lfitcilfitcicommand automatically calculate and graph the confidence bandsconfidence bands in simple regression have an hourglass shape, narrowest at the mean of XSpring 200517U9611 Prediction PredictionPredictionof a future Yat X=X0 Pred Standard error of predictionStandard error of prediction:}|{ )|(00 XYXY =Variability of Y about its meanUncertainty in the estimated mean 95% prediction interval95% prediction interval:)]|(Pred[*)975(.

|(Pred00 XYSEtXYdf 2020)])|( [( )]|(Pred[XYSEXYSE +=Spring 200518U9611 Residuals vs. predictedvalues plotAfter any regression analysis we can automatically draw a residual-versus-fitted plot just by typingSpring 200519U9611 Predicted values (yhatyhat)After any regression ,the predictpredictcommand can create a new variable yhatyhatcontaining predicted Y valuesabout its meanSpring 200520U9611 Residuals (ee)the residresidcommand can create a new variable eecontaining the residualsSpring 200521U9611 The residual-versus-predicted-values plot could be drawn by hand using these commandsSecond type of confidence interval for regression prediction: prediction bandprediction band This express our uncertainty in estimating the unknown value of Y for an individual observation with known X valueCommand:lftcilftciwith stdfstdfoptionAdditional note: Predict: Predictcan generate two kinds of standard errorsstandard errorsfor the predicted y value, which have two different applications.)

0123 Distance-5 00050 01000 VELOCITYC onfidence bands for conditional means (stdp)-10123 Distance-50005001000 VELOCITYC onfidence bands for individual-case predictions (stdf)Spring 200524U96110123 Distance-5 00050 01000 VELOCITYC onfidence bands for conditional means (stdp)-10123 Distance-50005001000 VELOCITYC onfidence bands for individual-case predictions (stdf)95% confidence interval95% confidence intervalfor {Y|1000}95% prediction interval95% prediction intervalfor Y at X=1000confidence bandconfidence band: a set of confidence intervalsfor {Y|X0}Calibration intervalCalibration interval: values of Xfor which Y0is in a prediction intervalSpring 200525U9611 Notes about confidence and prediction bands Both are narrowest at the mean of X Beware of extrapolation The width of the confidence Interval is zero if n is large enough; this is not true of the Prediction this is not true of the Prediction 200526U9611 Review of simple Linear regression220211210101211210)1/()/1(/ ) ()1(/ ) ()2/( ).

,1( .)(/))(( }|var{}|{2xxniiiiiniiniiisnXnSEsnSEnresn iXYresXYXXYYXXXYXXY += = == = = ==+= === 1. Model with constant squaresLeast squares: choose estimators 0and 1to minimize the sum of squared PropertiesPropertiesof 200527U9611 Assumptions of Linear regression A Linear regression Model assumes: Linearity: {Y|X} = 0+ 1X Constant Variance: var{Y|X} = 2 Normality Dist. of Y s at any X is normal Independence Given Xi s, the Yi s are independentSpring 200528U9611 Examples of Violations Non-Linearity The true relation between the independent and dependent variables may not be Linear . For example, consider campaign fundraising and the probability of winning an election. $50,000 P(w) Spending Probability of Winning an ElectionThe probability of winning increases with each additional dollar spent and then levels off after $50, 200529U9611 Consequences of violation of linearity If linearity is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity):Spring 200530U9611 Examples of Violations: Constant Variance Constant Variance or Homoskedasticity The Homoskedasticity assumption implies that, on average, we do not expectto get larger errors in some cases than in others.)

Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner. Example: income and spending on certain goods. People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower 200531U9611 Violation of constant varianceXincomeXXXXXXXX123456X98710 9 8 7 5 6 Spending 9= +YabX99(())As income increases so do the errors (vertical distance from the predicted line)Relation between Income and Spending violates homoskedasticity 6= +YabX66(())Spring 200532U9611 Consequences of non-constant variance If constant variance is violated, LS estimates are still unbiased but SEs, tests, confidence Intervals, and Prediction Intervals are incorrect However, the degree 200533U9611 Violation of Normality Non-Normality Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every example of a bimodal distributionFrequency of Nicotine useSpring 200534U9611 Consequence of non-Normality If normality is violated, LS estimates are still unbiased tests and CIs are quite robust PIs are notOf all the assumptions, this is the one that we need to be least worried about violating.

Why? Spring 200535U9611 Violation of Non-independence Non-Independence The independence assumption means that errors terms of two variables will not necessarily influence one another. Technically, the RESIDUALSor error terms are uncorrelated. The most common violation occurs with data that are collected over time or time series analysis. Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and ConsumptionResiduals of GNP and Consumption over TimeHighly CorrelatedSpring 200536U9611 Consequence of non-independence If independence is violated:- LS estimates are still unbiased- everything else can be misleadingPlottingcode islitter(5 micefrom eachof 5 litters)Log Weight Log Height Note that mice from litters 4 and 5 have higher weight and heightSpring 200537U9611 Robustness of least squares The constant variance assumption is important.

Normality is not too important for confidence intervals and p-values, but is important for prediction intervals. Long-tailed distributions and/or outliers can heavily influence the results. Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).Strategy for dealing with these potential problemsStrategy for dealing with these potential problems Plots; Residual plots; Consider outliers (more in Ch. 11) Log Transformations (Display )Spring 200538U9611 Tools for Model checking Scatterplot of Y vs. X (see Display p. 213)* Scatterplot of residuals vs. fitted values**Look for curvature, non*Look for curvature, non--constantconstantvariance, and outliersvariance, and outliers Normal probability plot ( ) It is sometimes useful for checking if the distribution is symmetric or normal ( for PIs). Lack of fit F-test when there are replicates(Section ).

Lecture 2 Linear Regression: A Model for the Mean

Tags:

Information

Transcription of Lecture 2 Linear Regression: A Model for the Mean

Related search queries

Lecture 2 Linear Regression: A Model for the Mean

Tags:

Information

Documents from same domain

Related documents

Related search queries