Example: tourism industry

Multiple Linear Regression

Multiple Linear RegressionSong GeBSN, RN, PhD CandidateJohns Hopkins University School of Biostatistics for Evidence based PracticeLearning ObjectivesBy the end of this module, you will be able to:1. Articulate assumptions for Multiple Linear regression2. Explain the primary components of Multiple Linear regression3. Identify and define the variables included in the Regression equation4. Construct a Multiple Regression equation5. calculate a predicted value of a dependent variable using a Multiple Regression equationLearning Objectives Cont d6. Distinguish between unstandardized (B) and standardized (Beta) Regression coefficients7. Distinguish between different methods for entering predictors into a Regression model (simultaneous, hierarchical and stepwise)8. Identify strategies to assess model fit9. Interpret and report the results of Multiple Linear Regression analysisReview of lecture two weeks ago Linear Regression assumes a Linear relationship between independent variable(s) and dependent variable Linear Regression allows us to predict an outcome based on one or several predictors Linear Regression allows us to explainthe interrelationships among variables Linear Regression is a parametric testHow to choose X and Y?

Calculate a predicted value of a dependent ... variables that compare one category to a ... (adjusted beta=0) – H a: There is an association between frequency of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity (adjusted beta≠0) Analysis Example: Model Summary

Tags:

  Adjusted, Calculate, Compare

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Multiple Linear Regression

1 Multiple Linear RegressionSong GeBSN, RN, PhD CandidateJohns Hopkins University School of Biostatistics for Evidence based PracticeLearning ObjectivesBy the end of this module, you will be able to:1. Articulate assumptions for Multiple Linear regression2. Explain the primary components of Multiple Linear regression3. Identify and define the variables included in the Regression equation4. Construct a Multiple Regression equation5. calculate a predicted value of a dependent variable using a Multiple Regression equationLearning Objectives Cont d6. Distinguish between unstandardized (B) and standardized (Beta) Regression coefficients7. Distinguish between different methods for entering predictors into a Regression model (simultaneous, hierarchical and stepwise)8. Identify strategies to assess model fit9. Interpret and report the results of Multiple Linear Regression analysisReview of lecture two weeks ago Linear Regression assumes a Linear relationship between independent variable(s) and dependent variable Linear Regression allows us to predict an outcome based on one or several predictors Linear Regression allows us to explainthe interrelationships among variables Linear Regression is a parametric testHow to choose X and Y?

2 Y can be regressed on X X can be regressed on Y The Regression is not symmetric The choice of which Regression to perform depends on the scientific question: Is X to be used to explain or predict Y? Is Y to be used to explain or predict X? ( Does poor health status explain high pollution level?) Linear Regression Assumptions1. Independent variable can be any scale (ratio, nominal, etc.)2. Dependent variable need to be ratio/interval scale3. Dependent variable need to be normally distributed overall and normally distributed for each value of the independent variable4. If dependent variable is not normally distributed, we can transform itReview: Normal distributionExample of transformed dataPositively skewedNormally distributedMethodMath OperationGood for:Bad for:Logln(x)log10(x)Right skewed dataZero valuesNegative valuesSquare root xRight skewed dataNegative valuesSquarex2 Left skewed dataNegative valuesCube rootx1/3 Right skewed dataNegative valuesNot as effective as log transformReciprocal1/xMaking small values bigger and big values llZero valuesNegative Samples must be representative of the population6.

3 There is no multicollinearity: the interdependent variables are so strongly intercorrelated that they are indistinguishable from each otherIf VIF lies between 1 10, no multicollinearityIf VIF <1 or >10, then there is The relationship between x and y must be Linear . When two scores are graphed, they should tend to form a straight line. If that is not a Linear relationship, other methods must be For every value of X, the distribution of Y scores must have approximately equal variability (homoscedasticity) Multiple Linear Regression Recall student scores example from previous module What will you do if you are interested in studying relationship between final grade with midterm (or screening) score and other variables such as previous (undergraduate) GPA, GRE score and motivation? A simple Linear Regression (SLR) cannot handle this A separate SLR with each explanatory (independent) variable will provide information in isolation You will need to use a Multiple Linear Regression (MLR) method to study them togetherMultiple Linear Regression A Multiple Linear Regression model shows the relationship between the dependent variable and Multiple (two or more) independent variables The overall variance explained by the model (R2) as well as the unique contribution (strength and direction) of each independent variable can be obtained In MLR, the shape is not really a line.

4 If there are three variables, the shape is a plane, and if there are four or more variables, it is impossible to visualize or graph. However, by convention, we still refer to the Regression equation as a Regression 'line'.MLR with Two Predictors Linear Regression Equation Sometimes also called multivariate Linear Regression for MLR The prediction equation isY = a+ b1X1+ b2X2+ b3X3+ bkXk There is still one intercept constant, a, but each independent variable ( , X1, X2, X3) has their own Regression coefficientReview: Simple Linear Regression Y is a Linear function of X Y = a + bx a = intercept b = slopeInterpretation of MLR CoefficientsInterpretation of MLR CoefficientsInterpretation of MLR CoefficientsGroup exercise: interpret B0, B1 and B2 Data are from children aged 1 to 5 years in the Variables Y is the child s arm circumference (cm) X1 is the age of the child (months) X2 is the height of the child (cm) Does arm circumference increase with increasing child age after controlling for child height?

5 Multiple Linear Regression model Y = B0 + B1 X1 + B2 X2 Answers B0= the estimated mean arm circumference when the values of age and height are zero B1= the change in the estimated mean arm circumference associated with each 1 month increase in age if height is unchanged B3= You do! Multiple Linear Regression Models We can get six critical pieces of information from an MLR: The overall significance of the model The variance in the dependent variable that comes from the set of independent variables in the model The statistical significance of each individual independent variable (controlling for the others) The direct effect (and direction of the effect) of each independent variable on the dependent variable The relative strength of the independent variables The Regression equation, which allows us to predict values of the dependent variable given values of the independent variablesThe overall piece: R2(coefficient of determination) R2provides the proportion of variability explained by using X R2measures the ability to predict an individual Y using its X(s) Statistical significance of the overall model (Model F-test) Recall that R is population correlation coefficient Takes on values between -1 and +1 0 indicates no Linear association; 1 indicates a perfect positive Linear relationship; -1 indicates a perfect negative Linear relationshipR: population correlation output for R squareThe individual piece: Correlation coefficientF Test of Regression coefficient: Whether the independent variable associated with it is contributing significantly to the variance accounted for in the dependent variableGroup exercise Propose a research question that can be answered by MLR State under what assumptions do we use this statistical method?

6 State the formula and what B0, B1 and B2 stand for? Break We are interested in knowing if going to restaurants frequently (five or more times/week) can lead to higher cholesterol. We also know that age, gender, and race/ethnicity can affect cholesterol. How can we tell if going out to restaurants frequently, this factor alone, will affect cholesterol levels? Do age, gender, ethnicity, and going out to eat frequently all affect cholesterol levels? Dependent variable: cholesterol level Independent variables: age (years), gender (male/female), race/ethnicity (Black, White, Asian, or Hispanic), frequency of going out to eat (5+ times/week vs less than 5 times/week) Linear Regression Assumptions Linear Regression is a parametric method and requires that certain assumptions be met to be sample must be representative of the dependent variable must be of ratio/interval scale and normally distributed overall and normally distributed for each value of the independent every value of X, the distribution of Yscores must have approximately equal variability (homoscedasticity) relationship between Xand Ymust be independent variables are not very strongly inter-correlated (no multicollinearity)Creating Dummy Variables Using dummy variables is a way to express a nominal independent variable with Multiple categories by a series of dichotomous (binary)

7 Variables that compare one category to a different category that serves as the reference The number of dummy variables created will be one less than the number of categories of the variable One of the categories is chosen to serve as the reference category You then include all the dummy variables in the Regression model instead of the original categorical variableCreating Dummy Variables: Example Let's say we have a race/ethnicity variable with four categories (non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, and Hispanic) If we want to use it in a Multiple Regression , we would need to create three variables (4-1) to represent the four categories We would put these variables into the Multiple Regression equation instead of the four category race/ethnicity variableExample Cont d We would therefore create 3 (4 1) dummy variables and choose one category as the reference, in this case, non-Hispanic White Non-Hispanic Black (1=yes, 0=no) Non-Hispanic Asian (1=yes, 0=no) Hispanic (1=yes, 0=no)

8 Say these are called Dummy1, Dummy2 and Dummy3 Race/EthnicityDummy1 Dummy2 Dummy3 Non Hispanic Black100 Non Hispanic Asian010 Hispanic001 Non Hispanic White000 Information from MLR Overall variance explained by the model ( , do the independent variables in the model, taken together, do a good job at predicting the dependent variable?) using the adjusted R2 Statistical significance of the overall model (Model F-test of R2) The strength, direction, and statistical significance of each independent variable ( Regression coefficients) Regression equation as a whole can be used to predict values of the dependent variable for a given set of values of the independent variablesMLR: Analysis Example We will use data on 489 NYCHANES study participants to look at a number of potential predictors of total cholesterol (mg/dL) The dependent variable is total cholesterol (mg/dL) We can see that total cholesterol is somewhat right-skewedMLR: Analysis Example Cont d To correct for this departure from normality, an adjustment called a Linear transformation of the variable can be made In this case, we take the natural log of cholesterol.

9 This makes the dependent variable normally distributedMLR: Analysis Example Cont d We will use multivariate Linear Regression to look at a number of independent variables Gender (female=1 vs. male=0) Age (continuous) Frequency of eating in restaurants (frequent=1 vs. infrequent=0) Race/ethnicity (Black, White, Asian, or Hispanic) Note that the race/ethnicity variable has four categories. In order to look at this variable in a Regression model, we will have to create dummy : Analysis Example Cont d We will create 3 (4 1) dummy variables and use the category White as the reference. The variable coding will be Black (1 = person is non-Hispanic Black; 0 = person is any other race/ethnicity) Asian (1 = person is non-Hispanic Asian; 0 = person is any other race/ethnicity) Hispanic (1 = person is Hispanic, 0 = person is not Hispanic)MLR: Analysis Example Cont d We are testing a number of hypotheses, one null and one alternate hypothesis for each independent variable in the model.

10 For example, one hypothesis we are testing is H0: There is no association between frequency of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity ( adjusted beta=0) Ha: There is an association between frequency of eating out and total cholesterol, adjusting for gender, age, and race/ethnicity ( adjusted beta 0)Analysis Example: Model Summary adjusted R2= The four independent variables explain of the variance in the dependent SummaryModelRR2 adjusted R2 Std. Error of the : (Constant), Hispanic, restaurant_dich, participant gender, age in years, Asian, BlackAnalysis Example: ANOVA The p-value for the overall model is The amount of variance explained by the model (independent variables) is statistically significantAnalysis Example: Coefficients Beta for gender ( ), beta for age ( ), beta for eating in restaurants ( ), beta for Black ( ), beta for Asian ( ), and beta for Hispanic ( ), the Regression constant ( )CoefficientsaModelUnstandardized CoefficientsStandardized Confidence Interval for BBStd.


Related search queries