Example: bankruptcy

Selecting Variables in Multiple Regression - …

Selecting Variables in Multiple RegressionJames H. SteigerDepartment of Psychology and Human DevelopmentVanderbilt UniversityJames H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression1 / 29 Selecting Variables in Multiple Regression1 Introduction2 The Problem with RedundancyCollinearity and Variances of Beta Estimates3 Detecting and Dealing with Redundancy4 Classic Selection ProceduresThe Akaike Information Criterion (AIC)The Bayesian Information Criterion(BIC)Cross-Validation Based CriteriaAn Example The Highway DataForward SelectionBackward EliminationStepwise Regression5 Computational Examples6 Caution about Selection MethodsJames H.

Selecting Variables in Multiple Regression 1 Introduction 2 The Problem with Redundancy Collinearity and Variances of Beta Estimates 3 Detecting and Dealing with Redundancy 4 Classic Selection Procedures

Tags:

  Multiple, Selecting, Variable, Regression, Selecting variables in multiple regression

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Selecting Variables in Multiple Regression - …

1 Selecting Variables in Multiple RegressionJames H. SteigerDepartment of Psychology and Human DevelopmentVanderbilt UniversityJames H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression1 / 29 Selecting Variables in Multiple Regression1 Introduction2 The Problem with RedundancyCollinearity and Variances of Beta Estimates3 Detecting and Dealing with Redundancy4 Classic Selection ProceduresThe Akaike Information Criterion (AIC)The Bayesian Information Criterion(BIC)Cross-Validation Based CriteriaAn Example The Highway DataForward SelectionBackward EliminationStepwise Regression5 Computational Examples6 Caution about Selection MethodsJames H.

2 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression2 / 29 IntroductionIntroductionOne problem that can arise in exploratory Multiple Regression studies is which predictorsfrom a set of potential predictor Variables should be included in the Multiple regressionanalysis, and in the ultimate prediction this module, we review some traditional and newer approaches to variable selection,pointing out some of the pitfalls involved in Selecting a subset of Variables to H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression3 / 29 The Problem with RedundancyThe Problem with RedundancyA fundamental problem when one has several potential predictors is that some may belargely redundant with result of such redundancy is called multicollinearity, which occurs when somepredictors are linear combinations of others (or nearly so), resulting in a covariance matrixof predictors that is singular, or nearly outcome of multicollinearity is that parameter estimates become subject to wildsampling fluctuations, for theoretical reasons that we investigate on the next H.

3 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression4 / 29 The Problem with RedundancyCollinearity and Variances of Beta EstimatesThe Problem with RedundancyCollinearity and Variances of Beta EstimatesSuppose we have just two predictors, and the mean function isE(Y|X1=x1,X2=x2) = 0+ 1X1+ 2X2(1)It can be shown thatVar( j) = 21 r2121 SXjXj(2)wherer12is the correlation betweenX1andX2, andSXjXj= i(xij xj) the above formula, we can see that, asr212approaches 1, these variances are H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression5 / 29 The Problem with RedundancyCollinearity and Variances of Beta EstimatesThe Problem with RedundancyCollinearity and Variances of Beta EstimatesWhen the number of predictors exceeds 2, the previous result , we haveVar( j) = 21 R2j1 SXjXj(3)whereR2jis the squared Multiple correlation betweenXjand the other is easy to see why the quantity 1/(1 R2j) is called thejthvariance inflation factor, H.

4 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression6 / 29 Detecting and Dealing with RedundancyDetecting and Dealing with RedundancySimple multicollinearity may be detected in several ways. For example, one might examinethe correlation matrix to see if any predictors are highly correlated, and delete thevarimax-rotated principal component structureof a set of predictors willreveal more complex forms of multicollinearity, so long as the redundancy is component analysis will reveal uncorrelated Variables that are linearcombinations of the original predictors, and which account for maximum possible there is a lot of redundancy, just a few principal components might be as H.

5 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression7 / 29 Detecting and Dealing with RedundancyDetecting and Dealing with RedundancyIn some cases, predictors may be redundant with each other, but the redundancy Harrell sHmiscpackage includes a functionredunto detect such nonlinearredundancy and suggest Variables that might be candidates for H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression8 / 29 Classic Selection ProceduresClassic Selection ProceduresIn this section, we review the classic variable selection procedures that have dominatedthe social sciences procedures are usually referred to as1 Forward Selection2 Backward Elimination3 Stepwise RegressionJames H.

6 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression9 / 29 Classic Selection ProceduresClassic Selection ProceduresThe goal of variable selection is to divide a set of predictors in the columns of a matrixXinto active and inactive number of partitions is 2k, which becomes quite large very quickly whenkis are two fundamental issues:1 Given a particular candidate for the active terms, what criterion should be used to comparethis candidate to other possible choices?2 How do we deal computationally with the potentially huge number of comparisons that needto be made?James H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression10 / 29 Classic Selection ProceduresClassic Selection ProceduresOriginally, the criteria for model evaluation were purely statistical.

7 In order to be added toa model, a variable had to be significant according to the classic partialFtest, eitherwith ap-value below a certain pto enter value, or with anFstatistic specified as the Fto enter value (as in SPSS).More recently, attention has shifted to so-called informational criteria, which appear, atleast at first glance, to combine model fit with model complexity in assessing whether avariable should be added to a prediction H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression11 / 29 Classic Selection ProceduresThe Akaike Information Criterion (AIC)Classic Selection ProceduresThe Akaike Information Criterion (AIC)Criteria for comparing various candidate subsets are based on the lack of fit of a modeland its constants that are the same for every candidate subset, the AIC, orAkaikeInformation Criterionfor a candidateC, isAICC=nlog(RSSC/n) + 2pC(4)According to the Akaike criterion, the model with the smallest AIC is to be H.

8 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression12 / 29 Classic Selection ProceduresThe Bayesian Information Criterion(BIC)Classic Selection ProceduresThe Bayesian Information Criterion(BIC)The Schwarz Bayesian Informatin Criterion (BIC) isBICC=nlog(RSSC/n) +pClog(n)(5)James H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression13 / 29 Classic Selection ProceduresCross-Validation Based CriteriaClassic Selection ProceduresCross-Validation Based CriteriaThe major reason for employing fit indices that correct for complexity is because, forsample data, increasing the complexity of the model can never yield a higherRSS, andalmost always will yield a lowerRSS,even when the increase in complexity yields no gainin prediction in the genuine cross-validation, the sample is divided into two parts at random, aconstruction(or calibration)

9 Setand avalidation model is fit to the construction set and parameter estimates are model with those parameter estimates is then used to predict the response variablein the validation set used as a measure of fit, and is not corrected for H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression14 / 29 Classic Selection ProceduresCross-Validation Based CriteriaClassic Selection ProceduresCross-Validation Based CriteriaThePRESS measure is an attempt to assess the cross-validation capability of a modelbased on a single a particular model, for each observation,compute fitted values from based on all the data other than that the squared difference between the response and the predicted valuesThese squared errors are summed up across the entire H.

10 Steiger (Vanderbilt University) Selecting Variables in Multiple Regression15 / 29 Classic Selection ProceduresCross-Validation Based CriteriaClassic Selection ProceduresCross-Validation Based CriteriaThe resulting statistic, for model subset candidateXCisPRESS=n i=1(yi x Ci C(i))2(6)PRESScan be computed asPRESS=n i=1( eCi1 hCii)2(7)where eCiandhCiiare, respectively, the residual and the leverage for theith case in thesubset index is relatively straightforward to compute in simple linear Regression because ofthe above computational simplification, but this simplicity does not generalize to morecomplex H. Steiger (Vanderbilt University) Selecting Variables in Multiple Regression16 / 29 Classic Selection ProceduresAn Example The Highway DataClassic Selection ProceduresAn Example The Highway DataThis example employs the highway accident data from ALR Section Variables (including the response, log(Rate)), are described in ALR3 Table SELECTIONS mall values ofAICare preferred, so better candidate sets will have smallerRSSand a smaller number of termspC.


Related search queries