Example: bachelor of science

Goodness of Fit in Logistic Regression

Goodness of Fit in Logistic RegressionDavid M. RockeApril 13, 2021 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 20211 / 62 Goodness of Fit for Logistic RegressionCollection of Binomial Random VariablesSuppose that we haveksamples ofn0/1 variables, aswith a binomial Bin(n,p), and suppose that p1, p2,.., pkare the sample proportions. We know thatE( p) =pV( p) =p(1 p)/nDavid M. RockeGoodness of Fit in Logistic RegressionApril 13, 20212 / 62If p= Ave( pi) then if the distribution really isbinomial, we should have that the sample variances2of the pishould be close to p(1 p)/n. If it isnot, then there is something sample variance can be as small as 0 if all the piare the same, and is largest if some of the piare0 and the remainder are M. RockeGoodness of Fit in Logistic RegressionApril 13, 20213 / 62 For example, suppose thatk= 20 andn= 50, Ifp= , then p ands2 p(1 p)/n= ( )( )/50 = 5 of the sample proportions are 1 and 45 are 0,then p= buts2=[(5)( )2+ (45)(( )2]/39 = ,which is a factor of 50 too the variance is too big, then either the distributionis not binomial, or we need more predictors (wehave only one in this example, the intercept).)

(residual) deviance of a model is the di erence between the minus twice the log likelihood of that model and that of the saturated model that ts each group with its own proportion. So we could consider the deviance of the given model as a likelihood ratio test of whether the given model is satisfactory; that is, whether it can be

Tags:

  Tests, Between, Erences, Di erence between

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Goodness of Fit in Logistic Regression

1 Goodness of Fit in Logistic RegressionDavid M. RockeApril 13, 2021 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 20211 / 62 Goodness of Fit for Logistic RegressionCollection of Binomial Random VariablesSuppose that we haveksamples ofn0/1 variables, aswith a binomial Bin(n,p), and suppose that p1, p2,.., pkare the sample proportions. We know thatE( p) =pV( p) =p(1 p)/nDavid M. RockeGoodness of Fit in Logistic RegressionApril 13, 20212 / 62If p= Ave( pi) then if the distribution really isbinomial, we should have that the sample variances2of the pishould be close to p(1 p)/n. If it isnot, then there is something sample variance can be as small as 0 if all the piare the same, and is largest if some of the piare0 and the remainder are M. RockeGoodness of Fit in Logistic RegressionApril 13, 20213 / 62 For example, suppose thatk= 20 andn= 50, Ifp= , then p ands2 p(1 p)/n= ( )( )/50 = 5 of the sample proportions are 1 and 45 are 0,then p= buts2=[(5)( )2+ (45)(( )2]/39 = ,which is a factor of 50 too the variance is too big, then either the distributionis not binomial, or we need more predictors (wehave only one in this example, the intercept).)

2 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 20214 / 62 The deviance isD= 2 [yiln(yi/ i) + (n yi) ln((n yi)/(n i))]If we havekgroups from a single binomial distribution,then i=n p. The expressionyiln(yi/n p) + (n yi) ln((n yi)/(n n p)is like( pi p)2= (yi n p)2/n2in that both get larger as the difference between theobserved and expected get M. RockeGoodness of Fit in Logistic RegressionApril 13, 20215 / 62 Residual DevianceSuppose we havekgroups andnobservations. The(residual) deviance of a model is the differencebetween the minus twice the log likelihood of thatmodel and that of the saturated model that fitseach group with its own we could consider the deviance of the givenmodel as a likelihood ratio test of whether the givenmodel is satisfactory; that is, whether it can beshown that adding more variables helps M. RockeGoodness of Fit in Logistic RegressionApril 13, 20216 / 62If our model hasqpredictors (counting categoricalvariables as one less than the number of levels andan intercept, then the difference from the saturatedmodel isk q 1, and we could compare thedeviance to a 2k q 1which has meank q the deviance is too big, then something is wrong:Omitted predictors?))

3 Not binomial?David M. RockeGoodness of Fit in Logistic RegressionApril 13, 20217 / 62> summary( )glm(formula = ~ smoking + obesity + snoring, family = binomial)Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 4e-10 **smokingYes *snoringYes *---Signif. codes: 0 ?**? ?**? ?*? ?.? ? ? 1(Dispersion parameter for binomial family taken to be 1)Null deviance: on 7 degrees of freedomResidual deviance: on 4 degrees of freedomDavid M. RockeGoodness of Fit in Logistic RegressionApril 13, 20218 / 62 Residual deviance: on 4 degrees of freedomThe residual deviance is not too large, so we don tappear to have a ( 24< ) = so it is not too small M. RockeGoodness of Fit in Logistic RegressionApril 13, 20219 / 62 Deviance for Grouped DataWhen data are entered as groups withdisease/notdisease, then R uses the definition ofdeviance comparing it to a model saturated the hypertension data, there are 8 groups anddeviance is relative to an 8df model likeSmoking*Obesity* M.

4 RockeGoodness of Fit in Logistic RegressionApril 13, 202110 / 62 Deviance for Ungrouped DataIf the data are given in observation form with 0/1response, then R uses a definition of deviancerelative to an observation-saturated model whereeach response is perfectly means that the deviance is just minus twicethe log can still use the deviance test when the analysisis M. RockeGoodness of Fit in Logistic RegressionApril 13, 202111 / 62> <- glm(CHD~CAT+SMK+HPT,family=binomial,evans)> <- glm(CHD~CAT*SMK*HPT,family=binomial,evans)> anova( , ,test="Chisq")Analysis of Deviance TableModel 1: CHD ~ CAT + SMK + HPTM odel 2: CHD ~ CAT * SMK * HPTR esid. Df Resid. Dev Df Deviance Pr(>Chi)1 605 601 4 .> summary( )Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) < 2e-16 **CAT **SMK *HPT *Null deviance: on 608 degrees of freedomResidual deviance: on 605 degrees of freedomDavid M.

5 RockeGoodness of Fit in Logistic RegressionApril 13, 202112 / 62 This is a test of whether we should add all of theinteractions. The result is not significant, as a testof Goodness of (see below) there can still be additionalpredictors that are important, in this case both bysignificance test and AIC.> add1( , ,test="Chisq")Single term additionsModel:CHD ~ CAT + SMK + HPTDf Deviance AIC LRT Pr(>Chi)<none> :SMK 1 :HPT 1 **SMK:HPT 1 M. RockeGoodness of Fit in Logistic RegressionApril 13, 202113 / 62 Goodness of Fit for Uncategorized DataThe procedure above works only if the number ofgroups in which the predictors are the same is smallcompared commonly used procedure if there are continuouspredictors is the Hosmer-Lemeshow Goodness of works poorly if there are too many ties, haslow statistical power, but may be useful whenalmost all the observations have distinct M.

6 RockeGoodness of Fit in Logistic RegressionApril 13, 202114 / 62 Order the data by the predicted values and cut intoclasses of equal size, say observed and expected cases in 2test as usual from (O E)2 can be done ()from theResourceSelectionpackage in is very commonly used, but has low power, andinterpretation in case of rejection can be M. RockeGoodness of Fit in Logistic RegressionApril 13, 202115 / 62> library(ResourceSelection)ResourceSelect ion 2016-02-15 Warning message:package ResourceSelection was built under R version > <- glm(CHD~CAT+CHL+SMK+HPT,family=binomial,evans)> ( $y,fitted( ))Hosmer and Lemeshow Goodness of fit (GOF) testdata: $y, fitted( )X-squared = , df = 8, p-value = that the model omits interactions we know areimportant, but still passes the HL M. RockeGoodness of Fit in Logistic RegressionApril 13, 202116 / 62 Model Checking and DiagnosticsLinear RegressionIn linear Regression , the major assumptions in orderof importance:Linearity:The mean ofyis a linear (in thecoefficients) function of the :Different observations arestatistically Variance:The residual variance is thesame for each :The error distribution is M.

7 RockeGoodness of Fit in Logistic RegressionApril 13, 202117 / 62 DiagnosticsLinear RegressionPlot residuals vs. fitted valuesPlot residuals vs. predictorsLook for influential observations with dffits anddfbeta. These are observations that have a largeeffect on the can use many of these techniques in M. RockeGoodness of Fit in Logistic RegressionApril 13, 202118 / 62 Model Checking and DiagnosticsLogistic RegressionIn Logistic Regression , the major assumptions in orderof importance:Linearity:The logit of the mean ofyis a linear (inthe coefficients) function of the :Different observations arestatistically Function:The variance of anobservation with meanpisp(1 p) :The error distribution is M. RockeGoodness of Fit in Logistic RegressionApril 13, 202119 / 62 Diagnostics for Grouped LogisticRegressionDeviance test for Goodness of deviance residuals vs. fitted values. In thiscase, there are as many residuals and fitted valuesas there are distinct dfffits vs.

8 Fitted values. This is the scaledchange in the predicted value of pointiwhen pointiitself is removed from the fit. This has to be thewhole category in this this works well automatically only when the dataare given to R in aggregated M. RockeGoodness of Fit in Logistic RegressionApril 13, 202120 / 62> summary( )Call:glm(formula = CHD ~ CAT + SMK + HPT, family = binomial, data = evans)Deviance Residuals:Min 1Q Median 3Q :Estimate Std. Error z value Pr(>|z|)(Intercept) < 2e-16 **CAT **SMK *HPT *---Signif. codes: 0 ** ** * . 1(Dispersion parameter for binomial family taken to be 1)Null deviance: on 608 degrees of freedomResidual deviance: on 605 degrees of freedomAIC: of Fisher Scoring iterations: 5 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 202121 / 62> <- aggregate(cbind(CHD,1-CHD,1)~CAT+SMK+HPT,FUN=sum,data=evans)> print( )CAT SMK HPT CHD V2 V31 0 0 0 5 117 1222 1 0 0 1 5 63 0 1 0 15 193 2084 1 1 0 7 11 185 0 0 1 4 51 556 1 0 1 7 32 397 0 1 1 20 82 1028 1 1 1 12 47 59> res <- ( )[,4:5]> <- glm(res~CAT+SMK+HPT,family=binomial,data = )David M.

9 RockeGoodness of Fit in Logistic RegressionApril 13, 202122 / 62> summary( )Call:glm(formula = res ~ CAT + SMK + HPT, family = binomial, data = )Deviance Residuals:1 2 3 4 5 6 7 :Estimate Std. Error z value Pr(>|z|)(Intercept) < 2e-16 **CAT **SMK *HPT *---Signif. codes: 0 ** ** * . 1(Dispersion parameter for binomial family taken to be 1)Null deviance: on 7 degrees of freedomResidual deviance: on 4 degrees of freedomAIC: of Fisher Scoring iterations: 4 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 202123 / 62> summary( )Call:glm(formula = CHD ~ CAT + SMK + HPT, family = binomial, data = evans)Deviance Residuals:Min 1Q Median 3Q :Estimate Std. Error z value Pr(>|z|)(Intercept) < 2e-16 **CAT **SMK *HPT *---Signif.

10 Codes: 0 ** ** * . 1(Dispersion parameter for binomial family taken to be 1)Null deviance: on 608 degrees of freedomResidual deviance: on 605 degrees of freedomAIC: of Fisher Scoring iterations: 5 David M. RockeGoodness of Fit in Logistic RegressionApril 13, 202124 / 62 The Goodness of fit test is to compare , theresidual deviance, with a 24.> pchisq(deviance( ),4,lower=F)[1] know that theCAT:HPTinteraction is significant,which is somewhat indicated by the relatively high valueof the residual M. RockeGoodness of Fit in Logistic RegressionApril 13, 202125 / 62> summary(glm(res~CAT+SMK+HPT+CAT:HPT,fami ly=binomial,data= ))Call:glm(formula = res ~ CAT + SMK + HPT + CAT:HPT, family = binomial,data = )Deviance Residuals:1 2 3 4 5 6 7 :Estimate Std. Error z value Pr(>|z|)(Intercept) < 2e-16 **CAT **SMK *HPT **CAT:HPT **---Signif.


Related search queries