Example: stock market

Predicting the Probability of Being a Smoker: A Probit ...

Predicting the Probability of Being a Smoker: A Probit AnalysisDepartment of EconomicsFlorida State UniversityTallahassee, FL 32306-2180 AbstractThis paper explains the Probability of Being a smoker, based on 23 variables, using aprobit analysis model. Specifically, age, gender, marital status, location, race, riskybehavior, health insurance coverage, obtaining routine medical care and highest degreeobtained are the basis of the construction of the model. They are hypothesized to besignificant factors. 14 variables are individually significant at the 5% and 1% level. Theregressors are jointly significant at both levels. However, the Probit marginal effectsdemonstrate that race and possessing a high school degree can affect the Probability ofsmoking by -10% to 34%.1 INTRODUCTIONT obacco use in the United States is a behavior that has been studied intensely due to itsperceived benefits by users and extreme externalities. A problem of interest with tobaccouse, primarily smoking, is the ability to predict who is a current smoker.

Predicting the Probability of Being a Smoker: A Probit Analysis Department of Economics Florida State University Tallahassee, FL 32306-2180 Abstract This paper explains the probability of being a smoker, based on 23 variables, using a probit analysis model. Specifically, age, gender, marital status, location, race, risky

Tags:

  Analysis, Begin, Probability, Predicting, Probit, Probit analysis, Predicting the probability of being

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Predicting the Probability of Being a Smoker: A Probit ...

1 Predicting the Probability of Being a Smoker: A Probit AnalysisDepartment of EconomicsFlorida State UniversityTallahassee, FL 32306-2180 AbstractThis paper explains the Probability of Being a smoker, based on 23 variables, using aprobit analysis model. Specifically, age, gender, marital status, location, race, riskybehavior, health insurance coverage, obtaining routine medical care and highest degreeobtained are the basis of the construction of the model. They are hypothesized to besignificant factors. 14 variables are individually significant at the 5% and 1% level. Theregressors are jointly significant at both levels. However, the Probit marginal effectsdemonstrate that race and possessing a high school degree can affect the Probability ofsmoking by -10% to 34%.1 INTRODUCTIONT obacco use in the United States is a behavior that has been studied intensely due to itsperceived benefits by users and extreme externalities. A problem of interest with tobaccouse, primarily smoking, is the ability to predict who is a current smoker.

2 Annually, theUnited States Department of Health and Human Services (DHHS) conducts the MedicalExpenditure Panel Survey (MEPS). MEPS is a comprehensive examination of individualhealth and medical expenditures. There are approximately 1,099 variables within MEPSand 33,691 observations. Smoking and tobacco use does not comprise a majority of this dataset. However, several variables along with a binary for individuals who are current smokersshould enable an econometrician to explore this aspect of human behavior. By utilizingMEPS which is compiled by the Agency for Healthcare Research and Quality (AHRQ) andapplying limited dependent variable regression analysis , namely a Probit model, the goal ofthis analysis is to predict the Probability of a person Being a Problem of InterestThe ability to predict the behavior of certain individuals is of paramount importanceto statisticians, econometricians and economists. Smoking is unique in that it is a form ofbehavior that extremely restricted.

3 Almost every aspect of smoking is restricted by govern-ment, groups and individuals, via the use of cultural rules. Enabling someone with thepower to foretell who is a smoker based on a particular set of characteristics would createenormous benefits in the form of reduced transaction costs and greater efficiency. Improve-ments in the provision of healthcare and insurance could be realized. Healthcare providerswould possibly be able to make informed and optimal decisions as opposed to decisions inthe face of uncertainty. Another interesting aspect, or result, would be lower transactioncosts in the search for personal relationships. Individuals could lower their search costs aswell as make informed decisions. Hence, the power of a model that predicts the likelihood1someone is a smoker would be Application of Limited Dependent Variable MethodsChester Ittner Bliss (1934), a biologist, first introduced the notion of a Probit . Bliss wasconcerned with the treatment of a particular type of data.

4 Specifically, Bliss (1934) soughtto express the percentage of organisms killed by pesticides. Maddala (1983, ) notes thatGoldberger (1964) developed theprobit analysis , a latent variable,Ji, is observed instead ofYiwhich is an unobserved,qualitative dependent variable. Recall, an econometrician is faced with a classical regressionmodel subject to qualitative observation of the dependent variable. In the context of classicalregression,Yiis observable in the following model:Yi= i+Xi + i. This is not the casewith the current problem of interest. Here,Yiisnotobservable. A latent variable,Ji,isobserved, whereJi= 1 if the individual currently smokes andJi= 0 otherwise. The binarychoice model,Yi=Xi i, is needed to analyze this problem. Estimation of the binarychoice model requires the establishment of the relationship betweenJiandXi. Recall,Ji=1if and only ifYi>0. This implies thatXi i>0. Solving for iyields: i<(Xi / ).Therefore,P(Ji=1)=P( i<Xi )=F(Xi )(1)which implies thatP(Ji=0)=1 F(Xi ).

5 (2)The latent variable,Ji, takes the value of 1 or 0. Thus, the density function forJiis:f(Ji)=[F(Xi )]Ji[1 F(Xi )]1 Ji.(3)2 The variables and are not identified; however, = 1 is identified. Using this fact,the log-likelihood function for the binary choice model is:lnL( )=n i=1{JilnF(Xi )+(1 Ji)ln[1 F(Xi )]}.(4)The dependent variable,smokei, takes on discrete values; it is an indicator for individualswho currently smoke. One can infer from equations (1) and (2) that a binary choice modelallows for a clear statement of the relationship between the latent variable,smokei,andthe regressors. This does not occur in the context of the classical regression model. Hence,limited dependent variable methods must be used to predict the likelihood of an individualbeing a DISCUSSION OF MODELA binary choice model, specifically a Probit model, is to be employed to derive theprobability that someone smokes. Using the data, a Probit model is constructed:P(smokei=1)= ( + sexsex+ ageage+ race1race1+ race2race2+ race3race3+ race4race4+ race5race5+ marriedmarried+ gedged+ hidiplhidipl+ bachbach+ mastrmastr+ medcaremedcare+ hrwghrwg+ hourwkhourwk+ inscovinscov+ risk1risk1+ risk2risk2+ risk3risk3+ risk4risk4+ region1region1+ region2region2+ region3region3)(5)The following variables, which were extracted from the MEPS panel, are purported to haveexplanatory power on the decision to smoke: sex (gender), age, race, marital status, ed-ucation (in terms of highest degree obtained), routine medical care, hourly wage, hoursworked per week, health insurance coverage, willingness to take risks and location in Descriptive statistics are provided in Table RegressorsA discusion of the regressors and their implication in an individual s choice to smoke is inorder.

6 Gender and age are believed to play an ambiguous role. This is due to the fact thatmen and women of all ages ,race2,race3,race4andrace5are dummy variablesthat were used to indicate if persons are White, Black, American Indian/Alaska Native, Asianor Native Hawaiian/Pacific Islander, respectively. These are intriguing variables in the sensethat different races and cultures accept smoking, or at least perceive it indicator is included in the model for marital status. Marriage is thought to be animportant factor when an individual decides to smoke. Spouses can influence their signif-icant other, especially with respect to decisions regarding health. The variable for highestdegree obtained was decomposed into five binary variables. A higher degree should be as-sociated with an individual who is more health conscious. Whether an individual obtainsroutine medical care and currently maintains health insurance coverage or not are importantdeterminants. These determinants are represented by the variablesmedcareandinscov,respectively.

7 An individual s employment environment, work week schedule and income canobviously create undue stress and variables that at-tempt to capture these aspects, or byproducts of employment, and hopefully will explain anindividual s decision to this panel of data, ARHQ includes a variable that describes an individual s will-ingness to risks. If an individual is willing to take risks, then she should be willing, to somedegree, to smoke or be open to smoking (This statement is based heavily on the assumptionthat smoking is a risk). Analogous to the reasoning for binary variables for race, there existbinary variables for the individual s location within the A more detailed examinationof these variables is conducted in the Data Probit model is a special case where the error terms are independent and identicallydistributed with mean 0 and variance 1, i iidN(0,1). Regarding the binary choice model,this assumption about the error terms impliesF(Xi )= (Xi ), where ( )is the standard4normal distribution function.

8 The log-likelihood function is now simply:lnL( )=n i=1{Jiln (Xi )+(1 Ji)ln[1 (Xi )]}(6)whereJiis the latent variablesmokei,Xiare the regressors in equation (5) and is theratio, 1 .3 RESULTSThe Probit model, equation (5), was estimated. Results from this analysis can be foundin Table 2. Coefficients, standard errors, t-statistics and p-values are reported for the twenty-four regressors. The value of the log-likelihood function is model estimates can be used to test the joint significance of the regression and theindividual significance of the estimates. The following is the null-alternative pair for testingthe significance of each coefficient estimate:H0: i=0HA: i = 0 fori=sex, age, , region3(7)At the = and = , the significance of each regressor will be tested. Hence,it is necessary to use a two-tailed test. Based on the number of observations and numberof regressors,n=7628andk= 24, the degrees of freedom ( ) are 7624. A level ofsignificance, = , yields a two-tailed critical value of = Based on this ,14ofthe 24 regressors are individually significant.

9 These are:sex,race1,race2,race4,married,ged,hi dipl,medcare,hrwg,hourwk,inscov,region1, region2andregion3. Choosing = yields a two-tailed critical value of = At this level, all of the variablesmentioned are still significant with the exception test the joint significance of the regressors, the log-likelihood ratio is employed. The5null-alternative hypothesis pair is:H0: sex= age= = region3=0HA: at least one i =0.(8)Essentially, the null hypothesis states that that all of the regressors have no explanatorypower in the variation of the dependent variable,smokei. Using the log-likelihood ratio, 2[lnL( ) lnL( )]A 2k 1,(9)the following results. Note that lnL( ) is the value of the constrained likelihood function andlnL( ) is the value of the unconstrained likelihood function, which have respective values and This yields a value of ; hence t= 2k 1wherek 1 = 23. The critical value for 223at = is ( 1)andat = , it is ( 2). Thus, since t> 1and t> 2, the null hypothesis is rejected.

10 This result implies theregressors are jointly significant at the 5% and 1% effects for the Probit model were then calculated. Results can be found inTable 3. Recall that marginal effects, in the context of the Probit model, are the vector ofstandardized coefficients. That is, P(smokei=1) XTi= (Xi ) XTi= (Xi ) (10)It is known that for differenti s, ( ) now varies. The coefficients are scaled differently,but still proportionately. The marginal effects allow for a more appropriate analysis whendetermining the specific effect of a one unit change inXion the latent variable, 3 indicates thatrace2,race4,gedandhidiplhave a -10%, -15%, 31% and 19%effect onsmokei. Specifically, if the value ofrace4changes from 0 to 1, then this impliesthat there is a 15% decrease in the Probability that the individual is a smoker. Analogously,6ifrace2were to change in value from 0 to 1, there would by a 10% decrease in the probabilityof a person Being a smoker. On the other hand, possessing a general equivalence diploma(ged) or a high school diploma (hidipl) cause the Probability of a person Being a smoker toincrease by 31% and 19%, respectively.


Related search queries