Transcription of Bayesian Causal Inference: A Tutorial
1 Bayesian Causal inference : A TutorialFan LiDepartment of Statistical ScienceDuke UniversityJune 2, 2019 Bayesian Causal inference Workshop, Ohio State UniversityCausationIRelevant questions about causationIthe philosophical meaningfulness of the notion of causationIdeducing the causes of a given effectIunderstanding details of a Causal mechanismIHere we focus on measuring the effects of causes, wherestatistics arguably can contribute mostISeveral statistical frameworksIgraphical models (S Wright, J Pearl)Istructural equations (S Wright, T Haavelmo, J Heckman)Ipotential outcomes (J Neyman, DB Rubin)Potential Outcome FrameworkIThe Potential Outcome Framework: the most widely usedframework across many disciplinesIBrief historyIRandomized experiments: Fisher (1918, 1925), Neyman(1923)IFormulation (assignment mechanism and Bayesian model):Rubin (1974, 1977, 1978)IObservational studies and propensity scores: Rosenbaumand Rubin (1983)IHeterogonous treatment effects and machine learning:Athey and Imbens (2015), many othersPotential Outcome Framework: Key ComponentsINo causation without manipulation: a cause must be(hypothetically) manipulatable, , intervention, treatmentIGoal: estimate the effects of cause , not causes of effectIThree integral components (Rubin, 1978):Ipotential outcomes.
2 Corresponding to the various levels of atreatmentIassignment mechanismsIa ( Bayesian ) model for the science ( the potentialoutcomes and covariates)ICausal effects: a comparison of the potential outcomesunder treatment and control forthe same set of unitsBasic SetupIData: a random sample ofNunits from a target populationIA treatment with two levels:w=0,1 IFor each uniti, we observe the (binary) treatment statusWi, a vector of covariatesXi, and an outcomeYobsiIFor each uniti, two potential outcomesYi(0),Yi(1) implicitly invoke the Stable Unit Treatment ValueAssumption (SUTVA)IBold font for matrices or vectors consisting of thecorresponding variables for theNunits: for example,IX= (X 1,..,X N) ,W= (W1,..,WN) Causal Estimands (Parameter of Interest)IPopulation average treatment effect (PATE): PATE=E[Yi(1) Yi(0)].
3 ISample average treatment effect (SATE): SATE=1NN i=1[Yi(1) Yi(0)].IAverage treatment effect for the treated (ATT): ATT=E[Yi(1) Yi(0)|Wi=1].IConditional average treatment effect (CATE): (x) =E[Yi(1) Yi(0)|Xi=x].The Fundamental Problem of Causal InferenceHolland, 1986 IFor each unit, we can observe at most one of the twopotential outcomes, the other is missing (counterfactual)IPotential outcomes and assignments jointly determine thevalues of the observed and missing outcomes:Yobsi Yi(Wi) =Wi Yi(1) + (1 Wi) Yi(0)ICausal inference under the potential outcome framework isessentially a missing data problemITo identify Causal effects from observed data, one mustmake additional (structural or/and stochastic) assumptionsPerfect DoctorPotential OutcomesObserved DataY(0)Y(1)W Y(0)Y(1)13141?
4 146006?4104?5205?6306?6106?8101?10891? MechanismIA key identifying assumption is on assignment mechanism:the probabilistic rule that decides which unit gets assignedto which treatmentPr(Wi=1|Xi,Yi(0),Yi(1))IIn randomized experiments, assignment mechanism isusually known and controlled by investigatorsIIn observational studies, assignment mechanism is usuallyunknown and uncontrolledPositivity (or overlap)Assumption 1: Positivity (or overlap):0<Pr(Wi=1|Xi,Yi(0),Yi(1))<1 for requires, in large samples, for all possible valuesof the covariates there are both treated and control from observed dataIgnorability (or unconfoundedness)Assumption 2: Ignorability (or unconfoundedness)Pr(Wi=1|Xi,Yi(0),Yi(1)) = Pr(Wi=1|Xi)Often also written as{Yi(0),Yi(1)} Wi|XiIAssumes that within subpopulations defined by values ofobserved covariates, the treatment assignment is randomIRules out unmeasured confoundersIei(x) Pr(Wi=1|Xi=x)is called the propensity score(Rosenbaum and Rubin, 1983)IUnconfoundedness and positivity jointly define strongignorability"Identify Causal effects under unconfoundednessIUnder unconfoundedness, forw=0,1:Pr(Y(w)|X) = Pr(Yobs|X,W=w)IThus ATE can be estimated from observed data.
5 PATE=Ex[E(Yobs|X=x,W=1) E(Yobs|X=x,W=0)]IRandomized experiments satisfy unconfoundednessIUntestable and likely violated to a degree, but invoked inmost observational studiesISensitivity to unconfoundedness is routinely checked(Cornfield, 1959; Rosenbaum and Rubin, 1983b)Classification of assignment mechanismsIRandomized experiments:Istrong ignorability automatically holdsIgood balance is (in large samples) guaranteedIIgnorable (or unconfounded) observational studiesIstrong ignorability is assumed, conditional on covariatesIbalance need to be achievedIQuasi-experiments: looking for natural" experiments(under assumptions)Classification of ignorable assignment mechanismsWe will focus on ignorable assignment mechanisms andextensionsIStandard ignorable assignment mechanism: one-timetreatment, conditional on covariatesISequentially ignorable: time-varying treatmentILatent ignorable: post-treatment variables, principalstratificationILocally ignorable.
6 Regression discontinuityIWeakly ignorable: multi-valued and continuous treatmentIInterference: when SUTVA is and Modes of InferenceITwo overarching methodsIImputation: impute the missing potential outcomes(model-based or matching-based)IWeighting: weight (often function of the propensity scores)the observed data to represent a target populationIThree modes of inferenceIFrequentist: imputation, weighting, motivated byconsistency, asymptotic normality, (semiparametric)efficiency, : modeling and imputing missing potentialoutcomes based on their posterior distributionsIFisherian randomization: combine randomization tests withBayesian methods, unique to randomized experimentsBayesian inference of Causal EffectsIFour quantities are associated with each sampled unit:Yi(0),Yi(1),Wi,XiIThree observed:Wi,Yobsi=Yi(Wi),Xi; one missingYmisi=Yi(1 Wi)IGivenWi, there is a one-to-one map between(Yobsi,Ymisi)and(Yi(0),Yi(1)):Yob si=Yi(1)Wi+Yi(0)(1 Wi)IThus Causal estimands = (Y(0),Y(1))can berepresented as functions = (Yobs,Ymis,W)General Structure (I)Rubin, 1978, Ann.
7 inference considers the observed values of thefour quantities to be realizations of random variables andthe unobserved values to be unobserved random variablesIPr(Y(0),Y(1),W,X): joint probability density function ofthese random variables for all unitsIAssuming unit-exchangeability, there exists a unknownparameter vector with a prior distp( )such that (deFinetti, 1963):Pr(Y(0),Y(1),W,X) = iPr(Yi(0),Yi(1),Wi,Xi| )p( )d General Structure (II)IBayesian inference of the estimand = (Yobs,Ymis,W):obtain the joint posterior (predictive) distributions ofYmis, ,and thusYmis, and thus IFactorization of the joint distribution:Pr(Yi(0),Yi(1),Wi,Xi| )=Pr(Wi|Yi(0),Yi(1),Xi, W) Pr(Yi(0),Yi(1)|Xi, Y) Pr(Xi| X)IUsually we do not want to modelPr(Xi), rather wecondition onXIWe make two assumptionsIa prioridistinct and independent parameters for Wand YIIgnorable assignment mechanismPr(Wi|Yi(0),Yi(1),Xi) = Pr(Wi|Xi)General Structure (III)IUnder the two assumptions, the joint posterior distributionof(Ymis, Y)isPr(Ymis, Y|Yobs,W,X) p( Y)p( W)p( X) Pr(Wi|Yi(0),Yi(1),Xi, W) Pr(Yi(0),Yi(1)|Xi, Y) Pr(Xi| X) p( Y)N i=1Pr(Yi(0),Yi(1)|Xi, Y)IAbove the termsPr(Wi|Xi, W)andPr(Xi| X)drop out ofthe likelihood not informative about YorYmisINeed to specify the model for science :Pr(Yi(0),Yi(1)|Xi)ITwo different specific strategies to simulateYmisStrategy 1.
8 Data Augmentation ( gibbs sampling )IIteratively simulateYmisand fromPr(Ymis|Yobs,W,X, )andPr( |Ymis,Yobs,W,X)IPosterior predictive distribution ofYmis:Pr(Ymis|Yobs,W,X, ) i:Wi=1Pr(Yi(0)|Yi(1),Xi, Y) i:Wi=0Pr(Yi(1)|Yi(0),Xi, Y)IImpute missing potential outcomesIFor treated units, impute the missingYi(0)fromPr(Yi(0)|Yi(1),Xi, Y|X)IFor control units: impute the missingYi(1)fromPr(Yi(1)|Yi(0),Xi, Y|X)Strategy 1: Data Augmentation ( gibbs sampling )IImputation crucially depends on the model for science:Pr(Yi(1),Yi(0)|Xi)IButYi(1),Yi(0 )are never jointed observed, no informationat all about the association betweenYi(1)anYi(0) posterior=prior, and posterior of estimand will besensitive to its priorStrategy 1: ProblemsIProposed by Rubin (1978), widely usedIProblem: Observed data contain information on themarginal distributions of the potential outcomes, but no orlittle information on the associationINo clear separation of identified and non-identifiedparametersIWhat does identifiability mean?
9 IFrequentist: the parameter can be expressed as a functionof the observed data distributionIDogmatic Bayesian : with proper prior, all parameters areidentifiable (Lindley, 1972)IGustafson (2015): sensitivity of the posterior on the prior -weak identifiabilityStrategy 2: Transparent ParameterizationIIRichardson, Evans, and Robins (2010): transparentparametrizationISeparate identifiable and non-identifiable parametersIBased on the definition of conditional probability(Oobs= (X,Yobs,W)is the observed data)Pr(Ymis, |Oobs) = Pr( |Oobs) Pr(Ymis| ,Oobs)IFirst simulate givenOobsfromPr( |Oobs), then simulateYmisgiven andOobsfromPr(Ymis| ,Oobs)IPartition the parameter ( m) that governs the marginaldistributions ofYi(1)andYi(0)from the parameter ( a) thatgoverns the association between themIAssume mand aarea prioriindependentStrategy 2: Transparent ParameterizationIPosterior of.
10 Pr( |Oobs) p( aY|X)p( mY|X) Wi=1Pr(Yi(1)|Xi, mY|X) Wi=0Pr(Yi(0)|Xi, mY|X)IThe posterior mY|Xis updated by the likelihood, but not aY|X(same as prior)IGiven a posterior draw of mY|X, we can imputeYmisas inStrategy 1 IRepeat the analysis varying aY|X(from 0 to 1) as sensitivityanalysis (Ding and Dasgupta, 2016)Example of Strategy 2: Regression AdjustmentICompletely randomized experiment with continuousoutcomeIAssume a bivariate normal model for the joint potentialoutcomes(Yi(1)Yi(0))|(Xi, Y|X) N(( 1Xi 0Xi),( 21 1 0 1 0 20))IStrategy 2: mY|X= ( 1, 0, 21, 20), aY|X= I{(Xi,Yobsi) :Wi=1}contribute to the likelihood of{ 1, 21}I{(Xi,Yobsi) :Wi=0}contribute to the likelihood of{ 0, 20}IThe observed likelihood does not depend on :posterior=priorExample: Regression AdjustmentIImpose standard conjugate normal-inverse 2priors to and IFor a fixed and given each draw of( 1, 0, 21, 20), weimpute the missing potential outcomes:IFor treated units (Wi=1), drawYi(0)| N( 0Xi+ 0 1(Yobsi 1Xi), 20(1 2)),IFor control units (Wi=0), we drawYi(1)| N( 1Xi+ 1 0(Yobsi 0Xi), 21(1 2)).
