A Practitioner’s Guide to Cluster-Robust Inference

1 A Practitioner s Guide to Cluster-Robust Inference A. Colin Cameron and Douglas L. Miller Abstract We consider statistical Inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters. Examples include data on individuals with clustering on village or region or other category such as industry, and state-year differences-in-differences studies with clustering on state. In such settings default standard errors can greatly overstate estimator precision.

Instead, if the number of clusters is large, statistical Inference after OLS should be based on Cluster-Robust standard errors. We outline the basic method as well as many complications that can arise in practice. These include cluster-specific fixed effects, few clusters, multi-way clustering , and estimators other than OLS. Colin Cameron is a Professor in the Department of Economics at UC- Davis. Doug Miller is an Associate Professor in the Department of Economics at UC- Davis. They thank four referees and the journal editor for very helpful comments and for guidance, participants at the 2013 California Econometrics Conference, a workshop sponsored by the Programme Evaluation for Policy Analysis, seminars at University of Southern California and at University of Uppsala, and the many people who over time have sent them cluster-related puzzles (the solutions to some of which appear in this paper).

Doug Miller acknowledges financial support from the Center for Health and Wellbeing at the Woodrow Wilson School of Public Policy at Princeton University. 2 I. Introduction In an empiricist s day-to-day practice, most effort is spent on getting unbiased or consistent point estimates. That is, a lot of attention focuses on the parameters ( ). In this paper we focus on getting accurate statistical Inference , a fundamental component of which is obtaining accurate standard errors ( , the estimated standard deviation of ).

We begin with the basic reminder that empirical researchers should also really care about getting this part right. An asymptotic 95% confidence interval is , and hypothesis testing is typically based on the Wald t-statistic =( 0)/ . Both and are critical ingredients for statistical Inference , and we should be paying as much attention to getting a good as we do to obtain . In this paper, we consider statistical Inference in regression models where observations can be grouped into clusters, with model errors uncorrelated across clusters but correlated within cluster.

One leading example of clustered errors is individual-level cross-section data with clustering on geographical region, such as village or state. Then model errors for individuals in the same region may be correlated, while model errors for individuals in different regions are assumed to be uncorrelated. A second leading example is panel data. Then model errors in different time periods for a given individual ( , person or firm or region) may be correlated, while model errors for different individuals are assumed to be uncorrelated.

Failure to control for within-cluster error correlation can lead to very misleadingly small standard errors, and consequent misleadingly narrow confidence intervals, large t-statistics and low p-values. It is not unusual to have applications where standard errors that control for within-cluster correlation are several times larger than default standard errors that ignore such correlation. As shown below, the need for such control increases not only with increase in the size of within-cluster error correlation, but the need also increases with the size of within-cluster correlation of regressors and with the number of observations within a cluster.

A leading example, highlighted by Moulton (1986, 1990), is when interest lies in measuring the effect of a policy variable , or other aggregated regressor, that takes the same value for all observations within a cluster. One way to control for clustered errors in a linear regression model is to additionally specify a model for the within-cluster error correlation, consistently estimate the parameters of this error correlation model, and then estimate the original model by feasible generalized least squares (FGLS) rather than ordinary least squares (OLS).

Examples include random effects estimators and, more generally, random coefficient and hierarchical models. If all goes well this provides valid statistical Inference , as well as estimates of the parameters of the original regression model that are more efficient than OLS. However, these desirable properties hold only under the very strong assumption that the model for within-cluster error correlation is correctly specified. A more recent method to control for clustered errors is to estimate the regression model with limited or no control for within-cluster error correlation, and then post-estimation obtain Cluster-Robust standard errors proposed by White (1984, ) for OLS with a multivariate dependent variable (directly applicable to balanced clusters); by Liang and Zeger (1986) for linear and nonlinear models; and by Arellano (1987) for the fixed effects estimator in linear panel models.

These Cluster-Robust standard errors do not require specification of a model for within-cluster error correlation, but do require the additional assumption that the number of clusters, rather than just the number of observations, goes to infinity. Cluster-Robust standard errors are now widely used, popularized in part by Rogers (1993) who incorporated the method in Stata, and by Bertrand, Duflo and Mullainathan (2004) 3 who pointed out that many differences-in-differences studies failed to control for clustered errors, and those that did often clustered at the wrong level.

Cameron and Miller (2011) and Wooldridge (2003, 2006) provide surveys, and lengthy expositions are given in Angrist and Pischke (2009) and Wooldridge (2010). One goal of this paper is to provide the practitioner with the methods to implement Cluster-Robust Inference . To this end we include in the paper reference to relevant Stata commands (for version 13), since Stata is the computer package most often used in applied microeconometrics research. And we will post on our websites more expansive Stata code and the datasets used in this paper.

A Practitioner’s Guide to Cluster-Robust Inference

Tags:

Information

Advertisement

Transcription of A Practitioner’s Guide to Cluster-Robust Inference

Related search queries

A Practitioner’s Guide to Cluster-Robust Inference

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries