Example: biology

Much Ado About Nothing: A Comparison of Missing Data ...

much Ado About nothing : A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models Nicholas J. Horton and Ken P. Kleinman cellent textbooks by Little and Rubin (2002), Schafer (1997), and Allison (2002) provide a comprehensive overview of meth- Missing data are a recurring problem that can cause bias or ods in this setting, focused primarily on multiple imputation. lead to inefficient analyses. Statistical methods to address miss- Although somewhat dated, Little (1992) describes a hierarchy ingness have been actively pursued in recent years, including im- of approaches to account for Missing predictors, including the putation, likelihood, and weighting approaches.

Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models Nicholas J. Horton and Ken P. Kleinman Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Statistical methods to address miss-

Tags:

  About, Much, Nothing, Much ado about nothing

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Much Ado About Nothing: A Comparison of Missing Data ...

1 much Ado About nothing : A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models Nicholas J. Horton and Ken P. Kleinman cellent textbooks by Little and Rubin (2002), Schafer (1997), and Allison (2002) provide a comprehensive overview of meth- Missing data are a recurring problem that can cause bias or ods in this setting, focused primarily on multiple imputation. lead to inefficient analyses. Statistical methods to address miss- Although somewhat dated, Little (1992) describes a hierarchy ingness have been actively pursued in recent years, including im- of approaches to account for Missing predictors, including the putation, likelihood, and weighting approaches.

2 Each approach maximum likelihood approach of Ibrahim (1990). Publications is more complicated when there are many patterns of Missing by Meng (2000) and Raghunathan (2004) provide a general intro- values, or when both categorical and continuous random vari- duction, and the paper by Ibrahim et al. (2005) reviews recent de- ables are involved. Implementations of routines to incorporate velopments in a comprehensive fashion, though their application observations with incomplete variables in regression models are (cancer dataset) features incompleteness on only one variable. A. now widely available. We review these routines in the context useful online annotated bibliography provides a comprehensive of a motivating example from a large health services research reading list (Carpenter 2006a).

3 Dataset. While there are still limitations to the current imple- In this article, we update the prior review of Horton and Lipsitz mentations, and additional efforts are required of the analyst, (2001) and apply methods described by Ibrahim et al. (2005) to it is feasible to incorporate partially observed values, and these the logistic regression analysis of a dataset with incompleteness methods should be used in practice. on four variables (both categorical and continuous) using a vari- ety of software packages. We discuss modeling assumptions, ap- KEY WORDS: Conditional Gaussian; Health services research; proaches, and compromises required for estimation within cur- Maximum likelihood; Multiple imputation; Psychiatric epidemi- rent implementations.

4 In Section 2, we briefly review methods ology. for incorporating incomplete observations in regression models then summarize findings of two surveys of how Missing data methods are used in practice in Section 3. We describe our moti- vating example (which features a large dataset with high propor- 1. INTRODUCTION tion of Missing values with a nonmonotone pattern) in Section 4, detail support for Missing data software in Section 5, then apply Missing data are a frequent complication of any real-world these methods to the motivating dataset in Section 6. We contrast study. The causes of missingness are often numerous, some due the strengths and limitations of these packages in practice, and to design, and some to chance.

5 Some variables may not be col- suggest improvements for the future. lected from all subjects, some subjects may decline to provide We focus on methods to incorporate partially observed pre- values, and some information may be purposely excised, for ex- dictors; different issues arise when some outcomes are not fully ample to protect confidentiality. While the use of complete case observed. We also do not consider longitudinal or clustered out- methods that drop subjects Missing any observations are com- comes, for which other complications arise ( Laird 1988;. monly seen in practice, this approach has the disadvantage of Robins, Rotnitzky, and Zhao 1995; and Jansen et al.)

6 2006). being inefficient as well as potentially biased. The development of methods for analysis of data with incom- 2. INCOMPLETE DATA REGRESSION METHODS. plete values has been an active area of research. Models that Notation and Nomenclature incorporate partially observed predictors are of particular inter- est in many real-world settings, since missingness of just a few We begin by introducing notation that will be used through- percent on each of a number of covariates may lead to a large out, assuming that data are collected on a sample of n subjects number of observations with some Missing information. The ex- and that primary interest relates to the parameters governing the conditional distribution f (Yi |Xi , ).

7 To simplify exposition, we Nicholas J. Horton is Assistant Professor, Department of Mathematics and Statis- suppress the subject indicator. For a given subject we can parti- tics, Clark Science Center, Smith College, Northampton, MA 01063-0001 (E- tion X into components denoting observed variables (Xobs ) and mail: Ken P. Kleinman is Associate Professor, De- those that are Missing for that subject (Xmis ). We denote by R. partment of Ambulatory Care and Prevention, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA. Helpful comments and assistance a set of response indicators ( , Rj = 1 if the j th element of regarding details of the software implementations were provided by Frank Har- X is observed, and equals 0 otherwise), governed by parame- rell, Jr.)

8 , Krista Kilmer, Gary King, Patrick Royston, Pralay Senchaudhuri, Stef ters . Little and Rubin (2002) introduced a nomenclature for van Buuren, and Yang Yuan. We thank Suzanne Switzer for assistance with the missingness in terms of probability models for R. The Missing review of Missing data methods in the New England Journal of Medicine and James Carpenter, Amy Herring, as well as Owen Thomas for useful comments. completely at random (MCAR) assumption is defined as We are grateful for the support provided by NIMH grant R01-MH54693 and the Smith College Picker Program. ), P (R|Y, X) = P (R|Y, Xobs , Xmis ) = P (R| . 2007. c American Statistical Association DOI: The American Statistician, February 2007, Vol.

9 61, No. 1 79. where in addition and are presumed distinct. Note that Hypothetical Hypothetical this depends on an assumed relationship between R and the un- Monotone Non-monotone Pattern Y X1 X2 X3 Y X1 X2 X3. observed and unknowable Xmis . Heuristically, this assumption . states that missingness is not related to any factor, known or 1 Obs Obs Obs Obs Obs Obs Obs Obs unknown, in the study. 2 Obs Obs Obs M Obs Obs Obs M. It may be more plausible to posit that missingness is Missing 3 Obs Obs M M Obs Obs M M. 4 Obs M M M Obs Obs M Obs at random (MAR), which assumes that 5 Obs M Obs Obs 6 Obs M M Obs P (R|Y, X) = P (R|Y, Xobs , ). This assumption states that the missingness depends only on Figure 1.

10 Monotone and nonmonotone patterns of missingness (Obs = observed, M = Missing ). observed quantities, which may include outcomes and predictors (in which case the missingness is sometimes labeled covariate dependent missingness (CDM)). The assumption regarding the the CC estimator is that if there are many different variables with lack of association with unobserved quantities (Xmis ) remains. Missing values, then a large fraction of the observations may be We note that at first glance, the meaning of the term Missing dropped. For the example datasets in Figure 1, only data from at random can be confusing, since missingness can actually be pattern 1 is included, though partial information is available from predicted (but is random after controlling for missingness due to the other patterns ( , the joint distribution of f (Y, X1 , X2 ) can observed quantities).)


Related search queries