Transcription of Stata: Bivariate Statistics
1 Page 1 of 8 Stata: Bivariate Statistics Topics: Chi-square test, t-test, Pearson s R correlation coefficient - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - There are three situations during survey data analysis in which Bivariate Statistics are commonly used. 1. Compare two groups First, Bivariate Statistics are used to compare two study groups to see if they are similar. For example, to compare two groups at baseline before an intervention is implemented, or to compare participants who are lost to follow up to those who remained in the study. When comparing groups, we want to provide strong evidence of any group differences, so we use a conservative threshold of p< to determine statistical significance.
2 In this course, we are learning to analyze research questions with binary outcomes. Bivariate Statistics can be used to summarize and compare characteristic across groups. For example, were there differences in social-demographic characteristics of women who did and did not experience intimate partner violence in the last 12 months? 2. Identify covariates for general explanatory model When a characteristic like age is different in people who did and did not experience the outcome, we say that the characteristic is associated with the outcome. This is because the characteristic helps to explain variance in the outcome. In cross sectional data analysis, we cannot draw causal conclusions.
3 We are not talking about causal Page 2 of 8 mechanisms that predict the outcome. Although woman s age group might be associated with whether or not she experienced intimate partner violence in the last 12 months, the biological process of aging does not cause her partner to act violently toward her. Rather, we are staying that a characteristic (like older age) tends to be present when the outcome is present. When we are developing a general explanatory model when the research question is Which factors are associated with [the outcome]? - then we use Bivariate Statistics to identify potential covariates that are worth testing in a multivariable model. If a variable is independently associated with the outcome, it might continue to explain the outcome once other factors are taken into account.
4 In this case, when Bivariate Statistics are used for the purpose of filtering potential covariates in multivariate analysis, we use a generous threshold of p< to determine statistical significance to ensure that we do not drop any potentially useful variables from the analysis. Note, the same statistical test used to compare two groups (usually the chi-square test in logistic regression), is the same test and output that we use here to filter variables. The only difference is in purpose of the test, and therefore our interpretation of its results are different. Page 3 of 8 3. Chi-square test The chi-square test is a common Bivariate statistic used to test whether the distribution in a categorical variable is statistically different in two or more groups.
5 The chi-square test gives a yes/no answer - a p-value less than the threshold means, yes, there are differences between the two groups. In a manuscript, if you see a p-value next to a categorical variable (with data summarized as percentages), this is usually a chi-square test statistic. The chi-square test statistic p-value is easy to interpret after you have set a threshold for statistical significance either the distributions are, or are not, that same. The chi-square test is a global statistic; it tells if you if there are any differences across cells, though it does not tell you which cell(s) are different. You can often tell which cells are different qualitatively based on the percentages, though additional or different testing might be performed to isolate whether certain cells are statistically different from the rest.
6 You should not use the chi-square test statistic if one or more cells in the cross tabulation has fewer than five observations, though this is incredibly rare in survey data analysis when tens of thousands of respondents are interviewed. If we have a response category with fewer than five observations, then we should combine it with another category. The chi-square test statistic is simple to implement in Stata. In fact, we have been doing it all along! Each time we use the tabulate command with survey data (by starting with svy:), we are producing a Pearson s chi-square F-statistic and p-value. Source: Manzi, A., et al. (2014) BMC Pregnancy and Childbirth Page 4 of 8 4.
7 T-test A t-test is used to test whether the distribution of a continuous variable is statistically different across groups a p-value less than the threshold means, yes, there are differences. Do NOT use a t-test when the distribution of outcomes within groups are not normal , or when the variance is not the same across groups. In these situations, consider transforming the variable (we do not discuss this further in this course), or categorize the continuous values and test it as a categorical variable. You can produce t-test Statistics for a continuous variable across two or more groups with survey data by specifying a linear regression, and testing for differences in the outcomes across group categories.
8 Page 5 of 8 5. Test for collinearity among two covariates Before fitting any kind of multivariate model whether a general explanatory model or a hypothesis test model you should test for collinearity. Collinearity occurs when two covariates in a multivariable model are highly related; usually this is because the two variables represent the same thing (the same concept, or they happen simultaneously). For example, in a society where husbands and wives tend to have the same level of education, then woman s education status and men s education status represent the same construct within households. Wife s education might do a good job explaining variance in the outcome, leaving little left over variance to be explained by husband s education.
9 As a result, the model becomes unstable. To produce parsimonious (efficient) multivariable models, and to prevent strange, unstable results, we test for strong associations among covariates and remove any collinear covariates from the analysis. The Pearson s R correlation coefficient is used to identify binary, ordinal, and continuous covariates that are correlated. Correlations of r> are often considered collinear in the social sciences. When two or more covariates are found to be collinear, we keep the one variable this is most strongly associated with the outcome, unless there is a conceptual reason to keep one over the other. For nominal variables (variables with non-ordered categories), say marriage type, you cannot use the Pearson s R correlation coefficient.
10 If you want to be rigorous, you might test one or more binary definitions of the variable, for example, married (yes/no), or separated (yes/no), rather than a four category definition of marital status. In practice, you might only do this step if you were concerned about collinearity for conceptual reasons. Page 6 of 8 6. Pearson s R correlation coefficient The reason we only use Pearson s R correlation coefficient for binary, ordinal, and continuous data is that it is a measure of strength of linear association between two variables. The Pearson s R correlations answers the question: How much are two variables associated on a scale of zero to absolute one? The Pearson s R correlation statistic is related to linear regression; it tries to draw a line of best fit through the data of two variables.