Example: barber

Abandon Statistical Signi cance - Columbia University

Abandon Statistical Significance Blakeley B. McShane1, David Gal2, Andrew Gelman3,Christian Robert4, and Jennifer L. Tackett11 Northwestern University ,2 University of Illinois at Chicago,3 Columbia University ,4 Universit e Paris-Dauphine21 Sep 2017 AbstractIn science publishing and many areas of research, the status quo is a lexicographicdecision rule in which any result is first required to have ap-value that surpasses threshold and only then is consideration often scant given to such factors asprior and related evidence, plausibility of mechanism, study design and data quality,real world costs and benefits, novelty of finding, and other factors that vary by researchdomain. There have been recent proposals to change thep-value threshold, but insteadwe recommend abandoning the null hypothesis significance testing paradigm entirely,leavingp-values as just one of many pieces of information with no privileged role inscientific publication and decision making.

Abandon Statistical Signi cance Blakeley B. McShane1, David Gal2, Andrew Gelman3, Christian Robert4, and Jennifer L. Tackett1 1Northwestern University, 2University of Illinois at Chicago, 3Columbia University, 4Universit e Paris-Dauphine 21 Sep 2017 Abstract In science publishing and many areas of research, the status quo is a lexicographic

Tags:

  2017

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Abandon Statistical Signi cance - Columbia University

1 Abandon Statistical Significance Blakeley B. McShane1, David Gal2, Andrew Gelman3,Christian Robert4, and Jennifer L. Tackett11 Northwestern University ,2 University of Illinois at Chicago,3 Columbia University ,4 Universit e Paris-Dauphine21 Sep 2017 AbstractIn science publishing and many areas of research, the status quo is a lexicographicdecision rule in which any result is first required to have ap-value that surpasses threshold and only then is consideration often scant given to such factors asprior and related evidence, plausibility of mechanism, study design and data quality,real world costs and benefits, novelty of finding, and other factors that vary by researchdomain. There have been recent proposals to change thep-value threshold, but insteadwe recommend abandoning the null hypothesis significance testing paradigm entirely,leavingp-values as just one of many pieces of information with no privileged role inscientific publication and decision making.

2 We argue that this radical approach is bothpractical and Introduction: The status quo and two alternativesThe biomedical and social sciences are facing a widespread crisis, with published findingsfailing to replicate at an alarming rate. Often, such failures to replicate are associated withclaims of huge effects from tiny, sometimes preposterous, interventions. Further, the primaryevidence adduced for these claims is one or more comparisons that are anointed statisti-cally significant typically defined as comparisons withp-values less than the threshold relative to a sharp point null hypothesis of zero effect and zero systematicerror. Indeed, thestatus quois thatp < is deemed as strong evidence in favor of ascientific theory and is required not only for a result to be published but even for it to betaken seriously.

3 Statistical significance serves as a lexicographic decision rule whereby anyresult is first required to have ap-value that surpasses the threshold and only then isconsideration often scant given to such factors as prior and related evidence, plausibilityof mechanism, study design and data quality, real world costs and benefits, novelty of finding,and other factors that vary by research domain (in the sequel, we refer to these collectivelyas the neglected factors).Traditionally, thep < rule has been considered to be a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples such as Carney et al. [2010] and Bem [2011], coupled with theoretical We thank Monya Baker and two anonymous reviewers for helpful comments.

4 We also thank the Na-tional Science Foundation, Institute for Education Sciences, Office of Naval Research, and Defense AdvancedResearch Projects Agency for partial support of this has made it clear that so-called researcher degrees of freedom [Simmons et al., 2011]are abundant enough for Statistical significance to easily be obtained from pure noise. Conse-quently, low replication rates are to be expected given existing scientific practices [Ioannidis,2005, Smaldino and McElreath, 2016], and calls for reform, which are not new (see, forexample, Meehl [1978]), have become alternative, suggested by Daniel Benjamin and seventy-one coauthors includingdistinguished scholars from a wide variety of fields, is to change the defaultp-value thresh-old for Statistical significance for claims of new discoveries from to [Benjaminet al.]

5 , 2017 ]. We believe this proposal is insufficient to overcome current difficulties withreplication and, perhaps curiously, we expect those authors would agree given that they restrict [their] recommendation to claims of discovery of new effects and recognize that the choice of any particular threshold is arbitrary and should depend on the prior oddsthat the null hypothesis is true, the number of hypotheses tested, the study design, therelative cost of Type I versus Type II errors, and other factors that vary by research topic. Indeed, many of [the authors] agree that there are better approaches to Statistical analysesthan null hypothesis significance testing [NHST]. In particular, we disagree with their emphasis on a particular immediate action chang-ing thep-value threshold is simple, aligns with the training undertaken by many researchers,and might quickly achieve broad acceptance.

6 We are not convinced this step would behelpful. In the short term, a more stringent threshold could reduce the flow of low qualitywork that is currently polluting even top journals. In the medium term, it could motivateresearchers to perform higher-quality work that is more likely to crack the barrier. Onthe other hand, a steeper cutoff could lead to even more overconfidence in results that do getpublished as well as greater exaggeration of the effect sizes associated with such results. Itcould also lead to the discounting of important findings that happen not to reach it. In sum,we have no idea whether implementation of the proposed threshold would improve ordegrade the state of science as we can envision both positive and negative outcomes resultingfrom it.

7 Ultimately, while this question may be interesting if difficult to answer, we view itas outside our purview because we believe thatp-value thresholds (as well as those based onother Statistical measures) are a bad idea in , and consequently, we proposeanother alternative, which is to Abandon statisti-cal significance. In particular, rather than propose a quick fix, a dam to contain the flooduntil we make sure we have the more permanent fixes in the words of a prominent mem-ber of the seventy-two [Resnick, 2017 ], we recommend dropping the NHST paradigm andthep-value thresholds associated with it as the default Statistical paradigm for research,publication, and discovery in the biomedical and social sciences. Specifically, rather thanallowing Statistical significance as determined byp < (or some other Statistical thresh-old) to serve as a lexicographic decision rule in scientific publication and Statistical decisionmaking more broadly as per the status quo, we propose that thep-value be demoted fromits threshold screening role and instead, treated continuously, be considered along with theneglected factors as just one among many pieces of make this recommendation for three broad reasons.

8 First, in the biomedical and so-2cial sciences, the sharp point null hypothesis of zero effect and zero systematic error used inthe overwhelming majority of applications is generally not of interest because it is generallyimplausible. Second, the standard use of NHST to take the rejection of this straw mansharp point null hypothesis as positive or even definitive evidence in favor of some preferredalternative hypothesis is a logical fallacy that routinely results in erroneous scientific rea-soning even by experienced scientists and statisticians. Third,p-value and other statisticalthresholds encourage researchers to study and report single comparisons rather than focusingon the totality of their data and elaborating on our own suggestions for improving replicability, we discuss generalproblems with NHST that remain unresolved by the Benjamin et al.

9 [ 2017 ] proposal aswell as problems specific to the proposal. We then discuss the implications of abandoningstatistical significance for the scientific publication process as well as for Statistical decisionmaking more Problems with null hypothesis significance testingIn the biomedical and social sciences, effects are typically small and vary considerably acrosspeople and contexts. In addition, measurements can be highly variable and are often onlyindirectly related to underlying constructs of interest, so that even when sample sizes arelarge, the possibilities of systematic variation and bias results in the equivalent of small orunrepresentative samples. Consequently, estimates from any single study are themselvesgenerally noisy. This, in combination with the fact that the single study is typically thefundamental unit of analysis, poses problems for the NHST that effects are small and variable and measurements are noisy, the sharp pointnull hypothesis of zero effect and zero systematic error used in the overwhelming majority ofapplications is itself implausible [Berkson, 1938, Edwards et al.]

10 , 1963, Bakan, 1966, Tukey,1991, Cohen, 1994, Gelman et al., 2014, McShane and B ockenholt, 2014, Gelman, 2015].Consequently, Cohen [1994] has derided this null hypothesis as the nil hypothesis andlampoons it as always false, and Tukey [1991] notes that two treatments are alwaysdifferent. Indeed, even were an effect truly zero, experimental realities dictate that theeffect would not be exactly zero in any study designed to test addition, noisy estimates in combination with a publication process that screens forstatistical significance results in published estimates that are biased upwards (potentially to alarge degree) and often of the wrong sign [Gelman and Carlin, 2014]. Indeed, the screening ofestimates for Statistical significance by the publication process provides an indirect incentivefor researchers to conduct many small noisy studies, resulting in estimates that can be madeto yield one or more statistically significant results [Simmons et al.


Related search queries