## Abstract

We suggest twenty immediately actionable steps to reduce widespread inferential errors related to “statistical significance testing.” Our propositions refer to the theoretical preconditions for using *p*-values. They furthermore include wording guidelines as well as structural and operative advice on how to present results, especially in research based on multiple regression analysis, the working horse of empirical economists. Our propositions aim at fostering the logical consistency of inferential arguments by avoiding false categorical reasoning. They are not aimed at dispensing with *p*-values or completely replacing frequentist approaches by Bayesian statistics.

### 1 Introduction

Widely scattered over time and disciplines, a vast amount of criticism regarding the misuses and misinterpretations of the *p*-value (the don’ts) as well as a large number of suggestions for reform (the do’s) have accumulated. One might think that enough has been said on this subject in the meanwhile, both before and after the ASA-statement (Wasserstein and Lazar 2016) that prominently red-flagged misuses and inferential errors. But the problems seem to be here to stay, particularly in disciplines such as economics that heavily rely on multiple regression analysis. Two features of the frequentist null hypothesis significance testing (NHST) framework are at the origin of most errors: first, the dichotomization of results depending on whether the *p*-value is below or above some arbitrary threshold (usually 0.05). Second, the associated terminology that speaks of “hypothesis testing” and “statistically significant” as opposed to “statistically non-significant” results. Dichotomization in conjunction with misleading terminology propagate cognitive biases that seduce researchers to make logically inconsistent and overconfident inferences, both when *p* is below and when it is above the “significance” threshold. The following errors seem to be particularly widespread:
^{[1]}

use of

*p*-values when there is neither random sampling nor randomizationconfusion of statistical and practical significance or complete neglect of effect size

unwarranted binary statements of there being an effect as opposed to no effect, coming along with

misinterpretations of

*p*-values below 0.05 as posterior probabilities of the null hypothesismixing up of estimating and testing and misinterpretation of “significant” results as evidence confirming the coefficients/effect sizes estimated from a single sample

treatment of “statistically non-significant” effects as being zero (confirmation of the null)

inflation of evidence caused by unconsidered multiple comparisons and

*p*-hackinginflation of effect sizes caused by considering “significant” results only

The ASA-statement highlights that the *p*-value does not provide a good measure of evidence regarding a hypothesis. In other words, it does not provide a clear rationale or even calculus for statistical inference (Goodman 2008). While Berry (2017: 896) might be pushing too hard when claiming that a *p*-value “as such has no inferential content,” one must recognize that it is but a *graded* measure of the strength of evidence against the null, but only in the sense that small *p*-values will occur more often if there is an effect compared to no effect (Hirschauer et al. 2018).
^{[2]} Joining Berry (2017), Gelman and Carlin (2017), Greenland (2017), McShane and Gal (2017), McShane et al. (2017), Trafimow et al. (2018), and many others, we believe that degrading the *p*-value’s continuous message into binary “significance” declarations (“bright line rules”) is at the heart of the problem.
^{[3]} Since the *p*-value is deeply anchored in the minds of most scientists including economists, we believe that demanding drastic changes, such as renouncing *p*-values or replacing frequentist approaches by Bayesian methods, is not the most promising way to guard researchers from the inferential errors that we see today. Dispensing with the dichotomy of significance testing but retaining the *p*-value and adopting small and manageable steps towards improvement seems to be more promising (Amrhein et al. 2017). Such steps will have to account for the idiosyncrasies of each scientific discipline. For example, requirements in the medical sciences, which often focus on mean differences between randomized experimental treatments, will at least partly differ from those in economics, which frequently resorts to multiple regression analysis of observational data.

Even if one is aware of the fundamental pitfalls of NHST, it is difficult to escape the categorical reasoning that is so entrancingly suggested by its dichotomous “significance” declarations.
^{[4]} Econometricians regularly face the challenge to provide an interpretative evaluation of numerous regression coefficients. Imagine a regression with several focal variables (predictors), a set of controls, and possibly even some secondary covariates (interaction terms, higher-order polynomials, etc.) introduced in the process of model specification. How should we evaluate and comment on the “evidence” as represented in the large number of regression coefficients and their associated *p*-values? We know that we should dispense with the extremely convenient but misleading dichotomous interpretation, but we still lack harmonized wordings that describe the inferential content of a *p*-value appropriately.

Our difficulties are due to the fact that the *p*-value is not only a highly non-linear but also noisy summary statistic of the data at hand (Hirschauer et al. 2018). This implies that a difference between, let’s say, a *p*-value of 0.20 and 0.19 does not indicate the same increase of the strength of evidence against the null as a difference between 0.04 and 0.03. It also requires realizing that all sample estimates, including *p*-values, may vary considerably over random replications. The *p*-value’s inconclusive inferential content precludes per se making probability statements about hypotheses (Gelman 2016; Hirschauer et al. 2016). We must therefore avoid wordings that invite confusion with the Bayesian posterior probability, i. e. the epistemic probability that a scientific proposition about the world is true *given* the evidence in the data. Unfortunately, spontaneous interpretations of the *p*-value are often not correct as people, especially when confronted with “significance” language, seem to be prone to the “inverse probability error.” That is, they often confuse the “conditional probability of data given a hypothesis” (*p*-value) with the “conditional probability of a hypothesis given the data” (posterior probability). Cohen (1994: 997) coined the term “inverse probability error” to highlight that the *p*-value “does not tell us what we want to know [i. e. the posterior probability], and [that] we so much want to know what we want to know that out of desperation, we nevertheless believe that it does.”

On the one hand, we know that for a two-sided test “any *p*-value less than 1 implies that the test [null] hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger *p*-value would be even more compatible with the data” (Greenland et al. 2016: 341). Along the same lines but with a focus on experimental data, Goodman (2008: 136) notes that “the effect best supported by the data from a given experiment is always the observed effect, regardless of its significance.” On the other hand, we commonly interpret the *p*-value as a “first defense line” against being fooled by the randomness of sampling (Benjamini 2016) when generalizing from our findings to the population. We should meet this defense-line interpretation with caution, however, because the *p*-value itself is but a statistic of a noisy random sample. In plausible constellations of noise and sample size, the *p*-value exhibits wide sample-to-sample variability (Halsey et al. 2015). This is paralleled by the variability of estimated coefficients over replications. We may easily find a large coefficient in one random sample (overestimation) and a small one in another (underestimation). Unbiased estimators estimate correctly on average (Hirschauer et al. 2018). We would thence need *all* estimates from frequent replications – *irrespective* of their *p*-value and their being large or small – to obtain a good idea of the population effect size. Based on a single sample, we have no way of identifying the *p*-value below (above) which the associated effect size estimate is too large (too small), but we are very likely to overestimate effect sizes when taking “significant” results at face value (Hirschauer et al. 2018). Even when finding a “highly significant” result (with, let’s say, a *p*-value of 0.001), which ironically would be a highly appreciated case in conventional NHST, we cannot make a direct inference and assume the estimated effect to accurately reflect the population effect size (Bancroft 1944; Danilov and Magnus 2004). Quite on the contrary:

Under reasonable sample sizes and reasonable population effect sizes, it is the abnormally large sample effect sizes that result in

p-values that meet the 0.05 level, or the 0.005 level, or any other alpha level, as is obvious from the standpoint of statistical regression (Trafimow et al. 2018).

Hence, even seemingly neutral representations such as “the retail prices of product A exceed the retail prices of product B by 20 % on average (*p*<0.001)” may be misleading because they insinuate that the evidence against the null can be translated into evidence in favor of the concrete effect that we happened to find in a single sample (Amrhein et al. 2017).

The problem of correctly interpreting *p*-values is exacerbated by the fact that the disregard of multiple comparisons, which are pervasive in econometric analyses, inflates the strength of evidence against the null and makes the *p*-value essentially uninterpretable. A *p*-value is a summary statistic that tells us how incompatible the data are with the specified statistical model including the null hypothesis. In a pre-specified *single* regression, the *p*-value represents the conditional probability of finding the observed effect (or even a larger one) in random replications *if* the null hypothesis were true. While there is no multiple testing problem if one focuses a priori on one hypothesis and model, a multiple testing problem arises whenever researchers independently perform and interpret more than one test on one data set. Disregarding the multiple testing problem is quite common in multiple regression analysis. For one thing, this is due to the fact that it is widespread practice to subject several hypotheses, which are implicitly presumed to be independent, to significance testing one by one, and then to search the list of *p*-values for “significant” results. Imagine for illustration sake a set of ten hypotheses and assume the ten corresponding regressors to be independent and completely non-predictive. Despite the completely random probabilistic structure, there is a 40 % chance (1–0.95^{10}) of finding at least one coefficient with *p* ≤ 0.05 if we perform ten independent tests (Altman and Krzywinski 2017). Furthermore, while obfuscating the dividing line between confirmatory and exploratory research, we must not forget that econometricians commonly fit regression models to pre-existing data and retain one model as the final (“best”) model after multiple models have been tried out and evaluated by using some measure of model fit (e. g. likelihood ratio test or Akaike Information Criterion). We arrive at overconfident conclusions if we assess the strength of evidence in only the “best” model even though multiple analytical alternatives had been tried out before (Danilov and Magnus 2004; Forstmeier et al. 2016). The same applies, to an even greater extent, when researchers self-interestedly explore multiple analytical variants and selectively choose one variant that “works” in terms of producing statistical significance (*p*-hacking; Simmons et al. 2011).

### 2 Suggestions for the use and interpretation of *p*-values

While some disciplines and especially those that traditionally use experimental designs, such as the medical sciences, have adopted substantial reforms to abate inferential errors related to the *p*-value, economists at large do not seem to play a very active part in the debate and the reform efforts. We believe that this is not so much due to economists not recognizing the problems associated with significance testing, but rather due to their not knowing what to do instead as long as no disciplinary consensus has been reached. With a view to the *p*-value’s deep entrenchment in the current research practice and the apparent need for both guidance and (some degree of) consensus, the objective of this paper is to contribute to the debate by systematically compiling suggestions – none of them new and none of them our own – that jointly seem to represent the most promising set of concrete and immediately actionable steps to reduce inferential errors.
^{[5]} Being economists, we focus on suggestions that are relevant for correctly interpreting the results of multiple regression analysis, which is the working horse in econometric research. Being pragmatic, we focus on suggestions that are concerned with the analysis of single-sample data, even though we are aware of the advantages of multiple-study designs, meta-analysis, and Bayesian approaches for making valid inferences.

Contenting ourselves for the time being with compiling small incremental steps must not be understood as opposition towards more substantial change in the future. Quite on the contrary. We hope that our suggestions will help prepare the field for better study designs and inferential tools, and especially more pre-registration, replication, and meta-analytical thinking, i.e. tools that take systematic account of the fact that all estimates (standard errors, *p*-values, effect sizes) vary over random replications (Gelman 2016) and that our best unbiased estimators only estimate correctly *on average* (Hirschauer et al. 2018). More immediately, however, we hope that they will serve as a discussion base or even tool kit that is directly helpful, for example, to editors of economics journals who aim at revising their editorial policies and guidelines to increase the quality of published research. In brief, we address the question of how a typical econometric study, which for the time being refrains from *Bayesian statistics* and continues to use *frequentist statistics*, should proceed to avoid the inferential errors that are so pervasive at present. It is important to note that some suggestions, such as displaying standard errors, could be criticized as asking for redundant information. Readers of a research paper could in principle compute standard errors when effect sizes and *p*-values are provided. Mathematical redundancy is not a good argument, however. Instead, the question is how we should present information to avoid cognitive biases and foster the logical consistency of inferential arguments through good intuition.

Our suggestions are best preceded by a quote by Vogt et al. (2014: 242; 244) who note that the classical tools for statistical inference (including *p*-values) are inherently based on a random process of data generation:

in research not employing random assignment or random sampling, the classical approach to inferential statistics is inappropriate. […] In the case of a random sample, the

p-value addresses the following question: ‘If the null hypothesis were true of the population, how likely would we have been to obtain a sample statistic this large or larger in a sample of this size?’ […] In the case of random assignment, thep-value targets the following question: ‘If the null hypothesis were true about the difference between treated and untreated groups, how likely is it that we would have obtained a difference between them this big (or bigger) when studying treatment and comparison groups of this size?’ […] If the experimental and control groups have not been assigned using probability techniques, or if the cases have not been sampled from a population using probability methods, inferential statistics are not applicable. They are routinely applied in inapplicable situations, but an error is no less erroneous for being widespread.

#### Fundamental prerequisites for using the p-value

**Suggestion 1:** Do not use neither *p*-values nor other inferential tools such as standard errors or confidence intervals if you already have data for the whole population of interest. In this case, no generalization (inference) from the sample to the population is necessary and you can directly describe the population properties. Do not use *p*-values either if you have a non-random sample that you chose for convenience reasons instead of using probability methods. Being inherently based on probability theory and a random process of data generation, *p*-values are not interpretable for non-random samples.
^{[6]}

**Suggestion 2:** Be clear that the function of the *p*-value is different depending on whether the data generating process is random sampling or random assignment. In the random sampling case, you are concerned with generalizing from the sample to the population (external validity). In the random assignment case, you are concerned with the internal validity of an experiment in which you randomly assign treatments to subjects. In an experiment, the *p*-value is a continuous measure of the strength of evidence against the null hypothesis of there being no treatment effect. It is *no help* for generalizing to a target population when the experimental subjects have not been recruited through random sampling from a defined parent population.

**Suggestion 3:** When using *p*-values as a tool that is to help generalize from a sample to a population, provide convincing arguments that your sample represents at least *approximately* a random sample. To avoid misunderstandings, transparently state how and from which population the random sample was drawn and, consequently, to which population you want to generalize.

#### Wording guidelines for avoiding misunderstandings

**Suggestion 4:** Use wordings that ensure that the *p*-value is understood as a *graded* measure of the strength of evidence against the null. Make sure that readers realize that no particular information is associated with a *p*-value being below or above some threshold such as 0.05 (see also suggestion 19).
^{[7]}

**Suggestion 5:** Avoid wordings that insinuate that the *p*-value denotes an epistemic (posterior) probability that you can attach to a scientific hypothesis (the null) *given* the evidence you found in your data. Stating that you found an effect with an “error probability” of *p* is misleading, for example. It suggests the false interpretation that the *p*-value is the probability of the null – and therefore the probability of being “in error” when rejecting it. Consequently, avoid the term “error probability.”

**Suggestion 6:** Avoid wordings that insinuate that a low *p*-value indicates a large or even practically or economically relevant size of the estimate, and vice versa. Use wordings such as “large” or “relevant” but refrain from using “significant” when discussing the effect size – at least as long as threshold thinking and dichotomous interpretations of *p*-values associated with the term “statistical significance” linger on in the scientific community (see also suggestion 19).

**Suggestion 7:** Do not suggest that high *p*-values can be interpreted as an indication of no effect (“evidence of absence”) even though in the NHST-approach “non-significance” leads to non-rejection of the null hypothesis of no effect. Do not even suggest that high *p*-values can be interpreted as “absence of evidence.” Doing so would negate the evident effects that you observed in the data.

**Suggestion 8:** Avoid formulations and representations that could suggest that *p*-values below 0.05 can be interpreted as evidence in favor of the just-estimated coefficient. Formulations claiming that you found a “statistically significant effect of z” should be avoided, for example, because they mix up estimating and testing procedures. The strength of evidence against the null cannot be translated into evidence in favor of the concrete estimate that one happened to find in a sample.

**Suggestion 9:** Avoid using the terms “hypothesis *testing*” and “*confirmatory* analysis” or at least put them into proper perspective and communicate that it is logically impossible to infer from the *p*-value whether the null hypothesis or an alternative hypothesis is true. We cannot even derive probabilities for hypotheses based on what has delusively become known as “hypothesis *testing*.” In the usual sense of the word, a *p*-value cannot “test” or “confirm” a hypothesis, but only describe data frequencies under a certain statistical model including the null hypothesis.
^{[8]}

**Suggestion 10:** Restrict the use of the word “evidence” to the concrete findings in your data and clearly distinguish this evidence from your inferential conclusions, i. e. the generalizations you make based on your study and all other available evidence (see also suggestion 14).

#### Things to do and discuss explicitly

**Suggestion 11:** Do explicitly state whether your study is *exploratory* and thus aimed at generating new research questions/hypotheses (“ex post hypotheses”), which might be termed “hypothesizing after results are known,” or whether you aim at producing new evidence with regard to *pre-specified* research questions/hypotheses (“ex ante hypotheses”). If your paper contains both types of study, explicitly communicate *where* you change from the study of pre-specified issues to exploratory search.

**Suggestion 12:** In *exploratory* search for potentially interesting associations, do never use the term “*hypothesis* testing” because you have no testable ex ante hypotheses. But large effect sizes in conjunction with low *p*-values may be useful as a flagging device to identify ex post hypotheses that might be worth investigating with new data in the future. To prevent overhasty generalizations from such an unconstrained “search for discoveries” in a sample, it might be worthwhile considering Berry’s (2017: 897) recommendation to use the following warning: “Our study is exploratory and we make no claims for generalizability. Statistical calculations such as *p*-values and confidence intervals are descriptive only and have no inferential content.”

**Suggestion 13:** If your study is (what would be traditionally called) “confirmatory” (see suggestion 9), i. e. aimed at producing evidence regarding *pre-specified* research questions/hypotheses, exactly report in your paper the list of questions/hypotheses that you drafted as well as the model you specified regarding structure and variable set *before* seeing the data. In the results section, clearly relate findings to these ex ante questions or hypotheses. While the study of ex ante specified hypotheses is conventionally termed “confirmatory analysis” and “hypotheses testing,” these terms should be avoided or at least put into proper perspective. They might mislead people to expect categorical yes/no answers that we cannot give (see also suggestion 9).

**Suggestion 14:** When studying pre-specified questions or hypotheses, clearly distinguish two parts in the analysis: (i) the description of the empirical *evidence* (estimated effect sizes) that you happened to find in your single study (What is the evidence in the data?); (ii) the *inferential reasoning* that you base on this evidence under consideration of the study design, *p*-values, confidence intervals, and external evidence (What should one reasonably believe after seeing the data?). If applicable, a third part should outline the recommendations or *decisions* that you would make all things considered including the weights attributed to type I and type II errors (What should one do after seeing the data?).

**Suggestion 15:** If you fit your model to the data even though you are concerned with pre-specified hypotheses, explicitly demonstrate that your data-contingent model specification does *not* constitute “hypothesizing after the results are known.” When using *p*-values as an inferential tool that is to help make inferences, *explicitly* consider and comment on multiple comparisons. Doing so, distinguish between (i) multiple comparisons that you make when you perform more than one test on one data set in your final multiple regression model, and (ii) the multiple comparisons that you make if you tried multiple models before retaining one model as the “best” model. If appropriate, use robustness checks to show how substantially stable your findings are over a reasonable range of analytical variants.

**Suggestion 16:** Explicitly distinguish between statistical and scientific inference. In the random sampling case, for example, *statistical inference* is concerned with the fact that even a random sample does not exactly reflect the properties of the population (sampling error). Generalizing from a random sample to its population is only the first step of *scientific inference*, which is the totality of reasoned judgments (inductive generalizations) that we make in the light of our own study and the available body of external evidence. We might want to know, for example, what we can learn from a random sample of a country’s agricultural students for its student population, or even people in general. Be clear that a *p*-value can do *nothing* to assess the generalizability of results beyond the parent population (here: the country’s *agricultural* students) from which the random sample has been drawn.

#### Operative rules

**Suggestion 17:** Provide information regarding the size of your estimate (point estimate). In many regression models, a meaningful representation of magnitudes will require going beyond coefficient estimates and displaying marginal effects or other measures of effect size.

**Suggestion 18:** Do not use asterisks (or the like) to denote different levels of “statistical significance.” Doing so could instigate erroneous categorical reasoning.

**Suggestion 19:** Provide *p*-values if you use the graded strength of evidence against the null as an argument (amongst others) to make inferences. However, do not classify results as being “statistically significant” or not. That said, avoid using the terms “statistically significant” and “statistically non-significant” altogether. Dispensing with these two categorical labels would enable you for the first time to use “relevant” and “significant” as interchangeable terms. However, to avoid confusion, it might be better to steer clear of the term “significant” altogether.

**Suggestion 20:** Provide standard errors for all effect size estimates. Additionally, provide confidence intervals for the focal variables of interest associated with your pre-specified research questions/hypotheses.

We believe that our suggestions represent practically significant steps towards improvement, but we do not expect that all empirical economists will endorse all of them at once. Some suggestions, such as providing effect size measures and displaying standard errors, are likely to cause little controversy. Others, such as renouncing dichotomous significance declarations and giving up the term “statistical significance” altogether, will possibly be questioned. Opposition against giving up the conventional and “neat” yes/no declarations is likely to be fueled by the fact that no consensus has yet been reached as to which formulations are appropriate and foolproof to avoid cognitive biases and communicate the correct meaning of frequentist concepts such as *p*-values and confidence intervals.

First of all, we need intuitive and correct formulations that ensure that the *p*-value is understood as a *graded* measure of the strength of evidence *against the null*. Which wording is appropriate to convey the information of a *p*-value of, let’s say, 0.37 as opposed to 0.12 or 0.06 or 0.005 – for large and small effects, respectively? Our troubles do not come as a surprise since the difficulties of translating the *p*-value concept adequately into natural language are at the heart of the problem. Berry (2017: 896) puts it in a nutshell:

I forgive nonstatisticians who cannot provide a correct interpretation of

p<0.05.p-Values are fundamentally un-understandable. I cannot forgive statisticians who give understandable—and therefore wrong—definitions ofp-values to their nonstatistician colleagues. But I have some sympathy for their tack. If they provide a correct definition, then they will end up having to disagree with an unending sequence of ‘in other words’. And the colleague will come away confused […].

While this statement may seem overly pessimistic, we agree with the problem description. The only way out is to *find* and *agree* on formulations that convey the limited but existing informational content of the *p*-value in both a *correct* and *meaningful* way, lest we better abandon its use altogether.

We also need formulations that provide an intuitive and correct interpretations of confidence intervals (CI). Imagine observing a mean difference of 10 g in daily weight gains between two groups of animals that were randomly assigned to different dietary treatments. Finding a 95 % CI of [8, 12], it would not be correct to say that the difference is between 8 g and 12 g with 95 % probability. Not much change for the better is obtained by replacing “probable” by “plausible” or “confident.” Stating, for example, that we can be 95 % confident that the difference is between 8 g and 12 g” is extremely deceptive. While such statements sound like uncertainty statements, they promise too much certainty. In the words of Gelman (2016), they are to be qualified as “uncertainty laundering” because they neglect the inherent uncertainty of the CI itself. A correct interpretation requires realizing that CI (analogous to *p*-values) are noisy and vary from one random sample to the other. A 95 % CI only means that 95 % of CI computed for repeatedly drawn random samples will capture the “true” value (Greenland et al. 2016). Most formulations in empirical papers and even textbooks, however, seem to communicate in one way or the other that a CI provides the probability that the “true” effect size (population parameter) is within the stated interval. They thus insinuate that we could make epistemic probability statements regarding population effect sizes based on the results of a single study. Such statements must be reserved to Bayesian analysis (here: the Bayesian posterior probability interval), however.

### 3 Reforms under way and outlook

Both the accumulation of knowledge and technological developments in computing continuously shift what constitutes best methodological practice in statistical analysis. Given these dynamics, sticking traditions as well as tight but rarely scrutinized journal guidelines may slow down or even prevent necessary change. While overly rigid formal rules in the process of publication are detrimental with respect to dynamic adjustments, guidelines can also be a pertinent means to disseminate new best practice procedures and induce overdue change in inert disciplinary traditions.

Trying to get an impression of the reforms in publishing economic research, we asked the editors of 100 leading economics journals (Scimago Journal & Country Rank) about policy changes with respect to the use of *p*-values.
^{[9]} Overall, journals seem still a long way from translating a significant portion of recent reform suggestions into concrete journal policies. Despite the prominence of the current *p*-value debate, a substantial share of editors believe that their reviewing systems are sufficiently effective to prevent inferential errors. Consequently, they do not see a need to bring about formal change. Some journal editors, however, seem to be seriously worried about the misuses and misinterpretations of the *p*-value. What is more, some editorial boards are deliberating concrete steps or have already started reforming their guidelines to prevent misleading practices and inferential errors. It was interesting to learn that these reforms represent a subset of our suggestions. For example, leading journals, such as the *American Economic Review, Econometrica*, and the four AEJs (*Applied Economics, Economic Policy, Macroeconomics, Microeconomics*), now request authors not to use asterisks or other symbols to denote “statistical significance.” While this seems a small change, it breaks with a convention that many economists considered to be set in stone. The basic idea behind banning asterisks is to prevent overconfident yes/no conclusions. The *American Economic Review* and *Econometrica* furthermore request authors to explicitly report effect sizes and display standard errors (or even confidence intervals) *instead of p*-values in results tables, but they do not explicitly ban *p*-values. This is consistent with a suggestion put forward by many critical voices in the recent debate – to demote *p*-values from their pedestal and consider them as a tool amongst many that help make appropriate inferences (cf., e. g. Amrhein et al. 2017; McShane et al. 2017; Trafimow et al. 2018). The fact that some leading economics journals, which are widely considered as beacons for best practice, have initiated modest but sensible changes with regard to the use of the *p*-value is a promising signal. Goodman (2017: 559) notes that “norms are established within communities partly through methodological mimicry.” If a field’s flagship journals, opinion leaders, and professional associations take up the lead, they may be able to set a trend. “Once the process starts, it could be self-enforcing. Scientists will follow practices they see in publications; peer reviewers will demand what other reviewers demand of them.”

Besides avoiding mistakes within a single study, making appropriate inferences requires considering the body of evidence including all prior knowledge in the field under research. Several reforms beyond the realms of the single study have been suggested. Meta-analyses that systematically consolidate the body of evidence and Bayesian methods that formally consider prior knowledge are prominent examples. Furthermore, two reforms on the institutional level are practically important in some disciplines but only nascent in others. First, many leading journals now oblige authors to provide their raw data and analytical protocols in the appendix to facilitate *replication* studies scrutinizing a study’s findings. While compulsory sharing of raw data and analytical protocols seems to slowly trickle down to more and more journals, institutionalized efforts to actually promote replication studies are weak in economics compared to other fields. According to Duvendack et al. (2015, 2017), most of the 333 economic Web-of-Science journals still give low priority to replication. The same holds for initiatives targeted at counteracting publication bias. While a global initiative *All Trials Registered/All Results Reported* was launched in 2013 in the medical sciences, similar efforts are rare in economics. Among the few exceptions are *The Replication Network*, and *Replication in* Economics. Both platforms are aimed at fostering the scrutiny of scientific claims and at counteracting publication bias by providing not only databases for replications but also equal opportunities for publishing positive and negative results. Two recent activities in economics indicate that problem awareness in economics is growing: the *American Economic Review* published eight short papers about *Replication* in its 5th issue of 2017; and Economics published a special issue *The practice of replication* in 2018.

Another important reform on the institutional level is *pre-registration*. Pre-registration goes beyond the mere appeal to honestly report *pre-specified* hypotheses and analysis plans. Instead, it obliges researchers to disclose their hypotheses, data, and analytical approach *before* running the analysis. Analyses that deviate from the pre-analysis plan must be justified in the final paper. Pre-registration is aimed at preventing covert multiple comparisons (*p*-hacking) and at providing equal chances of being published independent of which results are eventually found. In other words, it is to prevent not only selective reporting but also selective publishing and thus the bias towards “statistically significant” findings (Rosenthal 1979), which seems to be widespread even in economic flagship journals (Brodeur et al. 2016). Contrary to fields such as clinical drug trials for which pre-registration is standard (*http://www.who.int/ictrp/network/primary/en/*), it is as yet rare in the social sciences. However, two innovative initiatives have been launched recently. The *American Economic Association* started pre-registering randomized controlled trials on its AEA RCT platform in 2013. Since 2017 study designs and analysis plans have to meet some formal criteria to be published before the data is collected. An even further reaching pilot program was started by the *Journal of Development Economics* in 2018 to test how pre-results peer reviews (“blind reviews”) can contribute to better science in economics. The pilot gives researchers who intend to use data that are yet to be collected “the opportunity to have their prospective empirical projects reviewed and approved for publication *before* the results are known.”

The poor development of replication, meta-analysis, and pre-registration in economics stems from the discipline’s culture and its focus on observational study, data-driven model specification, and multiple regression analysis. In other words, there are questions to be answered before approaches from other fields can be transplanted to (non-experimental) economic research. It is not clear, for example, how pre-registration or replication should work within a culture in which it is not only common but highly appreciated to specify regression models *after* seeing the data (“model fitting”). If one accepts the model fitting exercise, one would have to pre-register full decision trees. Furthermore, pre-registration of studies that are based on pre-existing data that are already available to the research community before registering do not seem to make much sense. In addition, the question arises of how to carry out quantitative meta-analysis and consolidate the body of evidence when even within a narrow field of research there are often as many data-dependent model specifications as studies. The fact that economic research is mainly a bottom-up research exercise is responsible for the lacking comparability across studies. Non-programmed bottom-up research produces a large quantity of empirical results on topical issues, but is plagued by an enormous heterogeneity of empirical measures and model specifications. Besides differing measures for the focal variables of interest, databased models are regularly populated by differing interaction terms, transformed variables, lagged variables, higher-order polynomials, and control variables. Given the heterogeneity of econometric models, applied economists need consensus regarding the legitimacy and meaning of specification search as well as regarding best practices for replication and meta-analysis.

## Acknowledgments

We owe a special debt to Andrew Gelman (Columbia University) who gave us helpful comments and criticism on our suggestions. Any remaining errors are our own. We would like to thank the German Research Foundation for financial support.

#### References

Altman, N., M. Krzywinski (2017), Points of Significance: P Values and the Search for Significance. Nature Methods 14 (1): 3–4.10.1038/nmeth.4120Search in Google Scholar

Amrhein, V., F. Korner-Nievergelt, T. Roth (2017), The Earth Is Flat (P>0.05): Significance Thresholds and the Crisis of Unreplicable Research. https://peerj.com/preprints/2921.pdf.10.7287/peerj.preprints.2921v2Search in Google Scholar

Bancroft, T.A. (1944), On Biases in Estimation Due to the Use of Preliminary Tests of Significance. Annals of Mathematical Statistics 15 (2): 190–204.10.1214/aoms/1177731284Search in Google Scholar

Benjamini, Y. (2016), It’s Not the *P*-Values’ Fault. The American Statistician 70 (2): Supplemental Material to the ASA Statement on P-Values and Statistical Significance http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5354.pdf.Search in Google Scholar

Berry, D. (2017), A p-Value to Die For. Journal of the American Statistical Association 112 (519): 895–897.10.1080/01621459.2017.1316279Search in Google Scholar

Brodeur, A., M. Lé, M. Sangnier, Y. Zylberberg (2016), Star Wars: The Empirics Strike Back. American Economic Journal: Applied Economics 8 (1): 1–32.10.1257/app.20150044Search in Google Scholar

Cohen, J. (1994), The Earth Is Round (P<0.05). American Psychologist 49 (12): 997–1003.10.1037/0003-066X.49.12.997Search in Google Scholar

Colquhoun, D. (2014), An Investigation of the False Discovery Rate and the Misinterpretation of P-Values. Royal Society Open Science 1: 140216. http://dx.doi.org/10.1098/rsos.140216:1–16.Search in Google Scholar

Danilov, D., J.R. Magnus (2004), On the Harm that Ignoring Pretesting Can Cause. Journal of Econometrics 122 (1): 27–46.10.1016/j.jeconom.2003.10.018Search in Google Scholar

Denton, F.T. (1988), The Significance of Significance: Rhetorical Aspects of Statistical Hypothesis Testing in Economics. 163–193 in: A. Klamer, D.N. McCloskey, R.M. Solow (eds.), The Consequences of Economic Rhetoric. Cambridge: Cambridge University Press.10.1017/CBO9780511759284.013Search in Google Scholar

Duvendack, M., R. Palmer-Jones, W.R. Reed (2015), Replications in Economics: A Progress Report. Econ Journal Watch 12 (2): 164–191.Search in Google Scholar

Duvendack, M., R. Palmer-Jones, W.R. Reed (2017), What Is Meant by “Replication” and Why Does It Encounter Resistance in Economics? American Economic Review 107 (5): 46–51.10.1257/aer.p20171031Search in Google Scholar

Fisher, R.A. (1925), Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.Search in Google Scholar

Forstmeier, W., E.-J. Wagenmakers, T.H. Parker (2016), Detecting and Avoiding Likely False-Positive Findings – A Practical Guide. Biological Reviews of the Cambridge Philosophical Society 92 (4): 1941–1968.10.1111/brv.12315Search in Google Scholar

Gelman, A. (2016), The Problems with P-Values are Not Just with P-Values. American Statistician, supplemental material to the ASA statement on p-values and statistical significance 10: 2016.Search in Google Scholar

Gelman, A., J. Carlin (2017), Some Natural Solutions to the *P*-Value Communication Problem–And Why They Won’t Work. Journal of the American Statistical Association 112 (519): 899–901.10.1080/01621459.2017.1311263Search in Google Scholar

Gelman, A., E. Loken (2014), The Statistical Crisis in Science. American Scientist 102: 460–465.10.1511/2014.111.460Search in Google Scholar

Gelman, A., H. Stern (2006), The Difference between “Significant” and “Not Significant” Is Not Itself Statistically Significant. The American Statistician 60 (4): 328–331.10.1198/000313006X152649Search in Google Scholar

Gigerenzer, G., J.N. Marewski (2015), Surrogate Science: The Idol of a Universal Method for Statistical Inference. Journal of Management 41 (2): 421–440.10.1177/0149206314547522Search in Google Scholar

Goodman, S. (2008), A Dirty Dozen: Twelve P-Value Misconceptions. Seminars in Hematology 45: 135–140.10.1053/j.seminhematol.2008.04.003Search in Google Scholar

Goodman, S.N. (2017), Change Norms from Within. Nature 551: 559.Search in Google Scholar

Greenland, S. (2017), Invited Commentary: The Need for Cognitive Science in Methodology. American Journal of Epidemiology 186 (6): 639–645.10.1093/aje/kwx259Search in Google Scholar

Greenland, S., S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman (2016), Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations. European Journal of Epidemiology 31 (4): 337–350.10.1007/s10654-016-0149-3Search in Google Scholar

Halsey, L.G., D. Curran-Everett, S.L. Vowler, B. Drummond (2015), The Fickle P Value Generates Irreproducible Results. Nature Methods 12 (3): 179–185.10.1038/nmeth.3288Search in Google Scholar

Hirschauer, N., S. Grüner, O. Mußhoff, C. Becker (2018), Pitfalls of significance testing and *p*-value variability: An econometrics perspective. Statistics Surveys 12: 136–172.10.1214/18-SS122Search in Google Scholar

Hirschauer, N., O. Mußhoff, S. Grüner, U. Frey, I. Theesfeld, P. Wagner (2016), Die Interpretation des *p*-Wertes – Grundsätzliche Missverständnisse. Journal of Economics and Statistics 236 (5): 557–575.10.1515/jbnst-2015-1030Search in Google Scholar

Ioannidis, J., C. Doucouliagos (2013), What’s to Know about the Credibility of Empirical Economics? Journal of Economic Surveys 27 (5): 997–1004.10.1111/joes.12032Search in Google Scholar

Ioannidis, J.P.A. (2005), Why Most Published Research Findings are False. PLoS Medicine 2 (8): e124: 0696–0701.Search in Google Scholar

Kline, R.B. (2013), Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association (Second edition).10.1037/14136-000Search in Google Scholar

Krämer, W. (2011), The Cult of Statistical Significance – What Economists Should and Should Not Do to Make Their Data Talk. Schmollers Jahrbuch 131 (3): 455–468.10.3790/schm.131.3.455Search in Google Scholar

Leamer, E.E. (1978), Specification Searches: Ad Hoc Inference with Non experimental Data. New York: Wiley.Search in Google Scholar

Lehmann, E.L., J.P. Romano (2010), Testing Statistical Hypotheses. New York: Springer (Third edition).Search in Google Scholar

McCloskey, D.N., S.T. Ziliak (1996), The Standard Error of Regressions. Journal of Economic Literature 34 (1): 97–114.Search in Google Scholar

McShane, B., D. Gal (2017), Statistical Significance and the Dichotomization of Evidence. Journal of the American Statistical Association 112 (519): 885–908.10.1080/01621459.2017.1289846Search in Google Scholar

McShane, B., D. Gal, A. Gelman, C. Robert, J.L. Tackett (2017), Abandon Statistical Significance. https://arxiv.org/pdf/1709.07588.pdf.Search in Google Scholar

Motulsky, J.J. (2014), Common Misconceptions about Data Analysis and Statistics. The Journal of Pharmacology and Experimental Theurapeutics 351 (8): 200–205.10.1124/jpet.114.219170Search in Google Scholar

Neyman, J., E.S. Pearson (1933), On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London A 231: 289–337.10.1098/rsta.1933.0009Search in Google Scholar

Rosenthal, R. (1979), The File Drawer Problem and Tolerance for Null Results. Psychological Bulletin 86 (3): 638–641.10.1037/0033-2909.86.3.638Search in Google Scholar

Sellke, T., M.J. Bayarri, J.O. Berger (2001), Calibration of *p*-Values for Testing Precise Null Hypotheses. The American Statistician 55 (1): 61–71.Search in Google Scholar

Simmons, J.P., L.D. Nelson, U. Simonsohn (2011), False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science 22 (11): 1359–1366.10.1177/0956797611417632Search in Google Scholar

Trafimow, D., et al. (2018), Manipulating the Alpha Level Cannot Cure Significance Testing. Frontiers in Psychology 9. https://www.frontiersin.org/articles/10.3389/fpsyg.2018.00699/full.10.3389/fpsyg.2018.00699Search in Google Scholar

Vogt, W.P., E.R. Vogt, D.C. Gardner, L.M. Haeffele (2014), Selecting the Right Analyses for Your Data: Quantitative, Qualitative, and Mixed Methods. New York: The Guilford Publishing.Search in Google Scholar

Wasserstein, R.L., N.A. Lazar (2016), The ASA’s Statement on P-Values: Context, Process, and Purpose. The American Statistician 70 (2): 129–133.10.1080/00031305.2016.1154108Search in Google Scholar

Ziliak, S.T. (2016), Statistical Significance and Scientific Misconduct: Improving the Style of the Published Research Paper. Review of Social Economy 74 (1): 83–97.10.1080/00346764.2016.1150730Search in Google Scholar

Ziliak, S.T., D.N. McCloskey (2008), The Cult of Statistical Significance. How the Standard Error Costs Us Jobs, Justice, and Lives. Michigan: The University of Michigan Press.10.3998/mpub.186351Search in Google Scholar

**Received:**2018-09-14

**Revised:**2018-11-20

**Accepted:**2018-12-20

**Published Online:**2019-03-20

**Published in Print:**2019-07-26

© 2019 Oldenbourg Wissenschaftsverlag GmbH, Published by De Gruyter Oldenbourg, Berlin/Boston

article