## Abstract

Clinical studies and economic experiments are often conducted with randomized controlled trials. In clinical studies, power calculations are carried out as a standard. But what’s about economic experiments? After describing the basic idea of the calculation procedure in a brief tutorial, I tackle the practice of sample size calculations in the field of experimental economics by considering the publications of 5 economic journals in the period 2000–2018. These are two top-ranked economic journals (*Quarterly Journal of Economics* and *American Economic Review*)*,* the leading field journals in the area of experimental economics (*Experimental Economics*) and behavioral sciences (*Journal of Economic Behavior and Organization*), and a leading field journal in environmental economics (*Environmental and Resource Economics*). In contrast to clinical drug trials, sample size calculations have rarely been carried out by experimental economists. But the number of power calculations has slightly increased in recent years, especially in the top-ranked journals of economics. However, this can be partly explained by the fact that field experiments (in which scholars pay more attention to power analyses than in lab experiments these days) play an important role in these journals.

## Acknowledgements

I would like to thank two anonymous reviewers for their comments, ideas, and criticism and the editor Professor Peter Winker. In addition, I am grateful for the financial support provided by the German Research Foundation (DFG, German Research Foundation) –388911356.

## References

Aguiar-Conraria, L., P.C. Magalhães, C.A. Vanberg (2016), Experimental Evidence that Quorum Rules Discourage Turnout and Promote Election Boycotts. Experimental Economics 19 (4): 886–909.10.1007/s10683-015-9473-9Search in Google Scholar

Amrhein, V., F. Korner-Nievergelt, T. Roth (2017), The Earth Is Flat (p>0.05): Significance Thresholds and the Crisis of Unreplicable Research. PeerJ 5: e3544. doi: 10.7717/peerj.3544.Search in Google Scholar

Baird, S., C. McIntosh, B. Özler (2011), Cash or Condition? Evidence from a Cash Transfer Experiment. The Quarterly Journal of Economics 126 (4): 1709–1753.10.1093/qje/qjr032Search in Google Scholar

Barham, B.L., J.-P. Chavas, D. Fitz, V.R. Salas, L. Schechter (2014), The Roles of Risk and Ambiguity in Technology Adoption. Journal of Economic Behavior & Organization 97: 204–218.10.1016/j.jebo.2013.06.014Search in Google Scholar

Benati, L., P. Surico (2009), VAR Analysis and the Great Moderation. The American Economic Review 99 (4): 1636–1652.10.1257/aer.99.4.1636Search in Google Scholar

Benjamin, D.J. et al. (2017), Redefine Statistical Significance. Human Nature Behavior. https://www.nature.com/articles/s41562-017-0189–z.Search in Google Scholar

Berry, D.A. (2016), P-Values are Not What They’re Cracked up to Be. Online Discussion: ASA Statement on Statistical Significance and P-values. The American Statistician 70 (2): 1–2.Search in Google Scholar

Bettinger, E.P., B.T. Long, P. Oreopoulos, L. Sanbonmatsu (2012), The Role of Application Assistance and Information in College Decisions: Results from the H&R Block Fafsa Experiment. The Quarterly Journal of Economics 127 (3): 1205–1242.10.1093/qje/qjs017Search in Google Scholar

Brandon, A., J.A. List (2015), Markets for Replication. Proceedings of the National Academy of Sciences of the United States of America 112 (50): 15267–15268.10.1073/pnas.1521417112Search in Google Scholar

Cadsby, C.B., M. Servátka, F. Song (2010), Gender and Generosity: Does Degree of Anonymity or Group Gender Composition Matter? Experimental Economics 13 (3): 299–308.10.1007/s10683-010-9242-8Search in Google Scholar

Callen, M., M. Isaqzadeh, J.D. Long, C. Sprenger (2014), Violence and Risk Preference: Experimental Evidence from Afghanistan. The American Economic Review 104 (1): 123–148.10.1257/aer.104.1.123Search in Google Scholar

Camerer, C., et al. (2016), Evaluating Replicability of Laboratory Experiments in Economics. Science 351: 1433–1436.10.1126/science.aaf0918Search in Google Scholar

Candelo, N., R.T.A. Croson, C. Eckel (2018), Transmission of Information within Transnational Social Networks: A Field Experiment. Experimental Economics 21 (4): 905–923.10.1007/s10683-017-9557-9Search in Google Scholar

Casari, M., J.C. Ham, J.H. Kagel (2007), Selection Bias, Demographic Effects, and Ability Effects in Common Value Auction Experiments. American Economic Review 97 (4): 1278–1304.10.1257/aer.97.4.1278Search in Google Scholar

Christensen, E. (2007), Methodology of Superiority Vs. Equivalence Trials and Non-inferiority Trials. Journal of Hepatology 46 (5): 947–954.10.1016/j.jhep.2007.02.015Search in Google Scholar

Cooper, D.J., J.H. Kagel, W. Lo, L.Q. Gu (1999), Gaming against Managers in Incentive Systems: Experiments with Chinese Students and Chinese Managers. American Economic Review 89 (4): 781–804.10.1257/aer.89.4.781Search in Google Scholar

Cummings, R.G., J. Martinez-Vazquez, M. McKee, B. Torgler (2009), Tax Morale Affects Tax Compliance: Evidence from Surveys and an Artefactual Field Experiment. Journal of Economic Behavior & Organization 70 (3): 447–457.10.1016/j.jebo.2008.02.010Search in Google Scholar

Deming, D.J., N. Yuchtman, A. Abulafi, C. Goldin, L.F. Katz (2016), The Value of Postsecondary Credentials in the Labor Market: An Experimental Study. American Economic Review 106 (3): 778–806.10.3386/w20528Search in Google Scholar

Dickhaut, J., D. Houser, J.A. Aimone, D. Tila, C. Johnson (2013), High Stakes Behavior with Low Payoffs: Inducing Preferences with Holt–Laury Gambles. Journal of Economic Behavior & Organization 94: 183–189.10.1016/j.jebo.2013.03.036Search in Google Scholar

Dreber, A., E. von Essen, E. Ranehill (2011), Outrunning the Gender Gap—boys and Girls Compete Equally. Experimental Economics 14 (4): 567–582.10.1007/s10683-011-9282-8Search in Google Scholar

Dunning, T. (2012), Natural Experiments in the Social Sciences: A Design-based Approach Cambridge: Cambridge University Press.10.1017/CBO9781139084444Search in Google Scholar

Fehr, E., S. Gächter (2000), Cooperation and Punishment in Public Goods Experiments. American Economic Review 90 (4): 980–994.10.1257/aer.90.4.980Search in Google Scholar

Filiz-Ozbay, E., J.C. Ham, J.H. Kagel, E.Y. Ozbay (2018), The Role of Cognitive Ability and Personality Traits for Men and Women in Gift Exchange Outcomes. Experimental Economics 21 (3): 650–672.10.1007/s10683-016-9503-2Search in Google Scholar

Flurkey, K., J.M. Currer, D.E. Harrison (2007), Mouse Models in Aging Research. 637–672 in: J.G. Fox, et al. (ed.), The Mouse in Biomedical Research, vol. 3. Amsterdam: Elsevier.10.1016/B978-012369454-6/50074-1Search in Google Scholar

Galiani, S., P. Gertler, M. Romero (2017), Incentives for Replication in Economics. Tech. Rept. National Bureau of Economic Research. https://www.nber.org/papers/w23576.pdf10.3386/w23576Search in Google Scholar

Giné, X., J. Goldberg, D. Yang (2012), Credit Market Consequences of Improved Personal Identification: Field Experimental Evidence from Malawi. The American Economic Review 102 (6): 2923–2954.10.3386/w17449Search in Google Scholar

Gjedrem, W.G., M. Rege (2017), The Effect of Less Autonomy on Performance in Retail: Evidence from a Quasi-natural Field Experiment. Journal of Economic Behavior & Organization 136: 76–90.10.1016/j.jebo.2017.02.008Search in Google Scholar

Guala, F. (2005), The Methodology of Experimental Economics. Cambridge: Cambridge University Press.10.1017/CBO9780511614651Search in Google Scholar

Güth, W., R. Schmittberger, B. Schwarze (1982), An Experimental-analysis of Ultimatum Bargaining. Journal of Economic Behavior & Organization 3 (4): 367–388.10.1016/0167-2681(82)90011-7Search in Google Scholar

Hickey, G.L., S.W. Grant, J. Dunning, M. Siepe (2018), Statistical Primer: Sample Size and Power calculations—Why, When and How? European Journal of Cardio-Thoracic Surgery 54 (1): 4–9.10.1093/ejcts/ezy169Search in Google Scholar

Higuchi, Y., V.H. Nam, T. Sonobe (2015), Sustained Impacts of Kaizen Training. Journal of Economic Behavior & Organization 120: 189–206.10.1016/j.jebo.2015.10.009Search in Google Scholar

Hirschauer, N., S. Grüner, O. Mußhoff, C. Becker (2019), Twenty Steps Towards an Adequate Inferential Interpretation of P-values in Econometrics. Journal of Economics and Statistics 239 (4): 703–721.10.1515/jbnst-2018-0069Search in Google Scholar

Holm, S. (1979), A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6 (2): 65–70.Search in Google Scholar

Huber, C. (2019), https://blog.stata.com/2019/01/10/calculating-power-using-monte-carlo-simulations-part-1-the-basics/#disqus_thread.Search in Google Scholar

Jacquemet, N., S. Luchini, J.F. Shogren, A. Zylbersztejn (2018), Coordination with Communication under Oath. Experimental Economics 21 (3): 627–649.10.1007/s10683-016-9508-xSearch in Google Scholar

Julious, S.A. (2004), Tutorial in Biostatistics. Sample Sizes for Clinical Trials with Normal Data. Statistics in Medicine 23 (12): 1921–1986.10.1002/sim.1783Search in Google Scholar

Julious, S.A., M.J. Campbell (2012), Tutorial in Biostatistics: Sample Sizes for Parallel Group Clinical Trials with Binary Data. Statistics in Medicine 31 (24): 2904–2936.10.1002/sim.5381Search in Google Scholar

Levitt, S.D., J.A. List (2007), What do Laboratory Experiments Measuring Social Preferences Reveal about the Real World? The Journal of Economic Perspectives 21 (2): 153–174.10.1093/acprof:oso/9780195328325.003.0015Search in Google Scholar

List, J.A., S. Sadoff, M. Wagner (2011), So You Want to Run an experiment, Now What? Some Simple Rules of Thumb for Optimal Experimental Design. Experimental Economics 14: 439–457.10.3386/w15701Search in Google Scholar

List, J.A., A.M. Shaikh, Y. Xu (2019), Multiple Hypothesis Testing in Experimental Economics. Experimental Economics 22: 773–793.10.3386/w21875Search in Google Scholar

Maniadis, Z., F. Tufano, J. List (2014), One Swallow Doesn’t Make a Summer: New Evidence on Anchoring Effects. American Economic Review 104 (1): 277–290.10.1257/aer.104.1.277Search in Google Scholar

Morgan, J., D. Ong, Z.Z. Zhong (2018), Location Still Matters: Evidence from an Online Shopping Field Experiment. Journal of Economic Behavior & Organization 146: 43–54.10.1016/j.jebo.2017.11.021Search in Google Scholar

Nguyen, T.-L., P. Landais (2017), Randomized Controlled Trials: Significant Results—fragile, Though. Kidney International 92 (6): 1319–1320.10.1016/j.kint.2017.06.021Search in Google Scholar

Noordzij, M., G. Tripepi, F.W. Dekker, C. Zoccali, M.W. Tanck, K.J. Jager (2010), Sample Size Calculations: Basic Principles and Common Pitfalls. Nephrol Dial Transplant 25: 1388–1393.10.1093/ndt/gfp732Search in Google Scholar

Persson, E. (2018), Testing the Impact of Frustration and Anger When Responsibility is Low. Journal of Economic Behavior & Organization 145: 435–448.10.1016/j.jebo.2017.12.001Search in Google Scholar

Resnick, P., R. Zeckhauser, J. Swanson, K. Lockwood (2006), The Value of Reputation on eBay: A Controlled Experiment. Experimental Economics 9 (2): 79–101.10.1007/s10683-006-4309-2Search in Google Scholar

Riyanto, Y.E., Y.X.W. Jonathan (2018), Directed Trust and Trustworthiness in a Social Network: An Experimental Investigation. Journal of Economic Behavior & Organization 151: 234–253.10.1016/j.jebo.2018.04.005Search in Google Scholar

Roe, B.E., D.R. Just (2009), Internal and External Validity in Economics Research: Tradeoffs between experiments, Field experiments, Natural experiments, and Field Data. American Journal of Agricultural Economics 91 (5): 1266–1271.10.1111/j.1467-8276.2009.01295.xSearch in Google Scholar

Roux, C., C. Thöni (2015), Do Control Questions Influence Behavior in Experiments? Experimental Economics 18 (2): 185–194.10.1007/s10683-014-9395-ySearch in Google Scholar

Schechter, L. (2007), Traditional Trust Measurement and the Risk Confound: An Experiment in Rural Paraguay. Journal of Economic Behavior & Organization 62 (2): 272–292.10.1016/j.jebo.2005.03.006Search in Google Scholar

Senn, S.J. (2002), Power is Indeed Irrelevant in Interpreting Completed Studies. British Medical Journal 325 (7375): 1304.10.1136/bmj.325.7375.1304Search in Google Scholar

Snedecor, G.W., G.W. Cochran (1980), Statistical Methods Ames: The Iowa State University Press.Search in Google Scholar

Wasserstein, R.L., A.L. Schirm, N.A. Lazar (2019), Moving to a World beyond “P<0.05”. The American Statistician 73: 1–19.10.1080/00031305.2019.1583913Search in Google Scholar

Whitley, E., J. Ball (2002), Statistics Review 4: Sample Size Calculations. Critical Care 6 (334): 335–341.10.1186/cc1521Search in Google Scholar

Zelmer, J. (2003), Linear Public Goods Experiments: A Meta-Analysis. Experimental Economics 6 (3): 299–310.10.1023/A:1026277420119Search in Google Scholar

Ziegelmeyer, A., K. Schmelz, M. Ploner (2012), Hidden Costs of Control: Four Repetitions and an Extension. Experimental Economics 15 (2): 323–340.10.1007/s10683-011-9302-8Search in Google Scholar

# Appendix: Power/sample size calculations

## Appendix 1 Power/sample size calculations in the QJE

Authors | Type of experiment | Evidence |
---|---|---|

Bertrand et al. (2010) | field | “Standard power calculations show that identifying a content feature effect […] would require over 300,000 observations.” |

Fryer (2011) | field | “An important potential limitation in our set of field experiments is that they were constructed to detect effects of 0.15 standard deviations or more with 80 % power.” |

Bettinger et al. (2012) | field | “With a control mean of 0.2, the sample size gives us about 80 % statistical power to detect a 3 percentage point difference in FAFSA submission rates at the 5 % significance level.” |

Kroft et al. (2013) | field | “Our power calculations called for 12,000 résumé submissions.” |

Lewis & Rao (2015) | field | “The ‘‘experiment multiplier’’ tells us how much larger an experiment would have to be in terms of new, independent individuals to achieve adequate power, which we define as an expected t-statistic of 3, as this produces power of 91 % with a one-sided test size of 5 %.” |

Atkin et al. (2017) | field | “We chose the number of firms to be treated based on calculations for statistical power to pick up spillover effects.” |

de Ree et al. (2018) | field | “These zero effects are precisely estimated; the small standard errors of 0.025σ provide us adequate power to detect effects as low as 0.05σ at the 5 % level.” |

**References**

Bertrand, M., D. Karlan, S. Mullainathan, E. Shafir, J. Zinman (2010), What’s Advertising Content Worth? Evidence from a Consumer Credit Marketing Field Experiment. The Quarterly Journal of Economics 125(1): 263–306.

Fryer, R.G. (2011), Financial Incentives and Student Achievement: Evidence from Randomized Trials. The Quarterly Journal of Economics 126(4): 1755–1798.

Bettinger, E.P., B.T. Long, P. Oreopoulos, L. Sanbonmatsu (2012), The Role of Application Assistance and Information in College Decisions: Results from the H&R Block Fafsa Experiment. The Quarterly Journal of Economics 127(3): 1205–1242.

Kroft, K., F. Lange, M.J. Notowidigdo (2013), Duration Dependence and Labor Market Conditions: Evidence from a Field Experiment. The Quarterly Journal of Economics 128(3): 1123–1167.

Lewis, R.A., J.M. Rao (2015), The Unfavorable Economics of Measuring the Returns to Advertising. The Quarterly Journal of Economics 130(4): 1941–1973.

Atkin, D., A. Chaudhry, S. Chaudry, A.K. Khandelwal, E. Verhoogen (2017), Organizational Barriers to Technology Adoption: Evidence from Soccer-Ball Producers in Pakistan. The Quarterly Journal of Economics 132(3): 1101–1164.

de Ree, J., K. Muralidharan, M. Pradhan, H. Rogers (2017), Double for Nothing? Experimental Evidence on an Unconditional Teacher Salary Increase in Indonesia. The Quarterly Journal of Economics 133(2): 993–1039.

## Appendix 2 Power/sample size calculations in the AER

Authors | Type of experiment | Evidence |
---|---|---|

Plott & Zeiler (2007) | lab | “The proportions are not statistically significantly different (p=0.31; power=0.0675).” |

Choi et al. (2007) | lab | “The power of the experiment is very sensitive to the number of observations for each subject. To illustrate this point, we simulated the choices of random subjects […]” |

Caplin et al. (2011) | lab | “[…] for 76,000 simulated subjects with the same number of switches as our subjects but who choose at random – a measure of the power of our test […].” |

Voors et al. (2012) | field | “[…] the probability of a type II error is now less than 0.14 (power is 0.86), which is below the 0.20 (above 0.80) threshold routinely assumed in empirical analysis.” & “We used G*Power3 software to conduct the power tests; see Faul et al. (2007).” |

Giné et al. (2012) | field | “Given the number of farmers in our study, it was infeasible to implement this design because power calculations suggested we could have at best two groups.” |

Maniadis et al. (2014) | lab | “Two other considerations – the statistical power of the test and the fraction of tested hypotheses that are true associations – are key factors to consider when making appropriate inference.” |

Eriksson & Rooth (2014) | field | “The fractions were chosen to ensure that we should be able to estimate any economically significant effects of these attributes (i. e. based on power calculations).” |

Bhargava & Manoli (2015) | field | “Treatments were randomized with equal sample weights with three exceptions: The control notice was over-sampled (× 4) to heighten the statistical power for pair-wise comparisons; the benefit display notices were over-sampled (× 3) to power tests of differentiation across listed benefit amounts […].” |

Carvalho et al. (2016) | lab | “A power analysis of Study 2 indicates that with a sample size of 2,700 participants and a 98 % compliance rate (i. e. the fraction of the before-payday group who completed the follow-up survey before payday) we can detect a before-versus-after-payday difference of 0.11 of a standard deviation in a two-sided test with power 0.8 and a significance level of 5 %.” |

Deming et al. (2016) | field | “We sent out health resumes from April to July 2014 as well. But the much smaller number of health job postings (Observations = 1,460 through July 2014) did not provide us with adequate statistical power.” |

Blattman et al. (2017) | field | “Based on the pilot, we estimated that the Minimum Detectible Effect for the full 1,000 (with a quarter for each treatment) would be a 0.12 standard deviation change in a standardized dependent variable for a two-tail hypothesis test with statistical significance of 0.05, statistical power of 0.80, an intra-cluster correlation of 0.25, and the proportion of individual variance explained by covariates as 0.10.” (→ Online-Appendix) |

Friebel et al. (2017) | field | “Power calculations reveal that the sample size is more than sufficient: on the basis of 27 months of observations pretreatment (January 2012 to March 2014) and three months of observations post treatment (April to June 2014), we would need 70 shops in each group to detect a 3 % treatment effect at a 5 % significance level with the probability 0.9.” |

**References**

Plott, C., K. Zeiler (2007), Exchange Asymmetries Incorrectly Interpreted as Evidence of Endowment Effect Theory and Prospect Theory? The American Economic Review 97(4): 1449–1466.

Choi, S., R. Fisman, D. Gale, S. Kariv (2007), Consistency and Heterogeneity of Individual Behavior under Uncertainty. American Economic Review 97(5): 1921–1938.

Caplin, A., M. Dean, D. Martin (2011), Search and Satisficing. The American Economic Review 101(7): 2899–2922.

Voors, M.J., E.E.M. Nillesen, P. Verwimp, E.H. Bulte, R. Lensink, D. van Soest (2012), Violent Conflict and Behavior: A Field Experiment in Burundi. The American Economic Review 102(2) 941–964.

Giné, X., J. Goldberg, D. Yang (2012), Credit Market Consequences of Improved Personal Identification: Field Experimental Evidence from Malawi. The American Economic Review 102(6): 2923–2954.

Maniadis, Z., F. Tufano, J. List (2014), One Swallow Doesn’t Make a Summer: New Evidence on Anchoring Effects. The American Economic Review 104(1): 277–290.

Eriksson, S., D.-O. Rooth (2014), Do Employers Use Unemployment as a Sorting Criterion When Hiring? Evidence from a Field Experiment. The American Economic Review 104(3): 1014–1039.

Bhargava, S., D. Manoli (2015), Psychological Frictions and the Incomplete Take-Up of Social Benefits: Evidence from an IRS Field Experiment. American Economic Review 105(11): 3489–3529.

Carvalho, L.S., S. Meier, S.W. Wang (2016), Poverty and Economic Decision-Making: Evidence from Changes in Financial Resources at Payday. American Economic Review 106(2): 260–284.

Deming, D.J., N. Yuchtman, A. Abulafi, C. Goldin, L.F. Katz (2016), The Value of Postsecondary Credentials in the Labor Market: An Experimental Study. American Economic Review 106(3): 778–806.

Blattman, C., J.C. Jamison, M. Sheridan (2017), Reducing Crime and Violence: Experimental Evidence from Cognitive Behavioral Therapy in Liberia. American Economic Review 107(4): 1165–1206. [→ Online Appendix]

Friebel, G., M. Heinz, M. Krueger, N. Zubanov (2017), Team Incentives and Performance: Evidence from a Retail Chain. American Economic Review 107(8): 2168–2203.

## Appendix 3 Power/sample size calculations in EE

Authors | Type of experiment | Evidence |
---|---|---|

Dreber et al. (2011) | field | “A sample size analysis indicates that 1411, 965 and 38,407 observations would be needed to obtain a significant result for the performance change in running, jumping and dancing respectively. The basis for the power calculation is a significance level of 5 % and a power of 80 %.” |

List et al. (2011) | [Theoretical] | “In calculating optimal sample sizes an experimenter must consider three key elements: (1) the significance level, (2) the power of the subsequent hypothesis test, and (3) the minimum detectable effect size.” & “A simple rule of thumb to maximize power given a fixed experimental budget naturally follows: the ratio of the sample sizes is equal to the ratio of the standard deviations of outcomes.” |

Dreber et al. (2014) | lab | “A sample size analysis indicates that 2037 observations would be needed to obtain a significant result for the gender gap in competition choice in word search. The basis for the power calculation is a significance level of 5 % and a power of 80 %.” |

Stoop (2014) | lab & field | “Because the main result is based on the non-rejection of the null hypothesis, power analyses were carried out. A priori, a sample of 80 allows an effect size of at least 40 % to be detected with a significance level of 5 % and a power of 80 % (Faul et al. 2007). The differences in behavior reported in Figure 1 are so small that even if they were statistically significant, they would be economically insignificant. The StuLab and CitLab experiments show a difference with CitField of approximately 7.5 percent points. To make such a difference statistically significant (with a significance level of 5 % and a power of 80 %), the number of observations in each experiment would have to be 577.” |

Jordan et al. (2016) | lab | “We also conduct a power analysis to assess the smallest effects of our endowment and strategy method manipulations that we could have detected with 80 % probability in Experiment 1.” |

Andersson et al. (2017) | lab | “To estimate the statistical power and sample size needed for the main study, we used the observed standard deviation of 4.85 from the final pilot experiment with n = 44 (i. e. the standard deviation of the average donation across the different charities). We wanted to have a sufficient sample size to be well powered to detect a medium-sized effect (i. e. Cohen’s d = 0.5; Cohen 1992) when testing Hypothesis 2 (the test in the sample with a universalism value above the median).^{14} We decided to include approximately 300 subjects in total, which implies a sample size of 150 for testing Hypothesis 2. This provides us with 86 % power to detect a medium-sized effect for Hypothesis 2; for Hypothesis 1, where we include the total sample, the power is 99 % of detecting a medium-sized effect (but if the universalism prime only affects donations in individuals with high universalism, this would decrease the expected effect size and consequently the power of the test of Hypothesis 1).” |

Koppel et al. (2017) | lab | “A sample-size calculation based on means and standard deviations from a previous study from our lab (Kirchler et al. in press) and with 70 % power showed that 50 participants were needed in each condition.” |

Filiz-Ozbay et al. (2018) | lab | “A referee raised the question of getting some idea of the increases in sample size that would be needed to determine if some of the Big 5 characteristics that had large, but statistically insignificant values, were in fact significant.” & “If we triple the sample to 138 women, this increases to 0.25 chance of finding that it is significantly different from zero, and if we increase the sample size by a factor of five, it increases to 0.26. These results imply that the insignificant coefficient on openness in column (2) of Table 2 may reflect a lack of power, but that very large increases in the sample size are needed to increase power substantially. One can do power calculations for other variables with large but insignificant coefficients in an analogous manner.” |

Jacquemet et al. (2018) | lab | “Conventional power calculation (based on a standard one-sided proportion test supporting the significance of the effect of the oath on the truth-telling ratio, with α = 5 %) for the data reported in Figure 4b yields the power of 0.357. The same test performed on the data reported in Figure 4a has a power of 0.838. We thank a referee for pointing this out.” |

**References**

Dreber, A., E. von Essen, E. Ranehill (2011), Outrunning the gender gap – boys and girls compete equally. Experimental Economics 14(4): 567–582.

List, J.A., S. Sadoff, M. Wagner (2011), So you want to run an experiment, now what? Some simple rules of thumb for optimal experimental design. Experimental Economics 14(4): 439–457.

Dreber, A., E. von Essen, E. Ranehill (2014), Gender and competition in adolescence: task matters. Experimental Economics 17(1): 154–172.

Stoop, J. (2014), From the lab to the field: envelopes, dictators and manners. Experimental Economics 17(2): 304–313.

Jordan, J., K. McAuliffe, D. Rand (2016), The effects of endowment size and strategy method on third party punishment. Experimental Economics 19(4): 741–763.

Andersson, O., T. Miettinen, K. Hytönen, M. Johannesson, U. Stephan (2017), Subliminal influence on generosity. Experimental Economics 20(3): 531–555.

Koppel, L., D. Andersson, I. Morrison, K. Posadzy, D. Västfjäll., G. Tinghög (2017), The effect of acute pain on risky and intertemporal choice. Experimental Economics 20(4): 878–893.

Filiz-Ozbay, E., J.C. Ham, J.H. Kagel, E.Y. Ozbay (2018), The role of cognitive ability and personality traits for men and women in gift exchange outcomes. Experimental Economics 21(3): 650–672.

Jacquemet, N., S. Luchini, J.F. Shogren, A. Zylbersztejn (2018), Coordination with communication under oath. Experimental Economics 21(3): 627–649.

## Appendix 4 Power/sample size calculations in JEBO

Authors | Type of experiment | Evidence |
---|---|---|

Mattei (2000) | lab | “[…] the power of the test is just 0.432.” |

Norman et al. (2003) | lab | “For all results except Group 9 the power of the test is > 0.8. With our worst case scenario for the coefficient of a modified bubble sort, the power of the test, shown in above Table, is still greater than 0.7 for these cases except Groups 6 and 7. For Group 6 combined with Group 7, the worse case scenario power of the test is > 0.9.” |

Haruvy & Stahl (2007) | lab | “This likelihood ratio test has asymptotic power of 1, and Monte Carlo simulations show that for our sample size, it has a power of 0.9 or higher […]” |

Schechter (2007) | Lab | “In order to approach a power level of 0.70, we would have needed 1535 observations in the gender regression, 5430 in the education regression, and 570 in the language regression. While the sample size here does not permit such subsample analysis, it leaves an interesting area for further examination.” |

Martin & Randal (2008) | field | “Thus, failure to reject a null hypothesis that busyness has no influence on behaviour could be a result of one of the following possibilities: busyness is in fact not relevant, busyness is important but the effects cancel out, or simply a lack of statistical power.” |

Ismayilov & Potters (2013) | lab | “If we hypothesize that the deception rate under no disclosure is about 0.44 (based on the two closest treatments in Gneezy, 2005) and that it increases by 50 % to 0.66 with disclosure, then the power of our test for the effect of disclosure is almost 80 % (two-sided test, no continuity correction).” |

Bronchetti et al. (2015) | field | “Our power calculations suggest that the sample size (and the size of each experimental arm) is sufficient to detect a 3.2-percentage point increase in flu vaccination, relative to the baseline vaccination rate for the control group of 9 %, after correcting for multiple comparisons.” |

de Haan & van Veldhuizen (2015) | lab | “Finally, as we show in Appendix A, our null result does not seem to be due to our study having a low power. […] For example, for the attraction effect task assuming a true effect size similar to the effect observed by Pocheptsova et al. (2009), our power is estimated to be in the 0.8–1 range, which is considered to be high to very high.” |

Hong et al. (2015) | field | “According to the rules of thumb provided by List et al. (2011), in our experimental design, our sample size is sufficient to detect about a 1/3 difference of winning probability and about 0.7-standard deviation difference of productivity, with power of 0.8 and significance level of 0.05.” |

Beltramo et al. (2015) | lab | “Based on power calculations and the minimum detectable effect for experimentally testing marketing messages, a total of 36 parishes were selected.” |

Karlan et al. (2015) | field | “We focus our discussion of statistical power on the profit results. The estimate for the effect of the consulting only treatment on stated income is an increase of 0.9 (se = 21.4) cedis over a control group mean of 146 (Column 1 of Table 6). Thus, the upper bound of the 95 % confidence interval is 41 cedis, a 28 % increase over the control group mean.” |

Katuščák et al. (2015) | Lab | “Note that, given the number of subjects we use in 4 H and 4 HR, if the effect of loser vs. minimal feedback identified by FO were real and if their data provides a good description of the idiosyncratic variance in bidding behavior, a two-tailed test would have a power of over 0.95 against the null hypothesis of no effect. If we instead use the variance implied by our data, the power of the test ranges from 0.73 to 0.95, depending on the protocol and the outcome measure.” |

Higuchi et al. (2015) | field | “Given the large variance in business performance at baseline, the number of samples needed to detect 10 % change in business performance with 90 % statistical power is over one thousand, which is far beyond our training and survey budget.” |

Barton et al. (2016) | field | “We performed initial power calculations for the districts pooled, which indicated that we would detect a 5 % difference in turnout rates at p = 0.05 between the negative and positive groups with a power of 0.85. When looking at each district individually, however, the effect size needs to be roughly ten percentage points to have an 80 % chance of finding it at p = 0.05.” |

Ericson & Kessler (2016) | lab | “In the first and second wave, some participants were also assigned to a subsidy, or status quo condition. These conditions were discontinued as a result of a power calculation and because the status quo eventually became a mandate.” |

Czermak et al. (2016) | lab | “Second, we calculated the power of our test and found that a power of 0.80 for a significance level of α = 0.05 could be achieved with our current sample size of N = 191 already if the range of the relative frequency of choosing Nash is less than 9 percentage points (assuming a mean of 45 %).” |

Kriss et al. (2016) | lab | “In the former case, this gives us 94 % power to detect a proportion that differs from 50 % by ± 25 % at p < 0.05 in a two-sided test; in the latter case we have 97 % power (Chow et al., 2008).” |

Lindeboom et al. (2016) | field | “The original proposal for the field experiment including power analysis is available at http://personal.vu.nl/b.vander.klaauw/OpzetCIZOnderzoek.pdf [in Dutch].” |

Bhanot (2017) | field | “The sample size for this study was determined with the loan repayment rate outcome in mind, based on an estimated ex-ante repayment rate of 80 %, power of 0.80, and an estimated effect size of five percentage points (informed primarily by the findings in Shu et al. (2012) and Karlan et al. (2010)).” |

Galeotti et al. (2017) | lab | “To estimate the sample size necessary to uncover these hypothesized effects, we conduct an a priori power analysis. This type of analysis would be of limited value without a reasonable estimate of the treatment differences we would expect in our experiment. Such estimates are typically based on the results of previous similar studies. The studies which are most similar to ours are Gino and Pierce (2009, 2010).” |

Persson (2018) | lab | “One limitation of the experiment is the relatively low power of the tests. For example, pooling the three treatments against the Control treatment and hypothesizing that the punishment probability is 0.3 in the former group (where N = 82) and 0.1 in the latter group (where N = 24), using a significance level of 0.05 the power of a one-sided Fisher’s exact test is 0.58; for comparisons across the treatments it is lower.” |

Andersen et al. (2018) | lab | “A power calculation using a two-sided t-test for simplicity shows that we would need a sample size of 470 individuals per treatment cell to detect a significant effect if power is set to 80 %, significance level is set to 5 %, and given the responses of our subjects.” |

Prasada & Bose (2018) | lab | “We make the standard assumptions of confidence level of 0.95 for type 1 error, 80 % for power of the test. We specify a conservative desired difference of 0.1, which is about 25 % of the standard deviation of conflict effort (3.7). Then according to Altman (1981), the minimum required number of independent observations is 214.” |

Heggedal et al. (2018) | lab | “Moreover, given a 5 % significance level and the observed variances of the D and N treatments, this test has a power of 99 %.” & “The power computation was carried out using the simulation routine of Bellemare et al. (2016).” |

Ding et al. (2018) | lab | “We report the power-analysis results in the last two columns of the table below. […] Our results indicate that our study has more than 80 % power to detect differences in Amplitude, APD, and RAD, regardless of whether we set the difference at 30 % or 50 %.” |

Brookins et al. (2018) | lab | “In view of the null result for part (a), we also conducted an ex post power calculation. The effect size in our data is 0.17, which is conventionally considered small. With this effect size, at 80 % power it would take 208 independent clusters (matches) per treatment – a sample size well beyond feasible – to obtain a significant difference in total output between […].” |

Rahwan et al. (2018) | lab | “We simulated 1,000 random samples for differently-sized groups, and determined the proportion of samples for each group that found a significant positive correlation between the number of coin flips reported as matched and negative moral affect using a one-sided Spearman test at the 5 % significance level. Using this procedure, we determined that a sample size of at least 460 participants per condition was required to achieve 80 % power in the main experiment. We decided to round up to 500 participants per condition to further increase power.” |

**References**

Mattei, A. (2000), Full-scale real tests of consumer behavior using experimental data. Journal of Economic Behavior & Organization 43(4): 487–497.

Norman, A., M. Ahmed, J. Chou, K. Fortson, C. Kurz, H. Lee, L. Linden, K. Meythaler, R. Rando, K. Sheppard, N. Tantzen, I .White, M. Ziegler (2003), An ordering experiment. Journal of Economic Behavior & Organization 50(2): 249–262.

Haruvy, E., D.O. Stahl (2007), Equilibrium selection and bounded rationality in symmetric normal-form games. Journal of Economic Behavior & Organization 62(1): 98–119.

Schechter, L. (2007), Traditional trust measurement and the risk confound: An experiment in rural Paraguay. Journal of Economic Behavior & Organization 62(2): 272–292.

Martin, R., J. Randal (2008), How is donation behaviour affected by the donations of others? Journal of Economic Behavior & Organization 67(1): 228–238.

Ismayilov, H., J. Potters (2013), Disclosing advisor’s interests neither hurts nor helps. Journal of Economic Behavior & Organization 93: 314–320.

Bronchetti, E.T., D.B. Huffman, E. Magenheim (2015), Attention, intentions, and follow-through in preventive health behavior: Field experimental evidence on flu vaccination. Journal of Economic Behavior & Organization 116: 270–291.

de Haan, T., R. van Veldhuizen (2015), Willpower depletion and framing effects. Journal of Economic Behavior & Organization 117: 47–61.

Hong, F., T. Hossain, J.A. List (2015), Framing manipulations in contests: A natural field experiment. Journal of Economic Behavior & Organization 118: 372–382.

Beltramo, T., G. Blalock, D.I. Levine, A.M. Simons (2015), The effect of marketing messages and payment over time on willingness to pay for fuel-efficient cookstoves. Journal of Economic Behavior & Organization 118: 333–345.

Karlan. D., R. Knight, C. Udry (2015), Consulting and capital experiments with microenterprise tailors in Ghana. Journal of Economic Behavior & Organization 118: 281–302.

Katuščák, P., F. Michelucci, M. Zajíček (2015), Does feedback really matter in one-shot first-price auctions? Journal of Economic Behavior & Organization 119: 139–152.

Higuchi, Y., V.H. Nam, T. Sonobe (2015), Sustained impacts of Kaizen training. Journal of Economic Behavior & Organization 120: 189–206.

Barton, J., M. Castillo, R. Petrie (2016), Negative campaigning, fundraising, and voter turnout: A field experiment. Journal of Economic Behavior & Organization 121: 99–113.

Ericson, K.M., J.B. Kessler (2016), The articulation of government policy: Health insurance mandates versus taxes. Journal of Economic Behavior & Organization 124: 43–54.

Czermak, S., F. Feri, D. Glätzle-Rützler, M. Sutter (2016), How strategic are children and adolescents? Experimental evidence from normal-form games. Journal of Economic Behavior & Organization 128: 265–285.

Kriss, P.H., R.A. Weber, E. Xiao (2016), Turning a blind eye, but not the other cheek: On the robustness of costly punishment. Journal of Economic Behavior & Organization 128: 159–177.

Lindeboom, M., B. van der Klaauw, S. Vriend (2016), Audit rates and compliance: A field experiment in care provision. Journal of Economic Behavior & Organization 131, Part B: 160–173.

Bhanot, S.P. (2017), Cheap promises: Evidence from loan repayment pledges in an online experiment. Journal of Economic Behavior & Organization 140: 246–266.

Galeotti, F., R. Kline, R. Orsini (2017), When foul play seems fair: Exploring the link between just deserts and honesty. Journal of Economic Behavior & Organization 142: 451–467.

Persson, E. (2018), Testing the impact of frustration and anger when responsibility is low. Journal of Economic Behavior & Organization 145: 435–448.

Andersen, S., U. Gneezy, A. Kajackaite, J. Marx (2018), Allowing for reflection time does not change behavior in dictator and cheating games. Journal of Economic Behavior & Organization 145: 24–33.

Prasada, D.V.P., G. Bose (2018), Rational conflict and pre-commitment to peace. Journal of Economic Behavior & Organization 149: 215–238.

Heggedal, T.-R., L. Helland, K.-E.N. Joslin (2018), Should I Stay or should I Go? Bandwagons in the lab. Journal of Economic Behavior & Organization 150: 86–97.

Ding, S., V. Lugovskyy, D. Puzzello, S. Tucker, A. Williams (2018), Cash versus extra-credit incentives in experimental asset markets. Journal of Economic Behavior & Organization 150: 19–27.

Brookins, P., J.P. Lightle, D. Ryvkin (2018), Sorting and communication in weak-link group contests. Journal of Economic Behavior & Organization 152: 64–80.

Rahwan, Z., O.P. Hauser, E. Kochanowska, B. Fasolo (2018), High stakes: A little more cheating, a lot less charity. Journal of Economic Behavior & Organization 152: 276–295.

## Appendix 5 Power/sample size calculations in ERE

Authors | Type of experiment | Evidence |
---|---|---|

Burton et al. (2007) | lab | “Pre-experiment power analyses indicated a power of 0.8 using very conservative assumptions about differences between treatments. As differences between treatments become greater, the power approaches one rapidly.” |

Mitani & Flores (2009) | lab | “The calculated power for a gender effect in Real Payment model is improved from 0.378 in Mitani and Flores (2007) to 0.9 in the current analysis […].” |

Araña & León (2013) | field | “For the largest price of 60 € the proportion of an acceptance response is 0.36 for the opt-in treatment and 0.38 for the opt-out treatment, which leads us to interpret that we can not reject the null hypothesis of equal proportions among samples with a significance level of 0.69 and a power of the contrast of 0.84.” |

**References**

Burton, A.C., K.S. Carson, S.M. Chilton, W.G. Hutchinson (2007), Resolving questions about bias in real and hypothetical referenda. Environmental and Resource Economics 38(4): 513–525.

Mitani, Y., N.E. Flores (2009), Demand Revelation, Hypothetical Bias, and Threshold Public Goods Provision. Environmental and Resource Economics 44(2): 231–243.

Araña, J.E., C.J. León (2013), Can Defaults Save the Climate? Evidence from a Field Experiment on Carbon Offsetting Programs. Environmental and Resource Economics 54(4): 613–626.

**Received:**2019-03-15

**Revised:**2019-12-13

**Accepted:**2019-12-16

**Published Online:**2020-02-27

© 2020 Oldenbourg Wissenschaftsverlag GmbH, Published by De Gruyter Oldenbourg, Berlin/Boston