Regression(s) discontinuity: Using bootstrap aggregation to yield estimates of RD treatment e ﬀ ects

: Following Efron (2014), we propose an algorithm for estimating treatment e ﬀ ects for use by researchers employing a regression-discontinuity (RD) design. This algorithm generates a set of estimates of the treatment e ﬀ ect from bootstrapped samples, wherein the polynomial-selection algorithm developed by Pei, Lee, Card, and Weber (2021) is applied to each sample, the average of these RD treatment e ﬀ ect (RDTE) estimates is computed and serves as the overall estimate of the RDTE. E ﬀ ectively, this procedure estimates a set of plausible RD estimates and weights the estimates by their likelihood of being the best estimate to form a weighted-average estimate. We discuss why this procedure may lower the estimate ’ s root mean squared error (RMSE). In simulation results, we show that this better performance is achieved, yielding up to a 5% reduction in RMSE relative to PLCW ’ s method and a 16% reduction in RMSE relative to Calonico, Cattaneo, and Titiunik ’ s (2014) method for bandwidth selection (with default settings).


Introduction
This article applies the insights of Efron [1] and builds on the methods of Imbens and Kalyanaraman (IK, [2]), Calonico, Cattaneo, and Titiunik (CCT, [3]), and Pei, Lee, Card, and Weber (PLCW, [4]), each of which develop data-driven methods for estimating treatment effects in the context of a regression-discontinuity (RD) design.The RD design for estimating treatment effects was first used by Thistlethwaite and Campbell [5].Hahn et al. evaluate the conditions that are necessary to identify the treatment effect in an RD design [6].Cattaneo and Titiunik [7] provide a review of contemporary methods for estimating treatment effects in RD designs and an earlier review can be found in the study by Imbens and Lemieux [8].Cattaneo et al. [9,10] provide "an accessible and practical guide for the analysis and interpretation of RD designs" (abstract).We review the IK, CCT, and PLCW methods to foreground our proposed approach (which we label as "LR").
IK derive an optimal bandwidth assuming that the user is employing a local linear regression estimator.IK define this optimal bandwidth as "the bandwidth that minimizes a first-order approximation to MSE(h), what we call the asymptotic mean squared error or AMSE(h)" (p. 4).They outline a feasible, data-driven estimate for this optimal bandwidth, culminating in their equation (17), which we reproduce here as equation (1) as follows: (1) C K is a constant whose magnitude depends on the type of kernel weight used in the regression.( ) ( ) is the sum of estimates of the variance of the forcing variable, x, just below ( ) − and above (+) the cutoff for assigning treatment, c. ( ) f c ˆis an estimate of the density of the forcing variable at the cutoff.
is an estimate of bias in the estimate of the RD treatment effect (RDTE) that arises from using a linear regression.More specifically, this term captures the differences in estimates of the second derivatives, which give the extent of difference in curvature near the cutoff.These second derivatives are estimated by a quadratic regression using observations near the cutoff.( ) r r ˆˆis the sum of estimated regularization parameters, which are included to account for error in the true curvature's estimation and to avoid the estimate of the optimal bandwidth from becoming infinite in cases when the difference in the curvatures is estimated to be zero at the threshold.
Note that this optimal bandwidth contains a series of estimated components.As such, the optimal bandwidth is not known with certainty.Perhaps in response, software by Nichols [11] to implement the IK bandwidth presents the user with estimates of the RDTE using the estimated IK bandwidth and this bandwidth doubled and it cut in half.This lack of certainty about the optimal bandwidth plays a key role in motivating our proposed approach.
CCT [3] build on the IK method in several respects.They develop robust confidence interval estimators for average treatment effects that are centered around a bias-corrected RD estimator.The "bias" in the conventional RD estimate arises as, for any given polynomial order choice (e.g., linear), there may be remaining higher-order curvature that causes the point estimate to be biased.Note that the IK bandwidth in equation (1) trades off an increase in bias that results from a larger bandwidth with the concurrent decrease in variance achieved by widening the bandwidth.Such optimal bandwidth algorithms produce bandwidths that are "too wide" in the sense that they permit bias to creep into the RDTE estimate.CCT generate a biascorrected RDTE estimate that adjusts the conventional RDTE estimate with an estimate of the bias.They estimate the unknown derivatives in the leading bias term by using a higher-order local polynomial in order to construct this estimate of the bias [12].CCT's confidence intervals are "robust" as they add to the estimated conventional variance the variance that comes from having to estimate the leading bias.This additional source of variance "could be interpreted as the contribution of the bias correction to the variability of the biascorrected estimator" (CCT, 2014b, p. 921).Their robust bias-corrected confidence intervals are as follows (CCT, 2014b, p. 914): where ( ) τ h ˆp n is the conventional estimate of the RDTE and is a function of the user-selected polynomial order, p, and the bandwidth that is used, h n .b ˆn is the estimated bias, and thus, ( ) − τ h b ˆp n n is the bias-corrected estimate of the RDTE.v ˆnbc is an estimate of the variance of the bias-corrected estimate of the RDTE, which includes the conventional variance and the additional variance coming from estimating the bias correction.Finally, CCT provide improvements on IK's method of deriving the optimal bandwidth.
CCT [3] conduct simulations using three data-generating processes (DGPs).Assuming a linear polynomial specification, they show that using the IK optimal bandwidth to generate conventional RDTE estimates with conventional standard errors yields confidence interval coverage rates that are too low.The 95% confidence interval around the estimated RDTE contained the true RDTE in only 82.3, 30.3, and 84.2% of cases for these three DGPs, respectively (CCT [3], third column of Table 1, pp. 2316-7).In contrast, the 95% confidence intervals around the bias-corrected estimated RDTEs with bias-corrected variance contained the true RDTE in 91.6, 93.2, and 93.3% of cases for these three DGPs (CCT [3], ninth column of Table 1, pp. 2316-7), much closer to the expected 95% rate.Although these coverage rates are a clear improvement, it should be noted that they are still modestly below the 95% target.While these robust confidence intervals add variance to account for the estimation of the bias correction, they do not account for uncertainty in the width of the bandwidth."Bandwidth" is the bandwidth used to identify the estimate of the RDTE (as opposed to the bandwidth used to estimate the bias).For our approach (LR), the "Bandwidth," "Effective Number...," and "Conventional Estimated RDTE" columns show the average values across bootstrapped samples.A uniform kernel weight is used for all three methods.
In contrast, our approach does account for this uncertainty and thus is more able to produce coverage rates close to 95%. 1  Using a local linear regression applied to data within either the IK or CCT bandwidths is the dominant method used in RD studies at this time. 2Yet, it is also still popular for scholars to use higher-order polynomials applied to the whole span of the observed data.There are strong arguments against this approach.While higher-order polynomials may have a stronger fit to the whole of the data on a given side of the threshold, their out-of-sample predictive power may be weak and standard errors on their estimated RDTEs may be quite large.Gelman and Imbens [17] note that "we do not have good methods for choosing that order in a way that is optimal for the objective of a good estimator for the causal effect of interest" and "optimizing some global goodness-of-fit measure ... is not closely related to the research objective of causal inference" (p.447).
This backdrop sets the stage for PLCW's method for selecting the optimal polynomial order.PLCW's method is to compare the estimated asymptotic mean squared error for a variety of polynomial orders, with the optimal bandwidth for each polynomial chosen using CCT's method, and select the order that minimizes this estimated AMSE.In simulations that test and compare polynomials of order 0 (i.e., a kernelweighted average) to order 4 (i.e., a kernel-weighted quartic regression), PLCW show that using this polynomial-order-selection method can produce lower mean squared errors than the common default approach of using a local linear regression.Furthermore, they show that for a given DGP, there may exist a threshold sample size above which the optimal polynomial switches from lower order (e.g., linear) to higher order (e.g., quadratic).
We believe that the PLCW approach (which wraps around the CCT approach) is attractive and likely to be widely used.However, this approach does raise questions.First, note that at the sample size threshold at which the optimal polynomial order switches from linear to quadratic, one should be agnostic about whether the RDTE estimate from the linear specification is better or worse than the RDTE estimate from the quadratic specification.If these two RDTE estimates differ from one another, then an average of these two RDTE estimates is likely to yield better performance (i.e., a lower mean squared error) than one obtains from either of these separate estimates taken by itself.Second, note that the PLCW method produces an estimate of the AMSE for each polynomial order and each order will generate a regression estimated using data within an estimated optimal bandwidth.Since the optimal bandwidth and resulting AMSE are not known with certainty, the optimal polynomial order cannot be known with certainty.
Efron [1] considers problems of this type.He notes that data-based model selection may yield "jumpy" estimates.Efron builds on the bootstrap aggregation (i.e., "bagging") technique commonly used in machine learning to combine regression or classification trees [18].Using this bootstrapped smoothing approach, which in the RD context would reduce the variability in the treatment effect estimation, Efron demonstrates how to compute appropriate standard errors and confidence intervals that take into account the model-selection procedure.This is a frequentist model averaging approach.Steel [19] notes that frequentist model averaging estimators can be described as weighted averages of parameters chosen from different models and notes that the key challenge is finding the appropriate weights.For further discussion of model averaging strategies and the additional uncertainty caused by model selection methods, see [20].Calonico et al. show that the mean-square-error-optimal bandwidth does not produce optimal confidence intervals and suggest alternate inference-optimal bandwidth choices [13].Their associated Stata command, rdrobust, provides these coverage error-rate optimal bandwidths (Calonico et al.) [14].In our subsequent empirical analysis we use the default setting of the rdbobust command, which provides mean squared error optimal bandwidths.Thus, the reader should be aware that we are being somewhat unfair in our conclusion that our approach yields better coverage rates.However, as practitioners will typically use default settings in their empirical work, this comparison of results with default settings is useful.For more discussion of coverage accuracy, see Calonico et al. [15,16]. 2 These papers have 2,576 and 2,440 citations, respectively, per Google Scholar as of December 2021.Among the 30 "recent" RD papers that we discuss below, 23 use one of these two methods for computing the optimal bandwidth.Calonico et al.'s user-written rdrobust Stata command uses a linear specification, the CCT bandwidth, and triangular kernel weights as its default settings, but they provide the user with the option of changing each of these defaults, including the possibility of using higher order polynomial specifications [14].This command can be installed by entering the following command into Stata: "ssc install rdrobust, replace." We follow Efron [1] by applying bootstrap aggregation to produce smoothed estimates of the treatment effect in RD contexts.Our approach, described in the next section, attempts to produce a weighted-average RDTE, which is composed of a set of RDTEs estimated using various polynomial orders and various bandwidths with each estimate weighted by the specification's probability of yielding the lowest AMSE.In the third section, we provide an example to illustrate the method and simulation evidence.Across five DGPs, we show that our approach yields lower mean squared errors and better coverage rates than (a) CCT's [3] method applied using a local linear regression for both the conventional and bias-corrected RDTEs and (b) PLCW's [4] method for polynomial order selection followed by CCT's [3] method for bandwidth selection for a given polynomial order.

Our proposed approach
From the complete set of observations from sample size N , we draw a random sample with replacement, creating a new sample of size N .We apply PLCW's [4] polynomial-order selection method to this sample, following their lead by testing polynomials of order 0 through 4. 3 Given the resulting optimal polynomial, we use CCT's [3] method to find the optimal bandwidth and produce an estimate of the RDTE. 4 We save the estimated RDTE.We then repeat this operation many times.In the simulations shown below, we generate 200 RDTE estimates from 200 bootstrapped samples. 5We then compute the average RDTE across the 200 samples, which is our estimate of the RDTE.
This data-driven algorithm has desirable features.First, the process produces the frequencies by which each polynomial order is selected by PLCW's method across the 200 bootstrapped samples.If there is little uncertainty about the optimal order, then the 200 bootstrapped samples will be likely to select the same polynomial order.In contrast, if there is uncertainty, the algorithm will effectively weight the various polynomial orders by their respective likelihood of being the "best" (i.e., having the lowest estimated AMSE).Second, uncertainty about the size of the optimal bandwidth will be captured by variation in the bandwidths across the 200 samples.If there is little uncertainty in the optimal bandwidth, then the widths will be close to the same across the 200 samples.Yet, as we show below, there is considerable uncertainty in both the optimal polynomial order and the optimal bandwidths for given orders for the DGPs we examine.
Regarding inference, we consider two approaches.The first approach estimates the standard error simply from the standard deviation of estimated RDTEs (call this "sd ˆB").By bootstrapping, the entire process of optimal order selection and optimal bandwidth selection for a given order, the standard deviation of the RDTEs captures (a) conventional variance in the parameters of the selected regression specification, (b) variance arising from the components of the optimal bandwidth being estimated, (c) variance in the optimal polynomial which relies on an estimated AMSE, and (d) variance in the estimate of the bias when a biascorrected RDTE is estimated, which was the subject of CCT's "robust confidence intervals."By capturing all of the sources of variance, we produce more robust confidence intervals.
The second approach comes from Efron [1], who shows that the first approach generates confidence intervals that are too wide.Efron produces a "smoothed interval... ± μ sd ˜1.96 ˜B...where μ ˜... is the bootstrap smoothed estimate" and sd ˜B is the "standard deviation ... for the smoothed bootstrap estimate" (p.995).Efron shows that in the "ideal case" (i.e., when the number of bootstrap samples equals N N and with each possible  3 We use PLCW's user-written Stata command rdmse, which can be installed by entering the following command into Stata: "net install rdmse, from (https://raw.githubusercontent.com/peizhuan/rdmse/master)replace." 4 Our procedure can be used with either the conventional RDTE or the bias-corrected RDTE.We show results with both approaches in the simulations below and find that they perform roughly the same for the DGPs that we examine.Furthermore, the user can use either triangular or uniform kernel weights, and we show the performance of each below.Other features of Calonico et al.'s (2018) rdrobust command are set at their default settings.5 A higher number of bootstrapped samples would produce more precise results, but comes at the cost of increased computer processing time.
Regression(s) discontinuity  5 combination of the N observations chosen once), Efron provides a method to estimate sd ˜B for a smaller number of bootstrapped samples than N N and uses 4,000 samples in his empirical demonstration.
While the second approach is superior, we use the first approach (i.e., sd ˆB) in the empirical demonstration below for pragmatic reasons.In our simulations, we generate 15,000 treatment effect estimates, each of which uses 200 bootstrapped samples (i.e., 3 million samples).Efron's sd ˜B is much noisier than sd ˆB when the number of bootstrapped samples is modest (e.g., 200) and may, in fact, incorrectly generate wider confidence interval estimates than produced using sd ˆB.Yet, using 4,000 bootstrapped estimates proved to be too time prohibitive; a test of a single treatment estimate emerging from 4,000 bootstrapped samples took 189 minutes.In the software we have developed, discussed below, we provide the option to the user to generate Efron's sd ˜B, which may be attractive when estimating a particular treatment effect. 6 Our approach is related to, but different from, a simple bootstrapping approach [24].The typical use of bootstrapping is to construct a confidence interval around a sample statistic by conducting the same procedure (e.g., computation of a sample median and computation of parameters of a particular regression) on repeated random samples with replacement.In such bootstrapping applications, a confidence interval is drawn around the statistic that is estimated on the full sample.It is generally not advised to use the mean of the bootstrapped samples' estimates of the statistic as one's estimate of the population parameter as "the bootstrapped statistic is one further abstraction away from your population parameter" and is "an estimate of an estimate" (Dima) [25].Each bootstrapped sample, by itself, produces an unbiased but less efficient estimate of the population parameter than can be obtained from the full sample statistic.Yet, this efficiency loss may be less than the efficiency gain that can be achieved by allowing for variation in the bandwidth and polynomial order across the bootstrapped samples.By allowing for this specification variation across samples, we are not, in fact, estimating the same sample parameters for each bootstrapped sample, and thus our approach differs from standard bootstrapping applications.
It thus becomes an empirical question whether our approach introduces net efficiency gains.In the simulation results we show below, the mean squared error, averaged across various DGPs and various sample sizes, is slightly smaller using our approach than PLCW's method and substantially smaller than using an assumed linear specification with CCT's [3] optimal bandwidth, suggesting efficiency gains can be had in using our approach.Furthermore, even in cases where our approach produces a higher mean squared error, the differences are usually small.That is, if there is an efficiency loss in a given setting, it appears likely to be small.Finally, there is the question of what kernel weight to use in the regressions.Triangular kernels that place more weight on observations near to the threshold and with a steadily declining weight throughout the range of x on which the regression is run are often used in RD estimations.This choice is supported by the findings of Cheng et al. [26], who conclude that for local polynomial estimation, the triangular weight function is best for any particular order.However, note that since our procedure yields varying bandwidths across the 200 samples, and since we average the 200 RDTEs, the observations closest to the cutoff are more likely to be included than observations further from the cutoff as such observations may be included in some of the bandwidths, but not others.Thus, even with a uniform kernel weight, our procedure generates declining weight on observations further from the cutoff.This phenomenon is illustrated in the example below.A uniform kernel for individual samples produces an effective kernel that is somewhat triangular when aggregated across the 200 samples, whereas the aggregated triangular kernel has a shape similar to a Laplace (double exponential) distribution.Consequently, Cheng et al.'s [26] recommendation of the triangular kernel may not hold in our application.Empirically, as shown below, we find that the uniform kernel produces smaller mean squared error for the DGPs we investigate, and thus, we have uniform as the default setting in our software (with triangular kernels as a user option). 6 Our paper is not the first to use bootstrap methods for inference for RD designs.Otávio et al. use bootstrapping to produce confidence intervals for sharp RD designs [21].Chiang et al. use a multiplier bootstrap to provide inference in the context of quantile treatment effects for a fuzzy RD design [22].Additionally, see Chiang and Sasaki for a discussion of inference in the context of quantile treatment effects for a sharp regression kink design [23].

Example and simulation evidence
We test the efficacy of our approach using five DGPs: -L1: The first two DGPs have been used in the simulation evidence provided in IK [2], CCT [3] and PLCW [4], and are derived from fifth-order polynomials fit on data from, respectively, Lee [27] and Ludwig and Miller [28].The third and fourth DGPs are identical to the first and second, except that they multiply the standard deviation of ε by 10.We included investigation of these DGPs as the variance in the first two DGPs seemed small (based on visual inspection) relative to many examples that we have seen in the RD literature and wanted to test the effectiveness of the CCT, PLCW, and LR algorithms given greater variance in y.The fifth DGP is derived from quadratic regressions fit on data from the study by Jacob et al. [29].These data include the seventh and eighth grade math assessments from a balanced panel of 2,767 students to which Jacob et al. applied a 10-point pseudo treatment effect to the eighth grade scores of those with seventh grade scores below 215 to evaluate RD methods.Perhaps not surprisingly, the relationship between seventh and eighth grade math scores is fairly linear.We included this DGP because we wanted to include a DGP for which a linear specification would be expected to perform well, which would seem to minimize the advantages of both PLCW's method for optimal polynomial selection as well as our approach, which wraps around PLCW's method.Appendix Figure A.1 illustrates these five DGPs with scatterplots showing 5,000 simulated observations and the lines showing the portion of y i that is a function of x i .
We assume that all observations to the right of the threshold are treated.That is, we are considering a "sharp" RD design. 8 To illustrate our approach, we begin by creating one sample with 500 observations drawn from the L1 DGP.Table 1 presents the results with the CCT and PLCW methods and our approach applied to this sample. 7 This software can be installed by entering the following command into Stata: "net install rdwa, from (https://raw.githubusercontent.com/mlmarklong/rdwa/master)replace."8Although not yet available in our rdwa command, a useful extension would be to add an option for estimating "fuzzy" RD estimates, where treatment probability increases at a threshold, and for "regression kink designs," where the slope of the treatment has a kink (i.e., a change in slope) and the researcher evaluates whether there is a corresponding kink in the outcome at treatment kink [30].Additional extensions would be to allow the researcher to include covariates in the analysis, which are permitted in both CCT and PLCW approaches and in their associated rdrobust and rdmse commands.
The first row shows the results using Calonico et al.'s (2018) rdrobust command with default settings, including the default setting of a linear polynomial, but with a uniform kernel.This method produces a conventional (non-bias-corrected) RDTE estimate of 0.081.Note that this RDTE is double the true RDTE of 0.04 for the L1 DGP.The confidence interval for this estimate [-0.028 to 0.190] includes, (i.e., "covers") the true RDTE of 0.04.In the second row, we show the results applying PLCW's method to select the optimal bandwidth, assuming one uses CCT's method with default rdrobust settings for each considered polynomial order.PLCW's method concludes that using a zero-order polynomial (i.e., a kernel-weighted average) will yield the lowest AMSE.Using this order, which produces a smaller bandwidth than the linear polynomial, yields a conventional RDTE estimate of 0.107 and a confidence interval of [0.028 to 0.196].This estimate again covers the true RDTE of 0.04, but is a little worse than the CCT result as the RDTE estimate is further from the true value (i.e., | | | | − > − 0.107 0.04 0.081 0.04 ).The bottom six rows of Table 1 show our approach.We find that the polynomial order 0 is selected in 151 of our 200 bootstrapped samples, order 1 is selected for 45 samples, order 2 is selected for 4 samples, and orders 3 and 4 are never selected.Thus, our approach suggests that roughly three-quarters of the weight should be placed on the zero-order polynomial specifications, and most of the rest of the weight should be placed on linear specifications.Furthermore, note that within each of the polynomial orders, we find substantial variation in the optimal bandwidth.For example, for the zero-order polynomials, we find a mean bandwidth of 0.065 (very similar to the bandwidth selected by the PLCW method, 0.068), yet this bandwidth varies across the 151 samples from a low of 0.039 to a high of 0.110.The "Conventional Estimated RDTE" column shows that the polynomial orders produce fairly different estimated RDTEs: 0.105 for order 0, 0.074 for order 1, and −0.029 for order 2. The bottom row of Table 1 shows the overall results for our approach.Our estimate of the conventional RDTE is 0.095 with a standard error of 0.050 and a 95% confidence interval spanning from −0.0004 to 0.186, which covers the true RDTE, 0.04.Note that these results fall in between the CCT and PLCW results, which is not surprising given that our results are a weighted average with most of the weight on the zero-order polynomial specifications.-3 provide greater clarity on the results using our approach.Figure 1 shows the scatterplot of the 500 observations drawn from the L1 DGP, regression lines drawn using PLCW's preferred specification (purple and green), and 200 regression lines applying PLCW's method to 200 bootstrapped samples (red and blue).This figure shows that the zero-order polynomial is the dominant choice, but there are a sizable contingent of linear regression lines included and a faint hint of quadratic specifications.Furthermore, this figure shows that bandwidths vary considerably and are larger for the higher-order specifications.
Figure 2 shows a histogram of the estimated RDTEs across the 200 samples.The overall estimate of the RDTE is given by the vertical maroon line at 0.095.Two 95% confidence intervals are illustrated.The long dashed lines give a "percentile-based" confidence interval suggested by the histogram itself, whereby there are five estimated RDTEs to the left and right of these long dashed lines and 190 estimated RDTEs (i.e., 95%) in between.The short dashed lines gives a "normal-based" confidence interval stemming from assuming that the RDTEs are distributed ( ) N 0.095, 0.050 2 .The green curve shows this normal distribution.If the number of samples were increased above 200, the difference between these confidence intervals would shrink.This histogram illustrates the trade-off between precision that could be had by taking more bootstrapped samples and processing times.The choice of 200 was made to balance between these concerns, but the user of our approach might want more precision if they can accept longer processing time.
Figure 3 shows the effective weight placed on observations as a function of the value of x.In panel (a), the uniform kernel shows a flat peak, reflective of the fact that all observations across all of the 200 samples that  are within the minimum bandwidth range [−0.039 to 0.039] obtain a weight of 1. Outside of that range, the effective weight declines as a function of bandwidth variation.None of the observations outside of the maximum bandwidth range [−0.270 to 0.270] are given any weight in producing the RDTE estimate.In panel (b), the triangular kernel has a pointed peak with an effective triangular weight for observations within the minimum bandwidth range.For observations outside this range, the effective weight is characterized by a negative exponential.Broadly speaking, both effective weighting schemes are approximately triangular, but neither is precisely triangular.
Next, we show how these methods compare when applied repeatedly across samples drawn from the five DGPs using three sample sizes, 500, 5,000, and 50,000.These sample sizes were chosen as they are broadly representative of the RD literature.To assess this contention, we conducted two literature searches on scholar.google.com.First, we chose "Show top recommendations" and searched for "Regression Discontinuity."We compiled the first 30 articles that were using RD methods to estimate a causal impact and excluded literature review articles or articles that principally developed RD methods.From these 30 articles, we identified the full sample size from the first RD estimate shown in a table in the article. 9Most of these "top" articles were more than a decade old.We then repeated this search by choosing "Show all recommendations" and restricted the search to articles published "Since 2017." 10 The median sample size was 2,917 for "top" articles and 3,005.5 for "recent" articles.Of these 60 articles, seven had sample sizes below 500 and eight had sample sizes above 50,000.Appendix Figure A2 shows the distribution of these sample sizes, and Appendix Table A1 gives details on the 60 studies. 11able 2 shows the results for 500 draws from the L1 DGP.We present the root mean squared errors (RMSEs) and coverage rates for both the conventional and bias-corrected estimated RDTEs.Single check marks denote the specification that has the lowest RMSE or coverage rate closest to 0.95 within sample size and kernel across the three methods, and double check marks denote the best performance within sample size across the six method × kernel combinations.
With a sample size of 500, the lowest RMSE is found for the CCT-triangular-kernel specification for the conventional RDTE (0.062) and the PLCW-triangular-kernel specification for the bias-corrected RDTE (0.061), yet these specifications provide coverage rates that are low (0.910 and 0.906, respectively).These RMSEs are narrowly better than that found using the LR-uniform specification (0.062 and 0.062, respectively), while LRuniform yields better coverage rates (0.918 and 0.946, respectively).
In the next panel, we show the results for the 5,000 sample size.Our specification produces the lowest RMSE and best coverage rates for the conventional RDTE using a triangular kernel and for the bias-corrected RDTE using a uniform kernel.For sample size 50,000, the lowest RMSEs are 0.009, which are achieved by both the LR and PLCW specifications.Yet, our approach again produces better coverage rates for these estimates.
In the bottom panel of Table 2, we average the results from the three panels above.We find that our approach with a uniform kernel produces the best results across the board.For users preferring the conventional RDTE, our approach produces the lowest RMSE and best coverage rates.The same is true if the user prefers a bias-corrected RDTE estimate.The final five rows of Table 2 show the relative performance of these specifications.For this analysis, we assume that users of CCT's and PLCW's methods are prone to use triangular kernel weights and bias-corrected RDTE estimates.We find that PLCW's method yields a 9% reduction in RMSE relative to CCT's method when using bias-corrected RDTE estimates, thus replicating PLCW's findings.In the subsequent rows, we compare CCT's and PLCW's methods to our approach.The bottom two rows show the comparisons assuming the user of our approach uses the conventional RDTE with a uniform kernel (which, as shown below, has the best overall performance).With these assumptions, our approach produces a 7%  reduction in RMSE relative to CCT's method, but a 2% higher RMSE than PLCW's.This is the only case, across the five data generating processes, in which our approach underperforms PLCW's method.Appendix Tables A2-A5 repeat the results in Table 2 but for the LM1, L2, LM2, and J1 DGPs.Across the bottom rows of these tables, we find that our approach reduces the RMSE by 8, 14, 1, and 7%, respectively, relative to PLCW's method.Rather than walking the reader through these individual DGP results one-at-a-time, we focus instead on the aggregation of the results from these five DGPs, as shown in Table 3.
Table 3 presents the central findings of our article.We present the average of the RMSEs each divided by the standard deviation of ε for its DGP (i.e., by 0.1295 for L1 and LM1, 1.295 for L2 and LM2, and 9.5 for J1).This Notes: The CCT rows show the results assuming a linear regression using CCT's optimal bandwidth selection method applied to a linear specification.PLCW rows use PLCW's method for polynomial order selection assuming CCT's optimal bandwidth.LR rows use our approach of bootstrap aggregation of the PLCW method.The coverage rates are computed assuming conventional standard errors for the conventional RDTE estimates and robust standard errors for the bias-corrected results.Single check marks denote the specification that has the lowest RMSE or coverage rate closest to 0.95 within sample size and kernel across the three methods, and double check marks denote the best performance within sample size across the six × method kernel combinations.The bottom five rows show the ratios of the RMSEs (averaged across the results for the three sample sizes), with "c" denoting conventional, "b" denoting biascorrected, "t" denoting a triangular kernel, and "u" denoting a uniform kernel.
normalization places the DGPs on more even footing, although we found similar results excluding this normalization.We also present the average coverage rate across the five DGPs.Focusing on the conventional RDTE columns, we find that our approach with a uniform kernel gives the lowest normalized RMSE and best coverage rates at each sample size and when averaged across sample sizes.In the bias-corrected RDTE columns, we find that PLCW with a uniform kernel yields a slightly lower normalized RMSE than our approach for the 500 and 50,000 sample sizes, while our approach with a uniform kernel is superior for the 5,000 sample size and when averaged across sample sizes.Moreover, our approach with a triangular yields the best coverage rates for each sample size.Notes: The CCT rows show the results assuming a linear regression using CCT's optimal bandwidth selection method applied to a linear specification.PLCW rows use PLCW's method for polynomial order selection assuming CCT's optimal bandwidth.LR rows use our approach of bootstrap aggregation of the PLCW method.The coverage rates are computed assuming conventional standard errors for the conventional RDTE estimates and robust standard errors for the bias-corrected results.Single check marks denote the specification that has the lowest RMSE or coverage rate closest to 0.95 within sample size and kernel across the three methods, and double check marks denote the best performance within sample size across the six × method kernel combinations.The bottom five rows show the ratios of the RMSEs (averaged across the results for the three sample sizes), with "c" denoting conventional, "b" denoting biascorrected, "t" denoting a triangular kernel, and "u" denoting a uniform kernel.
The final five rows of Table 3 shows the relative performance of the three measures.We can confirm PLCW's finding that their method yields lower squared errors than CCT's method for bandwidth selection paired with a triangular-kernel-weighted linear regression.PLCW's normalized RMSE is 8 and 11% below CCT's results for the conventional and bias-corrected RDTE, respectively.Our approach outperforms both of these methods.Comparing our approach using the conventional RDTE and a uniform kernel, we produce normalized RMSEs that are 16 and 5% lower, respectively, than CCT's and PLCW's methods applied to the biascorrected with a triangular kernel.
Finally, note that the conventional and bias-corrected yield comparable performance in terms of normalized RMSE; the average ratio of the normalized RMSE for the bias-corrected RDTE to the conventional RDTE is 1.011 when averaged across the 24 rows in Table 3.

Conclusion
In this article, we provide an approach for estimating the causal impact in an RD study.Our approach can produce a lower mean squared error than leading methods in the field, i.e., CCT [3] and PLCW [4].Furthermore, our approach may yield coverage rates for the true (typically unknown) local average treatment effects that are closer to the expected level of 95% when using a 95% confidence interval.Our approach consists simply of taking repeated bootstrapped samples, identifying the optimal polynomial order using PLCW's method for each sample and applying CCT's bandwidth selection to that sample, estimating the RDTE with this specification, and finally averaging the estimated RDTEs across these samples to generate an overall estimate of the RDTE.
Leading methods in the field estimate optimal bandwidths and optimal polynomial specifications and then generate a single estimate of the treatment effect with the apparently optimal bandwidth and specification.However, since the optimal bandwidth and optimal polynomial specification are not known with certainty and since other choices of the bandwidth and specification (which may be nearly equal in likelihood of being "optimal") can produce very different estimates of the treatment effect, there is a strong argument for a weighted-average "consensus" estimate to be used over the estimates from a particular bandwidth/specification choice.It is this argument that suggests scope for a reduced mean squared error.However, our bootstrap approach for deriving the weights comes at a costeach bootstrap sample is less efficient in generating the treatment effect estimate than could be generated using the full, original sample.Whether the gains from using the weighted-average consensus estimate outweigh the efficiency loss from using bootstrapped samples to generate treatment effect estimates is an empirical question.Our simulation results suggest (but do not prove) potential gains from this approach.Our approach is more likely to be superior when (a) there is more uncertainty about the optimal bandwidth and/or the optimal polynomial specification and (b) when the resulting estimates of the treatment effect are sensitive to these choices.
A drawback of our approach is that implementing this technique requires substantial computer processing time.Our procedure takes roughly 200 times as long as to run as PLCW's procedure (assuming the user implements the default setting of 200 samples).For example, in our test of a sample with = N 5,000 observa- tions, averaged across ten implementations, our rdwa command (with options set to not print the results for each bootstrapped sample and not to produce the associated graphs) took 547.6 s to complete compared to 2.9 s to implement PLCW's rdmse command wrapped around Calonico et al.'s (2018) rdrobust command to assess polynomials of order 0-4.Researchers will need to consider whether the added processing time is worth the potential improved performance of lower squared errors and better coverage rates.
004, 0.186] Notes: The full sample consists of 500 observations with 406 (94) to the left (right) of the treatment cutoff.

Figure 1 :
Figure 1: Example illustrating our approach: Specification choice and regression lines.

Figures 1
Figures 1-3 provide greater clarity on the results using our approach.Figure1shows the scatterplot of the 500 observations drawn from the L1 DGP, regression lines drawn using PLCW's preferred specification (purple and green), and 200 regression lines applying PLCW's method to 200 bootstrapped samples (red and blue).This figure shows that the zero-order polynomial is the dominant choice, but there are a sizable contingent of linear regression lines included and a faint hint of quadratic specifications.Furthermore, this figure shows that bandwidths vary considerably and are larger for the higher-order specifications.Figure2shows a histogram of the estimated RDTEs across the 200 samples.The overall estimate of the RDTE is given by the vertical maroon line at 0.095.Two 95% confidence intervals are illustrated.The long dashed lines give a "percentile-based" confidence interval suggested by the histogram itself, whereby there are five estimated RDTEs to the left and right of these long dashed lines and 190 estimated RDTEs (i.e., 95%) in between.The short dashed lines gives a "normal-based" confidence interval stemming from assuming that the RDTEs are distributed ( ) N 0.095, 0.050 2 .The green curve shows this normal distribution.If the number of samples were increased above 200, the difference between these confidence intervals would shrink.This histogram illustrates the trade-off between precision that could be had by taking more bootstrapped samples and processing times.The choice of 200 was made to balance between these concerns, but the user of our approach might want more precision if they can accept longer processing time.Figure3shows the effective weight placed on observations as a function of the value of x.In panel (a), the uniform kernel shows a flat peak, reflective of the fact that all observations across all of the 200 samples that

Figure 2 :
Figure 2: Example illustrating our approach: Histogram of the estimated RDTEs.

Figure 3 :
Figure 3: Example illustrating the our approach: Effective weights as a function of the kernel weight applied to each of 200 samples.(a) Uniform Kernels and (b) triangular kernels.

Table 1 :
Example illustrating our approach

Table 2 :
Simulation results for the L1 DGP

Table A1 :
Sample sizes for 30 top and 30 recent RD papers "Top" articles "Recent" articles