Testing for Treatment Effect Twice Using Internal and External Controls in Clinical Trials

Leveraging external controls -- relevant individual patient data under control from external trials or real-world data -- has the potential to reduce the cost of randomized controlled trials (RCTs) while increasing the proportion of trial patients given access to novel treatments. However, due to lack of randomization, RCT patients and external controls may differ with respect to covariates that may or may not have been measured. Hence, after controlling for measured covariates, for instance by matching, testing for treatment effect using external controls may still be subject to unmeasured biases. In this paper, we propose a sensitivity analysis approach to quantify the magnitude of unmeasured bias that would be needed to alter the study conclusion that presumed no unmeasured biases are introduced by employing external controls. Whether leveraging external controls increases power or not depends on the interplay between sample sizes and the magnitude of treatment effect and unmeasured biases, which may be difficult to anticipate. This motivates a combined testing procedure that performs two highly correlated analyses, one with and one without external controls, with a small correction for multiple testing using the joint distribution of the two test statistics. The combined test provides a new method of sensitivity analysis designed for data fusion problems, which anchors at the unbiased analysis based on RCT only and spends a small proportion of the type I error to also test using the external controls. In this way, if leveraging external controls increases power, the power gain compared to the analysis based on RCT only can be substantial; if not, the power loss is small. The proposed method is evaluated in theory and power calculations, and applied to a real trial.


Introduction
1.1 Use of external controls in randomized controlled trials Randomized controlled trials (RCTs) are the gold standard for generating high-quality causal evidence of new treatments and have long been recognized as the standard method to support key decisions in the drug development process (Jones and Podolsky, 2015;Bothwell and Podolsky, 2016).However, despite its clear advantages, the traditional paradiam of conducting RCTs has been increasingly criticized for failing to meet contemporary needs.In certain settings, for example, in HIV prevention (Janes et al., 2019;Sugarman et al., 2021), oncology (Rahman et al., 2021), and neurology (Mintzer et al., 2015), randomizing patients to placebo may be difficult for ethical or feasibility reasons.Moreover, adequately powered RCTs are becoming more and more impractical as a growing number of new treatments are targeted toward rare diseases or biomarker-defined subgroups of patients in the era of precision medicine (Eichler et al., 2021).
Meanwhile, a plethora of real-world data (RWD) have been curated for administrative or research purposes and are becoming accessible to researchers in the form of disease registries, administrative claims databases, and electronic health records.These rich data sources can produce valuable insights, i.e., real-world evidence (RWE), into the effect of treatments in routine, daily practice.However, researchers almost ubiquitously caution against possible bias from unmeasured confounding when using RWD.
Being well aware of the limitations of using either RCT or RWD alone, the idea of using RWD to supplement RCT has gained growing interest in recent years.As forcefully argued in Eichler et al. (2021), "the future is not about RCTs vs. RWE but RCTs and RWE."There are numerous oppor-tunities in how the integration of RCTs and RWD can achieve fruitful results that using either RCT or RWD alone can not (Colnet et al., 2020;Degtiar and Rose, 2021;Shi et al., 2021).Among those, an important theme is on augmenting the RCT with RWD to increase efficiency (Yang et al., 2020a,b;Gagnon-Bartsch et al., 2021;Chen et al., 2021;Cheng and Cai, 2021;Li and Luedtke, 2021), and particularly, constructing an externally augmented control arm in the analysis of RCTs (Li et al., 2020;Harton et al., 2021;Gao et al., 2021;Liu et al., 2022).Leveraging external controls -relevant individual patient data under control from external trials or real-world data -has the potential to reduce the cost of RCTs while increasing the proportion of trial patients given access to novel treatments.
Using external controls is not an entirely new idea.Criteria for evaluating what constitute an acceptable external control arm are proposed in Pocock (1976).It was discussed twenty years ago by the International Council for Harmonisation (ICH) (2000, E10 Section 2.5), and also recognized by the European Medicines Agency (EMA) (2006), US Food and Drug Administration (FDA) (2018), and National Cancer Institute (Sharpless and Doroshow, 2019) as one direction to modernize clinical trials.
In fact, properly selected external controls (e.g., using propensity score matching) have shown early promise, and several drugs have already been approved based on external control groups (Carrigan et al., 2020;Schmidli et al., 2020;Thorlund et al., 2020).
Using external controls typically requires the exchangeability condition, i.e., all patient characteristics that affect the potential outcome under control and differ between the trial population and the external control population are measured (Stuart et al., 2011).While careful adjustment for observed covariates can probably render the exchangeability assumption to hold approximately, the analysis may still be biased due to unmeasured covariates related to "difficulties in reliably selecting a comparable population because of potential changes in medical practice, lack of standardized diagnostic criteria or equivalent outcome measures, and variability in follow-up procedures" (US Food and Drug Administration (FDA), 2018).To reduce the potential biases from using external controls, an intuitive frequentist approach is "test-then-pool" that first tests for the comparability of the external controls and internal controls before leveraging external controls Liu et al. (2022).Bayesian methods that rely on power priors have also been popular, which use the likelihood of the external data to a specified power as the prior distribution Chen and Ibrahim (2000); Nikolakopoulos et al. (2018).As such, one can use power priors to adjust the weight allocated to the external information according to the levels of comparability between the external control and the internal data.However, these methods lack formal statistical theory on how the unmeasured biases might affect the validity and efficiency of the proposed procedures.
In this article, we take a different perspective to this problem and propose a sensitivity analysis approach to quantify the magnitude of unmeasured bias that would be needed to alter the study conclusion that presumed no unmeasured biases are introduced by employing external controls (International Council for Harm 2019).With the unbiased RCT-only test as the benchmark, leveraging external controls increases power or not depends on the interplay between sample sizes and the magnitude of treatment effect and unmeasured biases, which may be difficult to anticipate.This motivates a combined testing procedure that performs both tests, one with and one without external controls, correcting for multiple testing using the joint distribution of the two test statistics.Because the two tests are highly correlated, this correction for multiple testing is small.Interestingly, the proposed combined testing procedure can be viewed as a new method of sensitivity analysis designed for data fusion problems that anchors at the unbiased analysis based on RCT only and "spends" a small proportion of the type I error (i.e., the cost of multiple testing) to also test using the pooled controls.In this way, if leveraging external controls increases power, the power gain compared to the RCT-only test can be substantial; if not, the power loss is small.Before introducing technical details, it is useful to consider a motivating example.

Example: a randomized controlled trial in patients with type-2 diabetes
Consider a non-inferiority, phase 3 RCT (referred to as the internal trial, ClinicalTrials.govnumber, NCT01894568) comparing a new basal insulin, insulin peglispro, to insulin glargine as the control in Asian insulin-naïve patients with type-2 diabetes using a noninferiority margin of 0.4% (Hirose et al., 2018).The primary endpoint is the change in hemoglobin A1c (HbA1c) from baseline to 26 weeks of treatment.HbA1c is a continuous-valued measure of average blood glucose in the past three months.
Before this trial, a phase 3 RCT of similar design (referred to as the external trial, ClinicalTrials.govnumber, NCT01435616) has been conducted in the North America and Europe (Davies et al., 2016), whose control arm will be used as the source of external controls.
We focus on the overweight and obese population, which are respectively defined as 23 ≤ Body Mass Index (BMI) < 25 and BMI ≥ 25 for the internal trial according to the Asia-Pacific guidelines, and 25 ≤ BMI < 30 and BMI ≥ 30 for the external trial according to the World Health Organization classifications (Lim et al., 2017).There are in total 159 patients under treatment and 150 patients under control in the internal RCT, and 486 patients under control in the external trial.We match 159 similar external controls to the 159 treated patients in the internal RCT using optimal matching based on a robust Mahalanobis distance and a caliper on the propensity score.See Rosenbaum (2020, Part II) for discussion of these matching techniques.Table 1 describes covariate balance in the 159 matched pairs.All variables have standardized differences less than 0.13 and are considered sufficiently balanced (Rosenbaum, 2002).
Using only the internal RCT, 159 patients under treatment and 150 under control, we conduct a Z-test with the noninferiority margin of 0.4% and obtain a one-sided p-value 7.92 × 10 −7 .In this analysis, the evidence that the new insulin treatment is noninferior to insulin glargine is strong enough when only using the internal controls.On the other hand, under the exchangeability assumption, which implies that the 159 matched external controls are comparable to patients in the internal RCT, we construct an augmented control arm of 309 patients in total and obtain a one-sided p-value 1.88 × 10 −7 .
Again, we find strong evidence of noninferiority; however, an investigator may be in doubt about the exchangeability assumption due to the influence of regions on the outcome.Then a natural question is could the one-sided p-value of 1.88 × 10 −7 be due to regions rather than the effect of treatment?If the study conclusion from using external controls can be altered by a plausible effect of regions and because the RCT-only test is already powerful enough, the RCT-only test would be a better choice.
However, it would be difficult to know this before examining the data.Motivated by the advice of performing multiple analyses with an appropriate correction for multiple testing given by Rosenbaum (2012), we propose a combined testing procedure that performs both analyses, controlling for multiple testing using the joint distribution of the two test statistics.In this article, we will demonstrate that the combined test avoids making an inapt choice about whether to use external controls or not, and only has a small loss of power compared to knowing a priori which is the better choice.

Outline
Section 2 presents a test that uses only the internal controls and another test that also leverages the external controls, and discusses controlling type I error and comparing power without the exchangeability assumption.Section 3 proposes a combined test that performs both tests and studies in detail its statistical properties.Section 4 presents power calculations.Section 5 returns to the real data applications.Section 6 concludes with a discussion.
2 Testing Using Internal and External Controls

Testing Under Exchangeability
There is a randomized controlled trial (RCT) denoted as D = 1.Let A be a binary treatment, where A = 1 denotes treatment and A = 0 denotes control, X a vector of observed baseline covariates, Y (a)   the potential outcome under A = a, for a = 0, 1.Throughout the article, we assume consistency and Stable Unit Treatment Value Assumption (SUTVA) so that the observed outcome satisfies Y = (Rubin, 1980).Our estimand of interest is the average treatment effect in the RCT In particular, we consider testing a one-sided hypothesis: The other direction can be considered in the same way.Combining both one-sided tests and applying Bonferroni correction give a two-sided test (Cox et al., 1977, Section 4.2), and by inversion, a confidence interval.
Write the RCT sample as (Y i , X i , A i , D i = 1), i = 1, . . ., n r , which is assumed to be independent and identically distributed according to the joint law of (Y and S 2 a respectively be the sample mean and sample variance of the responses Y i 's from RCT subjects under treatment a, for a = 0, 1.Hence, the null hypothesis H 0 can be tested using a simple Z-statistic: where n 1 and n 0 are respectively the number of RCT patients under treatment and control.Based To supplement the RCT using external controls, one approach is to first extract external data for patients under control based on the inclusion/exclusion criteria of the RCT and then proceed by matching these external patients to the RCT patients based on their similarity in observed baseline information X (Schmidli et al., 2020).Let D = 0 denote the matched external controls, and thus which is assumed to be independent and identically distributed according to the joint law of (Y (0) , X) | A = 0, D = 0. Suppose that matching has rendered the baseline observed covariates comparable between the RCT and external controls, i.e., D ⊥ X, and that these baseline covariates X explain all differences between the RCT and external controls, i.e., the exchangeability assumption Let Y e be the sample mean of the responses Y i 's from the external controls, and wY 0 + (1 − w)Y e be a weighted average of mean responses for the two control groups, where w ∈ [0, 1] is a pre-specified weight, which could reflect the proportion of the internal control in the two control groups combined.Therefore, the null hypothesis H 0 can also be tested borrowing information from the external controls using where S 2 e is the sample variance of the responses Y i 's from external controls.We make two remarks about T 2 (w).First, T 2 (w) is constructed assuming independence between the RCT and external controls, which means that T 2 (w) may be conservative due to correlation induced by matching (Austin and Small, 2014) but usually to a small extent as the correlation is typically small (Schafer and Kang, 2008).
Second, T 2 (w), w ∈ [0, 1] defines a family of statistics that includes T 2 (1) = T 1 as a special case.Among those, the exchangeability assumption implies the optimal w that maximizes the efficiency of T 2 (w) is proportional to the sample size, i.e., the optimal w equals (n r π 0 )/(n r π 0 + n e ).One can also choose different values of w to reflect the weights allocated to the two control groups.

Controlling Type I Error Without Exchangeability
The aforementioned approach of leveraging external controls relies on the exchangeability assumption, which may not hold because the RCT patients and external controls may differ with respect to covariates that may not have been measured.Without exchangeability, Y 1 − {wY 0 + (1 − w)Y e } is not necessarily centered at θ 0 under H 0 and rejecting the null hypothesis when T 2 (w) ≥ z 1−α may inflate type I error.
), which may be nonzero when exchangeability does not hold.This could occur, for example, if an important prognostic variable is unobserved and left uncontrolled, or if a variable that differs in distribution between D = 0 and D = 1 (such as region) cannot be matched.The correct rejection region for a size-α test based on T 2 (w) is which is infeasible because ∆ ⋆ is unknown.To deal with this issue, a tempting choice is to estimate ∆ ⋆ by Y 0 − Y e and adjust the numerator of T 2 (w) to make it mean zero.Nonetheless, this "de-biasing" step introduces additional variation and the resulting test statistic becomes equivalent to T 1 , the test statistic without using any external controls.
In order to borrow information from external controls while still controlling type I error, we consider departures from the exchangeability through the lens of a sensitivity analysis (Rosenbaum, 2020).
Specifically, we consider a sensitivity parameter ∆ 0 such that it bounds the magnitude of bias ∆ ⋆ , i.e., Because ∆ ⋆ ≤ ∆ 0 , the reject region T 2,∆ 0 (w) ≥ z 1−α controls type I error at level α.As a special case when ∆ ⋆ ≤ ∆ 0 holds with ∆ 0 = 0 (e.g., under exchangeability), the reject region under exchangeability.As ∆ 0 increases, there is greater uncertainty about how the exchangeability might be violated, leading to more stringent rejection criterion to control type I error.
The reject region T 2,∆ 0 (w) ≥ z 1−α is sharp under ∆ ⋆ ≤ ∆ 0 in the sense that they are of size-α when ∆ ⋆ = ∆ 0 , so it cannot be improved unless further information is provided.

Power Comparison Without Exchangeability
Write σ 2 a = Var(Y (a) | D = 1), for a = 0, 1, and σ 2 e = Var(Y (0) | D = 0).Under the alternative hypothesis H A : θ ⋆ > θ 0 , the power of T 1 is the probability of event T 1 ≥ z 1−α , which is asymptotically where Φ(•) is the standard normal cumulative distribution.In parallel, the power of T 2,∆ 0 (w) is the Several remarks are in order based on the above power formulas.First, the power of T 2,∆ 0 (w) is larger than that of T 1 if and only if For instance, when ∆ 0 = ∆ ⋆ , i.e., the specified upper bound for ∆ ⋆ is tight, and σ 2 0 = σ 2 e , i.e., the variance of Y for the two control groups are equal, simple algebra reveals that the power of T 2,∆ 0 (w) is always larger than that of T 1 for any w satisfying max(0, (n r π 0 − n e )/(n r π 0 + n e )) ≤ w < 1.
Second, we can derive the oracle w that maximizes the power of , the optimal w takes the following form: where the first case is when ∆ 0 is specified too large, the power of T 2,∆ 0 (w) is maximized at w = 1, which means that using the external controls does not lead to efficiency gain.As an illustration, under the Under another special case when ∆ ⋆ = ∆ 0 and σ 1 = σ 0 = σ e , w opt becomes (n r π 0 )/(n r π 0 + n e ), which agrees with the optimal w under exchangeability discussed in Section 2.1.The proof of (3) is given in the supplementary material.

A Combined Test
Should we leverage external controls?In other words, is it better to use the test statistic T 1 constructed solely based on the RCT or the test statistic T 2,∆ 0 (w) that additionally leverages the external controls?
We know from the above theory and analysis that the answer to this question depends upon the context, specifically upon the nature and size of the treatment effect, and the specification of w and ∆ 0 , that might be difficult to anticipate prior to examining the data.As Motivated in Section 1, we propose a combined testing procedure that performs both T 1 and T 2,∆ 0 (w), correcting for multiple testing using the joint distribution of the two test statistics.
Again, for illustration, consider the special case that π −1 e , then ρ increases as w increases from 0 to 1, and thus ρ ranges between 0.5 and 1.
Consider the testing procedure that, for any specified ∆ 0 and w, rejects where c 1−α;ρ satisfies Φ 2,ρ (c 1−α;ρ ) = 1 − α, Φ 2,ρ (x, y) is the probability of the 2-dimensional lower orthant (−∞, x] × (−∞, y] for a bivariate normal distribution with expectation (0, 0) T , unit variances, and correlation coefficient ρ, and write Φ 2,ρ (x) = Φ 2,ρ (x, x).This combined testing procedure is able to control the type I error for any ∆ ⋆ ∈ [−∞, ∆ 0 ] because In what follows, we establish several attractive features of the combined test.Note that under the alternative hypothesis, the power of the combined test -the probability of event (4) -is where ≈ means asymptotic approximation.This leads to the first observation that the power of the combined test is generally larger than the worst of the two component tests, i.e., Power c ≥ min(Power 1 , Power 2 ), where Power 1 , Power 2 , Power c are respectively the asymptotic power of T 1 , T 2,∆ 0 (w), and the combined test.This can be seen from noting that where the second inequality holds when , when the power of the two component tests are not too similar.
Moreover, not only is the power of the combined test better than the worst of the two component tests in finite sample, it is also close to the better of the two component tests in finite sample, and equal to the better of the two component tests in the limit.To see this, we bound the difference in power as It is helpful to anchor several values of c 1−α;ρ and the upper bound 1 − 2Φ((z 1−α − c 1−α;ρ )/2) in terms of different α and ρ.When α = 0.025 and for ρ = 0.5, 0.7, 1, the critical values are c 1−α;0.5 = 2.21, c 1−α;0.7 = 2.18, c 1−α;1 = 1.96, and correspondingly, the upper bounds are 0.100, 0.088, 0. This means that because of the high correlation between T 1 and T 2,∆ ⋆ (w), the price paid for multiple testing is generally small.With regard to the limiting power, it is also easy to see that for fixed θ 0 and θ ⋆ > θ 0 , B 1 → −∞ as the sample size n r increases.Hence, the combined test always has its power approaching 1 as n r → ∞, just like the test T 1 that only uses RCT data, which is not the case for T 2,∆ ⋆ (w) as discussed in Section 2.3.This further shows the advantage of the combined test.
For implementation of the sensitivity analysis (either T 2,∆ 0 or the combined test), practitioners are not required to specify the value of the sensitivity parameter ∆ 0 .Following the pioneering work by Cornfield et al. (1959) and the sensitivity analysis literature Rosenbaum (2020), results from the combined test can be summarized by the "tipping point" -the magnitude of ∆ 0 that would be needed such that the null hypothesis can no longer be rejected.If such a value of ∆ 0 is deemed implausible, then we still have evidence to reject the null hypothesis based on the combined test.In Section 5, we illustrate the method using a real example.
Table 2 summarizes the power of T 1 , T 2,∆ 0 (w) and the combined test T c,∆ 0 (w) = max(T 1 , T 2,∆ 0 (w)), calculated respectively using ( 1), (2), and ( 5).For T 2,∆ 0 (w) and T c,∆ 0 (w), we consider two choices of w: the oracle w in (3) that maximizes the power (denoted as w opt ), and its value under exchangeability n 0 /(n 0 + n e ) = 1/4.In the supplementary material, we check powers by simulation, finding good agreement.In the supplementary material, we also include a check of the type I error, which are all close to or below the nominal level, indicating validity of all the tests.In contrast, a naive combined test without correcting for multiple testing cannot control the type I error.
The following is a summary of results in Table 2.
1. Across all scenarios, the power of the combined test T c,∆ 0 (1/4) is larger than the worst of the power of T 1 and T 2,∆ 0 (1/4), and close to the best of the power of T 1 and T 2,∆ 0 (1/4).This supports our theory in Section 3.
2. For T 1 , its power is not affected by ∆ 0 .For T 2,∆ 0 (1/4), its power is mostly larger than that of T 1 when ∆ 0 = 0.2, 0.3, but quickly diminishes as ∆ 0 increases and becomes substantially smaller than that of T 1 when ∆ 0 = 0.4, 0.6 across most scenarios.In comparison, when θ ⋆ = 0.2, 0.3, the sensitivity parameter ∆ 0 can be as large as 0.3 before the combined test T c,∆ 0 (1/4) starts to lose power compared to T 1 ; when θ ⋆ = 0.4, the sensitivity parameter ∆ 0 can be as large as 0.4.If a ∆ 0 larger than 0.3 or 0.4 is deemed implausible by practitioners, the combined test T c,∆ 0 (1/4) will have power gain compared to T 1 .On the other hand, because the combined test T c,∆ 0 (1/4) still performs T 1 as one of its component (i.e., anchors at T 1 ) but with a small adjustment for testing twice, the potential power loss compared to T 1 is never too large.This clearly demonstrates the key advantage of the combined test.

Application
We revisit the example introduced in Section 1.2 and illustrate how the proposed methods can be applied.Formally, we test the hypothesis that H 0 : θ ⋆ = θ 0 versus H A : θ ⋆ < θ 0 , with θ 0 = 0.4, which can be equivalently implemented using the tests described in Sections 2-3 with Y i 's replaced by −Y i 's and θ 0 replaced by −θ 0 .We set the significance level α = 0.025.
Using only the internal RCT, T 1 = 4.80 with p-value 7.92 × 10 −7 , based on which we reject the null hypothesis H 0 .This result is solely based on internal controls and thus is invariant to the value of ∆ 0 .
Leveraging external controls and let w = n 0 /(n 0 +n e ) = 0.485, T 2 (w) = 5.08 with p-value 1.88×10 −7 when ∆ 0 = 0. Therefore, under the exchangeability assumption, we can also reject the null hypothesis H 0 .To gauge the robustness of this conclusion to violation of the exchangeability, we apply the proposed sensitivity analysis.As discussed at the end of Section 3, results of our sensitivity analysis can be summarized by the "tipping point" -the magnitude of ∆ 0 that would be needed such that the null hypothesis can no longer be rejected.In this example, as ∆ 0 increases, the adjusted p-value associated with T 2,∆ 0 (w) increases but remains below α = 0.025 for any ∆ 0 ≤ 0.62.Namely, two patients with the same observed characteristics (as listed in Table 1), one in the internal RCT and the other in the external trial, may differ in their expected potential outcome under control by up to 0.62, under which the adjusted p-value is still below the significance level α.This means that the significant effect we observe cannot be explained away by unmeasured biases of magnitude up to ∆ 0 = 0.62.If such a large unmeasured bias is deemed implausible, then there is no real doubt that the rejection based on T 2,∆ 0 provides evidence of noninferiority.
Finally, using the combined test, max(T 1 , T 2,∆ 0 (w)) = 5.08 with adjusted p-value 3.41 × 10 −7 when ∆ 0 = 0.As ∆ 0 increases, the adjusted p-value for the combined test increases but plateaus at 1.41×10 −6 when T 1 ≥ T 2,∆ 0 (w).This means that rejection based on the combined test is insensitive to any value of ∆ 0 , i.e., similar to T 1 that only uses the internal RCT, rejection based on the combined test is insensitive to any violation of the exchangeability assumption.
It is also interesting to see the relative performance of T 1 , T 2,∆ 0 (w), T c,∆ 0 (w) when the internal RCT is underpowered, and thus the combined test may be more useful.For this purpose, we randomly sample with replacement 100 patients from the internal RCT, with a target ratio of 4/5 from the treated arm and 1/5 from the control arm.Then T 1 is computed using this subsample from the RCT, while T 2,∆ 0 (w) and T c,∆ 0 (w) additionally use the external controls that were matched to the sampled treated patients with w = n 0 /(n 0 + n e ) calculated using the subsample.This procedure is repeated 1000 times.
Hence, the sensitivity parameter ∆ 0 can be as large as 0.25 before the combined test starts to lose power compared to T 1 .In comparison, T 2,∆ 0 (w) has worse performance, with power equal to 80.4%, 64.8%, 45.7% when ∆ 0 = 0.1, 0.2, 0.25, respectively.Taking a closer look at the results, we note that if T 1 is larger than c 1−α;ρ defined in (4), then both T 1 and the combined test can reject H 0 regardless of the value of ∆ 0 .If T 1 < z 1−α , then T 1 cannot reject H 0 while the combined test can still reject 27.7% of these cases at ∆ 0 = 0.2.The potential loss of using the combined test is when T 1 is between z 1−α and c 1−α;ρ , in which cases using T 1 alone can reject H 0 but the combined test is sensitive to a certain value of ∆ 0 .However, this scenario is relatively rare and occurs in 8.4% of the repetitions; furthermore, even in this scenario, the combined test can still reject H 0 at ∆ 0 = 0.2 around half the time.
The last step of a sensitivity analysis is to reason about whether a value of ∆ 0 = 0.2 is plausible given that we have already controlled for baseline covariates listed in Table 1.For this task, an intuitive strategy is to judge the plausibility of ∆ 0 in reference to some observed covariates (Imbens, 2003).
Specifically, we can omit observed covariates one at a time during matching and calculate Ȳ0 − Ȳe using the resulting matched external controls.Using this procedure, we estimate the amount of bias from not matching on one of the observed covariates and to benchmark the plausibility of ∆ 0 , the amount of bias from not being able to match on the region variable.The results show that omitting the baseline HbA1c leads to the largest Ȳ0 − Ȳe that is equal to 0.14, while omitting any other observed variables in Table 1 leads to Ȳ0 − Ȳe ranging from -0.05 to 0.04.Based on the prior knowledge in Home et al.
(2014) that the baseline HbA1c explains most of the variability in the change in HbA1c, particularly in comparison to the geographical region, we view that ∆ = 0.2 is implausible.
In summary, before looking at the data, the choice between T 1 and T 2,∆ 0 (w), would be difficult to make or justify on the basis of a priori considerations.In some cases, T 1 may not be powerful enough due to the small sample size of the internal RCT, while leveraging external controls leads to a more powerful test.In some other cases, T 2,∆ 0 (w) may be sensitive to unmeasured biases while T 1 is already powerful enough.Under these circumstances, the combined test T c,∆ 0 (w) is often preferable as it performs both tests with a small correction for multiple testing by taking into account the high correlation of the two test statistics.

Discussion
We propose a sensitivity analysis approach for using external controls in clinical trials to examine the robustness of study conclusion to remaining unmeasured bias after controlling for measured covariates.
Results from the sensitivity analysis can be summarized by the "tipping point" -the magnitude of ∆ 0 that would be needed such that the null hypothesis can no longer be rejected.If ∆ 0 is deemed plausible (or implausible), the conclusion based on using external controls is sensitive (or robust) to unmeasured bias.
When in doubt about whether the use of external controls increases power, we propose a combined testing procedure that performs both tests, one only using the internal controls and one additionally using the external controls, correcting for multiple testing using the joint distribution of the two test statistics.Because the two test statistics are highly correlated, this correction for multiple testing is small, and thus the combined test only has a small loss of power compared to knowing a priori which test is best.Moreover, the combined test provides a new method of sensitivity analysis designed for data fusion problems, which anchors at the unbiased RCT-only analysis and spends a small proportion of the type I error to also test using the external controls.In this way, if leveraging external controls increases power, the power gain compared to the RCT-only analysis can be substantial; if not, the power loss is small.
Our work is motivated by the literature of sensitivity analysis, in which testing a hypothesis multiple times has been shown to be useful in enhancing the robustness to unmeasured bias (Rosenbaum, 2012;Small et al., 2013;Rosenbaum and Small, 2017;Ye and Small, 2021).Nonetheless, we focus on a distinct context and have shown that testing multiple times using both a known unbiased test and potentially biased tests can be particularly attractive for data fusion problems.We also have developed various properties of the combined procedure that has not appeared in the existing literature.
Finally, a remaining question is how to choose w for the combined test.The power of the combined test depends on w in a complicated way as w not only affects the definition of T 2,∆ 0 (w) but also the correlation ρ, which makes finding the optimal w a cumbersome task.In practice, a reasonable choice is w = π 0 n r /(n e + π 0 n r ), which minimizes the variance of wY 0 + (1 − w)Y e when Var(Y (0) | D = 1) = Var(Y (0) | D = 0).Another way is to pre-specify several values of w, calculate the corresponding test statistics, and combine all the test statistics using their joint null distribution.Because of the high correlation between these test statistics, the price paid for multiple testing will generally be small.Table 2: Theoretical power (in %) for T 1 , T 2,∆ 0 (w) and the combined test T c,∆ 0 (w) with w = 1/4 or w opt , where θ 0 = 0, ∆ ⋆ = 0.2, n 1 : n 0 : n e = 2 : 1 : 3, σ 1 = σ 0 = σ e = 1, and α = 2.5%.In the table, we omit the ∆ 0 subscript for notational simplicity.Table S1: Empirical and theoretical type I error (in %) for T 1 , T 2,∆ 0 (w), the combined test T c,∆ 0 (w), and the naive combined test T c,∆ 0 (w) with w = 1/4, where θ 0 = 0, ∆ ⋆ = 0.2, n 1 : n 0 : n e = 2 : 1 : 3, σ 1 = σ 0 = σ e = 1, and α = 2.5%.The empirical version is based on 10,000 repetitions.In the table, we omit the ∆ 0 subscript for notational simplicity.
c+d < 1.The second last inequation holds as ( −c c+d ) < −(a + b)/a < 0. In this case, the maxima is w = 1.In summary, 1.When −b/a > d c+d , the maximum is achieved at w = 1. 2. When −b/a < d c+d , the maximum is achieved at w = ae−bc ad+ae+bd .

Table 1 :
Covariate balance after matching in 159 matched pairs of one treated patient in the RCT and one external control patient.