Principal Surrogates in Context of High Vaccine Eﬃcacy

Background: The use of correlates of protection (CoPs) in vaccination trials oﬀers signiﬁcant advantages as useful clinical endpoint substitutes. Vaccines with very high vaccine eﬃcacy (VE) are documented in the literature (95% or above). Callegaro and Tibaldi, (2019) showed that the rare infections observed in the vaccinated groups of these trials poses challenges when applying conventionally-used statistical methods for CoP assessment such as the Prentice criteria and meta-analysis. Methods: In this paper, we extended Callegaro and Tibaldi, (2019) simulation study by evaluating the impact of high VE on the Principal stratiﬁcation approach. Results: Similarly to the Prentice framework, we showed that the power decreases when the VE grows. It follows that it can be challenging to validate a principal surrogate (and a statistical surrogate) when rare infections are observed in the vaccinated groups.


Background
An important factor influencing the duration and complexity of clinical trials is the choice of the endpoint used to assess vaccine efficacy. It would be extremely convenient to replace a late, costly or rare true endpoint by an immunological surrogate, which is measured earlier, cheaper, or more frequently. However, from a regulatory perspective, a surrogate endpoint (called sometimes Surrogate of protection, or Correlate of Protection) is not considered acceptable for the determination of efficacy, unless it has been validated, i.e. shown to predict clinical benefit. Prentice (1989) [1] introduced a formal definition of surrogacy based on the concept of mediation in a single-trial setting. Although appealing, Prentice's definition and criteria received criticism [2,3]. In subsequent decades, many statistical methods have been proposed for the evaluation of surrogate endpoints, most of them framed within the causal inference [2,4,5] and meta-analytic paradigms [6,7,8].
Although not common, vaccines with very high efficacy are documented in the literature [9,10,11,12,13,14]. These include the salmonella typhi vi coniugate [9], or the combined measles-mumps-rubella-varicella immunisation [14]. Assessing CoPs in the context of high VE using classical statical methods is problematic. Indeed, a very small number of cases/infections (corresponding to the vaccinated groups) can trigger considerable issues for such statistical models. There is therefore a need to evaluate the statistical methods for CoP assessment to the context of high efficacy vaccines. Callegaro and Tibaldi (2019) [15] showed that the validation of a surrogate endpoint using the Prentice criteria and meta-analytic frameworks (by randomized subgroups in single trial setting) can be problematic in case of high VE because of the rare events available in the vaccine group. The aim of this paper is to evaluate the performance of the causal framework, specifically the Principal Surroagate approach [4,5] in case of high VE.

The Prentice Criteria
The following set of notation will be used throughout the manuscript: Y j and S j are random variables denoting the observed clinical (binary) and the surrogate endpoint for subject j = 1, ..., n and Z j is a binary treatment indicator.
Key concepts, including the hypothesis-testing approach to the validation of substitute endpoints using randomised clinical trial data, were introduced by Prentice in 1989 [1]. Prentice four criteria for the validation of a surrogate endpoint can be evaluated using the following 4 models: In this paper we will mainly focus on criterion 4. This criterion is met if the null hypothesis H 01 : γ Z = 0 is rejected (p-value(s)< α) and the null hypothesis H 02 : β S = 0 is not rejected (p-value(z)≥ α).

Principal Surrogate Framework
Many causal inference approaches/methods have been published in the literature. In what follows, we describe the Vaccine Efficacy Framework of Follmann and Gilbert [4,5]. Since S i can be affected by treatment, there are 2 naturally occurring counterfactual values of S i : S i (1) under treatment, and S i (0) under control. The observed clinical endpoint (binary) is denoted by Y i and the counterfactual values are Y i (1) under treatment, and Y i (0) under control. Criteria for S to be a good surrogate are based on risk estimands that condition on the potential surrogate responses [5] risk (1) A contrast in risk (1) (s(1), s(0)) and risk (0) (s(1), s(0)) is a causal effect on the clinical endpoint. A classical contrast used in vaccines is the Vaccine Efficacy (VE) A Principal Surrogate (PS) is a biomarker satisfying two conditions: causal necessity V E(s(1), s(0)) = 0 for all s(1) = s(0) and Wide Effect Modification (WEM) which means that WEM is similar in spirit to the Individual Causal Association (ICA) [16], which is the correlation between the individual causal effect on the endpoint and on the surrogate.
In this paper we only focus on WEM. In fact, current works [5,17,18,19,20] suggest that WEM criterion is of primary importance for a biomarker to be a PS. Furthermore, [16] showed that the average causal necessity definition may be extremely restrictive.

Estimating VE
Assumptions A1-A3 (A1: Stable unit treatment value assumption; A2: Ignorable treatment assignments; A3: Equal individual clinical risk up to the time of surrogate measurements) imply that risk(Z)(s(1), s(0)) would be identified if we knew the potential outcomes S i (Z) of subjects assigned the opposite treatment 1 − Z risk (1)  It follows that it is necessary to impute (or integrate out) the missing potential biomarkers. The risk can be modeled using the following logistic model logit(P (Y i = 1|z i , s 1i , s 0i )) = β 0 +β z z i +β s(1) s 1i +β s(1)z s 1i z i +β s(0) s 0i +β s(0)z s 0i z i .
The model can be simplified in case of a Constant Biomarker (S i (0) = c) where the VE curve is used The constant biomarker assumption is reasonable when subjects have been selected to have no meaningful exposure to the pathogen, so that S(0) = 0. Examples include HIV or varicella vaccine trials. This assumption is also reasonable for populations exposed to the pathogen when the biomarker S i is the log10 Fold-Increase from baseline (F I i ), which is the difference between the log10 post (A i ) and the log10 baseline

Missing values imputation/integration
The key challenge in estimating these risk estimands is solving the problem of conditioning on counterfactual values that are not observable. This involves integrating out (or imputing) missing values based on some models, and under some set of assumptions and/or trial augmentations. [5] and [4] proposed to use the estimated maximum likelihood followed by bootstrap. [21] suggested a pseudoscore estimation procedure that does have a closed form variance estimator. [22] used a multiple imputation approach. In this paper we fit model 1 using the method implemented in the R package pseval [23]: Baseline Immunogenicity Predictor (BIP); parameters estimated using estimated maximum likelihood (missing information is integrated out) and the variance is estimated by bootstrap. Rcode is provided in the Appendix. This approach is similar in spirit to the method used in Follmann [4].

Simulations of Callegaro and Tibaldi (2019)
To evaluate the impact of high vaccine efficacy on the PS validation, we repeated the simulations of Callegaro and Tibaldi [15]. The Dunning regression model [24] was used to simulate the data in an ideal CoP setting, where the treatment effect is fully explained by the post values (A i ) as follows: Here, π can be interpreted as the probability of being exposed to the disease. Simulations were run using the following parameter assumptions: total sample size n = 5000, 1:1 randomization, π=0.1, µ = 8.3, γ = log(1 − 0.95); the immune response post vaccination is normally distributed A|Z = 0 ∼ N (3, 0.2) in the placebo group and A|Z = 1 ∼ N (3 + ∆, 0.2) in the vaccine group, where ∆ = 0.33, 0.75, 1, 1.5. The value of the immune response at baseline is generated as B ∼ N (3, 0.2) with correlation between A and B of 0.90 in the placebo group and 0.50 in the vaccine group. For each scenario, 1000 clinical trials were simulated. We fit Prentice model 4 on the simulated data with Fold-Increase (S i = F I i = A i − B i ) as surrogate adjusting for the baseline (B i ) using logit regression and the scaled logistic model [24] Note that this model is consistent with the model used to generate the data (Equation 2), with a slightly different parametrization. The power to meet Prentice criterion 4 (PC4) was measured as the proportion of simulated trials with pvalue(s)= 2Φ(−|γ Z / V ar(γ Z )|)) < α and p-value(z)= 2Φ(−|β S / V ar(β S )|)) ≥ α, α = 0.05. Furthermore, we applied the Principal surrogate approach on vaccine induced fold-increase (S(1) i = F I(1) i ) where missing information is integrated out using the baseline surrogate measurement (B i ). The power of the WEM approach was measured as the proportion of simulated trials with significant Wald statistics for the s(1)z coefficent of model (1) (pvalue(s(1)z) < α, α = 0.05). Appendix contains the R code used to apply the PS approach is provided. Table 1 shows that the power of both PC4 and WEM decreases when the VE increases. This is due to the fact that there is less information (number of events) as the VE increases. Note that the power of the Prentice approach is higher than in Callegaro and Tibaldi [15] because of the inclusion of the baseline surrogate as covariate. Simulation results suggest similar power for PC4 and WEM approaches.

Simulations with constant biomarker under placebo
In the previous simulations the Fold-Increase was not constant in placebo (it was normally distributed). To evaluate the performance of the Prentice and PS approach in case of constant biomarker under placebo, which mimics vaccine trials in a naive population, we simulated data using the model described above. However, in the inferential models, we replaced FI by FI * which is constant in Placebo. FI * is defined as where c is the 99% quantile of the distribution of FI in Placebo. Table 2 shows some loss of power of the PS approach when the VE increases. Even if the use of the Prentice framework is not justified in this context, table 2 shows the results of the Prentice criteria 4 (PC4 logistic model). Results from PC4 scaled logistic are not shown because the model is not converging. We observe a dramatic loss of power of Prentice criterion 4 when the VE is high.
Note that table 2 shows simulation results where the inferential models do not agree with the data generating mechanism, so it represents a situation of model miss-specification.
To disentangle the problem of model miss-specification from the constant biomarker problem, we generate additional constant biomarker data using a model consistent with the 'inferential' model used to fit the data. We simulated data using the following Dunning regression model: Here, π = 0.1 and the other parameters are chosen to mimic Table 1 Table 3 shows that the loss of power of Prentice approach shown in Table 2 was mainly due to model miss-specification. Simulations with low/moderate VE For comparison, we considered simulations with low VE. We simulated data as described above with µ 1 = E(A|Z = 1) = 3, 3.075, 3.15, 3.23, corresponding to estimated VE about 0%, 10%, 20% and 30%, respectively. Note that Prentice criteria 1 will not be met in this situation. For simplicity, we focused only on Prentice criterion 4. Table 4 shows that both approaches (PC4 and WEM) are powerful in the case of low/moderate VE. Prentice criterion 4 seems to be slightly more powerful than PS. Simulations using random intercept logistic (correlated potential outcomes) Finally, we generated data in a different way more aligned with the causal inference setting (potential outcomes). We generated correlated post-vaccination values (A(0), A(1)) using a bivariate normal distribution The variables Y (0), Y (1) are conditionally independent given b but unconditionally (averaged over b) correlated. The extent of correlation depends on the variance of the random effect (var(b)). We generated bridge distributed random intercept (using R package bridgedist [25]) such that the resultant marginal distribution follows a logistic regression model [26]. In fact, the marginal logistic regression model is logit(P (Y (z) = 1|A(z))) = µ/c + A(z)γ/c for z = 0, 1 with c = 1 + 3var(b)/π 2 . We simulated data with the following parameters: var(b) = 10 (scale = 0.5), µ = 3.6 and γ = −3.8. In this way, p 0 = P (Y = 1|Z = 0) = 0.05 and the estimated VE is about 0.45, 0.75, 0.85 and 0.95. Table 5 shows that Prentice criterion 4 is more powerful than WEM. The Prentice framework is more powerful than PS for different reasons: i) PS tests for an interaction, which is less powerful than a test for the main effect; ii) the covariate S (observed surrogate in vaccinated and placebo) has greater range in the Prentice model 4 than the covariate S(1) in the PS model. It is easier to estimate a slope for a covariate with a bigger range. Figure 1 illustrate these differences.
Case study: Analysis of a simulated data-set with large VE In this section we analyze one simulated dataset from the scenario with largest VE of table 1. The sample size is N=5000, with 1:1 randomization. The number of events observed in the two groups are 3 and 90, with an estimated VE of 96% (95%CI,0.89%-98%). Figure 2 shows that the vaccine and placebo groups had similar log10 titer distributions at baseline while there is a small overlap in distributions post vaccination. Antibody responses clearly increased from baseline to post-dose in vaccine recipients but not in placebo recipients. Figure 3 shows the Spearman correlation between baseline and post (left panel) and between baseline and Fold-Increase (right panel). First we examine the interaction between surrogate and the treatment. Table 6 shows that there is no interaction (p-value=0.49).
Secondly, we assess the four Prentice criteria. Table 7 shows that all criteria are met. In particular, the last 4 rows shows the results related to criterion 4. We can see that the effect of the surrogate is significant (p-value(s)=0.019), while the treatment effect is not significant, but is close to 5% (p-value(z)=0.078).
Slightly better results are obtained if Dunning model is used (see Table 8).
In summary, there is suggestive though not strong evidence that the Fold-Increase is a Statistical Surrogate.
Principal Surrogate framework Table 9 shows the results from R package pseval with 50 bootstrap (R codes are provided in the Appendix). We can see that the interaction between the treatment group and FI(1) (test for wide effect modification) is borderline (p-value=0.053). Figure 4 shows the estimated VE curve for Fold-Increase. The estimated VE curve is an increasing function of FI(1), however we can see large variability for small values of FI(1) and negative VEs for vaccine recipients with no rise.
In summary, there is suggestive though not strong evidence that the Fold-Increase is a Principal Surrogate.

Conclusions
Although not common, vaccines with very high efficacy (95% or above) are documented in the literature [9,10,11,12,13,14]. These trials raise the problem of assessing CoPs in the context where small number of cases/infections in vaccinated groups are available.
Callegaro and Tibaldi (2019) [15] showed that the validation of a surrogate endpoint using the Prentice criteria and meta-analytic frameworks (by randomized subgroups in single trial setting) can be problematic in case of high VE. In this paper, we evaluate the performance of the causal framework, specifically the Principal Surrogate (PS) approach [4,5] in case of high VE.
First, we replicated the simulation study of Callegaro and Tibaldi [15] where the clinical outcome was simulated using Prentice model 3 (assuming full mediation) and using the Dunning model [24]. These simulation results show that i) adjustments for important covariates (such as baseline surrogate) considerably improves the power of the Prentice approach (even if the model is miss-specified) in case of high VE. Furthermore, these simulation results show similar power of Prentice and PS frameworks. The power of both approches decreases when VE grows.
Second, we slightly changed the Callegaro and Tibaldi scenario to consider the case of constant biomarker under placebo and the case of small/moderate VE. Simulation results show that i) PS is more powerful than Prentice in case of constant biomarker when the inferential model is miss-specified, otherwise Prentice is more powerful; ii) Prentice criteria 4 and PS frameworks are powerful when the VE is small (see Table 3). However, in this case Prentice criteria 1 is not met, so the two approaches give different conclusions.
Finally, we simulated correlated potential outcome data using a bivariate (random intercept) logistic regression. In this case the Prentice framework is more powerful than the PS approach. This can be due to the following reasons: i) Prentice model 4 corresponds to the model used to generate the data and so there is no lack of fit in the Prentice framework; ii) PS tests for an interaction, which is less powerful than a test for the main effect; iii) the covariate S (observed surrogate in vaccinated and placebo) has greater range in the Prentice model 4 than the covariate S(1) in the PS model. It is easier to estimate a slope for a covariate with a bigger range (see Figure 1); (iv) Principal stratification has to impute S(1) for placebo participants which increases the variability of estimates relative to knowing S(1). In contrast S is known in all for the Prentice criterion.
It is important to highlight that the power comparison between the two approaches should be interpreted with care. In fact, the two approaches measure two different things: Prentice framework evaluates if the surrogate is a "statistical surrogate" while the PS evaluates if the surrogate is a "principal surrogate" (see [27] for more details).
For illustration, we analyzed one data-set simulated with full mediation (Dunning model 3) and with high VE (V E = 96%). Results showed suggestive thought not strong evidence that the FI is a Statistical Surrogate (Prentice criteria) or a PS. These results are due to the lack of power of these approaches in case of high VE. An interesting topic for future research is the implementation of the two approaches in a Bayesian framework with weakly informative priors (WIP). In fact, [15] showed that WIP can considerably increase the power of the meta-analytical approach in case of high VE.
In conclusion, we evaluated by simulation the impact of high VE on the PS approach. Similarly to the Prentice framework, we showed that the power decreases when the VE grows. It follows that it can be challenging to validate a principal surrogate (and a statistical surrogate) when rare infections are observed in the vaccinated groups. Competing interests AC and FT are employees of the GSK group of companies and hold shares in the GSK group of companies. DF declares no conflict of interest.

Funding
GlaxoSmithKline Biologicals SA was the funding source and was involved in all stages of the study conduct and analysis. GlaxoSmithKline Biologicals SA also took responsibility for all costs associated with the development and publishing of the present manuscript.

Authors contributions
AC, FT and DF equally contributed to all steps of the manuscripts development, and approved its final version.     Tables   Table 1 Simulation results of data generated using scaled logit Prentice model 3. Power (α = 0.05) to assess Prentice criterion 4 (PC4) using logistic and scaled logistic model and power to assess the Wide Effect Modification (WEM) of a Principal Surrogate using the logistic model (p-value of the interaction s(1)z).  Table 3 Simulation results with constant biomarker (inferential models agree with the data generating mechanism). Power (α = 0.05) to assess Prentice criterion 4 (PC4) using logistic model and power to assess the Wide Effect Modification (WEM) of a Principal Surrogate using the logistic model (p-value of the interaction s(1)z).