# An Approach to Evaluating and Comparing Biomarkers for Patient Treatment Selection

Holly Janes, Marshall D. Brown, Ying Huang and Margaret S. Pepe

# Abstract

Despite the heightened interest in developing biomarkers predicting treatment response that are used to optimize patient treatment decisions, there has been relatively little development of statistical methodology to evaluate these markers. There is currently no unified statistical framework for marker evaluation. This paper proposes a suite of descriptive and inferential methods designed to evaluate individual markers and to compare candidate markers. An R software package has been developed which implements these methods. Their utility is illustrated in the breast cancer treatment context, where candidate markers are evaluated for their ability to identify a subset of women who do not benefit from adjuvant chemotherapy and can therefore avoid its toxicity.

## 1 Introduction

There is an enormous amount of research effort being devoted to discovering and evaluating markers that can predict a patient’s chance of responding to treatment. A December, 2013 PubMed search identified 8,198 papers evaluating such markers over the last 2 years alone. Treatment selection markers, sometimes called “predictive” [1] or “prescriptive” [2] markers, have the potential to improve patient outcomes and reduce medical costs by allowing treatment provision to be restricted to those subjects most likely to benefit, and avoiding treatment in those only likely to suffer its side effects and other costs.

Methods for evaluating treatment selection markers are much less well developed than for markers used to diagnose disease or predict risk under a single treatment. In the medical literature, the most common approach to marker evaluation is to test for a statistical interaction between the marker and treatment in the context of a randomized and controlled trial (RCT; see Coates et al. [3], Busch et al. [4], and Malmstrom et al. and NCBTSG [5] for some recent examples). However this approach has limitations in that it does not provide a clinically relevant measure of the benefit of using the marker to select treatment and does not facilitate comparing candidate markers [6]. Moreover, the scale and magnitude of the interaction coefficient will depend on the form of the regression model used to test for interaction and on the other covariates included in this model [7].

There is a growing literature on statistical methods for evaluating treatment selection markers. A number of papers have focused on descriptive analysis, specifically on modeling the treatment effect as a function of marker [812]. In general, these approaches are not well-suited to the task of comparing candidate markers. Other papers have proposed individual measures for evaluating markers [6, 7, 1316], some of which we adopt as part of our analytic approach as described below. Still others have focused on the specific problem of optimizing marker combinations for treatment selection [1722]. A complete framework for marker evaluation, on a par with those developed for evaluating markers for classification [23, 24] or risk prediction [25], is still forthcoming.

In this paper, we lay out a comprehensive approach to evaluate markers for treatment selection. We propose tools for descriptive analysis and summary measures for formal evaluation and comparison of markers. The descriptives are conceptually similar to those of Bonetti and Gelber [8], Royston and Sauerbrei [9], Cai et al. [10], but we scale markers to the percentile scale to facilitate making comparisons. Our preferred global summary measure is the same as or closely related to that advocated by Song and Pepe [13], Brinkley et al. [16], Janes et al. [6], Gunter et al. [19], Qian and Murphy [20], McKeague and Qian [21], and Zhang et al. [22], a component of which was described by Zhao et al. [12] and Baker and Kramer [14]. We also propose several novel measures of treatment selection performance, motivated by existing methodology for evaluating markers for predicting outcome under a single treatment, i.e. for risk prediction. We develop methods for estimation and inference that apply to data from a randomized controlled trial comparing two treatment options where the marker is measured at baseline on all or a stratified case–control sample of trial participants. For illustration, we consider the breast cancer treatment context where candidate markers are evaluated for their utility in identifying a subset of women who do not benefit from adjuvant chemotherapy. Appendices include the results of a small-scale simulation study that evaluates the performance of the methods in finite samples and a description of the R package we have written that implements these methods.

## 2 Setting and notation

Suppose that the task is to decide between two treatment options, referred to as “treatment” (T=1) and “no treatment” (T=0). The clinical outcome of interest, D, is a binary indicator of a bad event within a specific time-frame following treatment provision; we refer to this outcome as an “adverse event” or “event”. The outcome D is thought to capture all potential impacts of treatment, so that any decrease in the rate of events justifies treatment; a generalization is discussed in Section 6.1. To achieve this, D may be chosen to represent a composite outcome such as an indicator of treatment-associated toxicity or death. We assume that the marginal treatment effect ρ0ρ1P(D=1T=0)P(D=1T=1) is positive, so that the default approach is to treat all subjects. The question is whether a marker, Y, if measured prior to treatment provision, is useful for identifying a subset of subjects who can avoid treatment. Note that the scenario where the marginal treatment effect is negative (or zero) and Y identifies a subset of subjects who benefit from treatment can be handled by simply reversing the treatment labels.

We focus on the ideal setting for evaluating treatment efficacy, an RCT comparing T=1 with T=0. By necessity, this must be a relatively large trial; it is well-known that large sample sizes are generally needed to detect statistical interactions. We assume to begin that Y is continuous and measured at baseline on all trial participants. We generalize our methods to case–control sampling from within an RCT in Section 6.2.

## 3 Motivating context

We illustrate our methods in the breast cancer treatment context. Women diagnosed with estrogen-receptor-positive and node-positive breast cancer are typically treated with both hormone therapy (e.g. tamoxifen) and adjuvant chemotherapy following surgery. This is despite the fact that it is generally well-accepted in the clinical community that only a subset of these women actually benefit from the adjuvant chemotherapy, and the remaining women suffer its toxic side effects, not to mention the burden and cost of unnecessary treatment [26]. A high public health priority is to identify biomarkers that can be used to predict which women are and are not likely to benefit from the adjuvant chemotherapy [27]. The Oncotype DX recurrence score is an example of a biomarker that is currently being used in clinical practice for this purpose. This marker is a proprietary combination of 21 genes whose expression levels are measured in the tumor tissue obtained at surgery [2830]. The marker has been shown to have value for identifying a subset of women who are unlikely to benefit from chemotherapy [28, 29, and 30].

To illustrate our methods, we simulated a marker, Y1, with the same performance as Oncotype DX in the SWOG SS8814 trial which evaluated adjuvant chemotherapy (cyclophosphamide, doxorubicin, and fluorouracil) given before tamoxifen for treating post-menopausal women with estrogen-receptor-positive, node-positive breast cancer [28, 31]. We also simulated another marker, Y2, which we will demonstrate is a much stronger marker. Both markers Y1 and Y2 are measured at baseline for 1,000 participants randomized with equal probability to tamoxifen alone (T=0) or tamoxifen plus chemotherapy (T=1). The outcome, D, is breast cancer recurrence or death within 5 years of randomization, and the marginal treatment effect is ρ0ρ1=0.240.21=0.03 as seen in SS8814. Marker Y1 is simulated to mimic the Oncotype DX distribution; Y1 is normally distributed with mean 4.8 and standard deviation 1.8. Marker Y2 is standard normal. Each marker is related to D via a linear logistic model, logitP(D=1T,Y)=β0+β1T+β2Y+β3YT, where for Y1 the model coefficients are chosen to mimic the performance of the Oncotype DX recurrence score [28]. Methods for simulating the data are described in the Appendix.

## 4 Methods for evaluating individual markers

### 4.1 Treatment rule

Given that the task is to decide between treatment and no treatment for each individual subject, it is sensible and common to define a binary rule for assigning treatment on the basis of marker value. Let Δ(Y)=P(D=1T=0,Y)P(D=1T=1,Y) denote the absolute treatment effect given marker value Y. The rule

donottreatifΔ(Y)<0

can be shown to be optimal in the sense that it minimizes the population event rate [16, 22, 32]. Some of the marker performance measures we consider evaluate the properties of this rule; other performance measures do not depend on specification of a treatment rule. We refer to subjects with Δ(Y)<0 as “marker-negatives” and Δ(Y)>0 as “marker-positives”. More general treatment rules are considered in Section 6.1.

### 4.2 Descriptives

For descriptive analysis, it is useful to display the distribution of risk of the event as a function of the marker under each treatment. We plot “risk curves” P(D=1T=1,Y) and P(D=1T=0,Y) vs marker percentile F(Y), where F is the cumulative distribution function (CDF) of Y [6]. Figure 1 shows the risk curves for the Oncotype-DX-like marker, Y1, and the much better marker, Y2. From these one can visually assess the variability in response on each treatment as a function of marker value. One can also determine the proportion of subjects with negative treatment effects who can avoid chemotherapy, 46% for Y1 vs 38% for Y2.

### Figure 1

Risk of 5-year breast cancer recurrence or death as a function of treatment assignment and marker percentile, for Y1, the Oncotype-DX-like marker (top), and the strong marker, Y2 (bottom). Horizontal pointwise 95% confidence intervals (CIs) are shown. Forty-six percent of women have negative treatment effects according to Y1 vs 38% with Y2; these women can avoid adjuvant chemotherapy

Another informative display is the distribution of treatment effect, as summarized by Δ(Y) vs FΔ(Δ(Y)) where FΔ is the CDF of Δ(Y) [7]. The example shown in Figure 2 reveals that Y2 has much greater variation in marker-specific treatment effect than does Y1. For Y2 a greater proportion of marker-specific treatment effects are extreme, whereas for Y1 the range is smaller and most treatment effects are near the average of ρ0ρ1=0.03.

### Figure 2

Distribution of the treatment effect, as measured by the difference in the 5-year breast cancer recurrence or death rate without vs with treatment, Δ(Y)=P(D=1T=0,Y)P(D=1T=1,Y), for the Oncotype-DX-like marker (Y1) and the strong marker (Y2). Horizontal pointwise 95% CIs are shown

### 4.3 Summary measures

The following are useful measures for summarizing marker performance that depend on specification of the treatment rule:

1. Average benefit of no treatment among marker-negatives,

Bneg=P(D=1T=1,Δ(Y)<0)P(D=1T=0,Δ(Y)<0)=E(Δ(Y)Δ(Y)<0)
1. Average benefit of treatment among marker-positives,

Bpos=P(D=1T=0,Δ(Y)>0)P(D=1T=1,Δ(Y)>0)=E(Δ(Y)Δ(Y)>0)
1. Proportion marker-negative, Pneg=P(Δ(Y)<0)

1. Decrease in population event rate under marker-based treatment,

Θ=P(D=1T=1)[P(D=1T=1,Δ(Y)>0)P(Δ(Y)>0)+P(D=1T=0,Δ(Y)<0)P(Δ(Y)<0)]=[P(D=1T=1,Δ(Y)<0)P(D=1T=0,Δ(Y)<0)]P(Δ(Y)<0)=BnegPneg,

where we define P(D=1T,Δ(Y)<0)=0 if P(Δ(Y)<0)=0. The measure Θ, or a variation on it, has been advocated by many as a global measure of marker performance [2, 6, 13, 16, 21, 22]. Θ varies between 0 and ρ1. The minimum value of 0 corresponds to an entirely useless marker with constant marker-specific treatment effect, Δ(Y)=ρ0ρ1>0 for all Y. For such a marker, Θ=ρ1[ρ11+00]=0 The maximum value of Θ is achieved when P(D=1T=1,Δ(Y)>0)=P(D=1T=0,Δ(Y)<0)=0, so that Θ=ρ1[0P(Δ(Y)>0)+0P(Δ(Y)<0)]=ρ1.

The constituents of Θ, namely Bneg and Pneg, are helpful for dissecting the impact of the marker. The measures Bneg and Bpos inform on the average benefit of the treatment policies recommended to marker-negatives and marker-positives, respectively. Bnegitself has been advocated by some as a measure of marker performance [12, 14], but clearly cannot be interpreted in isolation as it can be made arbitrarily large by making the marker-negative subgroup more extreme; i.e. the size of the subgroup (Pneg) is also relevant.

We also consider two marker performance measures that do not depend on specification of a treatment rule:

1. Variance in treatment effect, VΔ=Var(Δ(Y))=(Δ(Y)(ρ0ρ1))2dFΔ

2. Total gain, the area between the treatment effect curve and the marginal treatment effect, TG=Δ(Y)(ρ0ρ1)dFΔ.

The VΔ and TG measures suffer because of lack of clinical interpretation, but have the advantage of being independent of treatment rule and potentially form the basis for more efficient comparisons of markers. These measures are extensions of those used to evaluate markers for predicting risk of the event under a single treatment, rather than the treatment effect.

Table 1 contains estimates of VΔ and TG measures for markers Y1 and Y2 in the breast cancer example. Focusing on Y2, we see that the population impact of Y2-based treatment is a 10% reduction in the 5-year recurrence or death rate; this is a consequence of 38% of subjects avoiding adjuvant chemotherapy and a 26% reduction in the event rate due to avoiding chemotherapy in this subgroup. Among marker-positives, chemotherapy decreases the event rate by 21% on average. Less interpretable, but somewhat useful for global marker comparisons, are the values of VΔ=0.08 and TG=0.22.

### Table 1

Estimates of various measures of marker performance for the Oncotype-DX-like marker (Y1) and the strong marker (Y2) in the breast cancer example

 Measure Estimator Marker Y1 estimate (95% CI) Marker Y2 estimate (95% CI) Marker Y1 vs Y2 Estimated diff. (95% CI) P value for diff. Θ Θˆe 0.013 (–0.010, 0.044) 0.090 (0.060,0.122) −0.076 (–0.111, –0.042) <0.002 Θˆm 0.010 (0.000, 0.037) 0.099 (0.071,0.129) −0.088 (–0.115, –0.061) <0.002 Bneg Bˆnege 0.029 (–0.106, 0.082) 0.238 (0.170,0.309) −0.209 (–0.342, –0.129) <0.002 Bˆnegm 0.023 (0.000, 0.057) 0.262 (0.209,0.310) −0.239 (–0.294, –0.178) <0.002 Bpos Bˆpose 0.089 (0.020, 0.157) 0.203 (0.157,0.263) −0.114 (–0.193, –0.043) <0.002 Bˆposm 0.098 (0.035, 0.162) 0.211 (0.176,0.258) −0.113 (–0.184, –0.052) <0.002 Pneg Pˆneg 0.461 (0.000, 0.700) 0.377 (0.304,0.470) 0.084 (–0.358, 0.236) 0.768 VΔ VˆΔ 0.007 (0.001, 0.019) 0.080 (0.057,0.109) −0.073 (–0.103, –0.046) <0.002 TG TGˆ 0.066 (0.024, 0.110) 0.224 (0.187,0.263) −0.158 (–0.221, –0.102) <0.002

### 4.4 Estimation and inference

Our proposed estimation and inference methods build on methodology developed for risk prediction [3335]. This section overviews these approaches that are evaluated in a small-scale simulation study described in the Appendix. An R software package that implements these methods is also described in the Appendix.

#### 4.4.1 Estimation

Given data consisting of i.i.d copies of (Yi,Ti,Di), i=1,,N, the first step in estimation is to fit a model for risk as a function of T and Y. We use a general linear regression risk model with an interaction between T and Y,

(1)g(P(D=1T,Y))=β0+β1T+β2Y+β3YT.

Typically we let g be the logit function because of its advantages with case–control data (see Section 6.2) and because we have found logistic regression to be remarkably robust to model mis-specification. We note that the general linear model (1) is flexible in that the marker Y can itself be a transformed marker value. The risk and treatment effect estimates that result from fitting from this model are written Pˆ(D=1T=0,Y)=Riskˆ0(Y)=g1(βˆ0+βˆ2Y), Pˆ(D=1T=1,Y)=Riskˆ1(Y)=g1(βˆ0+βˆ1+βˆ2Y+βˆ3Y), and Δˆ(Y)=Riskˆ0(Y)Riskˆ1(Y). We estimate the marker and treatment effect distributions empirically and denote these by Fˆ and FˆΔ. The estimated risk curves are Riskˆ0(Y) and Riskˆ1(Y) vs Fˆ(Y). Pointwise α-level horizontal CIs inform about the variability in the proportion of participants at or below a given risk level; we obtain these using the percentile bootstrap method. The estimated treatment effect curve is Δˆ(Y) vs FˆΔ. Here pointwise horizontal CIs capture the variability in the estimated proportion of individuals with treatment effects below a certain value.

For the summary measures that depend on treatment rule, we consider both “empirical” and “model-based” estimators. An empirical estimator uses the estimated risk model (1) to classify individuals as marker-positive or marker-negative, and the performance of this rule is estimated empirically. For a model-based estimator, the risk model is used both to classify each individual and to estimate the performance of the classification rule. For example, the empirical estimator of Bpos estimates the average treatment effect empirically given Δˆ(Y)>0, and the model-based estimator averages Δˆ(Y) for this subgroup. The empirical estimators are less efficient but less reliant on risk model assumptions than model-based estimators (see Appendix). The estimators are listed below, where e and m superscripts indicate empirical and model-based estimators, Pˆ denotes an empirical probability estimate and Eˆ denotes an empirical mean:

Bˆnege=Pˆ(D=1T=1,Δˆ(Y)<0)PˆD=1T=0,Δˆ(Y)<0
Bˆnegm=EˆΔˆ(Y)Δˆ(Y)<0
Bˆpose=PˆD=1T=0,Δˆ(Y)>0PˆD=1T=1,Δˆ(Y)>0
Bˆposm=EˆΔˆ(Y)Δˆ(Y)>0
Pˆneg=PˆΔˆ(Y)<0
Θˆe=BˆnegePˆneg
Θˆm=BˆnegmPˆneg=Δˆ(Y)IΔˆ(Y)<0dFˆΔ

The treatment-rule-independent summary measures are estimated using the following model-based estimators:

VˆΔ=Δˆ(Y)ρˆ0ρˆ12dFˆΔ
TGˆ=Δˆ(Y)(ρˆ0ρˆ1)dFˆΔ,

where ρˆ0 and ρˆ1 are empirical estimates of P(D=1T=0) and P(D=1T=1), respectively. CIs for each summary measure are obtained using the percentile bootstrap.

#### 4.4.2 Hypothesis testing

Testing whether a marker has any performance for treatment selection is of interest for two reasons. First, this is a logical first step in marker evaluation. Second, the performance measures described above may have poor statistical properties at and near the null of no marker performance. This is similar to problems that have been identified with measures of risk prediction model performance [3641]; Section 7 includes further discussion of this point. Therefore, we advocate a simple pre-testing approach, whereby the marker performance measures are only estimated if the null hypothesis H0:Θ=0 corresponding to no marker performance is rejected.

For an unbounded marker, under risk model (1), H0 is equivalent to H01:β3=0 where β3 is the coefficient of interaction in the risk model. Therefore, H0 can be tested using a (most-powerful) likelihood ratio (LR) test for β3. However if Y is bounded, H01 implies H0 but the reverse does not hold; it is possible that β30 but Θ=0. Therefore we perform two hypothesis tests: H01:β3=0 is tested using a LR test and H02:β1/β3(Ymin,Ymax) using a Wald or percentile-bootstrap-based test, where β1/β3 is the marker value where Δ(Y)=0 under model (1) and Ymin and Ymax are the known lower and upper limits for Y. That is, testing H02 assesses whether the marker positivity threshold Y such that Δ(Y)=0 lies within the range of possible marker values. We declare H0 to be rejected if both H01 and H02 are rejected. Using two-sided α-level tests for H01 and H02 controls the overall type-I error rate at α since P(rejectH01andrejectH02Θ=0)min(P(rejectH01Θ=0),P(rejectH02Θ=0))=α. Other approaches have been proposed for testing the null of no marker performance [42, 43]; optimizing this test is not our focus.

For the unbounded markers Y1 and Y2 in our breast cancer example, H0 is rejected with p=0.005 and p<0.0001, respectively.

### 4.5 Calibration assessment

Assessing model calibration is a fundamental step in marker evaluation. We rely on standard methods for visualizing and testing goodness of fit for the risk model (1) and extend these methods to assess calibration of the treatment effect model.

Since patients are provided risk estimates under both treatment options, first we assess the fit of the risk model separately in the two treatment groups. Specifically, we define a well-calibrated model to be one for which P(D=1T=0,Risk0(Y)=r)r and P(D=1T=1,Risk1(Y)=r)r. To assess this, we split each treatment group t=0,1 into G equally-sized groups where the observations in each group have similar Riskˆt(Y). Commonly G=10 and the groups are based on quantiles of Riskˆt(Y). In each group, we calculate the average predicted risks, Risk¯tg(Y), and the observed risks, Pˆ(D=1T=t,G=g). Following Huang and Pepe [33], we plot the distribution of Riskˆ0(Y) and Riskˆ1(Y), overlaying the G observed risk values on the plot, as shown in Figure 3.

### Figure 3

Plots assessing calibration of the risk and treatment effect models, for the Oncotype-DX-like marker (left) and the strong marker (right)

To formally assess model calibration, a traditional Hosmer–Lemeshow goodness of fit test [44] can be applied separately to the two treatment groups. Specifically, for group T=t the test statistic

HLt=g=1GNtgPˆ(D=1T=t,G=gRisk¯tg(Y))2Risk¯tg(Y)1Risk¯tg(Y),

where Ntg is the number of participants in the gth group for T=t, is compared to a χ2 distribution with G2 degrees of freedom.

Another aspect of calibration is the extent to which the treatment effect model fits well. We want to ensure that P(D=1T=0,Δ(Y)=δ)P(D=1T=1,Δ(Y)=δ)δ. Following the approach above, we split the data into G evenly-sized groups based on Δˆ(Y) and calculate the average predicted treatment effect, Δ¯g(Y), and observed treatment effect, Pˆ(D=1T=0,G=g)Pˆ(D=1T=1,G=g), in each group. We plot the treatment effect curve and overlay the G observed treatment effect values as shown in Figure 3. Based on Figure 3, we see that the risk and treatment effect models for Y1 and Y2 in the breast cancer example are well-calibrated; the Hosmer–Lemeshow test statistics are 4.5 (p=0.81) and 8.9 (p=0.35) given T=0 and 5.0 (p=0.76) and 2.9 (p=0.94) given T=1. The Risk0(Y2) and Δ(Y1) curves suggest some evidence of poor calibration, which in our simulated data setting is attributable to sampling variability in the observed risks that are calculated using 50 observations each.

## 5 Comparing markers

The descriptives and summary measures proposed herein form the basis for comparing candidate markers. We assume that the two markers, Y1 and Y2, are measured on the same subjects, i.e. that the data are paired. With unpaired data, the analyses described above can be applied to each individual dataset and inferences can be drawn easily given that the estimated summary measures are statistically independent.

For drawing inference about the relative performance of two markers given paired data, CIs for the differences in performance measures and hypothesis tests of whether these differ from zero are informative. We fit separate models for P(D=1T,Y1) and P(D=1T,Y2), use these to estimate performance measures for Y1 and Y2, respectively, and bootstrap the differences in estimated performance measures. While global measures of marker performance such as Θ, VΔ, and TG are appropriate as the basis for formal marker comparisons, differences in the other summary measures inform about the nature of the difference between markers. We advocate Θ as the primary measure on which to base marker comparisons, given its clear clinical relevance and interpretation.

The results of the comparative analysis for the breast cancer example are shown in Table 1. We can see clearly that Y2 has uniformly better performance than Y1, with an estimated 10% vs 1% reduction in the 5-year recurrence or death rate. Despite the fact that there are estimated to be fewer marker-negative subjects based on Y2 (38% vs 46%), there is a much greater estimated benefit of no chemotherapy among Y2-marker-negatives (26% vs 2% reduction in the 5-year recurrence or death rate). In general the variation in treatment effect is larger for Y2.

## 6 Extensions

### 6.1 General treatment rules

In some settings there may be additional consequences of treatment that are not captured in the outcome, for example treatment-associated toxicities. This means that a treatment effect somewhat above zero may still warrant no treatment because it is offset by the other consequences of treatment. In these settings the optimal treatment rule can be shown to be

donottreatifΔ(Y)<δ,

where δ>0 is equal to the burden of treatment relative to that of the adverse event [15, 32]. The performance measures described above generalize naturally to this treatment rule:

Bneg(δ)=PD=1T=1,Δ(Y)<δPD=1T=0,Δ(Y)<δ
Bpos(δ)=PD=1T=0,Δ(Y)>δPD=1T=1,Δ(Y)>δ
Pneg(δ)=PΔ(Y)<δ
Θ(δ)=P(D=1T=1)[P(D=1T=1,Δ(Y)>δ)P(Δ(Y)>δ)
+P(D=1T=0,Δ(Y)<δ)P(Δ(Y)<δ)]
=Bneg(δ)Pneg(δ).

Note that the VΔ and TG measures do not require specification of a treatment rule. Further generalization to the setting where the burden of treatment and/or of the adverse event varies between individuals, perhaps as a function of Y, is described by Janes et al. [32].

### 6.2 Case–control sampling

The methods described above apply to the setting where the marker is measured at baseline on all RCT participants. However when the outcome D is rare, case–control sampling from within the RCT is a well-known efficient alternative that recovers much of the information contained in the entire trial population. This section extends the methods to the setting where the data consist of a case–control sample from the RCT, or a case–control sample stratified on treatment assignment, T. We consider case–control designs that sample all or a fixed proportion of the cases in the RCT, as well as a number of controls (perhaps stratified on T) that is a fixed multiple of the number of cases sampled.

Consider first unstratified case–control sampling. Suppose ND and NDˉ cases and controls occur in the trial “cohort” (N=ND+NDˉ). The case–control sample consists of a sample of nD=fND cases and nDˉ=knD controls, where f(0,1] and the control:case ratio k is an integer. Commonly all the cases are sampled (f=1), and 1–5 controls are sampled per case. Alternatively f may be set to a value less than 1 for a common event or when budget concerns or sample availability limit the number of cases that can be sampled; in these instances we assume that selection into the case–control sample is completely random conditional on D=1.

Let S=1 be an indicator of selection into the case–control sample. Given the case–control data, the task is to correct the estimates of P(D=1T,Y,S=1) and P(Δ(Y)<δS=1) for the case–control sampling. Suppose that an estimate of P(D=1) is available from the cohort. Using Bayes’ Theorem and the assumption of case–control sampling that P(S=1T,Y,D)=P(S=1D) we obtain the following identity which is used to correct the estimates of P(D=1T,Y,S=1) for the case–control sampling:

logitP(D=1T,Y,S=1)=logP(S=1T,Y,D=1)P(D=1T,Y)P(S=1T,Y,D=0)P(D=0T,Y)=logP(S=1D=1)P(S=1D=0)+logitP(D=1T,Y)=logP(D=1S=1)P(S=1)P(D=0)P(D=0S=1)P(S=1)P(D=1)+logitP(D=1T,Y)=logitP(D=1S=1)logitP(D=1)+logitP(D=1T,Y).

This result was originally cited by Prentice and Pyke [45] as the rationale for using logistic regression to model risk with case–control data. Note that P(D=1T,Y,S=1) can be estimated using the logistic regression risk model (1) fit to the case–control data, P(D=1) can be estimated from the trial cohort, and P(D=1S=1) estimated from the case–control data.

The distribution of Δ(Y), or equivalently of Y itself, can be estimated in the cases and controls in the case–control data and corrected to the cohort distribution via

FˆΔ(Y)=FˆΔDˉcc(Y)Pˆ(D=0)+FˆΔDcc(Y)Pˆ(D=1),

where superscript cc denotes estimation in the case–control sample and D and Dˉ subscripts denote case and control subsets.

We use a modified bootstrapping procedure for case–control data. To reproduce the variability in the cohort from which the case–control study is sampled, we first sample NDBin(N,Pˆ(D=1)) and set NDˉ=NND. Next we sample nD=fND cases and nDˉ=knD controls from the subjects in the case–control study. The estimation procedure is then performed in each bootstrap sample, and quantiles of the bootstrap distribution are used to characterize uncertainty.

Case–control sampling stratified on treatment assignment can also be accommodated. Here we assume a cohort with (ND0,NDˉ0,ND1,NDˉ1) subjects in each D×T stratum. The case–control sample consists of nD0=f0ND0 and nD1=f1ND1 cases for fixed proportions f0 and f1 in the two treatment groups, and nDˉ0=k0nD0 and nDˉ1=k1nD1 controls for fixed control:case ratios k0 and k1. Assume that estimates of P(D=1T=0), P(D=1T=1), and P(T=1) are available from the cohort. A similar identity can be exploited for estimation:

logitP(D=1T,Y)=logitP(D=1T,Y,S=1)+logitP(D=1T)logitP(D=1T,S=1)

The distribution of Δ(Y) combines empirical CDFs from the four D×T strata:

FˆΔ(Y)=FˆΔDˉ0cc(Y)Pˆ(D=0,T=0)+FˆΔDˉ1cc(Y)Pˆ(D=0,T=1)+FˆΔD0cc(Y)Pˆ(D=1,T=0)+FˆΔD1cc(Y)Pˆ(D=1,T=1).

Bootstrapping is implemented by first sampling N1Bin(N,P(T=1)) (N0=NN1), ND0Bin(N0,Pˆ(D=1T=0)), and ND1Bin(N1,Pˆ(D=1T=1)) where Nt=NDt+NDˉt. The stratified case–control sample is then sampled from the case and control subsets.

For calibration assessment, we plot observed and predicted risks and treatment effects as described in Section 4.5, where all are corrected for the biased sampling as described above. We also implement a variation on the Hosmer–Lemeshow test applied to case–control data (expression (7) of Huang and Pepe [33]).

## 7 Discussion

This paper proposes a statistical framework for evaluating a candidate treatment selection marker and for comparing two markers. Estimation and inference techniques are described for the setting where the marker or markers are measured on all or a treatment-stratified case–control sample of participants in a randomized, controlled trial. An R software package was developed which implements these methods. Developing a solid framework for evaluating and comparing markers is fundamental for accomplishing more sophisticated tasks such as combining markers, accounting for covariates, and assessing the improvement in performance associated with adding a new marker to a set of established markers.

Our approach to marker evaluation also applies when the marker is discrete. In addition, it can be applied when there are multiple markers and interest lies in evaluating their combination; Δ(Y)=P(D=1T=0,Y)P(D=1T=1,Y) is the combination of interest and the measures described here can be used to summarize the performance of this combination.

This work extends existing approaches for evaluating markers for risk prediction [25, 35, 46]. It also unifies existing methodology for evaluating treatment selection markers. In particular, our preferred marker performance measure has been advocated by Song and Pepe [13], Brinkley et al. [16], Janes et al. [6], Gunter et al. [19], Qian and Murphy [20], McKeague and Qian [21], and Zhang et al. [22].

There are challenges with making inference about the performance measures we propose, similar to problems that have been identified with measures of risk prediction model performance including the area under the ROC curve [36, 3840], the integrated discrimination index [37], and the net reclassification index [41]. The problems may arise when the sample size is modest and marker performance is weak. In particular for the Oncotype DX example, given that the marker is weak and the primary study evaluating its performance by Albain et al. [28] included just 367 women, our simulation results suggest that the resultant estimate of Θ is likely an over-estimate and that the CI may be conservative. For this reason, we propose testing for non-null marker performance prior to estimating the magnitude of performance. This approach performed reasonably well in our simulation studies, but improved approaches to inference, for the treatment selection as well as risk prediction problem, merit investigation.

The methods described here can and should be extended to accommodate other types of outcomes. Extension to continuous or count outcomes is straightforward. Specifically, after replacing P(D=1) with E(D) and using a risk model appropriate to the scale of the outcome, e.g. a linear or log-linear model for a continuous outcome, the analysis proceeds as above. The conceptual framework also applies to time-to-event outcomes, with the task being to predict risk of the outcome by a specified landmark time.

The methods may also be generalized to an observational study setting, or to a setting where data on the two treatments come from two different studies – perhaps historical data are paired with a single-arm trial of T=1. However, the usual concerns about measured and unmeasured confounding in estimating the treatment effect apply. In this setting, an analyst would be well-advised to stratify on variables that are potentially associated with treatment provision and outcome. More generally, methods for adjusting for covariates in the evaluation of marker performance warrant further research.

## Appendix

### Simulation studies

This section describes a small-scale simulation study that was performed to evaluate the statistical performance of our methods. Data were simulated to reflect the breast cancer RCT example, with T an indicator of chemotherapy in addition to tamoxifen, randomly assigned to half of study participants. Rates of 5-year breast cancer recurrence or death (D) were set to 21% and 24% with and without chemotherapy, respectively, as in SWOG SS8814 [28]. We explored the performance of the methods for a weak marker and a strong marker, both of which relate to D via the linear logistic model (1). The weak marker, Y1, mimics the performance of the Oncotype DX recurrence score as seen in Albain et al. [28]; Y1 is normally distributed with mean 4.8 and standard deviation 1.8. The strong marker, Y2, follows a standard normal distribution. We used the following procedure to simulate data. We denote the potential outcomes with and without treatment by D(1) and D(0). We specified models for P(D(0),D(1)Yj), j=1,2 that induce marginal linear logistic models as in eq. (1). Specifically, we assumed

logitP(D(0)=1Yi)=β0+β2Yj
logitP(D(1)=1Yi)=β0+β1+(β2+β3)Yj
(1a)P(D(0)=0,D(1)=1Y2)=P(D(0)=0Y2)P(D(1)=1Y2)
P(D(0)=0,D(1)=1Y1)=min(kP(D(0)=1Y1)P(D(1)=1Y1),P(D(0)=1Y1),P(D(1)=1Y1)),

where k ensures that

P(D(0)=0,D(1)=1Y1)dF(Y1)=P(D(0)=0,D(1)=1Y2)dF(Y2)

and

(2a)Y1N(4.8,1.8)andY2N(0,1).

These models fully specify the joint distribution of (D(0),D(1)) given Yj, j=1,2. Note that our measures and estimators depend only on the marginal distributions P(D(0)=1Yj) and P(D(1)=1Yj), j=1,2, and hence are invariant to the choice of joint distribution. Next, we used Bayes’ Theorem to calculate P(YjD(0),D(1)), j=1,2. We first simulated pairs of potential outcomes (D(0),D(1)) from the multinomial distribution induced by eqs (1a) and (2a). Then we simulated independent Y1 and Y2 from P(YjD(0),D(1)), j=1,2. Treatment assignment, T, was generated independent of potential outcomes and marker values. The observed outcome D was defined as D=D(1)T+D(0)(1T). True values for the marker performance measures were calculated as the average parameter estimates, using the true risk function (1), across 10 very large datasets (N=20,000,000).

We explore the bias and variance of the parameter estimates and false-coverage probabilities of the bootstrap percentile CIs for sample sizes ranging from N=250 to N=5,000. A total of 5,000 simulations were performed for each sample size. To explore the impact of our proposed pre-testing strategy, whereby the parameters are not estimated if H0:Θ=0 is not rejected, we evaluate the parameter estimates and CIs marginally and conditionally. Marginal means of parameter estimates include all estimates regardless of H0 rejection, and conditional means are computed only among datasets where H0 is rejected. The following probabilities of false coverage of nominal 95% CIs are evaluated: 1. Marginal probability of false coverage, where CIs are calculated regardless of H0 rejection; 2. Conditional probability of false coverage, computed only among datasets where H0 is rejected; and 3. Probability of rejecting H0 and the CI not covering the true value, termed the “false conclusion probability” [47].

#### Strong marker

The results for the strong marker are contained in Tables A.1A.3. For this marker, we see that the estimates and CIs have uniformly good performance. Marginal bias is small and false coverage is near nominal; the pre-testing has no impact because of the 100% power to reject H0 for this marker. There is minimal increase in variance due to using empirical vs model-based estimators.

### Table A.1

Mean parameter estimates for the strong marker

 N Prob.reject H0 Θ(0.110) Pneg (0.379) Bneg (0.291) Bpos (0.228) VΔ (0.094) TG (0.245) Mod. Emp. Mod. Emp. Mod. Emp. Marginal 250 1 0.113 0.112 0.380 0.295 0.293 0.230 0.229 0.097 0.246 500 1 0.112 0.112 0.380 0.293 0.293 0.230 0.230 0.096 0.246 1,000 1 0.111 0.111 0.379 0.292 0.292 0.229 0.229 0.095 0.246 5,000 1 0.110 0.110 0.379 0.291 0.291 0.228 0.228 0.094 0.246 Conditional 250 1 0.113 0.112 0.380 0.295 0.293 0.230 0.229 0.097 0.246 500 1 0.112 0.112 0.380 0.293 0.293 0.230 0.230 0.096 0.246 1,000 1 0.111 0.111 0.379 0.292 0.292 0.229 0.229 0.095 0.246 5,000 1 0.110 0.110 0.379 0.291 0.291 0.228 0.228 0.094 0.246

### Table A.2

False-coverage results for the strong marker

 N Prob. reject H0 Θ Pneg Bneg Bpos VΔ TG Mod. Emp. Mod. Emp. Mod. Emp. Marg. false cov. 250 1 0.059 0.045 0.052 0.056 0.030 0.051 0.034 0.056 0.056 500 1 0.054 0.043 0.053 0.050 0.031 0.050 0.038 0.054 0.053 1,000 1 0.056 0.055 0.051 0.049 0.044 0.047 0.044 0.055 0.055 5,000 1 0.055 0.051 0.055 0.056 0.048 0.052 0.049 0.053 0.056 Cond. false cov. 250 1 0.059 0.045 0.052 0.056 0.030 0.051 0.034 0.056 0.056 500 1 0.054 0.043 0.053 0.050 0.031 0.050 0.038 0.054 0.053 1,000 1 0.056 0.055 0.051 0.049 0.044 0.047 0.044 0.055 0.055 5,000 1 0.055 0.051 0.055 0.056 0.048 0.052 0.049 0.053 0.056 False concl. 250 1 0.059 0.045 0.052 0.056 0.030 0.051 0.034 0.056 0.056 500 1 0.054 0.043 0.053 0.050 0.031 0.050 0.038 0.054 0.053 1,000 1 0.056 0.055 0.051 0.049 0.044 0.047 0.044 0.055 0.055 5,000 1 0.055 0.051 0.055 0.056 0.048 0.052 0.049 0.053 0.056

### Table A.3

Empirical standard deviations of parameter estimates for the strong marker

 N Θ (0.110) Pneg (0.379) Bneg (0.291) Bpos (0.228) VΔ (0.094) TG (0.245) Mod. Emp. Mod. Emp. Mod. Emp. 250 0.031 0.035 0.072 0.057 0.072 0.045 0.05 0.03 0.042 500 0.022 0.024 0.051 0.039 0.048 0.031 0.036 0.021 0.029 1,000 0.016 0.018 0.036 0.028 0.035 0.022 0.025 0.015 0.021 5,000 0.007 0.008 0.016 0.012 0.015 0.01 0.011 0.007 0.009

#### Weak marker

The results for the weak marker are contained in Tables A.4A.6. With N=250 or 500, conditional on rejecting H0 the bias in parameter estimates and false coverage of CIs can be substantial; however rejecting H0 is unlikely with power 21% or 36%. Marginally, mean parameter estimates are substantially closer to their true values and false-coverage probabilities are generally near nominal. False conclusion probabilities are less than nominal but sometimes substantially below 0.05 indicating over-conservatism. With N=1,000 or 5,000, conditional and marginal bias is generally small and false-coverage probabilities are near or below nominal. False conclusion probabilities continue to be less than nominal. This example demonstrates that, for markers with near-null performance, substantial sample sizes are required for accurate inference. We also see that with smaller sample sizes there can be a substantial increase in variability associated with use of empirical vs model-based estimators.

### Table A.4

Mean parameter estimates for the weak marker

 N Prob. reject H0 Θ (0.0095) Pneg (0.439) Bneg (0.022) Bpos (0.073) VΔ (0.005) TG (0.050) Mod. Emp. Mod. Emp. Mod. Emp. Marginal 250 0.217 0.022 0.022 0.423 0.036 0.036 0.090 0.090 0.009 0.060 500 0.364 0.016 0.015 0.410 0.027 0.026 0.080 0.080 0.007 0.055 1,000 0.63 0.013 0.013 0.405 0.024 0.024 0.076 0.076 0.006 0.054 5,000 0.999 0.010 0.010 0.426 0.022 0.022 0.073 0.073 0.005 0.053 Conditional 250 0.217 0.042 0.041 0.547 0.071 0.069 0.159 0.154 0.022 0.112 500 0.364 0.026 0.025 0.509 0.046 0.044 0.117 0.117 0.013 0.084 1,000 0.630 0.017 0.017 0.473 0.032 0.032 0.091 0.090 0.008 0.066 5,000 0.999 0.010 0.010 0.426 0.022 0.022 0.073 0.073 0.005 0.053

### Table A.5

False-coverage results for the weak marker

 N Prob. reject H0 Θ Pneg Bneg Bpos VΔ TG Mod. Emp. Mod. Emp. Mod. Emp. Marg. false cov. 250 0.217 0.054 0.030 0.059 0.053 0.023 0.063 0.026 0.034 0.035 500 0.364 0.043 0.021 0.052 0.034 0.014 0.048 0.022 0.029 0.030 1,000 0.630 0.050 0.026 0.055 0.034 0.015 0.047 0.018 0.052 0.051 5,000 0.999 0.057 0.037 0.058 0.058 0.020 0.058 0.036 0.060 0.058 Cond. false cov. 250 0.217 0.162 0.089 0.102 0.190 0.074 0.248 0.088 0.158 0.161 500 0.364 0.083 0.043 0.065 0.086 0.032 0.121 0.057 0.081 0.083 1,000 0.630 0.045 0.028 0.043 0.047 0.023 0.061 0.028 0.046 0.044 5,000 0.999 0.056 0.037 0.058 0.057 0.020 0.057 0.035 0.058 0.057 Marg. false concl. 250 0.217 0.035 0.019 0.022 0.041 0.016 0.054 0.019 0.034 0.035 500 0.364 0.030 0.016 0.024 0.031 0.012 0.044 0.021 0.029 0.030 1,000 0.630 0.028 0.018 0.027 0.030 0.014 0.038 0.018 0.029 0.028 5,000 0.999 0.056 0.037 0.058 0.057 0.020 0.057 0.035 0.058 0.057

### Table A.6

Empirical standard deviations of parameter estimates for the weak marker

 N Θ (0.0095) Pneg (0.439) Bneg (0.022) Bpos (0.073) VΔ (0.005) TG (0.050) Mod. Emp. Mod. Emp. Mod. Emp. 250 0.025 0.029 0.309 0.032 0.09 0.051 0.083 0.009 0.037 500 0.017 0.02 0.27 0.023 0.061 0.037 0.065 0.006 0.028 1,000 0.012 0.014 0.22 0.017 0.045 0.027 0.033 0.004 0.021 5,000 0.005 0.007 0.106 0.008 0.013 0.013 0.015 0.002 0.01

### Software

We developed a package in the open-source software R called TreatmentSelection that implements our methods for evaluating individual markers and for comparing markers. The software is available at http://labs.fhcrc.org/janes/index.html. The following functions are included.

1. trtsel creates a treatment selection object

2. eval.trtsel evaluates a treatment selection object, producing estimates and CIs for the summary measures described in Section 4.3

3. plot.trtsel plots a treatment selection object, producing risk curves and the treatment effect curve described in Section 4.2

4. calibrate.trtsel assesses the calibration of a fitted risk model and treatment effect model using methods described in Section 4.5

5. compare.trtsel compares two markers using methods described in Section 5

Case–control and treatment-stratified case–control samplings are accommodated.

Here we illustrate use of the code by showing how the results shown in Figures 13 and Table 1 of the main text are produced. First we load the data using the following commands.

> library(TreatmentSelection)> data(tsdata)> tsdata[1:10,]trteventYlY211139.9120−0.85352106.68200.29053106.58200.08004001.35811.19255007.6820−0.207060041.1720−0.088071019.49200.167081120.8220−1.04859006.9620−0.243510002.50200.2030

Treatment selection objects are created and displayed for Y1 and Y2 using the commands

> trtsel.Y1 <- trtsel(event = "event", trt = "trt", marker = "Y1",data = tsdata, study.design="randomized cohort") data = tsdata, study.design="randomized cohort")> trtsel.Y1Study design: randomized cohortModel Fit: Link function: logit Coefficients:Estimate Std. Errorz valuePr(>|z|)(Intercept)−2.518143830.235642511−10.6862881. 179991e-26trt0.489386200.3117628571.5697391.164759e-01marker0.047600560.0064537917.3755971.636104e-13trt:marker−0.023188810.008324063−2.7857565.340300e-03Derived Data: (first ten rows)eventtrtmarkerfittedrisk.tOfittedrisk.tltrt.effectmarker.neg11139.91200.350165830.25837420.091791654902016.68200.099743580.1340472−0.034303626913016.58200.099316970.1337641−0.034447126614001.35810.079183160.1196652−0.040482084715007.68200.104100050.1369063−0.0328062456160041.17200.363933110.26431170.0996213622070119.49200.169339760.1746644−0.0053246137181120.82200.178432310.1793943−0.000962034119006.96200.100946780.1348426−0.0338958439110002.50200.083245380.1226384−0.03939297811> trtsel.Y2 <- trtsel(event = "event", trt = "trt", marker = "Y2", data = tsdata, study.design="randomized cohort") data = tsdata, study.design="randomized cohort")> trtsel.Y2Study design: randomized cohortModel Fit: Link function: logit Coefficients:EstimateStd. Errorz valuePr(>|z|)(Intercept)−1.21079120.1131642−10.6994161.024216e-26trt−0.51690080.1863643−2.7736045.543912e-03marker0.57791720.11486435.0313054.871514e-07trt:marker−2.04550330.2064547−9.9077563.851994e-23Derived Data: (first ten rows)eventtrtmarkerfittedrisk.tOfittedrisk.tltrt.effectmarker.neg111−0.85350.15393790.38340813−0.22947024212010.29050.26058960.103955630.15663398203010.08000.23784010.136449370.10139071204001.19250.37247230.029950870.3425214740500−0.20700.20908990.194050650.0150392320600−0.08800.22069030.168185150.05250518607010.16700.24707400.122090720.1249832770811−1.04850.13982580.45290799−0.3130821721900−0.24350.20562290.202565760.003057187010000.20300.25096470.116539950.1344247100

The descriptives shown in Figure 1 are produced using

> plot.trtsel(trtsel.Y1, main = "Yl: Oncotype-DX-like marker", bootstraps = 500, trt.names=c("chemo.","no chemo."))

> plot.trtsel(trtsel.Y2, main = "Y2: Strong marker", bootstraps = 500, trt.names=c("chemo.","no chemo."))

Calibration is assessed and displayed as shown in Figure 3 using

> cali.Y1 <- calibrate.trtsel(trtsel.Y1)> cali.Y1Hosmer-Lemeshow test for model calibration-----------------------------------------No Treatment (trt = 0):     Test Statistic = 4.496,DF = 8,p value = 0.8098813Treated (trt = 1):     Test Statistic = 4.986,DF = 8,p value = 0.7591213> cali.Y2 <- calibrate.trtsel(trtsel.Y2)> cali.Y2Hosmer-Lemeshow test for model calibration-----------------------------------------No Treatment (trt = 0):     Test Statistic = 8.896,DF = 8,p value = 0.3511235Treated (trt = 1):     Test Statistic = 2.868,DF = 8,p value = 0.9423597calibrate.trtsel(trtsel.Y1, plot.type = "risk.t0")calibrate.trtsel(trtsel.Y2, plot.type = "risk.t0")calibrate.trtsel(trtsel.Y1, plot.type = "risk.tl")calibrate.trtsel(trtsel.Y2, plot.type = "risk.tl")calibrate.trtsel(trtsel.Y1, plot.type = "treatment effect")calibrate.trtsel(trtsel.Y2, plot.type = "treatment effect")

The summary measure estimates and CIs shown in Table 1 are obtained by

> eval.Y1 <- eval.trtsel(trtsel.Y1, bootstraps = 500)> eval.Y1Hypothesis test:-----------------HO: No marker-by-treatment interactionP value = 0.00534Z statistic = -2.786Summary Measure Estimates (with 95% confidence intervals)-------------------------------------------------------Decrease in event rate under marker-based treatment (Theta)     Empirical:0.013 (-0.01,0.044)     Model Based:0.01 (0,0.038)Proportion marker-negative:     0.461 (0,0.717)Proportion marker-positive:     0.539 (0.283,1)Average benefit of no treatment among marker-negatives (B.neg)     Empirical:0.029 (-0.07,0.075)     Model Based:0.023 (0,0.059)Average benefit of treatment among marker-positives (B.pos)     Empirical:0.089 (0.014,0.15)     Model Based:0.098 (0.04,0.146)Variance in estimated treatment effect:     0.007(0.001,0.017)Total Gain:     0.066(0.026,0.1)Marker positivity threshold:21.082     Event Rates:------------Treat noneTreat allMarker-based TreatmentEmpirical:0.2510.2170.204(0.210,0.291)(0.182,0.251)(0.171,0.241)Model Based:0.2570.2140.204(0.217,0.295)(0.179,0.248)(0.169,0.232)> eval.Y2 <- eval.trtsel(trtsel.Y2, bootstraps = 500)> eval.Y2Hypothesis test:-----------------HO: No marker-by-treatment interactionP value = 0Z statistic = -9.908Summary Measure Estimates (with 95% confidence intervals)-----------------------------------------------------Decrease in event rate under marker-based treatment (Theta)
Empirical:0.09 (0.064,0.122)     Model Based:0.099 (0.074,0.128)Proportion marker negative:     0.377 (0.306,0.467)Proportion marker positive:     0.623 (0.533,0.694)Average benefit of no treatment among marker-negatives (B.neg)     Empirical:0.238 (0.173,0.304)     Model Based:0.262 (0.211,0.315)Average benefit of treatment among marker-positives (B.pos)     Empirical:0.203 (0.157,0.266)     Model Based:0.211 (0.171,0.259)Variance in estimated treatment effect:     0.08 (0.057,0.108)Total Gain:     0.224 (0.187,0.262)Marker positivity threshold: -0.258     Event Rates:--------------Treat noneTreat allMarker-based TreatmentEmpirical:0.2510.2170.128(0.215,0.290)(0.186,0.252)(0.096,0.155)Model Based:0.2450.2120.113(0.210,0.282)(0.180,0.245)(0.090,0.135)

The markers are compared based on summary measures, and visually (as in Figure 2) using

 mycompare <- compare.trtsel(trtsell = trtsel.Y1,trtsel2 = trtsel.Y2, bootstraps = 500,plot = TRUE,main = "",marker.names=c("Yl","Y2")) mycompareSummary Measure Estimates(with 95% confidence intervals)marker 1|marker 2|difference(p-value)-------------------------------------------------------------------------------Decrease in event rate under marker-based treatment (Theta)     Empirical:0.013|0.090|−0.076(< 0.002)(-0.007,0.049)|(0.062,0.124) |(-0.111,-0.043)     Model Based:0.010|0.099|−0.088(< 0.002)(0.000,0.039)|(0.070,0.130)|(-0.112,-0.058)Proportion marker negative:0.461|0.377|0.084(0.664)(0.000,0.707)|(0.305,0.471) |(-0.360,0.258)Average benefit of no treatment among marker-negatives (B.neg)     Empirical:0.029 |0.238 |−0.209(< 0.002)(-0.071,0.082)|(0.176,0.307) |(-0.331,-0.132)     Model Based:0.023 |0.262 |−0.239(< 0.002)(0.000,0.061)|(0.205,0.308) |(-0.288,-0.169)Average benefit of treatment among marker-positives (B.pos)     Empirical:0.089 |0.203 |−0.114(0.002)(0.007,0.161)|(0.162,0.266) |(-0.201,-0.035)     Model Based:0.098 |0.211 |−0.113(< 0.002)(0.042,0.162)|(0.175,0.259) |(-0.177,-0.038)Variance in estimated treatment effect :0.007 |0.080 |−0.073(< 0.002)(0.001,0.021)|(0.054,0.110) |(-0.103,-0.043)Total Gain:0.066 |0.224|−0.158(< 0.002)(0.027,0.111)|(0.181,0.266)|(-0.214,-0.091)

If instead the dataset with D, Y1, and T measurements consisted of a case–control sample from within an RCT, given estimates of P(T=1) and P(D=1) from the trial cohort (call these “Rand.frac” and “Risk.cohort”, respectively) and the size of the trial cohort, N, the only modification would be in creating the treatment selection object:

cctrtsel.Y1 <- trtsel(event = "event", treatment = "trt", marker = "Y1", data = tsdata, cohort.attributes = c(N, Rand.frac, Risk.cohort), study.design="nested case-control")

### References

1. SimonR. Lost in translation: problems and pitfalls in translating laboratory observations to clinical utility. Eur J Cancer2008;44:270713.Search in Google Scholar

2. GunterL, ZhuJ, MurphyS. Chapter Variable selection for optimal decision making. In: Proceedings of the 11th conference on artificial intelligence in medicine. Springer Verlag, 2007.Search in Google Scholar

3. CoatesAA, MillerEK, O’TooleSA, MolloyTJ, VialeG, GoldhirschA, et al. Prognostic interaction between expression of p53 and estrogen receptor in patients with node-negative breast cancer: results from IBCSG trials VIII and IX. Breast Cancer Res2012;14:R143.Search in Google Scholar

4. BuschS, RydenL, StalO, JirstromK, LandbergG. Low ERK phosphorylation in cancer-associated fibroblasts is associated with tamoxifen resistance in pre-menopausal breast cancer. PLoS One2012;7:e45669.Search in Google Scholar

5. MalmstromA, GronbergBH, MarosiC, StuppR, FrappazD, SchultzH, et al., and N. C. B. T. S. G. (NCBTSG).Temozolomide versus standard 6-week radiotherapy versus hypofractionated radiotherapy in patients older than 60 years with glioblastoma: the nordic randomised, phase 3 trial. Lancet Oncol2012;13:91626.Search in Google Scholar

6. JanesH, PepeMS, BossuytPM, BarlowWE. Measuring the performance of markers for guiding treatment decisions. Ann Intern Med2011;154:2539.Search in Google Scholar

7. HuangY, GilbertPB, JanesH. Assessing treatment-selection markers using a potential outcomes framework. Biometrics2012;68:68796.Search in Google Scholar

8. BonettiM, GelberRD. Patterns of treatment effects in subsets of patients in clinical trials. Biostatistics2004;5:46581.Search in Google Scholar

9. RoystonP, SauerbreiW. A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med2004;23:250925.Search in Google Scholar

10. CaiT, TianL, WongP, WeiL. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics2011;12:27082.Search in Google Scholar

11. ClaggettB, ZhaoL, TianL, CastagnoD, WeiLJ. Estimating subject-specific treatment differences for risk-benefit assessment with competing risk event-time data. Harvard University Biostatistics Working Paper Series, 1252011.Search in Google Scholar

12. ZhaoL, TianL, CaiT, ClaggettB, WeiLJ. Effectively selecting a target population for a future comparative study. Harvard University Biostatistics Working Paper Series, 2011.Search in Google Scholar

13. SongX, PepeMS. Evaluating markers for selecting a patient’s treatment. Biometrics2004;60:87483.Search in Google Scholar

14. BakerS, KramerB. Statistics for weighing benefits and harms in a proposed genetic substudy of a randomized cancer prevention trial. J Royal Stat Soc Ser C (Appl Stat)2005;54:94154.Search in Google Scholar

15. VickersAJ, KattanMW, SargentD. Method for evaluating prediction models that apply the results of randomized trials to individual patients. Trials2007;8:14.Search in Google Scholar

16. BrinkleyJ, TsiatisAA, AnstromKJ. A generalized estimator of the attributable benefit of an optimal treatment regime. Biometrics2010;66:51222.Search in Google Scholar

17. LuW, ZhangHH, ZengD. Variable selection for optimal treatment decision. Stat Meth Med Res2012;22(5):493504.Search in Google Scholar

18. FosterJC, TaylorJM, RubergSJ. Subgroup identification from randomized clinical trial data. Stat Med2011;30:286780.Search in Google Scholar

19. GunterL, ZhuJ, MurphyS. Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. J Biopharm Stat2011;21:106378.Search in Google Scholar

20. QianM, MurphyS. Performance guarantees for individualized treatment rules. Ann Stat2011;39:1180210.Search in Google Scholar

21. McKeagueIW, QianM. Evaluation of treatment policies via sparse functional linear regression. Stat Sin2013.Search in Google Scholar

22. ZhangB, TsiatisAA, LaberEB, DavidianM. A robust method for estimating optimal treatment regimes. Biometrics2012; 68(4):10108.Search in Google Scholar

23. PepeMS. The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press, 2003.Search in Google Scholar

24. ZhouX-H, McClishDK, ObuchowskiNA. Statistical methods in diagnostic medicine. New York: Wiley, 2002.Search in Google Scholar

25. PepeMS, JanesH. Methods for evaluating prediction performance of biomarkers and tests. UW Biostatistics Working Paper Series 384, 2012.Search in Google Scholar

26. Early Breast Cancer Trialists Collaborative Group. Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomized trials. Lancet2005;365:1687717.Search in Google Scholar

27. DowsettM, GoldhirschA, HayesDF, SennHJ, WoodW, VialeG. International web-based consultation on priorities for translational breast cancer research. Breast Cancer Res2007;9:R81.Search in Google Scholar

28. AlbainKS, BarlowWE, ShakS, HortobagyiGN, LivingstonRB, YehIT. Prognostic and predictive value of the 21-gene recurrence score assay in postmenopausal women with node-positive, oestrogen-receptor-positive breast cancer on chemotherapy: A retrospective analysis of a randomized trial. Lancet Oncol2010;11:5565.Search in Google Scholar

29. PaikS, ShakS, TangG, KimC, BakerJ, CroninM, et al. A multigene assay to predict recurrence of tamoxifen-treated,node-negative breast cancer. New Engl J Med2004;351:281726.Search in Google Scholar

30. PaikS, TangG, ShakS, ChungyeulK, BakerJ, KimW, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen-receptor-positive breast cancer. J Clin Oncol2006;24:372634.Search in Google Scholar

31. AlbainKS, BarlowWE, DavdinPM, FarrarWB, BurtonGV, KetchelSJ, et al., and the Breast Cancer Intergroup of North America. Adjuvant chemotherapy and timing of tamoxifen in postmenopausal patients with endocrine-responsive, node-positive breast cancer: A phase 3, open-label, randomized controlled trial. Lancet2009;274:205563.Search in Google Scholar

32. JanesH, PepeMS, HuangY. A framework for evaluating markers used to select patient treatment. Med Decis Making2014;34(2):15967.Search in Google Scholar

33. HuangY, PepeM. Semiparametric methods for evaluating the covariate-specific predictiveness of continuous markers in matched case-control studies. J R Stat Soc Ser B2010;59:43756.Search in Google Scholar

34. HuangY, PepeMS. Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods. Stat Med2010;29:1391410.Search in Google Scholar

35. HuangY, Sullivan PepeM, FengZ. Evaluating the predictiveness of a continuous marker. Biometrics2007;63:11818.Search in Google Scholar

36. VickersAJ, CroninAM, BeggCB. One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol2011;11:13.Search in Google Scholar

37. KerrKF, McClellandRL, BrownER, LumleyT. Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol2011;174:36474.Search in Google Scholar

38. PepeM, KerrK, LongtonG, WangZ. Testing for improvement in prediction model performance. UW Biostatistics Working Paper Series 379, 2011.Search in Google Scholar

39. SeshanVE, GonenM, BeggCB. Comparing ROC curves derived from regression models. Stat Med2013;32(9):148393.Search in Google Scholar

40. DemlerOV, PencinaMJ, D’AgostinoRB. Misuse of DeLONG test to compare AUCs for nested models. Stat Med2012;31:257787.Search in Google Scholar

41. KerrKF, WangZ, JanesH, McClellandRL, PsatyBM, PepeMS. Net reclassification indices for evaluating risk prediction instruments: a critical review. Epidemiology2014;25(1):114121.Search in Google Scholar

42. GailM, SimonR. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics1985;41:36172.Search in Google Scholar

43. ShusterJ, van EysJ. Interaction between prognostic factors and treatment. Control Clin Trials1983;4:20914.Search in Google Scholar

44. LemeshowS, HosmerDJ. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol1982;115:92106.Search in Google Scholar

45. PrenticeRL, PykeR. Logistic disease incidence models and case-control studies. Biometrika1979;66:40311.Search in Google Scholar

46. GuW, PepeM. Measures to summarize and compare the predictive capacity of markers. Int J Biostat2009;5: Article 27. Accessed at: http://dx.doi.org/10.2202/1557-4679.1188Search in Google Scholar

47. BenjaminiY, YekuteiliD. False discovery rate-adjusted multiple confidence intervals for selected parameters. J Am Stat Assoc2005;100:7181.Search in Google Scholar

Published Online: 2014-4-2
Published in Print: 2014-5-1

© 2014 by Walter de Gruyter Berlin / Boston