1 Introduction
There is an enormous amount of research effort being devoted to discovering and evaluating markers that can predict a patient’s chance of responding to treatment. A December, 2013 PubMed search identified 8,198 papers evaluating such markers over the last 2 years alone. Treatment selection markers, sometimes called “predictive” [1] or “prescriptive” [2] markers, have the potential to improve patient outcomes and reduce medical costs by allowing treatment provision to be restricted to those subjects most likely to benefit, and avoiding treatment in those only likely to suffer its side effects and other costs.
Methods for evaluating treatment selection markers are much less well developed than for markers used to diagnose disease or predict risk under a single treatment. In the medical literature, the most common approach to marker evaluation is to test for a statistical interaction between the marker and treatment in the context of a randomized and controlled trial (RCT; see Coates et al. [3], Busch et al. [4], and Malmstrom et al. and NCBTSG [5] for some recent examples). However this approach has limitations in that it does not provide a clinically relevant measure of the benefit of using the marker to select treatment and does not facilitate comparing candidate markers [6]. Moreover, the scale and magnitude of the interaction coefficient will depend on the form of the regression model used to test for interaction and on the other covariates included in this model [7].
There is a growing literature on statistical methods for evaluating treatment selection markers. A number of papers have focused on descriptive analysis, specifically on modeling the treatment effect as a function of marker [8–12]. In general, these approaches are not well-suited to the task of comparing candidate markers. Other papers have proposed individual measures for evaluating markers [6, 7, 13–16], some of which we adopt as part of our analytic approach as described below. Still others have focused on the specific problem of optimizing marker combinations for treatment selection [17–22]. A complete framework for marker evaluation, on a par with those developed for evaluating markers for classification [23, 24] or risk prediction [25], is still forthcoming.
In this paper, we lay out a comprehensive approach to evaluate markers for treatment selection. We propose tools for descriptive analysis and summary measures for formal evaluation and comparison of markers. The descriptives are conceptually similar to those of Bonetti and Gelber [8], Royston and Sauerbrei [9], Cai et al. [10], but we scale markers to the percentile scale to facilitate making comparisons. Our preferred global summary measure is the same as or closely related to that advocated by Song and Pepe [13], Brinkley et al. [16], Janes et al. [6], Gunter et al. [19], Qian and Murphy [20], McKeague and Qian [21], and Zhang et al. [22], a component of which was described by Zhao et al. [12] and Baker and Kramer [14]. We also propose several novel measures of treatment selection performance, motivated by existing methodology for evaluating markers for predicting outcome under a single treatment, i.e. for risk prediction. We develop methods for estimation and inference that apply to data from a randomized controlled trial comparing two treatment options where the marker is measured at baseline on all or a stratified case–control sample of trial participants. For illustration, we consider the breast cancer treatment context where candidate markers are evaluated for their utility in identifying a subset of women who do not benefit from adjuvant chemotherapy. Appendices include the results of a small-scale simulation study that evaluates the performance of the methods in finite samples and a description of the R package we have written that implements these methods.
2 Setting and notation
Suppose that the task is to decide between two treatment options, referred to as “treatment” (
We focus on the ideal setting for evaluating treatment efficacy, an RCT comparing
3 Motivating context
We illustrate our methods in the breast cancer treatment context. Women diagnosed with estrogen-receptor-positive and node-positive breast cancer are typically treated with both hormone therapy (e.g. tamoxifen) and adjuvant chemotherapy following surgery. This is despite the fact that it is generally well-accepted in the clinical community that only a subset of these women actually benefit from the adjuvant chemotherapy, and the remaining women suffer its toxic side effects, not to mention the burden and cost of unnecessary treatment [26]. A high public health priority is to identify biomarkers that can be used to predict which women are and are not likely to benefit from the adjuvant chemotherapy [27]. The Oncotype DX recurrence score is an example of a biomarker that is currently being used in clinical practice for this purpose. This marker is a proprietary combination of 21 genes whose expression levels are measured in the tumor tissue obtained at surgery [28–30]. The marker has been shown to have value for identifying a subset of women who are unlikely to benefit from chemotherapy [28, 29, and 30].
To illustrate our methods, we simulated a marker,
4 Methods for evaluating individual markers
4.1 Treatment rule
Given that the task is to decide between treatment and no treatment for each individual subject, it is sensible and common to define a binary rule for assigning treatment on the basis of marker value. Let
can be shown to be optimal in the sense that it minimizes the population event rate [16, 22, 32]. Some of the marker performance measures we consider evaluate the properties of this rule; other performance measures do not depend on specification of a treatment rule. We refer to subjects with
4.2 Descriptives
For descriptive analysis, it is useful to display the distribution of risk of the event as a function of the marker under each treatment. We plot “risk curves”

Risk of 5-year breast cancer recurrence or death as a function of treatment assignment and marker percentile, for
Citation: The International Journal of Biostatistics 10, 1; 10.1515/ijb-2012-0052
Another informative display is the distribution of treatment effect, as summarized by

Distribution of the treatment effect, as measured by the difference in the 5-year breast cancer recurrence or death rate without vs with treatment,
Citation: The International Journal of Biostatistics 10, 1; 10.1515/ijb-2012-0052
4.3 Summary measures
The following are useful measures for summarizing marker performance that depend on specification of the treatment rule:
- –Average benefit of no treatment among marker-negatives,
- –Average benefit of treatment among marker-positives,
- –Proportion marker-negative,
- –Decrease in population event rate under marker-based treatment,
where we define
The constituents of
We also consider two marker performance measures that do not depend on specification of a treatment rule:
- –Variance in treatment effect,
- –Total gain, the area between the treatment effect curve and the marginal treatment effect,
.
The
Table 1 contains estimates of
Estimates of various measures of marker performance for the Oncotype-DX-like marker (
| Measure | Estimator | Marker | Marker | Marker | |
| Estimated diff. (95% CI) | P value for diff. | ||||
| 0.013 (–0.010, 0.044) | 0.090 (0.060,0.122) | −0.076 (–0.111, –0.042) | |||
| 0.010 (0.000, 0.037) | 0.099 (0.071,0.129) | −0.088 (–0.115, –0.061) | |||
| 0.029 (–0.106, 0.082) | 0.238 (0.170,0.309) | −0.209 (–0.342, –0.129) | |||
| 0.023 (0.000, 0.057) | 0.262 (0.209,0.310) | −0.239 (–0.294, –0.178) | |||
| 0.089 (0.020, 0.157) | 0.203 (0.157,0.263) | −0.114 (–0.193, –0.043) | |||
| 0.098 (0.035, 0.162) | 0.211 (0.176,0.258) | −0.113 (–0.184, –0.052) | |||
| 0.461 (0.000, 0.700) | 0.377 (0.304,0.470) | 0.084 (–0.358, 0.236) | 0.768 | ||
| 0.007 (0.001, 0.019) | 0.080 (0.057,0.109) | −0.073 (–0.103, –0.046) | |||
| TG | 0.066 (0.024, 0.110) | 0.224 (0.187,0.263) | −0.158 (–0.221, –0.102) | ||
4.4 Estimation and inference
Our proposed estimation and inference methods build on methodology developed for risk prediction [33–35]. This section overviews these approaches that are evaluated in a small-scale simulation study described in the Appendix. An R software package that implements these methods is also described in the Appendix.
4.4.1 Estimation
Given data consisting of i.i.d copies of
Typically we let g be the logit function because of its advantages with case–control data (see Section 6.2) and because we have found logistic regression to be remarkably robust to model mis-specification. We note that the general linear model (1) is flexible in that the marker Y can itself be a transformed marker value. The risk and treatment effect estimates that result from fitting from this model are written
For the summary measures that depend on treatment rule, we consider both “empirical” and “model-based” estimators. An empirical estimator uses the estimated risk model (1) to classify individuals as marker-positive or marker-negative, and the performance of this rule is estimated empirically. For a model-based estimator, the risk model is used both to classify each individual and to estimate the performance of the classification rule. For example, the empirical estimator of
The treatment-rule-independent summary measures are estimated using the following model-based estimators:
4.4.2 Hypothesis testing
Testing whether a marker has any performance for treatment selection is of interest for two reasons. First, this is a logical first step in marker evaluation. Second, the performance measures described above may have poor statistical properties at and near the null of no marker performance. This is similar to problems that have been identified with measures of risk prediction model performance [36–41]; Section 7 includes further discussion of this point. Therefore, we advocate a simple pre-testing approach, whereby the marker performance measures are only estimated if the null hypothesis
For an unbounded marker, under risk model (1),
For the unbounded markers
4.5 Calibration assessment
Assessing model calibration is a fundamental step in marker evaluation. We rely on standard methods for visualizing and testing goodness of fit for the risk model (1) and extend these methods to assess calibration of the treatment effect model.
Since patients are provided risk estimates under both treatment options, first we assess the fit of the risk model separately in the two treatment groups. Specifically, we define a well-calibrated model to be one for which

Plots assessing calibration of the risk and treatment effect models, for the Oncotype-DX-like marker (left) and the strong marker (right)
Citation: The International Journal of Biostatistics 10, 1; 10.1515/ijb-2012-0052
To formally assess model calibration, a traditional Hosmer–Lemeshow goodness of fit test [44] can be applied separately to the two treatment groups. Specifically, for group
where
Another aspect of calibration is the extent to which the treatment effect model fits well. We want to ensure that
5 Comparing markers
The descriptives and summary measures proposed herein form the basis for comparing candidate markers. We assume that the two markers,
For drawing inference about the relative performance of two markers given paired data, CIs for the differences in performance measures and hypothesis tests of whether these differ from zero are informative. We fit separate models for
The results of the comparative analysis for the breast cancer example are shown in Table 1. We can see clearly that
6 Extensions
6.1 General treatment rules
In some settings there may be additional consequences of treatment that are not captured in the outcome, for example treatment-associated toxicities. This means that a treatment effect somewhat above zero may still warrant no treatment because it is offset by the other consequences of treatment. In these settings the optimal treatment rule can be shown to be
where
6.2 Case–control sampling
The methods described above apply to the setting where the marker is measured at baseline on all RCT participants. However when the outcome D is rare, case–control sampling from within the RCT is a well-known efficient alternative that recovers much of the information contained in the entire trial population. This section extends the methods to the setting where the data consist of a case–control sample from the RCT, or a case–control sample stratified on treatment assignment, T. We consider case–control designs that sample all or a fixed proportion of the cases in the RCT, as well as a number of controls (perhaps stratified on T) that is a fixed multiple of the number of cases sampled.
Consider first unstratified case–control sampling. Suppose
Let
This result was originally cited by Prentice and Pyke [45] as the rationale for using logistic regression to model risk with case–control data. Note that
The distribution of
where superscript
We use a modified bootstrapping procedure for case–control data. To reproduce the variability in the cohort from which the case–control study is sampled, we first sample
Case–control sampling stratified on treatment assignment can also be accommodated. Here we assume a cohort with
Bootstrapping is implemented by first sampling
For calibration assessment, we plot observed and predicted risks and treatment effects as described in Section 4.5, where all are corrected for the biased sampling as described above. We also implement a variation on the Hosmer–Lemeshow test applied to case–control data (expression (7) of Huang and Pepe [33]).
7 Discussion
This paper proposes a statistical framework for evaluating a candidate treatment selection marker and for comparing two markers. Estimation and inference techniques are described for the setting where the marker or markers are measured on all or a treatment-stratified case–control sample of participants in a randomized, controlled trial. An R software package was developed which implements these methods. Developing a solid framework for evaluating and comparing markers is fundamental for accomplishing more sophisticated tasks such as combining markers, accounting for covariates, and assessing the improvement in performance associated with adding a new marker to a set of established markers.
Our approach to marker evaluation also applies when the marker is discrete. In addition, it can be applied when there are multiple markers and interest lies in evaluating their combination;
This work extends existing approaches for evaluating markers for risk prediction [25, 35, 46]. It also unifies existing methodology for evaluating treatment selection markers. In particular, our preferred marker performance measure has been advocated by Song and Pepe [13], Brinkley et al. [16], Janes et al. [6], Gunter et al. [19], Qian and Murphy [20], McKeague and Qian [21], and Zhang et al. [22].
There are challenges with making inference about the performance measures we propose, similar to problems that have been identified with measures of risk prediction model performance including the area under the ROC curve [36, 38–40], the integrated discrimination index [37], and the net reclassification index [41]. The problems may arise when the sample size is modest and marker performance is weak. In particular for the Oncotype DX example, given that the marker is weak and the primary study evaluating its performance by Albain et al. [28] included just 367 women, our simulation results suggest that the resultant estimate of
The methods described here can and should be extended to accommodate other types of outcomes. Extension to continuous or count outcomes is straightforward. Specifically, after replacing
The methods may also be generalized to an observational study setting, or to a setting where data on the two treatments come from two different studies – perhaps historical data are paired with a single-arm trial of
Simulation studies
This section describes a small-scale simulation study that was performed to evaluate the statistical performance of our methods. Data were simulated to reflect the breast cancer RCT example, with T an indicator of chemotherapy in addition to tamoxifen, randomly assigned to half of study participants. Rates of 5-year breast cancer recurrence or death (D) were set to 21% and 24% with and without chemotherapy, respectively, as in SWOG SS8814 [28]. We explored the performance of the methods for a weak marker and a strong marker, both of which relate to D via the linear logistic model (1). The weak marker,
where k ensures that
and
These models fully specify the joint distribution of
We explore the bias and variance of the parameter estimates and false-coverage probabilities of the bootstrap percentile CIs for sample sizes ranging from
Strong marker
The results for the strong marker are contained in Tables A.1–A.3. For this marker, we see that the estimates and CIs have uniformly good performance. Marginal bias is small and false coverage is near nominal; the pre-testing has no impact because of the 100% power to reject
Mean parameter estimates for the strong marker
| N | Prob.reject | TG (0.245) | |||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||||
| Marginal | 250 | 1 | 0.113 | 0.112 | 0.380 | 0.295 | 0.293 | 0.230 | 0.229 | 0.097 | 0.246 |
| 500 | 1 | 0.112 | 0.112 | 0.380 | 0.293 | 0.293 | 0.230 | 0.230 | 0.096 | 0.246 | |
| 1,000 | 1 | 0.111 | 0.111 | 0.379 | 0.292 | 0.292 | 0.229 | 0.229 | 0.095 | 0.246 | |
| 5,000 | 1 | 0.110 | 0.110 | 0.379 | 0.291 | 0.291 | 0.228 | 0.228 | 0.094 | 0.246 | |
| Conditional | 250 | 1 | 0.113 | 0.112 | 0.380 | 0.295 | 0.293 | 0.230 | 0.229 | 0.097 | 0.246 |
| 500 | 1 | 0.112 | 0.112 | 0.380 | 0.293 | 0.293 | 0.230 | 0.230 | 0.096 | 0.246 | |
| 1,000 | 1 | 0.111 | 0.111 | 0.379 | 0.292 | 0.292 | 0.229 | 0.229 | 0.095 | 0.246 | |
| 5,000 | 1 | 0.110 | 0.110 | 0.379 | 0.291 | 0.291 | 0.228 | 0.228 | 0.094 | 0.246 | |
Notes: For
False-coverage results for the strong marker
| N | Prob. reject | TG | |||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||||
| Marg. false cov. | 250 | 1 | 0.059 | 0.045 | 0.052 | 0.056 | 0.030 | 0.051 | 0.034 | 0.056 | 0.056 |
| 500 | 1 | 0.054 | 0.043 | 0.053 | 0.050 | 0.031 | 0.050 | 0.038 | 0.054 | 0.053 | |
| 1,000 | 1 | 0.056 | 0.055 | 0.051 | 0.049 | 0.044 | 0.047 | 0.044 | 0.055 | 0.055 | |
| 5,000 | 1 | 0.055 | 0.051 | 0.055 | 0.056 | 0.048 | 0.052 | 0.049 | 0.053 | 0.056 | |
| Cond. false cov. | 250 | 1 | 0.059 | 0.045 | 0.052 | 0.056 | 0.030 | 0.051 | 0.034 | 0.056 | 0.056 |
| 500 | 1 | 0.054 | 0.043 | 0.053 | 0.050 | 0.031 | 0.050 | 0.038 | 0.054 | 0.053 | |
| 1,000 | 1 | 0.056 | 0.055 | 0.051 | 0.049 | 0.044 | 0.047 | 0.044 | 0.055 | 0.055 | |
| 5,000 | 1 | 0.055 | 0.051 | 0.055 | 0.056 | 0.048 | 0.052 | 0.049 | 0.053 | 0.056 | |
| False concl. | 250 | 1 | 0.059 | 0.045 | 0.052 | 0.056 | 0.030 | 0.051 | 0.034 | 0.056 | 0.056 |
| 500 | 1 | 0.054 | 0.043 | 0.053 | 0.050 | 0.031 | 0.050 | 0.038 | 0.054 | 0.053 | |
| 1,000 | 1 | 0.056 | 0.055 | 0.051 | 0.049 | 0.044 | 0.047 | 0.044 | 0.055 | 0.055 | |
| 5,000 | 1 | 0.055 | 0.051 | 0.055 | 0.056 | 0.048 | 0.052 | 0.049 | 0.053 | 0.056 | |
Notes: For
Empirical standard deviations of parameter estimates for the strong marker
| N | TG (0.245) | ||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||
| 250 | 0.031 | 0.035 | 0.072 | 0.057 | 0.072 | 0.045 | 0.05 | 0.03 | 0.042 |
| 500 | 0.022 | 0.024 | 0.051 | 0.039 | 0.048 | 0.031 | 0.036 | 0.021 | 0.029 |
| 1,000 | 0.016 | 0.018 | 0.036 | 0.028 | 0.035 | 0.022 | 0.025 | 0.015 | 0.021 |
| 5,000 | 0.007 | 0.008 | 0.016 | 0.012 | 0.015 | 0.01 | 0.011 | 0.007 | 0.009 |
Notes: For
Weak marker
The results for the weak marker are contained in Tables A.4–A.6. With
Mean parameter estimates for the weak marker
| N | Prob. reject | TG (0.050) | |||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||||
| Marginal | 250 | 0.217 | 0.022 | 0.022 | 0.423 | 0.036 | 0.036 | 0.090 | 0.090 | 0.009 | 0.060 |
| 500 | 0.364 | 0.016 | 0.015 | 0.410 | 0.027 | 0.026 | 0.080 | 0.080 | 0.007 | 0.055 | |
| 1,000 | 0.63 | 0.013 | 0.013 | 0.405 | 0.024 | 0.024 | 0.076 | 0.076 | 0.006 | 0.054 | |
| 5,000 | 0.999 | 0.010 | 0.010 | 0.426 | 0.022 | 0.022 | 0.073 | 0.073 | 0.005 | 0.053 | |
| Conditional | 250 | 0.217 | 0.042 | 0.041 | 0.547 | 0.071 | 0.069 | 0.159 | 0.154 | 0.022 | 0.112 |
| 500 | 0.364 | 0.026 | 0.025 | 0.509 | 0.046 | 0.044 | 0.117 | 0.117 | 0.013 | 0.084 | |
| 1,000 | 0.630 | 0.017 | 0.017 | 0.473 | 0.032 | 0.032 | 0.091 | 0.090 | 0.008 | 0.066 | |
| 5,000 | 0.999 | 0.010 | 0.010 | 0.426 | 0.022 | 0.022 | 0.073 | 0.073 | 0.005 | 0.053 | |
Notes: For
False-coverage results for the weak marker
| N | Prob. reject | TG | |||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||||
| Marg. false cov. | 250 | 0.217 | 0.054 | 0.030 | 0.059 | 0.053 | 0.023 | 0.063 | 0.026 | 0.034 | 0.035 |
| 500 | 0.364 | 0.043 | 0.021 | 0.052 | 0.034 | 0.014 | 0.048 | 0.022 | 0.029 | 0.030 | |
| 1,000 | 0.630 | 0.050 | 0.026 | 0.055 | 0.034 | 0.015 | 0.047 | 0.018 | 0.052 | 0.051 | |
| 5,000 | 0.999 | 0.057 | 0.037 | 0.058 | 0.058 | 0.020 | 0.058 | 0.036 | 0.060 | 0.058 | |
| Cond. false cov. | 250 | 0.217 | 0.162 | 0.089 | 0.102 | 0.190 | 0.074 | 0.248 | 0.088 | 0.158 | 0.161 |
| 500 | 0.364 | 0.083 | 0.043 | 0.065 | 0.086 | 0.032 | 0.121 | 0.057 | 0.081 | 0.083 | |
| 1,000 | 0.630 | 0.045 | 0.028 | 0.043 | 0.047 | 0.023 | 0.061 | 0.028 | 0.046 | 0.044 | |
| 5,000 | 0.999 | 0.056 | 0.037 | 0.058 | 0.057 | 0.020 | 0.057 | 0.035 | 0.058 | 0.057 | |
| Marg. false concl. | 250 | 0.217 | 0.035 | 0.019 | 0.022 | 0.041 | 0.016 | 0.054 | 0.019 | 0.034 | 0.035 |
| 500 | 0.364 | 0.030 | 0.016 | 0.024 | 0.031 | 0.012 | 0.044 | 0.021 | 0.029 | 0.030 | |
| 1,000 | 0.630 | 0.028 | 0.018 | 0.027 | 0.030 | 0.014 | 0.038 | 0.018 | 0.029 | 0.028 | |
| 5,000 | 0.999 | 0.056 | 0.037 | 0.058 | 0.057 | 0.020 | 0.057 | 0.035 | 0.058 | 0.057 | |
Notes: For
Empirical standard deviations of parameter estimates for the weak marker
| N | TG (0.050) | ||||||||
| Mod. | Emp. | Mod. | Emp. | Mod. | Emp. | ||||
| 250 | 0.025 | 0.029 | 0.309 | 0.032 | 0.09 | 0.051 | 0.083 | 0.009 | 0.037 |
| 500 | 0.017 | 0.02 | 0.27 | 0.023 | 0.061 | 0.037 | 0.065 | 0.006 | 0.028 |
| 1,000 | 0.012 | 0.014 | 0.22 | 0.017 | 0.045 | 0.027 | 0.033 | 0.004 | 0.021 |
| 5,000 | 0.005 | 0.007 | 0.106 | 0.008 | 0.013 | 0.013 | 0.015 | 0.002 | 0.01 |
Notes: For
Software
We developed a package in the open-source software R called TreatmentSelection that implements our methods for evaluating individual markers and for comparing markers. The software is available at http://labs.fhcrc.org/janes/index.html. The following functions are included.
- –trtsel creates a treatment selection object
- –eval.trtsel evaluates a treatment selection object, producing estimates and CIs for the summary measures described in Section 4.3
- –plot.trtsel plots a treatment selection object, producing risk curves and the treatment effect curve described in Section 4.2
- –calibrate.trtsel assesses the calibration of a fitted risk model and treatment effect model using methods described in Section 4.5
- –compare.trtsel compares two markers using methods described in Section 5
Here we illustrate use of the code by showing how the results shown in Figures 1–3 and Table 1 of the main text are produced. First we load the data using the following commands.
> library(TreatmentSelection) > data(tsdata) > tsdata[1:10,] trt event Yl Y2 1 1 1 39.9120 −0.8535 2 1 0 6.6820 0.2905 3 1 0 6.5820 0.0800 4 0 0 1.3581 1.1925 5 0 0 7.6820 −0.2070 6 0 0 41.1720 −0.0880 7 1 0 19.4920 0.1670 8 1 1 20.8220 −1.0485 9 0 0 6.9620 −0.2435 10 0 0 2.5020 0.2030
Treatment selection objects are created and displayed for Y1 and Y2 using the commands
> trtsel.Y1 <- trtsel(event = "event", trt = "trt", marker = "Y1",data = tsdata, study.design="randomized cohort") data = tsdata, study.design="randomized cohort") > trtsel.Y1 Study design: randomized cohort Model Fit: Link function: logit Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) −2.51814383 0.235642511 −10.686288 1. 179991e-26 trt 0.48938620 0.311762857 1.569739 1.164759e-01 marker 0.04760056 0.006453791 7.375597 1.636104e-13 trt:marker −0.02318881 0.008324063 −2.785756 5.340300e-03 Derived Data: (first ten rows) event trt marker fittedrisk.tO fittedrisk.tl trt.effect marker.neg 1 1 1 39.9120 0.35016583 0.2583742 0.0917916549 0 2 0 1 6.6820 0.09974358 0.1340472 −0.0343036269 1 3 0 1 6.5820 0.09931697 0.1337641 −0.0344471266 1 4 0 0 1.3581 0.07918316 0.1196652 −0.0404820847 1 5 0 0 7.6820 0.10410005 0.1369063 −0.0328062456 1 6 0 0 41.1720 0.36393311 0.2643117 0.0996213622 0 7 0 1 19.4920 0.16933976 0.1746644 −0.0053246137 1 8 1 1 20.8220 0.17843231 0.1793943 −0.0009620341 1 9 0 0 6.9620 0.10094678 0.1348426 −0.0338958439 1 10 0 0 2.5020 0.08324538 0.1226384 −0.0393929781 1 > trtsel.Y2 <- trtsel(event = "event", trt = "trt", marker = "Y2", data = tsdata, study.design="randomized cohort") data = tsdata, study.design="randomized cohort") > trtsel.Y2 Study design: randomized cohort Model Fit: Link function: logit Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) −1.2107912 0.1131642 −10.699416 1.024216e-26 trt −0.5169008 0.1863643 −2.773604 5.543912e-03 marker 0.5779172 0.1148643 5.031305 4.871514e-07 trt:marker −2.0455033 0.2064547 −9.907756 3.851994e-23 Derived Data: (first ten rows) event trt marker fittedrisk.tO fittedrisk.tl trt.effect marker.neg 1 1 1 −0.8535 0.1539379 0.38340813 −0.229470242 1 2 0 1 0.2905 0.2605896 0.10395563 0.156633982 0 3 0 1 0.0800 0.2378401 0.13644937 0.101390712 0 4 0 0 1.1925 0.3724723 0.02995087 0.342521474 0 5 0 0 −0.2070 0.2090899 0.19405065 0.015039232 0 6 0 0 −0.0880 0.2206903 0.16818515 0.052505186 0 7 0 1 0.1670 0.2470740 0.12209072 0.124983277 0 8 1 1 −1.0485 0.1398258 0.45290799 −0.313082172 1 9 0 0 −0.2435 0.2056229 0.20256576 0.003057187 0 10 0 0 0.2030 0.2509647 0.11653995 0.134424710 0
The descriptives shown in Figure 1 are produced using
> plot.trtsel(trtsel.Y1, main = "Yl: Oncotype-DX-like marker", bootstraps = 500, trt.names=c("chemo.","no chemo."))
> plot.trtsel(trtsel.Y2, main = "Y2: Strong marker", bootstraps = 500, trt.names=c("chemo.","no chemo."))
Calibration is assessed and displayed as shown in Figure 3 using
> cali.Y1 <- calibrate.trtsel(trtsel.Y1) > cali.Y1 Hosmer-Lemeshow test for model calibration ----------------------------------------- No Treatment (trt = 0): Test Statistic = 4.496, DF = 8, p value = 0.8098813 Treated (trt = 1): Test Statistic = 4.986, DF = 8, p value = 0.7591213 > cali.Y2 <- calibrate.trtsel(trtsel.Y2) > cali.Y2 Hosmer-Lemeshow test for model calibration ----------------------------------------- No Treatment (trt = 0): Test Statistic = 8.896, DF = 8, p value = 0.3511235 Treated (trt = 1): Test Statistic = 2.868, DF = 8, p value = 0.9423597 calibrate.trtsel(trtsel.Y1, plot.type = "risk.t0") calibrate.trtsel(trtsel.Y2, plot.type = "risk.t0") calibrate.trtsel(trtsel.Y1, plot.type = "risk.tl") calibrate.trtsel(trtsel.Y2, plot.type = "risk.tl") calibrate.trtsel(trtsel.Y1, plot.type = "treatment effect") calibrate.trtsel(trtsel.Y2, plot.type = "treatment effect")
The summary measure estimates and CIs shown in Table 1 are obtained by
> eval.Y1 <- eval.trtsel(trtsel.Y1, bootstraps = 500) > eval.Y1 Hypothesis test: ----------------- HO: No marker-by-treatment interaction P value = 0.00534 Z statistic = -2.786 Summary Measure Estimates (with 95% confidence intervals) ------------------------------------------------------- Decrease in event rate under marker-based treatment (Theta) Empirical: 0.013 (-0.01,0.044) Model Based: 0.01 (0,0.038) Proportion marker-negative: 0.461 (0,0.717) Proportion marker-positive: 0.539 (0.283,1) Average benefit of no treatment among marker-negatives (B.neg) Empirical: 0.029 (-0.07,0.075) Model Based: 0.023 (0,0.059) Average benefit of treatment among marker-positives (B.pos) Empirical: 0.089 (0.014,0.15) Model Based: 0.098 (0.04,0.146) Variance in estimated treatment effect: 0.007 (0.001,0.017) Total Gain: 0.066 (0.026,0.1) Marker positivity threshold: 21.082 Event Rates: ------------ Treat none Treat all Marker-based Treatment Empirical: 0.251 0.217 0.204 (0.210,0.291) (0.182,0.251) (0.171,0.241) Model Based: 0.257 0.214 0.204 (0.217,0.295) (0.179,0.248) (0.169,0.232) > eval.Y2 <- eval.trtsel(trtsel.Y2, bootstraps = 500) > eval.Y2 Hypothesis test: ----------------- HO: No marker-by-treatment interaction P value = 0 Z statistic = -9.908 Summary Measure Estimates (with 95% confidence intervals) ----------------------------------------------------- Decrease in event rate under marker-based treatment (Theta) Empirical: 0.09 (0.064,0.122) Model Based: 0.099 (0.074,0.128) Proportion marker negative: 0.377 (0.306,0.467) Proportion marker positive: 0.623 (0.533,0.694) Average benefit of no treatment among marker-negatives (B.neg) Empirical: 0.238 (0.173,0.304) Model Based: 0.262 (0.211,0.315) Average benefit of treatment among marker-positives (B.pos) Empirical: 0.203 (0.157,0.266) Model Based: 0.211 (0.171,0.259) Variance in estimated treatment effect: 0.08 (0.057,0.108) Total Gain: 0.224 (0.187,0.262) Marker positivity threshold: -0.258 Event Rates: -------------- Treat none Treat all Marker-based Treatment Empirical: 0.251 0.217 0.128 (0.215,0.290) (0.186,0.252) (0.096,0.155) Model Based: 0.245 0.212 0.113 (0.210,0.282) (0.180,0.245) (0.090,0.135)
The markers are compared based on summary measures, and visually (as in Figure 2) using
mycompare <- compare.trtsel(trtsell = trtsel.Y1, trtsel2 = trtsel.Y2, bootstraps = 500, plot = TRUE, main = "", marker.names=c("Yl","Y2")) mycompare Summary Measure Estimates (with 95% confidence intervals) marker 1 | marker 2 | difference (p-value) ------------------------------------------------------------------------------- Decrease in event rate under marker-based treatment (Theta) Empirical: 0.013 | 0.090 | −0.076 (< 0.002) (-0.007,0.049) | (0.062,0.124) | (-0.111,-0.043) Model Based: 0.010 | 0.099 | −0.088 (< 0.002) (0.000,0.039) | (0.070,0.130) | (-0.112,-0.058) Proportion marker negative: 0.461 | 0.377 | 0.084 (0.664) (0.000,0.707) | (0.305,0.471) | (-0.360,0.258) Average benefit of no treatment among marker-negatives (B.neg) Empirical: 0.029 | 0.238 | −0.209 (< 0.002) (-0.071,0.082) | (0.176,0.307) | (-0.331,-0.132) Model Based: 0.023 | 0.262 | −0.239 (< 0.002) (0.000,0.061) | (0.205,0.308) | (-0.288,-0.169) Average benefit of treatment among marker-positives (B.pos) Empirical: 0.089 | 0.203 | −0.114 (0.002) (0.007,0.161) | (0.162,0.266) | (-0.201,-0.035) Model Based: 0.098 | 0.211 | −0.113 (< 0.002) (0.042,0.162) | (0.175,0.259) | (-0.177,-0.038) Variance in estimated treatment effect : 0.007 | 0.080 | −0.073 (< 0.002) (0.001,0.021) | (0.054,0.110) | (-0.103,-0.043) Total Gain: 0.066 | 0.224 | −0.158 (< 0.002) (0.027,0.111) | (0.181,0.266) | (-0.214,-0.091)
If instead the dataset with D,
cctrtsel.Y1 <- trtsel(event = "event", treatment = "trt", marker = "Y1", data = tsdata, cohort.attributes = c(N, Rand.frac, Risk.cohort), study.design="nested case-control")
References
- 1.↑
Simon R. Lost in translation: problems and pitfalls in translating laboratory observations to clinical utility. Eur J Cancer 2008;44:2707–13.
- 2.↑
Gunter L, Zhu J, Murphy S. Chapter Variable selection for optimal decision making. In: Proceedings of the 11th conference on artificial intelligence in medicine. Springer Verlag, 2007.
- 3.↑
Coates AA, Miller EK, O’Toole SA, Molloy TJ, Viale G, Goldhirsch A, et al. Prognostic interaction between expression of p53 and estrogen receptor in patients with node-negative breast cancer: results from IBCSG trials VIII and IX. Breast Cancer Res 2012;14:R143.
- 4.↑
Busch S, Ryden L, Stal O, Jirstrom K, Landberg G. Low ERK phosphorylation in cancer-associated fibroblasts is associated with tamoxifen resistance in pre-menopausal breast cancer. PLoS One 2012;7:e45669.
- 5.↑
Malmstrom A, Gronberg BH, Marosi C, Stupp R, Frappaz D, Schultz H, et al., and N. C. B. T. S. G. (NCBTSG).Temozolomide versus standard 6-week radiotherapy versus hypofractionated radiotherapy in patients older than 60 years with glioblastoma: the nordic randomised, phase 3 trial. Lancet Oncol 2012;13:916–26.
- 6.↑
Janes H, Pepe MS, Bossuyt PM, Barlow WE. Measuring the performance of markers for guiding treatment decisions. Ann Intern Med 2011;154:253–9.
- 7.↑
Huang Y, Gilbert PB, Janes H. Assessing treatment-selection markers using a potential outcomes framework. Biometrics 2012;68:687–96.
- 8.↑
Bonetti M, Gelber RD. Patterns of treatment effects in subsets of patients in clinical trials. Biostatistics 2004;5:465–81.
- 9.↑
Royston P, Sauerbrei W. A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med 2004;23:2509–25.
- 10.↑
Cai T, Tian L, Wong P, Wei L. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 2011;12:270–82.
- 11.
Claggett B, Zhao L, Tian L, Castagno D, Wei LJ. Estimating subject-specific treatment differences for risk-benefit assessment with competing risk event-time data. Harvard University Biostatistics Working Paper Series, 1252011.
- 12.↑
Zhao L, Tian L, Cai T, Claggett B, Wei LJ. Effectively selecting a target population for a future comparative study. Harvard University Biostatistics Working Paper Series, 2011.
- 13.↑
Song X, Pepe MS. Evaluating markers for selecting a patient’s treatment. Biometrics 2004;60:874–83.
- 14.↑
Baker S, Kramer B. Statistics for weighing benefits and harms in a proposed genetic substudy of a randomized cancer prevention trial. J Royal Stat Soc Ser C (Appl Stat) 2005;54:941–54.
- 15.↑
Vickers AJ, Kattan MW, Sargent D. Method for evaluating prediction models that apply the results of randomized trials to individual patients. Trials 2007;8:14.
- 16.↑
Brinkley J, Tsiatis AA, Anstrom KJ. A generalized estimator of the attributable benefit of an optimal treatment regime. Biometrics 2010;66:512–22.
- 17.↑
Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Stat Meth Med Res 2012;22(5):493–504.
- 18.
Foster JC, Taylor JM, Ruberg SJ. Subgroup identification from randomized clinical trial data. Stat Med 2011;30:2867–80.
- 19.↑
Gunter L, Zhu J, Murphy S. Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. J Biopharm Stat 2011;21:1063–78.
- 20.↑
Qian M, Murphy S. Performance guarantees for individualized treatment rules. Ann Stat 2011;39:1180–210.
- 21.↑
McKeague IW, Qian M. Evaluation of treatment policies via sparse functional linear regression. Stat Sin 2013.
- 22.↑
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics 2012; 68(4):1010–8.
- 23.↑
Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press, 2003.
- 24.↑
Zhou X-H, McClish DK, Obuchowski NA. Statistical methods in diagnostic medicine. New York: Wiley, 2002.
- 25.↑
Pepe MS, Janes H. Methods for evaluating prediction performance of biomarkers and tests. UW Biostatistics Working Paper Series 384, 2012.
- 26.↑
Early Breast Cancer Trialists Collaborative Group. Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomized trials. Lancet 2005;365:1687–717.
- 27.↑
Dowsett M, Goldhirsch A, Hayes DF, Senn HJ, Wood W, Viale G. International web-based consultation on priorities for translational breast cancer research. Breast Cancer Res 2007;9:R81.
- 28.↑
Albain KS, Barlow WE, Shak S, Hortobagyi GN, Livingston RB, Yeh IT. Prognostic and predictive value of the 21-gene recurrence score assay in postmenopausal women with node-positive, oestrogen-receptor-positive breast cancer on chemotherapy: A retrospective analysis of a randomized trial. Lancet Oncol 2010;11:55–65.
- 29.↑
Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al. A multigene assay to predict recurrence of tamoxifen-treated,node-negative breast cancer. New Engl J Med 2004;351:2817–26.
- 30.↑
Paik S, Tang G, Shak S, Chungyeul K, Baker J, Kim W, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen-receptor-positive breast cancer. J Clin Oncol 2006;24:3726–34.
- 31.↑
Albain KS, Barlow WE, Davdin PM, Farrar WB, Burton GV, Ketchel SJ, et al., and the Breast Cancer Intergroup of North America. Adjuvant chemotherapy and timing of tamoxifen in postmenopausal patients with endocrine-responsive, node-positive breast cancer: A phase 3, open-label, randomized controlled trial. Lancet 2009;274:2055–63.
- 32.↑
Janes H, Pepe MS, Huang Y. A framework for evaluating markers used to select patient treatment. Med Decis Making 2014;34(2):159–67.
- 33.↑
Huang Y, Pepe M. Semiparametric methods for evaluating the covariate-specific predictiveness of continuous markers in matched case-control studies. J R Stat Soc Ser B 2010;59:437–56.
- 34.
Huang Y, Pepe MS. Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods. Stat Med 2010;29:1391–410.
- 35.↑
Huang Y, Sullivan Pepe M, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics 2007;63:1181–8.
- 36.↑
Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol 2011;11:13.
- 37.↑
Kerr KF, McClelland RL, Brown ER, Lumley T. Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol 2011;174:364–74.
- 38.↑
Pepe M, Kerr K, Longton G, Wang Z. Testing for improvement in prediction model performance. UW Biostatistics Working Paper Series 379, 2011.
- 39.
Seshan VE, Gonen M, Begg CB. Comparing ROC curves derived from regression models. Stat Med 2013;32(9):1483–93.
- 40.↑
Demler OV, Pencina MJ, D’Agostino RB. Misuse of DeLONG test to compare AUCs for nested models. Stat Med 2012;31:2577–87.
- 41.↑
Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net reclassification indices for evaluating risk prediction instruments: a critical review. Epidemiology 2014;25(1):114–121.
- 42.↑
Gail M, Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 1985;41:361–72.
- 43.↑
Shuster J, van Eys J. Interaction between prognostic factors and treatment. Control Clin Trials 1983;4:209–14.
- 44.↑
Lemeshow S, Hosmer DJ. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol 1982;115:92–106.
- 45.↑
Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika 1979;66:403–11.
- 46.↑
Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat 2009;5: Article 27. Accessed at: http://dx.doi.org/10.2202/1557-4679.1188
- 47.↑
Benjamini Y, Yekuteili D. False discovery rate-adjusted multiple confidence intervals for selected parameters. J Am Stat Assoc 2005;100:71–81.



