Patients often differ in their response to treatment, and characterizing this variation is crucial for the development of evidence-based, personalized treatment plans. In practice, treatments may be costly or may pose harm to patients (e.g. through adverse side effects or drug toxicity) and clinicians must balance treatment recommendations with each patient’s probability of response. Thus, there is considerable interest in the development and refinement of statistical methods capable of identifying patients with high versus low average treatment effect. For example, a recent randomized controlled trial in psychiatry evaluated the efficacy of citalopram for reducing agitation in patients with probable Alzheimer’s disease . Although the estimated average treatment effect in the trial was positive, an adverse cardiac event occurred in a small proportion of people, and the treatment was associated with slight cognitive worsening. Additionally, only 40% of participants assigned to citalopram had a moderate or marked response compared to 26% of those assigned to placebo, and thus it would clearly be desirable to identify strong predictors of response. In this setting, the preferred clinical goal is to target the treatment to patients who are predicted to experience a large clinical benefit. In addition to providing practical recommendations regarding who should be targeted for treatment, identifying patients whose response to citalopram is large could help clarify the biological mechanisms for citalopram’s action in this population.
Several approaches have been employed to estimate heterogeneity in treatment effects in the setting of randomized controlled trials. One general approach is to posit outcome regression models in which the effect of treatment assignment on response can differ depending on baseline covariates. A major limitation of this approach is that the posited outcome regression model may be misspecified. Zhang et al.  (see also Zhao et al. , Rubin and van der Laan ) adapt this regression framework and develop a robust method for identifying an optimal treatment regime, which, when followed, maximizes the empirical treatment effect in the study population. However, this optimal treatment regime does not necessarily identify highly benefited patients; indeed, it assigns treatment to a patient even when their expected treatment effect is small, as long as it is positive. In addition, one cannot directly adapt Zhang et al.’s  method to identify highly respondent subgroups of patients, for the following reason. That method maximizes the empirical treatment effect in the entire study population. If instead the goal is to maximize the treatment effect over particular subsets of patients, there will almost always be some small subsets that appear to achieve a treatment effect higher than a particular threshold chosen. Therefore, parameter estimation in this setting is ill-defined because it reduces to selecting the subgroup with the highest estimated treatment effect, regardless of the size of this subgroup. This issue illustrates that balance needs to be addressed between the magnitude of the treatment effect in a particular subgroup and the number of patients in that subgroup.
Cai et al.  proposed an alternative method for estimating heterogeneity in the treatment effect. In a two-stage approach, the first stage posits a working regression model (fitted by maximum likelihood, for example), and estimates each subject’s model-based expected response under each treatment arm, and hence the model-based subject’s effect is estimated as the difference between the two estimates. In a second stage, the approach uses the model-based effect estimate as a scalar index score for grouping patients. Then, a local likelihood approach is used to obtain non-parametric estimates of the treatment effect within each strata of the index score. This approach produces consistent estimates of the treatment effect within strata defined by the estimated regression model. However, because the working model in the first stage of the procedure may be misspecified, maximum likelihood or ordinary least squares estimators of model parameters may not be the best approach (even in large samples) to characterize the largest subgroup possible whose empirical treatment effect is greater than some pre-specified threshold.
In this paper we propose a method that characterizes large subgroups who experience a large treatment effect. Section 2 formulates the goal and further reviews the existing approaches. Section 3 develops the new approach. The essence of this approach is that it connects the estimation of parameters from the working model directly to the clinical goal – to identify large subgroups that experience a large empirical treatment effect. We show theoretically, and also by application to the CitAD trial throughout, that the proposed approach characterizes different highly benefited groups that can be much larger than those characterized by the existing approach. Section 4 concludes with remarks.
2 Goal and motivating background
2.1 Problem and limitations of existing methods
For the general framework, consider a study of a random sample of individuals from a population and for each of whom we can measure a vector of covariates , which we assume have finite although possibly many levels. Each individual can be assigned a standard treatment , in which case we would measure a potential outcome , or a new treatment , in which case we would measure a potential outcome . Actual assignment is assigned at random, that is, is independent of , and then the outcome corresponding to the actual assignment is observed. Based on the information of the study, the overall population average potential outcome can be estimated without further assumptions by the sample analogue of the average observed outcomes among those assigned .
Even if the new treatment is the best (on average, or for a particular patient, Zhang et al. ), its effect may be small and its administration associated with burden or adverse effects. Then, for subsequent clinical practice, physicians may wish to only give the new treatment to patients for whom the above study suggests the effect is large enough. To do this, for example, in the psychiatric trial we discuss in Section 2.2, the physicians wanted to characterize a subgroup of patients based on covariates, for whom the treatment effect is, on average, greater than a chosen clinically important value, say . Taking here the absolute difference as the causal effect of interest, the physicians’ goal is as follows:(1)
If it is possible to estimate well the conditional for all without further assumptions, then the goal eq. (1) is easily addressable. To see this, consider, for any indicator function , the quantity . We prove the following result in the Appendix.
Result 1. Among all indicator functions such that the indicator that maximizes the size is of the form
where is a constant determined by , provided that such a exists.
In other words, the largest group satisfying eq. (1) is and is obtained if we start including in the group patients from the larger down to the smaller values of the conditional , and stop when including the covariate with the next smallest value of in would first produce an average effect .
More realistically, when the levels of are many, the conditional effects are not estimable without further assumptions, and the above direct approach is not feasible. An existing approach  mirrors the theoretical approach using a working model (see Figure 1, first two columns). Specifically, here the existing approach in a first stage fits a parametric working model (which may not be correct): , by random assignment), by the MLE or a solution to another standard estimating equation. Based on this fit, the approach obtains an initial, model-based estimate of the effect using(2)
This approach can attempt to approximate goal eq. (1) by mimicking the theoretical solution given above, as follows: first, sort the covariates by the values of estimated effects, ; then, start creating the set by cumulating from larger to smaller values of and close the set when the empirical (non-parametric) estimated effect (difference in sample averages of treated minus control) in that set would stop being . This gives(3)
such that the empirical treatment effect in the set is at least . By largest-fraction set we mean a set that has the largest probability based on the empirical distribution of in the study.
A useful property of this approach, resulting from the empirical estimation at the second stage, is that the effect among the estimated highly benefited set in eq. (3) is approximately the desired clinical effect , even if the working model is incorrect. Specifically,  show that, allowing for the working model to be incorrect, the estimator will converge to a value, say , and the set will converge to
such that the effect within the set is at least . Therefore, the empirical , defined as the difference between the empirical averages of the highly benefited set assigned versus those assignd , converges to at least the nominal effect . The above assumes that is not constant in ; if it is, then the convergence may not hold, for example, because the sets may be empty.
For a trial with small to moderate sample size, the set of patients may have a true effect that is smaller than the limit. For this reason, we can use a modified set , that uses a resampling method to calibrate its effect to the nominal (Appendix B).
A problem with the above approach, however, is that it still uses the estimate (e.g., MLE) of the working model as if the model were correct. In Section 3, we show that, by using a different estimation of the same working model, a different highly benefited group can be identified, which can be much larger than the one identified by the existing approach. First, however, we illustrate the existing approach using data from the Citalopram for Agitation in Alzheimer Disease Study (CitAD) .
2.2 A motivating example
CitAD was a randomized placebo-controlled trial designed to evaluate the efficacy of citalopram in reducing agitation in patients with probable Alzheimer’s disease . The estimated average treatment effect was a 13.6% (se=7.1%) reduction in the probability of agitation symptoms in the citalopram versus the placebo group, as measured by the modified Alzheimer Disease Cooperative Study-Clinical Global Impression of Change Score (hereafter, mADCS-CGIC, Schneider et al. , Drye et al. ).
As agitation in Alzheimer’s disease (AD) is a heterogeneous clinical syndrome that encompasses many underlying pathologies, a secondary aim of the study was to characterize which patients were more likely to respond to citalopram, potentially elucidating which dysfunctional pathways might respond to citalopram. Characterizing heterogeneity in citalopram’s effect is also important because its use is associated with an adverse cardiac complication (long QT syndrome and cognitive worsening), and a preferred clinical goal would be to target highly respondent patients for treatment . We hypothesized that agitation in AD might involve disturbances in affective and/or executive control which might further reflect different disturbances in underlying brain circuits. One hypothesized type of agitation reflects affective disturbance, manifested by mood lability, irritability, anxiety, dysphoria, and/or other affective/mood symptoms. Another hypothesized type reflects agitation from loss of inhibitory control resulting in disinhibition, disorganization, apathy, or other clinical manifestations of loss of executive control. Given the substantial evidence for the involvement of serotonergic deficits in affective dysregulation in mood disorders, we hypothesized that participants with primarily affective type of agitation would respond better to citalopram treatment. To this end, one of the authors (CGL) derived two categorical scales, the affective dysregulation scale (ADS, ranging from 0-7), and the exective dyscontrol scale (EDS, ranging from 0 to 6), where higher values indicate more dysfunction. These scales were derived by examining the CitAD dataset for items that appeared to be a priori associated with affective or executive dysregulation (see Appendix A for detailed derivation). Table 1 is a cross-tabulation of the number of patients in each arm of the study with different combinations of ADS and EDS scores at baseline.
Our goal here is to assess if there exist patient profiles, based on the ADS and EDS covariates, that experience a high citalopram versus placebo effect , examining this question for and (by comparison the overall average was estimated at 13.6%). Table 1 shows that each cell is populated by a relatively small number (if any) of patients, so direct implementation of the theoretical approach described in Section 2.1 is not feasible.
To address the goal, consider first the approach of positing a working model, also described in Section 2.1. In particular, consider the logistic regression working models for the binary outcome , with value 1 signifying a reduction in agitation symptoms:
In this first approach, the parameters, , were estimated by the MLE , and in eq. (2) was estimated by . The latter takes 41 unique values, each corresponding to a non-empty cell in Table 1 (provided no two elements of are the same). Next, patients were ranked by their values , and for each of the three values of and 40%, first we identified the uncalibrated set, say , of the highly benefited patients based on the description in Section 2.1.
We evaluated the properties of these sets, by conducting a simulation as described in Appendix B. First, we found that the true effects experienced by the uncalibrated sets were approximately 5% lower than their corresponding three nominal values. Then, for each nominal value, we searched for the value that the empirical effect should have in order that the simulated true effects be equal to the nominal. These resulting values were and , respectively, and the corresponding sets, which we call in Appendix B, are shown on the top three panels in Figure 2.
For example, the set of patients who experience an average effect of 30% are the patients with or with . This group is estimated to form 34% of the study population.
3 Proposed approach
The proposed approach is motivated by re-examining the parallelism that a better estimation approach should try to draw to the theoretical solution. In the theoretical solution (left column of Figure 1), the largest set is achieved by cumulatively including covariates based on the order of the true conditional effects . The model-based approach of Section 2.1 tries to parallel this by, first, estimating the conditional effects based on the MLE of a model , and then cumulating these ordered effects, , as in eq. (3).
While the above set of patients does experience the desired effect in large samples, this is not, of course, the largest such set if the working model is incorrect. In fact, it is not even the largest achievable set when using the same working model. This is because, if the model is incorrect, the member of the model that maximizes the (incorrect) likelihood does not necessarily have the invariance property with respect to the truth, and so it is not necessarily the same as the member of the model that achieves the largest set.
The proposed approach is to find the largest such set that can be achieved. To do this, the model should be left free at the first stage, so that one can consider all values of the parameter , that can predict by . Then,
for each value of the parameter, find (4) and such that the empirical effect within the set is at least ; then
find (5) where is as obtained in eq. (4).
By construction in eq. (5), the proposed set is the largest possible set of the type in eq. (4) that can be achieved by using the working model, and so it is also at least as large as the one obtained in eq. (3) by the standard approach. Also by construction, the set will converge to
such that the effect within the set is at least , where is the maximizer of the right-hand-side of the last expression. Thus we have:
Moreover, with finitely many levels of , the empirical effect, say on the new highly benefited set converges, in large samples, to at least the nominal effect , and the empirical proportion, say converges to the probability . A formal proof of this result would be more involved, due in part to having to deal with the estimators of parameters within functions (such as empirical estimates of probabilities and effects), and also due to the appearance of non-smooth indicator functions in both the probability statement and the effect function. Nonetheless, this heuristic argument seems to suggest that, under some regularity conditions and in sufficiently large samples, the new method will correctly produce a larger set of highly benefited patients than the standard method.
In small to moderate samples, and as with empirical maximization of other objective functions (e.g., sum of squares), the above convergence happens, by construction, from values of the effect that can be larger than the nominal one. For this reason, it is better to consider a modified set that uses the resampling approach to calibrate to the nominal minimal effect (see Appendix B).
We evaluated the properties of this new method by an analogous simulation to that for the standard method of Section 2 and as described in detail in Appendix B. We found that the true effects experienced by the uncalibrated sets of the new method were approximately 10% lower than their corresponding three nominal values. Then, for each nominal value, we searched for the value that the empirical effect should have in order that the simulated true effects be equal to the nominal. These three values were approximately and , respectively, and these resulting sets, which we call in Appendix B, are shown on the bottom three panels in Figure 2.
For example, the set of patients that experiences an average effect of 30% are the patients with and the following (EDS,ADS) cells: (3,3), (4,3), (5,4), (6,4), as shown within the black contour of the bottom left panel of Figure 2. This group is estimated to form 56% of the study population. Therefore, even after adjusting for overfitting, the new method is estimated to characterize substantially larger groups of patients with high benefit.
We have illustrated a new method of characterizing groups of patients with high benefit. We believe the new method can have important clinical implications regarding which patients are targeted for treatment, as well as important methodological implications for characterizing such groups in observational studies.
The example of CitAD illustrates the potential of these methods. The ADS and EDS covariates are indeed predictive of effect regardless of whether standard methods or the new methods presented above are used, but the proportion of participants is much higher with the new method. For example, using a 30% effect size as the minimum difference of clinical significance, 34% of participants fall into ADS/EDS categories with clinically significant effects using standard methods compared to 56% with the new method. Thus, using ADS/EDS categories a clinician could identify 20% more patients with AD and agitation who would be predicted to have a clinically significant response to citalopram, an undoubtedly clinically meaningful difference. Given the potential toxicity of medications (for example, QTc prolongation observed with citalopram treatment in CitAD, ), identifying patients most likely to respond to drug represents a substantial improvement in maximizing benefit over risk. It is particularly impressive that ADS/EDS categories are so useful for predicting response because these subscales were derived from first principles, i.e. examining instruments at the item level and deriving the instruments pre hoc, independently of results, not as the result of cluster analytic techniques. This suggests the potential utility of applying these methods to other trials to improve clinicians’ ability to predict response to drug treatment.
A number of areas regarding the proposed method warrant further exploration. First, it is possible that the largest subgroup that, on average, has an effect larger than a constant may include finer subgroups with a negative effect. This is difficult to know, however, because a method that would search for this would be also subject to the difficulty of fitting effects given the high dimensional . Perhaps an expert’s opinion on whether the finer parts of the subgroup make sense would be useful. Second, making the clinical objective the same as the statistical objective function to maximize, while scientifically desirable, is prone to overfitting. Here, we addressed this in part by calibration through simulation. Additional work is needed to develop accessible inference methods for confidence intervals, and for finding if and how a semiparametric efficient estimator can be achieved for the set , for example using theory of van der Laan and Rubin , van der Laan and Rose . Further, one can build additional parsimony into the estimation by regularizing the objective function through adding a condition that, for example, the magnitude of the coefficients be restricted. Thus, the contribution of the proposed method is not in competition with regularization, but is, instead, to emphasize the change of the core objective function - from a statistical one (e.g., least squares or likelihood) to a clinically meaningful one such as of the proportion of highly benefited patients. Working with this objective function analytically is not as straightforward because its complexity suggests it may not be convex. In practice we searched for maxima using simulated annealing.
Usefully, the new method can be applied to also characterize highly benefited groups in observational studies. Specifically, if treatment assignment is ignorable  and the propensity score  is reliably estimable, then, in principle, similar methods to these presented here can be applied to the population of potential outcomes after adjusting through the propensity score. This would provide an alternative way of fitting, for example, a structural mean model [13, 14], where the coefficients are chosen to maximize group of patients that are benefited beyond a minimum effect desired by physicians and patients.
Porsteinsson A, Drye L, Pollock B, Devanand D, Frangakis C, Ismail Z, et al. Effect of citalopram on agitation in alzheimer disease: the CitAD randomized clinical trial. J Am Med Assoc 2014;311:682–91. Web of ScienceCrossrefGoogle Scholar
Schneider L, Olin J, Doody R, Clark C, Morris J, Reisberg B, et al. Validity and reliability of the Alzheimer’s disease cooperative study-clinical global impression of change. the Alzheimer’s disease cooperative study. Alzheimer Dis Assoc Disord 1997;11(Suppl 2):S22–S32. CrossrefGoogle Scholar
Drye L, Ismail Z, Porsteinsson A, Weintraub D, Marana C, Pelton D, et al. Citalopram for agitation in Alzheimer’s disease: design and methods. Alzheimers Dement 2012;8:121–30. Web of ScienceCrossrefGoogle Scholar
Drye L, Spragg D, Devanand D, Frangakis C, Marano C, Meinert C, et al. Changes in QTC interval in the citalopram for agitation in Alzheimer’s disease (citad) randomized trial. Plos One 2014;9:e98426. Web of ScienceGoogle Scholar
van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat 2006;2:Article 1110.2202/1557–4679.1043. Google Scholar
van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2011. Google Scholar
Robins JM. In: Sechrest L, Freeman H, Mulley A, editors. The analysis of randomized and non- randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Washington, DC: In Health Service Research Methodology: A Focus on AIDS, 1989;113–159.
Levin H, High W, Goethe K, Sisson R, Overall J, Rhoades H, et al. The neurobehavioural rating scale: assessment of the behavioural sequelae of head injury by the clinician. J Neurol Neurosurg Psychiatry 1987;50:183–93. CrossrefGoogle Scholar
Cummings J, Mega M, Gray K, Rosenberg-Thompson S, Carusi D, Gornbein J. The neuropsychiatric inventory: Comprehensive assessment of psychopathology in dementia. Neurology 1987;44:2308–14. CrossrefGoogle Scholar
Cohen-Mansfield J. Conceptualization of agitation: results based on the cohen-mansfield agitation inventory and the agitation behavior mapping instrument (with dicussion). Int Psychogeriatrics 1996;8:309–15. CrossrefGoogle Scholar
Appendix A: Characterization of the largest highly benefited subgroup
We prove the result for the case where has finite though possibly many levels. Consider the indicator and the constant defined in Result 1; and consider any other indicator whose subgroup size is strictly larger than that of , i.e., suppose . Then it is useful to consider the quantity
where is as defined in Section 2.1 and . Specifically, is non-negative because if , both of the first two terms are non-negative; and if , both of the first two terms are non-positive. Moreover, is strictly positive with positive probability because, when (and ), then the first two terms are strictly positive regardless of . Now, if is summed over , we get
where and are the effects and , respectively, within the subgroups defined by the indicators. Thus, if we must have . By assumption, , and thus the maximum size is attained at by .
Appendix B: Evaluation and calibration of highly benefited sets through simulation
We sought to evaluate the properties of estimated highly benefited subgroups derived through fitting data from a trial, utilizing both the standard and proposed methods. To do so, we applied the estimated sets to the target population from which the data are sampled. In order to do this, for example, for the proposed method and for a nominal minimum effect equal to, say , we did the following.
For both the standard and the proposed methods for characterizing a highly benefited subgroup, we evaluated properties of the estimated sets based on – derived through fitting data from a trial – are applied to the target population from which the data are sampled. In order to do this, for example, for the proposed method and for a nominal minimum effect equal to, say , we did the following.
Treat as the target source population, and obtain a bootstrap data sample, with replacement.
For , derive in order to reach a minimum empirical effect on data , as described in Section 3 (here, the explicit notation for the empirically achieved minimum effect and for the data is important).
Apply back to the target source population , and find the true effect on these patients , which, based on the notation of Section 2.1, is .
Repeat steps (1)–(3) and find the true average effect, averaged over the simulated data sets given .
If the true effect as verified in step 4 is different from the nominal then search, using a bijection method, for what value we should require the empirical effect in step 2 to be, so that the true effect in step 4 is equal to the nominal. Call that empirical effect (this function can be different between the proposed method and the standard method).
for the data define the calibrated highly benefited group for the nominal effect, as
We used the same approach to evaluate and produce calibration also for the standard method.
Appendix C: Derivation of the affective and executive scales
Items were derived from medical/psychiatric history and from neuropsychiatric instruments including Cornell Scale for Depression in Dementia (CSDD, Alexopoulos et al. ), Neurobehavioral Rating Scale (NBRS, Levin et al. ), Neuropsychiatric Inventory (NPI, Cummings et al. ), and Cohen-Mansfield Agitation Inventory (CMAI, Cohen-Mansfield ). The ADS consisted of 7 items: (1) family history of mood disorder; (2) personal history of mood disorder; (3) Depression defined as CSDD score 6 or NBRS depression item 3 or NPI Depression score 4; (4) Mood lability defined as NBRS mood lability item 3; (5) Anxiety defined as NBRS anxiety 3 or NPI Anxiety 4; (6) Irritability defined as NPI Irritability 4; and Somatic defined as NBRS somatic symptoms item 3. Each ADS item was scored as 0 or 1 and summed for total range of 0 to 7. The EDS consisted of 6 items: (1) Inattention defined as NBRS inattention item 3; (2) Aberrant Motor Behavior defined as NPI Aberrant Motor Behavior 4 or CMAI aberrant motor behavior item 4; (3) Disinhbition defined as NPI Disinhibition 4 or CMAI disinhibition 4 or CMAI disinhibition 4; (4) Apathy defined as NPI Apathy 4 or NBRS apathy item ; (5) Poor planninag as defined by NBRS poor planning item 3; (6) Disorganization defined as NBRS disorganization item . Each EDS item was scored as 0 or 1 and summed for total range of 0 to 6.
About the article
Published Online: 2017-05-20
The authors thank the Johns Hopkins CitAD group, NIH grant R01 AI102710-01A1, a collaboration between Johns Hopkins Department of BIostatistics and Medimmune, and Mark van der Laan, Marco Carone, and anonymous referees for helpful discussions. Any opinions expressed in the paper are solely the authors’.