As stated by Peto et al. , “There is simply no serious scientific alternative to the generation of large-scale randomized evidence. If trials can be vastly simplified, …, and thereby made vastly larger, then they have a central role to play in the development of rational criteria for the planning of health care throughout the world.” Recruitment of a large number of eligible patients from a general population is both a major strength and weakness of large pragmatic trials. Deliberately broad and sometimes ill-defined entry criteria mean that the overall result can be difficult to apply to particular groups. In modern medical practice, individualized medicine is often advocated, and clinicians frequently need to make decisions about how best to use results of randomized clinical trials (RCT) and systematic reviews to maximize the wellbeing of their patients. Therefore, subgroup analyses have become increasingly necessary, if heterogeneity of treatment effects is likely to occur. However, various views seem to exist among scientific and clinical communities about the proper justifications and conducts for this kind of analysis.
Some statisticians and non-clinical epidemiologists have warned about the dangers of subgroup analyses due to the concerns of multiplicities, data dredging, false positive subgroup treatment effects, and the rarity of qualitative heterogeneity of relative treatment effect (e.g. see the discussions and references in Wang et al.  and the simulation study by Brookes et al. ). On the other hand, groups of practicing clinicians and statisticians have warned of the dangers of applying the overall results of large trials to individual patients without consideration of patho-physiology or other determinants of individual response[4–6]. Rothwell  put it in this way.
The main potential of subgroup analysis is not in the identification of groups that differ in their response to treatment for reasons of patho-physiology, but is in answering practical questions about how treatments should be used most effectively. Subgroup analyses related to questions of the practical application of interventions can be vital to effective clinical practice.
In this article, without adding to the debates, we focus on the better methodological conduct of subgroup data analyses. In practice, sometimes even with stratified randomization specifically for subgroups of interest, the subgroup data may still have certain degrees of imbalance among some covariates, the accidental bias as stated by Efron  and Lachin , which may create bias in data analysis. In general, smaller trials are more likely to have covariate imbalance; however, imbalance can sometimes also occur in larger trials. For example, in a post hoc analysis, Wright et al.  compared the outcomes in hypertensive black and non-black patients treated with chlorthalidone, amlodipine, and lisinopril in the trial conducted by the ALLHAT Collaborative Research Group. Even with the sample size of several thousands in each stratum, they still showed significant imbalance in several covariates among the strata. Therefore, blindly analyzing subgroup data without proper adjustment of covariate imbalances may be problematic and even produce misleading results. Unfortunately, this is not uncommon in medical journals. Our intention in this article is to bring the matching methods, which had been used quite often in other scientific disciplines, into clinical trial subgroup data analysis as a tool to better adjust the covariate imbalance. Our simulation study had shown some remarkable merits of the matching methods for this purpose.
The organization of this manuscript is as follows. In Section 2, we briefly describe the statistical background of propensity score and the methods of matching to adjust the potential covariate imbalance, followed by comparisons of their performance in a simulation. In Section 3, we discuss the estimation of confidence interval of treatment effect using stochastic approximation. An example is provided in Section 4 to illustrate the procedures discussed herein using a data set from a recent clinical trial. Further discussions are presented in Section 5.
2 Statistical method and notations
The concept of propensity scores is thoroughly discussed by Rosenbaum and Rubin  as well as by other authors. In the following, we describe a few key points for analytical purposes. Let denote the response to the experimental treatment of subject and denote the response to the control treatment of subject i. Let denote the vector of covariates associated with subject i and = 1(0), if subject i receives experimental (control) treatment. The observed outcome for subject i is then
If the subjects were appropriately randomized between experimental treatment and control groups, then
even though of the experimental treatment group and of the control group cannot be estimated from the data since each subject can receive only either experimental or control treatment, but not both.
If the data were appropriately randomized, the estimand of the average treatment effect, which can be estimated empirically using the observed data, can be written as
which can be further re-expressed as
where are all positive with , , with
being the (unobserved) treatment effects from the experimental treatment and control groups, respectively.
When the covariate imbalance occurs, proper matchings of subjects to better balance covariates are usually recommended in order to obtain a more appropriate estimate of treatment effect. Given covariate and following the results of Rubin [12, 13], one can show that
Therefore, the treatment effect of the experimental treatment group
where the expectation is taken over , can be estimated.
Define the propensity score as
namely, the probability of patient i being assigned to experimental treatment given the covariate. Assume
where or 1, Rosenbaum and Rubin  showed that
where the expectation is taken over , and can be expressed similarly. Therefore, the average treatment effect can be derived from the estimates of and . More details about the propensity score can be found in Rosenbaum  in addition to the articles mentioned herein.
Let and be the vector of covariates. A common method to estimate is via the logit function, i.e.
where and are known functions and
represent the main effects and interactions, respectively. The parameters in eq. (1) can be estimated using MLE-based methods. Goodness-of-fit can be checked graphically via, for example, Landwehr et al.  or Tsai .
According to Rosenbaum and Rubin , it is advantageous to sub-classify or match not only on but also for other functions of X. In particular, such a refined procedure may be used to obtain estimates of the average treatment effect in a subpopulation defined by the components of X, for example, gender or different disease classifications.
In addition to matching by the propensity score defined above, other matching schemes exist. Two of the more commonly used methods are Mahalanobis and Genetic matching. Given two covariates, and , the distances between them used in Mahalanobis and Genetic matching are defined as
respectively, where is the Cholesky decomposition of the covariance matrix of X, and W is a diagonal positive definite weight matrix. The elements of W can be chosen objectively to simultaneously minimize the distributional difference and location difference of covariates between the experimental treatment and control groups based on the Kolmogorov–Smirnov test and t-test, respectively . On the other hand, W can also be chosen somewhat subjectively depending on the relative importance among the matched variables. When certain variables are considered as more important and higher degree of balance for the selected variables is desired, one can assign higher weights for those variables during the matching processes.
The matching can be performed with either pair-matching or full-matching depends on the distributions of the data. The treatment effect can then be estimated between the matched data in the control and experimental groups. The overall treatment effect can be estimated by a weighted average of the individual matched groups. In the example below, we used the full-matching to estimate the treatment difference.
The conventional test of covariate balance between groups based on the t-test focuses only on location and can miss distributional differences between the covariates. On the other hand, the Kolmogorov–Smirnov test compares distributional differences and can miss differences in locations. By combining these two tests, matching can often be better assessed.
2.1 Comparison of matching methods via simulation
To further investigate the performance of various matching methods, a simulation with 5,000 iterations was conducted under various scenarios. Specifically, the simulation plan was designed as follows.
Sample size: two sets of sample size were used in simulation. The first set assumes equal sample size () for both experimental treatment and control groups. The second set also assumes these sample sizes for the experimental treatment group with the control group being about 20% smaller so that to mimic the different sample size allocations in many RCT and to study the effect of these methods with different sizes between samples.
Assume three covariates ( and ) will be matched between experimental treatment and control groups. The covariates were assumed to have somewhat different distributions between experimental treatment and control groups. Four different distributions are assumed. They consist of standard normal distributions with possibly different means and variances, or contaminated normal distributions with either symmetric or asymmetric contaminations from either tail. The list of distributions is shown in Table 1. In a separate simulation, we also include a binary covariate as suggested by a reviewer.
The response variable (Y) was assumed to follow two different models. The first model is
and the second model is
The treatment effect difference between experimental treatment and control groups is assumed to be a constant, e.g. 1. The purpose of assuming two different models is to compare these methods when the model is incorrectly specified. These two models were modified accordingly when the binary covariate was included.
The statistical methods to be compared are:
Empirical mean difference,
Least squares (LS) fit (assuming the first model is correct),
LS fit (assuming the second model is correct),
Matching using the propensity score based on
Matching on and using all available data,
Matching on and the propensity score using all available data,
Matching on and but excluding data in either tail outside of two times MAD (MAD is defined as ) from the median for each covariate (to mimic Tukey’s robust trimmed estimate),
Matching on and the propensity score but excluding data in either tail outside of two times MAD from the median for each covariate.
Two criteria for comparisons are examined:
The estimates of the true treatment effect and the variation of the estimates,
Balancing the covariates between experimental treatment and control groups. This will be assessed by examining the minimum p-value of the Kolmogorov–Smirnov test for the distributional equality of each covariate between experimental treatment and control groups before and after matching. Large p-values indicate greater comparability of the experimental treatment and control groups in terms of the covariates and hence reflect better covariate balance between groups.
2.2 Summary of simulation results
By examining the median, the inter-quartile distance, and the overall range of the box plots of the estimated treatment effects, we make the following conclusions:
The simple observed treatment difference can be a very poor estimate when the covariate distributions are different and deviate from standard normal distributions as shown in Panels 2 and 4 of Figures 1 and 2.
For the main effect model, the LS fit (when the model is correctly specified or even over-fitted with interaction terms) is generally better than other methods in estimating the treatment effect. But the LS fit with main effect only can perform poorly, if the true model includes interactions; however, the LS fit with interactions (correct model) outperforms other methods. This finding seems to be quite consistent among various sample sizes.
Matching purely based on propensity scores usually performs worse than Genetic matching either with all available data or with the trimmed dataset in estimating the true treatment effect. The trimmed estimate using Genetic matching to match both covariates and propensity scores performs almost uniformly better than any other method regardless of model specification and sample size, except for the LS fit when the model is correctly specified as discussed in eq. (2).
When the covariates of experimental treatment and control groups have identical normal distributions, the LS method outperforms all other methods since there is no need for matching. Additional effort to match seems to be redundant. The propensity score matching seems to make the covariate matching worse more often than not. However, the Genetic matching seems to perform reasonably well, especially when the outliers were trimmed away (Panel 1 of Figures 3 and 4).
However, when the covariate distributions are different between experimental treatment and control groups and deviate from the standard normal, the effect of matching from all methods becomes very visible. This can be seen in Panels 2–4 of Figures 3 and 4. Genetic matching with trimmed outliers tends to outperform all other methods either matched only on all covariates or with propensity score included. This is true for all distributions and sample sizes tested here.
The sample size in Figures 1 and 2 was 450, and in Figures 3 and 4, it was 400. The patterns for other sample sizes are similar with somewhat higher variations for smaller sample sizes, therefore, are not shown here.
As discussed above, when the model is correctly specified, the simple LS method outperforms other methods as expected. However, as in most of the data analysis, one rarely knows the correct model or the distribution from which the data was generated. Therefore, the performance of LS method can sometimes be expected to diminish in the analysis of real data. On the other hand, the performance of Genetic matching seems to be almost always comparable to the LS method when the model is correctly specified and performs much better when the model is mis-specified as shown in Panels 1, 2, and 4 of Figure 2. Therefore, the Genetic matching seems to serve as a “model mis-specification proof” tool for general data analysis. It is interesting to note that Diamond et al.  also similarly concluded that Genetic matching is preferred over other matching methods, because it is more efficient (smaller MSE) and is less biased.
3 Estimation of confidence interval of treatment effect
As discussed by Lachin  and other researchers, in most clinical trial practices, participants are actually more of a convenient sample than a truly randomized sample from the intended population with a specific disease for treatment. After a group of study subjects has been recruited, trialists then give their best efforts to randomly assign subjects to treatments. That is one of the primary reasons the randomization model is preferred to the population model by these researchers for statistical inferences. Since subgroups, defined either pre- or post-randomization, also inherit these properties, randomization model seems to be a natural choice for inferences.
It is a common statistical practice to accompany the point estimate of treatment effect with the corresponding confidence interval so that the magnitude of the effect can be better judged by clinical practitioners. Lachin suggested using randomization model to estimate the treatment effect and invoking the concept of population model to estimate the confidence interval. As an alternative, one can use the stochastic approximation as proposed by Robbins and Monro  and implemented by Garthwaite  to estimate the confidence interval of the treatment effect. Briefly, for treatment effect , a randomization test is performed to test the hypothesis against both one-sided alternatives and . A separate search is performed for each endpoint of the corresponding confidence interval. The upper and lower endpoints of the confidence interval are updated according to an algorithm after every randomization test. The asymptotic property of the search process is discussed by Garthwaite and Buckland . In addition, under weak regularity conditions, the estimates converge in probability to the correct confidence limits .
A phase III, multi-national randomized, double blind, placebo-controlled clinical trial was conducted by a pharmaceutical company to compare the treatment effect of drug A and drug B to placebo in controlling disease activity in subjects with rheumatoid arthritis having an inadequate clinical response to methotrexate. (Due to the restriction of the data provider, the names of drug A and drug B are not revealed.) The study was not originally designed to compare drug A and drug B directly. However, a post hoc analysis to compare these two drugs in a subgroup of countries of the original study is of clinical interest. A total of 156 and 165 patients were randomized to drugs A and B in these countries, respectively. The primary endpoint of the study was the disease activity score based on 28 joints (DAS28).
Comparisons of several baseline covariates using the t-test did not show particular imbalance between the two treatment groups. However, a more in-depth investigation of the baseline distributions by quantile–quantile (Q–Q) plots showed some deviations between the two populations. The objective in this analysis is to properly estimate the treatment difference under the situation of baseline imbalance.
The first step in this analysis is to match the patients from drugs A and B. Both the propensity score and the Genetic matching methods were applied with the covariates including age, baseline pain score, baseline CRP, and other components of DAS28, so that we can compare the relative performance of these two matching methods. As an example, the baseline pain scores between the treatment groups are compared and shown in Figure 5. The original Q–Q plot of pain scores between drug A and drug B is shown in Panel 1. The Q–Q plots of this covariate using propensity score matching and Genetic matching are shown in Panels 2 and 3, respectively. One can clearly see substantial improvement in covariate balance of Genetic matching over the propensity score matching.
Permutation distributions of the treatment effect before and after Genetic matching were also generated and are shown in Figure 6. The observed treatment difference prior to matching is about –0.19. However, the magnitude of the treatment difference was reduced substantially to –0.048 after matching. This indicates the importance of the proper matching of patients in the two treatment groups. Without this step, the treatment difference may potentially be over-estimated. Even though the permutation test did not show a significant treatment difference in either pre- or post-matching, the test prior to matching had a higher significance level than after matching.
The 95% confidence interval of the treatment effect difference was estimated using the procedure described previously. A total of 5,000 randomized samples were generated and analyzed. The estimates fluctuate substantially in the beginning of the approximation process. The process began to stabilize after about 2,500 randomizations. Figure 7 shows the stochastic approximation for the upper and lower limits of the confidence interval. The resulting 95% confidence interval is (–0.110, 0.4858).
Subgroup data analysis is common practice in medical research in order to better understand or explore treatment effects in different groups of patients. This is an important step toward individualized medicine. However, how to do it properly to get a more or less unbiased (since no one knows what the truth is) treatment effect is a difficult task, especially when the data are not appropriately randomized or only observational. Researchers have proposed several classes of methods to analyze this kind of data (e.g. see  and the discussions therein) and matching methods, based on propensity score, Mahalanobis matching, Genetic matching, and their variants, are among the important tools for this purpose. Particularly, Genetic matching provides the extra flexibility of weighting the selected covariates for the desired matching, so that the treatment effect estimate can also reflect the preferences of the investigators. With a good matching between the experimental treatment and control subjects and a higher degree of association between the covariates and the response, the treatment effect can be more accurately estimated.
In this manuscript, we conducted a simulation to further look into the performance of these methods under various scenarios with respect to their ability to better balance the covariates between the experimental treatment and control groups and also to produce unbiased estimate of treatment effect. The methods we compared ranged from the usual linear regression, conventional matching techniques with all available data to more robust alternatives, which exclude possible outliers. In general, Genetic matching is preferred to other methods under various data distributions of the covariates and various sample sizes.
Variable selection to be used in these procedures is an important point to consider. Several authors have proposed various approaches to incorporate covariates to estimate the propensity score [24–26]. The general findings are to incorporate variables which are thought to be related to outcomes, and variables thought to be confounded with both treatment assignment and outcomes. The model which incorporates as many covariates as possible and the model which includes obvious covariates such as age, gender, and race do not always seem to perform well. One should note that pre-randomization variables will not be confounded with treatment assignment in RCT and a successful randomization process is likely to correct for both the known and the unknown confounders. However, under the scenario of possible missing confounding variables, known or unknown, compounded with possible covariate imbalance, the performance of matching methods relative to other approaches is still not well-understood, therefore, it will be further researched and reported in the future.
It is generally recommended that careful examination of covariate balance between treatment groups be conducted prior to statistical inferences. Besides the formal test procedures, graphical methods can quite often reveal the subtle data details which are not easily detected in test procedures. In addition to the treatment effect estimate, it is more informative to provide readers with a confidence interval that brackets the estimate, which can be quite useful for the clinicians to gauge the clinical significance of the treatment effect. Toward this purpose, among other approaches, one can use the Robbins–Monro stochastic approximation to estimate the confidence interval of the treatment effect difference (the R-program is available from the authors). One may notice that the confidence interval is asymmetric to the estimate in our data example. The stochastic approximation estimates the upper and lower bounds of the confidence interval separately by comparing the randomization test statistics using the original and re-randomized data, which is different from the population model approach. Hence, the asymmetry may be part of the intrinsic properties of the combination of randomization test and stochastic approximation; however, a more detailed investigation of this phenomenon seems to be worthwhile.
Toward the goal of individualized medicine, medical research, and practices often use the multi-stage therapeutic strategies, for example, the dynamic treatment regimes (DTRs), in which dose or treatment is modified at each stage according to a patient’s current history, disease status, and response to the most recent treatment in the testing of experimental treatments for serious diseases such as cancer or psychiatric problems. Statistical research in design and analysis of studies aimed at evaluating the effects of these strategies also has an active history, for example, Zelen  and Wei and Durham  on the play-the-winner rule; Lavori and Dawson [29, 30], Thall et al. , Murphy , Oetting et al. , on the designs of randomized trials that aim at the evaluation of DTRs; Robins [34–37] on the g-estimation of structural nested models; and Murphy , Robins , and Moodie et al.  on the optimal treatment regime estimation. As explained by Moodie et al., the inferences in Robins and Murphy are based on the difference between the empirical and the counterfactual observations, which is also utilized in our proposed method. Given the number of treatments in DTRs and the usually moderate number of subjects, their methods utilized the parametric or semi-parametric method for more efficient estimation and modeling. Even though they could have employed subject matching as we have done here, the moderate trial size may post a severe limitation. Alternatively, Zhao et al.  proposed a non-parametric individualized treatment rule using outcome weighting learning which circumvents the need for conditional mean modeling, the counterfactual assumptions, and essentially turning the optimal treatment selection into a weighted classification problem using SVM techniques. Their individual treatment rule assigns treatments to each subject only based on subject’s prognostic information. Presumably, they could have modified their weighting schemes to increase the flexibility, as provided by the Genetic matching, to adjust the weights preferred by the investigators beyond the prognostic factors.
The ultimate goal of these researches is to use the results of the trials as a basis for generating hypotheses and planning a future, larger scale confirmatory trial, and to tailor the specific treatments for patient subgroups with specific characteristics in diseases as well as genome and biomarker profiles, so that potentially better responses can be achieved. As more trials with more subjects have been conducted based on the best knowledge accumulated from these experimental findings, inevitably, additional questions will be raised to further identify and compare the patient subgroups with respect to treatment efficacy and adverse effects. The subgroup analysis method proposed in this article can become an important and handy tool to perform rigorous post hoc analysis to further understand the inner-workings of the new treatments. Therefore, the integration of the multi-state therapeutic strategies and careful post hoc subgroup analysis can become an important effort for medical advancement.
Even though the large-scale randomized controlled trials are generally considered as the gold standard to generate convincing clinical information, one should not underestimate the importance of subgroup analysis. One cannot ignore that concerns exist in conducting and reporting subgroup analysis, and problems persist even with the CONSORT  and ICH , guidance; however, when properly planned, reported, and interpreted, subgroup analysis can provide valuable information. In some clinical trial settings, subgroup analysis can also be among the primary objectives. For example, the FDA had granted marketing approval for Pemetrexed plus Cisplatin [44, 45] to be used to treat non-small cell lung cancer patients with non-squamous even though the entire study did not show significance in overall survival.
The authors would like to thank the Editor and all the reviewers for their insightful comments. Their inputs substantially improve the quality of this manuscript.
Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: Risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol 2004;57:229–36. PubMedCrossrefGoogle Scholar
Tukey JW. The future of data analysis. Ann Math Stat 1962;33:13–14. Google Scholar
Wright J, Dunn J, Cutler J, Davis B, Cushman W, Ford C, et al. Outcomes in hypertensive black and non-black patients treated with chlorthalidone, amlodipine, and lisinopril. New Engl J Med 2005;293:595–1607. Google Scholar
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;79:516–24. Google Scholar
Rubin DB. Assignment to a treatment group on the basis of a covariate. J Educ Stat 1977;2:1–26. Google Scholar
Rosenbaum PR. Observational studies. New York: Springer-Verlag, 1995. Google Scholar
Tsai KT. Assessing regression modeling with ordinal responses. Presentation at the Joint Statistical Meetings of the American Statistical Association, 2008. Google Scholar
Sekhon JS. Alternative balance metrics for bias reduction in matching methods for causal inference. Working paper, 2006. Available at: http://sekhon.berkeley.edu/papers/SekhonBalanceMetrics.pdf
Diamond A, Sekhon JS. Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Berkeley, CA: Institute of Governmental Studies, University of California, 1996. Available at: http://escholarship.org/uc/item/8gx4v5qt
Blum JR. Approximation methods that converge with probability one. Ann Math Stat 1954;25:390–4. Google Scholar
Shadish WR, Clark MH, Steiner PM. Can non-randomized experiments yield accurate answers? A randomized experiment comparing random and nonrandomized assignments. J Am Stat Assoc 2008;103:1334–56. Web of ScienceCrossrefGoogle Scholar
Oetting A, Levy R, Weiss S, Murphy S. Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: Shrout PE, editor. Causality and psychopathology: finding the determinants of disorders and their cures. Arlington, VA: American Psychiatric Publishing, 2011:175–205. Google Scholar
Robins J. A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy survivor effect. Math Model 1986;7:1393–512. CrossrefGoogle Scholar
Robins J. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, editors. Health service research methodology: a focus on AIDS. Rockville, MD: U.S. Public Health Service, 1989:113–59.Google Scholar
Robins J. Analytic methods for estimating HIV treatment and cofactor effects. In: Ostrow G, Kessler R, editors. Methodological issues of AIDS mental health research. New York: Plenum Publishing, 1993:213–90.Google Scholar
Robins J. Causal inference from complex longitudinal data. In: Berkane M, editor. Latent variable modeling and applications to causality. New York: Springer-Verlag, 1997:69–117. Google Scholar
Robins J. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P, editors. Proceedings of the second Seattle symposium on biostatistics. New York: Springer, 2004:189–326. Google Scholar
Moher D, Schulz KF, Altman DG, et al. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Available at: http://www.consort-statement.org/. Accessed: 1 Nov 2007.
International Conference on Harmonisation (ICH). Guidance for industry: E9 statistical principles for clinical trials. Rockville, MD: Food and Drug Administration, September 1998. Available at: http://www.fda.gov/cder/guidance/ICH-E9-fnl.PDF. Accessed: 1 Nov 2007.
Scagliotti G, Parikh P, von Pawel J, Biesma B, Vansteenkiste J, Manegold C, et al. Phase III study comparing Cisplatin plus Gemcitabine with Cisplatin plus pemetrexed in chemotherapy-naive patients with advanced-stage non-small cell lung cancer. J Clin Oncol 2008;26:3543–51. PubMedCrossrefGoogle Scholar
HemOncToday. FDA approved pemetrexed plus cisplatin for nonsquamous NSCLC. 2008. Google Scholar