Semiparametric methods have been developed to increase efficiency of inferences in randomized trials by incorporating baseline covariates. Locally efficient estimators of marginal treatment effects, which achieve minimum variance under an assumed model, are available for settings in which outcomes are independent. The value of the pursuit of locally efficient estimators in other settings, such as when outcomes are multivariate, is often debated. We derive and evaluate semiparametric locally efficient estimators of marginal mean treatment effects when outcomes are correlated; such outcomes occur in randomized studies with clustered or repeated-measures responses. The resulting estimating equations modify existing generalized estimating equations (GEE) by identifying the efficient score under a mean model for marginal effects when data contain baseline covariates. Locally efficient estimators are implemented for longitudinal data with continuous outcomes and clustered data with binary outcomes. Methods are illustrated through application to AIDS Clinical Trial Group Study 398, a longitudinal randomized clinical trial that compared the effects of various protease inhibitors in HIV-positive subjects who had experienced antiretroviral therapy failure. In addition, extensive simulation studies characterize settings in which locally efficient estimators result in efficiency gains over suboptimal estimators and assess their feasibility in practice.
Semiparametric estimators are appealing because of their robustness to distributional assumptions and model misspecification. In the analysis of randomized trials, semiparametric theory has been used to develop estimators of treatment effects that improve efficiency of inferences by incorporating baseline covariates, where “baseline” describes data measured prior to randomization. In this paper, we present a semiparametric locally efficient estimator to improve efficiency of inferences in randomized experiments with correlated outcomes when baseline covariates are available. We begin with a review of current estimators for multivariate outcomes and then introduce our semiparametric locally efficient estimator.
Correlated outcomes are often observed in medical research studies, such as those that randomize clusters of subjects or that randomize individual subjects but collect repeated measures of the response. We denote the outcome for the ith independent randomized unit, in such studies by the -dimensional response vector , which may represent longitudinal measurements from a single subject or a set of responses from subjects within a cluster defined by a family, hospital, or class. Considering the substantial costs incurred by such studies, it is of interest to maximize efficiency in the estimation of treatment effects using all available data.
In general, studies collect data on i.i.d. observations , where denotes a scalar treatment assignment to 1 of K possible treatments, and is a matrix of baseline covariates. Throughout we allow to be fixed or random and assume ignorability when is random. Longitudinal data also include a time variable denoting time points at which outcomes are measured. As in the case of unit size , we allow to be either fixed or random but ignorable. When repeated measures are taken on the same subject, baseline covariates are measured at ; thus for all , resulting in a single level of baseline covariate information. Clustered data, however, may include pre-treatment covariates at the level of the group or the individual, creating two layers of auxiliary data. In the longitudinal context, we refer to the vector as the subject, or independent unit and as observation- or measurement-level data. For clustered data, we refer to as cluster-level and as individual-level observations.
Semiparametric estimation often involves specifying a restricted mean model. When estimating marginal treatment effects, a model for the expected outcomes given treatment assignment is usually assumed. Consequently, only data on the treatment and outcome are used in estimation. For example, in longitudinal studies, the marginal effect of treatment over time may be measured by assuming the restricted mean model
where is a function of . The main effect , which measures imbalance in at baseline, is expected to be zero when randomization successfully balances covariate profiles across treatment arms. The post-baseline effect of treatment is measured by . Parameters and may be vector-valued, as the function describing the effect of time on expected outcomes may be of some polynomial form. Similarly, for clustered data, the semiparametric model
may be assumed, with treatment effects determined by inference on .
Estimating equations are determined by geometric arguments that distinguish parameters of interest, such as the treatment-outcome association () in the context of randomized studies, from other components needed to fully specify the data-generating distribution, which are represented by . For parameters of interest, we aim to derive regular asymptotically linear (RAL) estimators, where an asymptotically linear estimator is one for which there exists a function such that
Regularity conditions ensure that variance bounds are well-defined and exclude superefficient estimators that have undesirable properties under local alternatives . The function is called the influence function of and determines its limiting distribution. As eq. (3) suggests, any RAL estimator may be obtained by solving an influence function equation. To derive the class of estimating functions under an assumed model , one first defines the nuisance scores for the data-generating distribution ; one then determines the subspace defined by the closed linear span of all scores of smooth parametric submodels in model . This nuisance tangent space is denoted by . The orthogonal complement of , defines the set , indexed by h, which contains the set of influence functions of all RAL estimators [2, 3]. For correlated outcomes, the geometric arguments of semiparametric theory may be viewed as a generalization of the quasilikelihood approach of Liang and Zeger  in deriving generalized estimating equations (GEE). We denote as the set of distributions of with known treatment process satisfying eq. (1). Under model , defines the estimating equations
for estimating the p-dimensional vector . The index or weight is a function of a random treatment variable and time , and . We use bold to denote the vector-valued mean function and to represent its scalar components.
A locally efficient estimator of a semiparametric model is defined as an estimator that achieves the semiparametric efficiency bound (minimum asymptotic variance among all RAL estimators) at a given submodel for the data-generating law, but remains consistent outside the data-generating submodel , provided that the marginal model is correct. More explicitly, semiparametric models parametrize specific components of a data-generating process and leave others unspecified. Estimation may require working models of unspecified components; a locally efficient estimator achieves the semiparametric efficiency bound when such working models are correctly specified, but also remains consistent when only the parametric component is correctly specified. A semiparametric locally efficient estimator is determined by finding the optimal estimating function, referred to as the efficient score, which for GEE requires finding the optimal . When no baseline covariates are observed, Chamberlain  showed that the efficient score of is obtained by setting , where is the variance–covariance matrix of , and . The estimator remains consistent, however, when a working covariance other than the true covariance is substituted into the estimating equations, thereby demonstrating that consistency is achievable outside of the data-generating law.
For model , defined by observations , marginal model (1), and known treatment process, the set of influence functions was derived by Robins et al.  and arises by augmenting the influence functions of under model . Augmented estimators are constructed by subtracting the orthogonal projection of the standard estimating function onto the span of the scores of the treatment mechanism from the standard estimating function [6, 7]. For correlated outcomes, , and augmented GEE are
where for K-level treatment , . Fixing , the most efficient estimating function sets [6, 8–10]. The augmentation therefore involves estimation of the conditional mean outcome regression model , which may be related the marginal parameter of interest in the binary treatment setting by under an identity link and under the logit link. When baseline covariates are predictive of the outcome augmentation reduces variability in estimated treatment effects, irrespective of the outcome distribution. For the longitudinal marginal model (1), if outcomes are restricted to post-baseline measurements, the baseline measurement may be utilized as a baseline covariate and included in . The term is then no longer required to assess a post-baseline effect of treatment and may be removed from the model, leaving to capture the marginal treatment effect. The interaction term may still be required for correct model specification even when the baseline outcome is included as a covariate if the treatment effect varies in time.
Semiparametric locally efficient estimators of parameters in restricted mean models of marginal treatment effects have been implemented for univariate data in the presence of baseline covariates by Robins , Bang and Robins , van der Laan and Rubin , Tsiatis et al. , Zhang et al. , and Moore and van der Laan [14, 15]. In these developments, the choice of has no impact on the resulting asymptotic variance and is therefore not considered for deriving efficient estimators. For a univariate outcome, the model defined by a unique parameter for each treatment level is saturated, and the choice of is inconsequential. When is multivariate, is not saturated because a single parameter is shared across components of the vector . As a result, provides a larger set of estimating functions than the orthogonal complement of the nuisance tangent space of the corresponding restricted mean model for a univariate outcome. Each element in is indexed by ; the choice of impacts efficiency, and the optimal must be found to achieve minimum variance. A similar discussion of , model saturation, and estimating functions was included in Neugebauer and van der Laan .
The efficient score in model does not generally have the same optimal index as the efficient score in model . When incorporating auxiliary covariates in the estimation of marginal treatment effects via augmented GEE, the choice , while resulting in a consistent estimator, is no longer optimal in model . The efficient score is determined by optimizing over all index functions [6, 7, 9]. Robins  established general theory for deriving the efficient score of treatment effects in marginal structural models (MSMs) of time-dependent exposures, including the case of multivariate outcomes. Application of the Robins  theory to establish locally efficient estimators in specific settings, such as for randomized trials with correlated outcomes, requires further derivation. Additionally, the locally efficient estimators of Robins  were neither implemented nor evaluated for practical use. Models (1) and (2) may be viewed as examples of MSMs for a point exposure; the Robins  theory therefore equally applies. Although the efficient score may be obtained theoretically, it is often computationally intensive to calculate. Consequently, inefficient estimators are typically used. The suboptimal estimator based on augmenting GEE with was shown to improve efficiency by Zhang et al.  within the context of the linear mixed model and Stephens et al.  for general continuous and binary outcomes. In subsequent text, we refer to unaugmented GEE (4) under model with the index function as standard GEE, and the suboptimal estimator obtained by augmenting standard GEE is referred to as simple augmented GEE. Here we show how to further improve on simple augmented GEE by deriving the corresponding semiparametric locally efficient estimator for model . We then evaluate the feasibility of achieving such improvement in practice.
The following section presents the efficient score and derives a locally efficient estimator of marginal treatment effects in randomized trials with correlated outcomes when auxiliary data are available as in model . We also discuss an implementation procedure detailing how to appropriately estimate each component of the efficient score. In Sections 3 and 4, we compare the derived semiparametric locally efficient estimator to standard and simple augmented GEE through a simulation study and application to the AIDS Clinical Trial Group Study 398, a randomized longitudinal HIV intervention trial.
2.1 The efficient score
We consider the setting of longitudinal data and note that results follow analogously for clustered data by omitting . Before presenting the main result, some additional notation is required. Conditioning on , the matrix takes K possible values, which may be denoted by K constant matrices . For binary treatment, we have and , which denote the index functions under treatment and control , respectively. Let , the -dimensional vector of the difference in the conditional and marginal mean outcomes given time. Using this construction, let , the complete index matrix of dimension . Using a result from Newey and McFadden , we show in the supplementary material that the optimal index and resulting efficient score may be determined by solving a generalized information equality. Here, we present our main result:
Proposition 1. The efficient score for modelis
As shown above, is of dimension and may be decomposed into the difference , where is a block diagonal matrix with diagonal components . The block diagonal of contains the matrices , and off-diagonal block components are determined by .
When treatment is binary, simplifies to
where . Inverting analytically and letting ,
Expressing the optimal index as in eq. (7) demonstrates that incorporates information on the treatment assignment and auxiliary covariates through , while the standard index does not. The matrix is by definition the covariance of and , the expected outcomes given baseline covariates and treatment assignment to a and , respectively. The optimal index therefore boosts efficiency by incorporating information on the covariance in expected outcomes when weighting the residuals in the marginal model estimating equations. To implement locally efficient GEE for model , estimates of , , and for all unique pairs of treatment levels , including , are needed. The next section details an estimation procedure for each component of when is continuous and is the identity link, or is binary and is the inverse logit link.
2.2 Estimation of
The semiparametric locally efficient estimator requires estimates of three additional parameters:
These quantities may be linked by the law of total variance, . For the ith independent unit, the -dimensional vector determines the matrix and ultimately impacts the form of the marginal variance matrix . Observing the relationship among each of these parameters provides guidance for estimation. For example, the working marginal covariance selected must be compatible with the working model chosen for . More generally, the models for each component of must be specified so that the model selected for one component does not preclude the choice of model chosen for another. One approach that ensures compatibility is to start by estimating through an appropriate regression technique to provide an estimate . The conditional mean outcome may be modeled by
where represents the collection of covariates for the jth measurement in the ith unit. The next step is to estimate the conditional expectation by noting how the model of impacts the form of the matrix . The final step is estimation of by summing the estimates of and .
2.2.1 General estimation of and .
Since is a covariance matrix, it may generally be estimated in a similar fashion to estimating the correlation parameters in standard GEE. Let , where is a diagonal matrix with the jth diagonal component , the covariance of the predicted outcomes of element j under treatments a and , and is a correlation matrix with and for a function of correlation parameter that denotes the correlation in the predicted outcomes of element j under treatment a and element under treatment . The parameters , which may be a vector, and characterize the covariance in conditional mean outcomes under treatments a and . Letting , where is an initial estimate of obtained, for example, by maximum likelihood inference for Generalized Linear Models, may be estimated by
where is the dimension of the outcome regression parameter . The correlation parameter is then estimated by the moment equations
For , we obtain an estimate of .
As an alternative approach, one may also derive an expression of , the element of , which depends on and the covariance in baseline covariates. An empirical estimate of may then be substituted into this expression.
After estimating , the conditional variance of , , may be estimated using the correlation parameters from GEE based on the conditional mean model (8). Under homoscedasticity for all i. To ensure compatibility of all parameters, the marginal variance is then estimated by , where and are estimates of and , respectively.
2.2.2 Estimation of for clustered data or longitudinal data with
For clustered data and longitudinal data with in eq. (8), calculating is straightforward. When data are clustered, , leaving . In this setting, is calculated as . If auxiliary covariates are equally correlated among subjects within a cluster for all . This holds for all link functions . For longitudinal data when (i.e. the effects of baseline covariates on the conditional mean outcome do not vary over time) . If is the identity link, this reduces to for all . For clustered data, is a constant that depends on , , and , whereas for longitudinal data, is replaced by since for all j.
2.2.3 Estimation of under a compatible standard form
In some special cases where summing and results in a marginal covariance matrix with a standard form, e.g., exchangeable, may be estimated directly while maintaining compatibility with and . As stated above, if individual-level covariates are equally correlated among subjects within the ith cluster, the model imposes compound symmetry on , where diagonal components depend on and off-diagonal components are determined by . If the conditional variance is also exchangeable, is the sum of two exchangeable matrices and therefore also has an exchangeable structure. The optimal index may then be calculated by estimating directly as in standard or simple augmented GEE and using the above procedure to estimate .
A consistent estimator of the asymptotic variance of , the solution to the augmented estimating equations (5) evaluated under eq. (6), may be calculated using the sandwich variance formula of Huber .
3 Simulation study
Semiparametric locally efficient GEE for model were compared to standard and simple augmented GEE through a simulation study. Simulations were completed for clustered data with continuous and binary outcomes and longitudinal data with continuous outcomes. Results are based on 1,000 Monte Carlo datasets.
3.1 Continuous outcomes
3.1.1 Clustered data
Data for clusters were generated, with = 2, 4, 6, 8, 10, 12 with equal probability for the first set of simulations and = 10, 20, 30, 40, 50 in the second set. Auxiliary covariates , , and were each generated from a multivariate normal distribution with , , and . Correlation was induced among individual-level covariates within the same cluster by setting , , and . Covariance terms and were varied from 0.5 to 2 and 1.5 to 6, respectively, to evaluate the effect of auxiliary covariate correlation on the performance of locally efficient augmented GEE. At and , covariates were weakly correlated among individuals in the same cluster, while at and , covariates were perfectly correlated, thereby becoming cluster-level. The exact values considered for and were (0.5, 1, 1.5, 2) and (1.5, 3, 4.5, 6), for simulation sets 1–4 at each set of cluster sizes. Within the jth individual in the ith cluster, auxiliary covariates were independent. The treatment variable was drawn from the Bernoulli distribution with p = 1/2. Clustered responses were generated from the following model, with individual-level error terms and cluster-level effects . The proportion of variability in explained by auxiliary covariates was held fixed at roughly 25%. Simulations were completed with and , representing the case in which covariates account for all between-cluster heterogeneity ( independent) and the alternative of some intracluster correlation caused by an unmeasured variable ( exchangeable), respectively.
For each dataset, the marginal effect of treatment was estimated by fitting model (2) through standard, simple augmented, and locally efficient GEE for . The impact of misspecification on the locally efficient estimator and its efficiency relative to simple augmented and standard GEE was evaluated by fitting various models to estimate . The correct model for , denoted by “C” in tables and figures, was , and two additional models: Model 1, “M1” = , and Model 2, “M2” = . “Model 1” evaluated the impact of misspecifying the functional form of , while “Model 2” examined the effect of adding noise to the outcome regression model. All working covariance matrices were fit under exchangeable structure.
Efficiency comparisons relative to standard GEE are summarized in Figure 1(a) and 1(b), while the Monte Carlo Relative Efficiency (MCRE) of the locally efficient estimator for to simple augmented GEE is given in Table 1. Small cluster figures are included in the supplementary material. Across all levels of correlation, augmented estimators resulted in increased efficiency compared to the unaugmented estimator (MCRE 1.25–11.6). For low correlation among simple augmented and locally efficient augmented estimators performed similarly. Simple augmented GEE and locally efficient GEE for also resulted in similar efficiency when the conditional mean model did not include the data-generating quadratic term , or the true conditional variance was exchangeable (MCRE locally efficient to simple augmented GEE 0.99–1.01). When correlation was increased among within a cluster, the assumed conditional mean model included all important covariates in the correct functional form, and baseline covariates accounted for all within-subject correlation, locally efficient GEE for gained in efficiency over the simple augmented GEE (MCRE locally efficient to simple augmented GEE 1.04–1.22). Increased covariance among auxiliary covariates also resulted in greater efficiency gains for any augmented GEE relative to the standard estimator. Trends were more pronounced for large average cluster size (average = 30 vs average = 7).
|Cluster size = 2, 4, 6, 8, 10, 12|
|Cluster size = 10, 20, 30, 40, 50|
3.1.2 Longitudinal responses
For each Monte Carlo dataset, m = 500 longitudinal response vectors were generated from the model , where , and had an AR1 structure with correlation parameter or 0.5 for different sets of simulations. Covariates and were normally distributed with mean 0 and variance and , respectively. Variance parameters , , and were varied so that baseline covariates accounted for 10–60% of the variability in in increments of 10%. Subjects were randomly assigned to treatment ( = 1) with probability 1/2. For each subject , where varied from 1 to 8, as might be the case in a longitudinal study with staggered entry.
Standard GEE, simple augmented GEE, and locally efficient GEE for were applied to each Monte Carlo dataset to estimate marginal treatment effects. All GEE were fit based on the marginal mean model with inferences on the treatment effect completed through . Standard and simple augmented GEE were applied to each Monte Carlo dataset with AR1, exchangeable, and true working covariance structures, with the true structure under the marginal model being a summation of AR1 and exchangeable matrices as described in Section 2. Locally efficient GEE for were fit under the true covariance structure and a misspecified marginal AR1 working covariance. Baseline covariates were incorporated fitting several outcome regression models. We use “C” to denote the correct model , which corresponds to the true data-generating mechanism; “M1” indicates the model , omitting the exponent on ; and “M2” is the model that includes a noisy covariate , such that .
Efficiency comparisons are summarized in Figure 2 and Table 2. Additional figures are given in the supplementary material. For well-specified variance components and conditional mean models, the locally efficient GEE for was more efficient than the simple augmented GEE, with the difference in efficiency increasing with the percent variability explained by (MCRE of locally efficient to simple augmented GEE 1.0–1.27). Similarly, all augmented estimators were more efficient than standard GEE, with efficiency gains from augmenting increasing with correlation in and (MCRE of augmented GEE to standard GEE 1.36–5.28). For poorly specified conditional mean models, locally efficient GEE for and simple augmented GEE were nearly equally efficient (MCRE of locally efficient to simple augmented GEE 0.97–1.0), but when the marginal variance was also misspecified locally efficient GEE were less efficient than simple augmented GEE (MCRE 0.88–0.99). This demonstrates that the locally efficient GEE for is a bit more sensitive to working marginal covariance misspecification than simple augmented GEE. Among the simple augmented estimators, the estimator with the incorrect marginal AR1 working covariance resulted in the estimate with the lowest variability. This illustrates an important distinction between locally efficient and suboptimal estimating functions. Considering estimators using a suboptimal index, misspecified models for parameters in the index may result in more efficient inferences than correctly specified models. For the locally efficient estimator, semiparametric asymptotic efficiency is achieved only in the absence of model misspecification of all parameters in .
|Correlation between Y and X|
3.2 Clustered binary data
As for continuous outcomes, data for m = 500 clusters of variable size were generated with = 2, 4, 6, 8, 10, 12 for small cluster settings and = 10, 20, 30, 40, 50 for the large cluster scenario. The binary treatment variable was simulated from the Bernoulli(1/2) distribution. Individual-level covariates , , and were each generated from a multivariate normal distribution with , = 2, , and , inducing marginal correlation among individuals within the same cluster. Covariance parameters were varied to evaluate the impact of covariance in auxiliary covariates on the performance of augmented estimators, with and for different sets of simulations. For low levels of , covariates were weakly correlated, while for , covariates were cluster-level. Binary outcomes were simulated from the model , where was drawn from the bridge distribution for the logit link  with scale parameter . Simulations were completed with two values of the bridge distribution scale parameter, and , representing settings in which all sources of between-cluster heterogeneity are measured through auxiliary covariates, or when unmeasured sources of between-cluster heterogeneity are present. A total of 16 sets of simulations were done, varying cluster size, correlation in , and .
Standard, simple augmented, and locally efficient GEE for were applied to each dataset and compared for efficiency. For each estimator, the restricted mean model of interest was Model 2 with the inverse logit link and measuring the marginal effect of treatment. Among augmented estimators, four outcome regression models were considered: (1) “C” – Correct, ; (2) “M1” – Model 1, ; (3) “M2” – Model 2, ; and (4) “M1 OLS” – Model 1 OLS, . With the exception of Model 4, which was fit through ordinary least squares (OLS), all outcome regression models were fit by logistic regression. All estimators were fit with exchangeable working covariances.
Large cluster results are shown in Figure 3(a) and 3(b) and Table 3, while small cluster results are included in the supplementary material. Conclusions are similar to those obtained for continuous outcomes. Efficiency improvement with augmented estimators relative to standard GEE increased with correlation in auxiliary covariates (MCRE 1.10–10.54), as did the additional efficiency gains for the locally efficient GEE for over simple augmented GEE (MCRE 1.0–1.23). Simple and locally efficient augmented estimators were equally efficient for or when conditional mean models left out important transformations, but differences in efficiency favoring the optimal estimator were observed for and well-specified covariate-adjusted models.
|Correlation between Y and X|
3.3 Simulation study summary
Results from the various simulation settings provide insight into the performance of the locally efficient GEE for model and its practical value. The locally efficient estimator theoretically achieves minimum asymptotic variance when all components of and the augmentation are correctly specified. The results show that achieving the efficiency bound is not robust to model misspecification of working covariances and conditional means; the locally efficient GEE for was only more efficient than simple augmented GEE when all mean models included important covariates in the correct polynomial form, and the correct structure was specified for working covariances. Even under well-specified models, the locally efficient GEE only improved over the simple augmented GEE when the data-generating mechanism was such that the underlying conditional variance, had a sparse structure, such as AR1 or independence. The difficulty of correctly specifying models for nuisance parameters, particularly covariances, as well as measuring all sources of correlation so that is sparse present challenges for successfully implementing locally efficient estimators in real-world analysis. This challenge is further illustrated in the following section with application to AIDS Clinical Trial Group Study 398.
The semiparametric locally efficient estimator of marginal treatment effects for correlated outcomes was applied to data from AIDS Clinical Trial Group Study 398 (ACTG 398) . ACTG 398 was a multicenter, double-blind trial, in which 481 HIV-infected patients were randomized to one of four arms, (A) saquinavir, (B) indinavir, (C) nelfinavir, or (D) placebo based on their past protease inhibitor (PI) treatment. Patients were only randomized to drugs to which they had no prior exposure. Randomized treatments were added to a common antiviral regimen for all subjects. Subjects’ CD4 counts were measured at weeks 0 (baseline), 4, 8, and every 8 weeks thereafter until 48 weeks or dropout. Here, we apply the GEE estimators to compare the nelfinavir and placebo arms among patients who were eligible for both according to the stratified randomization scheme. Additional baseline covariates were age, sex, past PI use, past non-nucleoside reverse transcriptase inhibitor exposure, weight, Karnofsky score, intravenous drug use, and race/ethnicity. Weeks 4–32 of follow-up were included for analysis, with CD4 measurements at week 4 and beyond included as outcomes and week 0 CD4 included as a baseline covariate. Data were ~90% complete through week 32. In evaluating the effect of treatment on CD4, the marginal model was , where indicates the week of the jth measurement on the ith individual, and was an indicator for the placebo arm. This model was chosen by minimizing the prediction error from 10-fold cross validation of several candidate parametric models that included categorical time, quadratic time, or an interaction of time and treatment. Since only follow-up measurements were modeled as outcomes, and no interaction was detected between treatment and time, the effect of treatment was captured by .
Standard, simple augmented, and locally efficient GEE for were applied to estimate . Several candidate outcome regression models for augmented GEE were identified through model selection procedures. Cross validation was used to select the final model, , where CD4 is baseline CD4. The Quasi-likelihood under the independence model criterion (QIC) goodness-of-fit statistic  was compared among GEE fit to unaugmented marginal and conditional models to guide the choice of working covariance structures. To enforce compatibility of the marginal variance, conditional variance, and outcome regression in fitting locally efficient augmented GEE, the additive estimate of the marginal covariance was used. The working conditional variance was chosen by selecting the covariance structure resulting in the lowest QIC when fitting GEE on the conditional mean model. Simple augmented GEE were computed under all possible working marginal covariance structures, including the additive estimator motivated by the locally efficient GEE.
Results are shown in Table 4. Regarding covariance selection, unstructured working covariance resulted in the lowest QIC for the conditional model (supplementary material), suggesting the semiparametric locally efficient estimator should be fit assuming an unstructured form of . Several other covariance structures were also implemented for the locally efficient estimator to explore variance misspecification. Among simple augmented estimators, the additive marginal covariance obtained by summing the unstructured and induced by the chosen conditional mean model resulted in lower variability than estimators using standard marginal covariance structures. Estimated treatment effects exhibited variability across estimators but fell within a range of one standard deviation within the class of estimator considered (standard, simple augmented, or semiparametric locally efficient). Comparing standard GEE with different working covariance models, the estimated difference in average CD4 for the placebo arm versus nelfinavir ranged from 9.9 to 20.17. The direction of the effect was reversed for estimators that incorporated baseline covariates, with average CD4 on the placebo arm 0.07–8.11 units lower than the nelfinavir arm. Treatment did not have a significant impact on CD4 at the 0.05 level for any of the estimators considered.
|Simple Aug. (Ind)||−8.110||9.203||4.797|
|Simple Aug. (Exch)||−6.385||8.904||5.124|
|Simple Aug. (AR1)||−3.059||9.244||4.754|
|Simple Aug. (Un)||−0.079||9.411||4.587|
|Simple Aug. (Exch/Un)||−5.972||8.571||5.530|
|Simple Aug. (Exch/AR1)||−5.048||8.920||5.106|
|Loc. Eff. (Ind)||−8.110||9.203||4.797|
|Loc. Eff. (Exch)||−6.821||8.953||5.068|
|Loc. Eff. (Exch/AR1)||−5.715||9.073||4.936|
|Loc. Eff. (Exch/Un)||−6.277||8.601||5.492|
Estimators that incorporated baseline covariates greatly increased efficiency, with for standard GEE and among augmented estimators (relative efficiency augmented to standard GEE 5.0). Simple augmented and locally efficient GEE for resulted in similar efficiency – a result that may be explained by several factors: (1) Subjects had the same number of follow-up visits. For GEE, the index impacts efficiency most when the number of observations per unit is variable, (2) the unstructured conditional variance is not sparse, and (3) the components of may be misspecified. As a benchmark for efficiency, we also fit unaugmented GEE assuming the conditional mean model with an unstructured working covariance. This estimator represents the most efficient estimator of that may be obtained using , which requires assuming that the more restrictive conditional mean model is correct. From this estimator, we can determine that for this particular case, there is little additional efficiency to be gained by locally efficient GEE if simple augmented GEE are fit under the best working covariance (Table 4).
We derived and implemented a closed-form expression of the efficient score and a semiparametric locally efficient estimator in model for correlated outcomes. This estimator requires correct specification of the marginal mean model, but is consistent and asymptotically normal under misspecification of the variance or conditional mean working models. To avoid misspecification of the marginal mean model, we recommend several modeling approaches. In cluster randomized studies, misspecification of the marginal mean model may be avoided by including a unique parameter for each treatment level. Under exchangeable correlation structures, which are typically assumed, the marginal mean model is then correct. In longitudinal studies with a small number of visits, fixed effects for each visit time and an interaction of time and treatment may be used to avoid misspecification. When continuous time measurement prevents such a strategy, nonparametric methods may be used to suggest appropriate functional forms. Through simulation, we demonstrated that the semiparametric locally efficient estimator is more efficient than corresponding suboptimal estimators in certain settings, particularly when randomized units vary in size, baseline covariates account for a large portion of the within-unit correlation, and baseline covariates are at least moderately predictive of the outcome. In longitudinal studies, variable size may occur when studies have staggered entry or as subjects are lost to follow-up. The estimator derived is only semiparametric locally efficient in the first case, as the locally efficient estimator for incomplete data incorporates information on the missingness process.
There are several challenges to achieve semiparametric local efficiency, some of which stem from the parametric nature of the model of the marginal treatment effect parameter. Assuming the mean model is correct, accounting for correlation through measured covariates and correctly specifying the form of correlation can be difficult in practice. This challenge may be addressed through the use of scientific knowledge and covariance structure diagnostic tools, but is still likely to make local efficiency unachievable in most practical settings, rendering the simple augmented GEE the more useful option. When the marginal mean model is not correct, the semiparametric locally efficient estimator does not yield a consistent treatment parameter, regardless of the index used. The need for correct specification is yet another challenge to the presented estimator, but this challenge is common across all GEE estimators. Although theoretically possible, the prize of implementing semiparametric local efficiency for restricted mean models of marginal treatment effects with baseline covariates in the context of correlated outcomes is typically not worth the chase. The correlated outcome setting gives rise to the possibility of multiple estimators; the semiparametric locally efficient estimator does not offer much practical gain compared to the augmented estimator using the index function from standard GEE after taking into account the possibility of model misspecification of nuisance parameters.
There are several alternatives to the semiparametric locally efficient and suboptimal augmented estimators we consider. A nonparametric approach to modeling marginal treatment effects in longitudinal designs was shown in Neugebauer and van der Laan . Compared to our semiparametric locally efficient estimator the nonparametric strategy has the advantage of providing an interpretable causal parameter when the marginal mean model is possibly misspecified, but it does not utilize information on the correlation in outcomes and therefore would not be locally efficient in the setting of a time-fixed treatment with a correctly specified marginal model for the mean. Another semiparametric approach considers estimation of the conditional mean model followed by marginalization (the G-formula), but this relies on correct specification of the conditional mean model, whereas augmented estimators do not. Despite the shortcomings of the optimal augmented estimator, large efficiency gains were shown for longitudinal analysis when the baseline level of the outcome was incorporated in estimation as an auxiliary covariate. Baseline levels of outcomes can be highly predictive of follow-up levels, suggesting that in the analysis of data from longitudinal studies, failing to incorporate baseline covariates can be highly inefficient. These results suggest the value of incorporating baseline covariates in both interim and final analyses of data from randomized clinical trials.
6 Supplementary material
The reader is referred to the on-line supplementary materials for the derivation of the efficient score and additional simulation and data analysis results.
This research was supported by NIH grants R37 51164, T32AI007358, R01AI104459-01A1, and 1R21ES019712-01.
1. NeweyWK. Semiparametric efficiency bounds. J Appl Econometrics1990;5:99–135.Search in Google Scholar
2. BickelPJ, KlassenCA, RitovY, WellnerJA. Efficient and adaptive estimation for semiparametric models. Baltimore, MD: The Johns Hopkins University Press, 1993.Search in Google Scholar
3. van der VaartAW. Asymptotic statistics. Cambridge, UK: Cambridge University Press, 1998.Search in Google Scholar
4. LiangKY, ZegerSL. Longitudinal data analysis for discrete and continuous outcomes. Biometrics1986;42:121–30.Search in Google Scholar
5. ChamberlainG. Asymptotic efficiency in semi-parametric models with censoring. J Econometrics1986;32:189–218.Search in Google Scholar
6. RobinsJM, RotnitzkyA, ZhaoLP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc1994;89:846–66.Search in Google Scholar
7. RobinsJM. Marginal structural models versus structural nested models as tools for causal inference. In: BerryD,HalloranME, editors. Statistical models in epidemiology: the environment and clinical trials, vol. 116. New York: Springer-Verlag, 1999:95–134.Search in Google Scholar
8. RobinsJ. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association Section on Bayesian Statistical Science 1999, 2000:6–10.Search in Google Scholar
9. van der LaanMJ, RobinsJM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003.Search in Google Scholar
10. ZhangM, TsiatisAA, DavidianM. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics2008;64:707–15.Search in Google Scholar
11. BangH, RobinsJM. Doubly robust estimation in missing data and causal inference models. Biometrics2005;61:962–72.Search in Google Scholar
12. van der LaanMJ, RubinD. Targeted maximum likelihood learning. Int J Biostat2006;2:1–40.Search in Google Scholar
13. TsiatisAA, DavidianM, ZhangM, LuX. Covariate adjustment for two-sample treatment comparisons for randomized clinical trials: a principled yet flexible approach. Stat Med2008;27:4658–77.Search in Google Scholar
14. MooreKL, van der LaanMJ. Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation. Stat Med2009;28:39–64.Search in Google Scholar
15. MooreKL, van der LaanMJ. Application of time-to-event methods in the assessment of safety in clinical trials. In: PeaceKE, editor. Design, summarization, analysis & interpretation of clinical trials with time-to-event endpoints. New York: Chapman & Hall, 2009.Search in Google Scholar
16. NeugebauerR, van der LaanMJ. Nonparametric causal effects based on marginal structural models. J Stat Plann Inference2007;137:419–34.Search in Google Scholar
17. StephensAJ, Tchetgen TchetgenEJ, De GruttolaV. Augmented gee for improving efficiency of inferences in cluster randomized trials by leveraging cluster and individual-level covariates. Stat Med2011;31:915–30.Search in Google Scholar
18. NeweyWK, McFaddenD. Large sample estimation and hypothesis testing. Handbook of Econometrics 1994;4:2111–245. Available at: http://www.sciencedirect.com/science/article/pii/S1573441205800054Search in Google Scholar
19. HuberPJ. Robust estimation of a location parameter. Ann Math Stat1964;35:73–101.Search in Google Scholar
20. WangA, LouisTA. Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function. Biometrika2003;90:765–75.Search in Google Scholar
21. HammerSM, VaidaF, BennettK, HolohanMK, SheinerL, EronJ, et al. Dual vs. single protease inhibitor therapy following antiretroviral treatment failure. J Am Med Assoc2002;288:169–80.Search in Google Scholar
22. PanW. Akaike’s information criterion in generalized estimating equations. Biometrics2001;57:120–25.Search in Google Scholar
© 2014 by Walter de Gruyter Berlin / Boston