In many observational studies, logistic or other constraints may render recruitment of disease-free patients for follow-up studies infeasible. In such cases, subjects who already experienced the initiation of the disease prior to recruitment (i.e. prevalent cases) are sampled. It is well known that subjects so recruited do not form a representative sample from the target population because subjects with longer survival time have greater chance to be recruited into the study. When the disease has stationary incidence, the induced bias in sampling is called length bias [1, 2]. This bias in sampling can lead to bias in the estimation of an exposure effect of interest.
Length-biased sampling can affect the sampling distribution of the covariates, such that covariates associated with the longer survivors have a higher chance of being selected. Recently, Bergeron et al. , Shen et al. , Qin and Shen  and Ning et al.  studied analysis of covariates under biased sampling. Studies on length-biased sampling can be traced as far back as Wicksell , Fisher , Neyman , Cox and Lewis , Zelen and Feinlein  and Patil and Rao . An updated review of the subject can be found in Asgharian et al. .
A second source of potential bias in estimation of treatment or exposure effects encountered in observational studies is confounding. In the simple case of binary exposure, when exposure is influenced by other predictors, individuals in each exposure group may have different characteristics, which yielding imbalanced covariate distributions across the different groups. If the predictors also influence outcome (say, survival time) this may also lead to bias in the estimated exposure effect. Under an assumption of no unmeasured confounding, a consistent exposure effect estimator can be obtained by two well-known methods: Inverse probability of treatment weighting (IPTW) and propensity score regression (PSR). Weighted proportional hazard (PH) models for right censored data were introduced by Binder  and Lin  in the survey sampling literature. Pugh, Robins, Lipsitz and Harrington  also presented a weighted PH estimation equation to adjust for missing covariates [15–18].
In a recent article, Ertefaie et al.  developed a method for estimating the propensity score in the presence of length-biased sampling. In this paper, we address estimation of total causal effects in the presence of both length-biased sampling and confounding, which we term the double-bias problem, in the analysis of survival data. Specifically, we develop augmented estimating equations based on PH and accelerated failure time (AFT) models that can be used to estimate the exposure effect. In both cases, the augmentation spaces are formed using the censoring mechanism to improve the efficiency.
The rest of this paper is organized as follows. In Section 2, we introduce concepts and notation used in the manuscript. Section 3 presents our proposed estimating equation for estimating the propensity score when data are subject to length-biased sampling. In Sections 4 and 5, we present our estimating equations to deal with length-biased sampling and confounding under PH and AFT modeling assumptions, respectively. Also, the large sample properties of the estimators derived from the proposed estimation procedure are presented. We examine the performance of the proposed approach via simulation, and, in Section 7, apply our method to analyze a set of length-biased right-censored survival data collected as part of the Canadian Study of Health and Aging (CSHA) investigating the effect on survival of institutionalization; see Wolfson et al. .
2 Length-biased sampling
Our notation is similar to that of Ertefaie et al. . Our data comprise n i.i.d samples of where D and are the binary treatment variable and the vector of covariates, respectively. A is the time from the onset of the disease to the recruitment time and R covers the time from the recruitment time to the event (residual life time). Accordingly, the observed lifetime is defined as . In the presence of right censoring, C is the censoring time measured from the recruitment to the loss to follow up. The observed survival time is . The variables with superscript pop represent the population variables; variables without pop denote the observed truncated variables. Figure 1 illustrates the different random quantities introduced in this section. The symbols and denote a censored lifetime and an observed failure, respectively.
Let F and f be the distribution and density of , respectively. If the onset times are generated by a stationary Poisson process (the so-called stationarity assumption), then (1)if has a corresponding absolutely continuous density , where is the mean survival time under F. Equation (1) is derived under a uniform truncation assumption.
For , we define the process by where is the censoring indicator ( indicating failure). We use small letters to refer to the possible values of the corresponding capital letter random variable. Throughout the manuscript, we make the following standard assumptions:
A1. The variable is independent of the calendar time of the onset of the disease.
A2. The disease has stationary incidence, i.e. the disease incidence occurs at a constant rate.
A3. The censoring time C is independent of .
2.2 Counterfactual outcomes
We define the causal effect of interest using the counterfactual framework introduced by Rubin . The counterfactual values are representing the backward, forward recurrence times, and observed survival time, respectively, if . Similarly, represents the counterfactual response. The observed response, , is defined as .
Positivity: where is the conditional probability of receiving treatment d given .
3 Propensity score estimation under length-biased sampling
Rosenbaum and Rubin  adjust for differences between exposed and unexposed groups using a scalar function of the measured covariates, the propensity score, which removes the bias induced by differences between these two groups of units. The propensity score, , for binary exposure D is defined by , where is a p-dimensional vector of covariates.
In general, the propensity score is unknown and needs to be estimated; it has also been shown that even if the propensity score is known, one may gain efficiency in estimating the average treatment effect (ATE) by estimating using the data available . However, estimating the propensity score using a length-biased sample does not lead to a balancing score or create the desired pseudo-population in which the exposure is independent of covariates; indeed, it may induce even more bias than leaving the confounders unadjusted .
Assuming a logit model for the propensity score in the target population, we have (2)where is a vector of parameters. Cheng and Wang  develop a method that consistently estimates the parameters of the propensity score from prevalent survival data. Their method requires correct specification of the conditional hazard model given the treatment and covariates. Ertefaie et al.  show that under assumptions A1–A3 this requirement can be removed, and propose the following estimating equation (3)where and is the Kaplan–Meier estimator of the survivor function of the residual censoring variable C. Note that the censored individuals contribute to this estimating equation through . Ertefaie et al.  show that where is the augmentation element . In this manuscript, we use eq. (3) to estimate the parameters of the propensity score. The term augments the failure time of the censored subjects using the observed failure times. We present the form of this augmentation term in Appendix B.
4 Cox PH models
The hazard ratio (HR) is defined as the ratio of hazards in the exposed and unexposed groups. Qin and Shen  introduce a set of estimating equations to assess the effect of covariates on the survival time in the presence of length-biased sampling. Our proposed estimating equation is an adaptation of the estimating equation introduced by Qin and Shen  (under the PH model) which adjusts for the confounding as well as length-biased sampling. We derive an estimating equation which estimates the marginal treatment effect without the need of estimating the effect for other covariates on the survival time.
Under A1–A3 and identifiability assumptions, the density of a counterfactual failure time observed in the study under exposure d can be expressed as where and are the counterfactual densities of the survival time if all the individuals would have received the exposure d. The second equality follows as where and [3–5].
Assuming the PH model for the counterfactual survival time, we have , and parameter can be interpreted as a causal HR for the total effect of the treatment D. We propose the following estimating equation for , (4)In Appendix D, we show that eq. (4) corresponds to a score function of a pseudo-partial likelihood which can be presented as where is the counterfactual survival function of the survival time if all the individuals would have received the exposure d. The dependence of the estimating eq. (4) on the parametrization for the propensity score is shown by defining (5)In the proof of Theorem 1 in the Appendix, we show that can also be written as (6)where The stochastic process can be estimated by replacing the and by their estimates, and , respectively. In the proof of Theorem 1 given in the Appendix B, we show that this stochastic process has mean zero.
The following theorem addresses the asymptotic properties of the estimator obtained by the estimating eq. (6) when and are replaced by their estimated values and , respectively. The parameters of the propensity score can be estimated using the estimating equation given in eq. (3). Define where is the cumulative hazard function of the censoring variable. The stochastic process can be estimated by replacing the by its estimate, . The stochastic process has mean zero, where is the survival function of the residual life time.
Theorem 1 Let be the exposure effect estimator obtained as the root of (7)Then under regularity conditions , and listed in Appendix A, where is defined in Appendix B. Also, the estimating function converges in probability to (8)where and for are defined in Appendix B.
Proof See Appendix B.
In the absence of length-biased sampling, augmented partial likelihood estimators have been proposed in Robins et al.  and van der Laan and Robins . The function in eq. (8) generalizes this idea to length-biased sampling settings. Note the second part of the summation in is the augmentation element.
Remark: Parameter measures the marginal association between the exposure and the hazard, which is not necessarily equal to the conditional association due to non-collapsibility.
5 Accelerated failure time models
Inspired by the AFT models introduced by Cox and Oakes , we consider a general form of AFT models, where we do not assume a known error distribution. Assuming the AFT model for the counterfactual survival time, we have and the parameter can be interpreted as a total treatment effect. Under causal identifiability assumptions and by the balancing property of the propensity score, the above model can be written in terms of the observed data as follows (9)We refer to this model as the AFT propensity score regression (AFTPSR) model . Higher order and interaction terms can also be included in the model if needed. While AFT models may suffer from lack of robustness with respect to the log transformation, they are often more interpretable .
5.1 AFT-weighted estimating equations
Another approach for correcting the bias induced by non-random assignment was suggested by Horvitz and Thompson  and Hájek and Dupač  who introduced estimators which weight the observed outcomes. The IPTW estimator adjusts for confounding by assigning a weight to each individual proportional to their chance of receiving the exposure they actually received [34, 35].
We generalize the IPTW estimator to account for length-biased sampling. In our setting, the weights are the reciprocal of the probability of being in the exposure group to which each individual is observed to belong. The estimating equation corresponding to IPTW is given by (10)where is the empirical average. This is a version of the complete case influence function introduced by Tsiatis  modified to take into account the censoring weight .
Augmented IPTW (AIPTW), which is a more efficient version of IPTW, was introduced by Scharfstein et al.  and Lipsitz et al. . Let for . The corresponding estimating equation is given by (11)where and The causal effect estimator corresponding to an influence function in is called a double robust (DR) estimator in the sense that the estimator is consistent if either the propensity score model or the conditional response mean model is correctly specified [28, 36, 39, 40]. The influence function (11) is a member of the class of AIPTW influence functions and it has been shown that it is more efficient than eq. (10) . In the proof of Theorem 2, we show how has been derived.
5.2 Asymptotic properties of the WEE estimator
Theorem 2 presents the asymptotic properties of the DR treatment effect estimator obtained by eq. (11) in the presence of length-biased sampling using the AFT models when both the treatment assignment and are replaced by their estimated values.
Theorem 2 Let be a DR estimator corresponding to . Then under regularity conditions and , where is defined in the Appendix B.
Proof See Appendix B.
6 Simulation studies
We examine the performance of the proposed estimating equations for the Cox and the AFT models. In both cases, we simulate 1,000 datasets consisting of 200, 400 and 800 observations to study the performance of the proposed estimating equations for estimating the unmediated causal effect. Here, the censoring variable C is generated from a uniform distribution in the interval where the parameter is set such that it results in a desired censoring proportion. To create length-biased samples, we generate a variable A from a uniform distribution and ignore those whose generated unbiased failure time is less than A.
6.1 Cox model
We generated the population failure times from the hazard model , where with uniformly distributed on (0,1), , where . The true marginal treatment effect computed by Monte Carlo is . We consider three different unadjusted scenarios: Unadjusted is an estimator for which neither the length-biased nor the confounding is adjusted, Unadjusted is obtained by adjusting for the length-biased sampling but leaving the confounding unadjusted, and Unadjusted is carried out by adjusting for the confounding while the length-biased sampling is left unadjusted. The estimating equations for these unadjusted cases are listed in Appendix E.
Table 1 summarizes the marginal estimated treatment effects and their standard errors. Our simulation results confirm that the proposed estimating eq. (7) under Cox model assumption adjusts for both confounding and the length-biased sampling and results in smaller MSE across different sample sizes.
6.2 AFT model estimation
We consider a nonlinear failure time model and include the exposure effect modifier by adding the interaction term between the treatment and a confounder () as follows, where is uniformly distributed on (–1,1), is uniformly distributed on (0,1), , and . The estimated treatment effects and their standard errors are listed in Table 2. Similar to the previous section, we consider three different unadjusted scenarios. We have used a correct conditional mean model in the DR estimating equation. The DR estimator dominates the two other estimators in terms of the standard deviation and the MSE. Increasing the censoring proportion increases the bias in the PSR, IPTW and DR estimators while maintaining the unbiasedness. All the unadjusted estimators are biased and in our parameter setting it seems that the failure to account for the length-biased sampling leads to a more biased estimator compared to the Unadjusted. The estimating equations for the unadjusted cases are listed in Appendix E.
7 Real data analysis: the Canadian study of health and aging
The CSHA, initiated in 1989, is a nationwide study on aging in Canada. One of the objectives of CSHA was to study dementia. The CSHA included three phases in 1991, 1996 and 2001. In the first phase, 10,263 individuals aged 65 or over were sampled at random across Canada, from both rural and urban areas, from communities and institutions for the elderly. Among the participants, 1,132 people were diagnosed with dementia. The ages of dementia onset were assessed from each individual’s medical history. We analyze the data collected during the first phase of the study which began in 1991 by sampling prevalent cases and examining the types of dementia: probable Alzheimer’s disease, possible Alzheimer’s disease and vascular dementia. The age of death or censoring were recorded for each subject from the time of screening, while the age at onset was ascertained retrospectively using CAMDEX from caregivers (Wolfson et al. ). Gender, level of education and the types of dementia are available as baseline covariates. The timescale for survival is set in years.
7.1 Exposure of interest: institutionalization
One of the collected covariates is the dichotomous institutionalization (exposure) indicator, which takes the value one if the subject is institutionalized at the time of sampling, and zero otherwise. We are interested in comparing survival of institutionalized subjects with dementia and subjects recruited from the community.
Since there are some covariates which confound the effect of the exposure on the survival time, the crude difference estimator will be biased. We estimate the effect of this covariate while having confounding and length-biased sampling as two sources of estimation bias using Cox PH, and semiparametric AFT models. Our data include 818 subjects (after excluding patients with missing information), of which 180 subjects were right censored . The validity of the stationarity assumption has been shown to be reasonable by Addona and Wolfson  and Asgharian et al. .
In order to estimate the causal effect of institutionalization, we need to ignore those individuals that their date of institutionalization is after their onset of the disease. However this information was not recorded in the dataset. We address this limitation using a multiple imputation approach to generate synthetic data on which the estimating equations can be used. Using an informed model, we generate a binary variable, Z, conditional on the age at onset, , and the gender, , that attempts to reveal whether institutionalization occurred prior to onset. Specifically, we used the model to generate Z, and then ignored patients with , i.e. those patients that whose date of institutionalization is after their onset time. We parametrized the above model such that older patients and females, , have more chance to be institutionalized before the onset of dementia. The value of the parameters are extracted from Carrière and Pelletier . These authors estimate the relationship between sociodemographic characteristics and institutionalization of citizens of Canada. One of the limitations of our logistic model for Z is that we do not have all the covariates that are used in Carriere and Pelletier  such as income and marital status. We use the above model to fill the missing variable repeatedly and create a collection of 20 imputed data sets .
7.2 Semiparametric AFT models
We have estimated the institutionalization effect on survival time using the semiparametric estimating equation proposed in Section 5. Table 3 presents the estimated institution effect using the semiparametric estimating equations proposed in Section 5 under the AFT model. PSR is the estimator based on eq. (9), AWE is the weighted estimator based on eq. (10) and DR is the estimator based on eq. (11) described in Theorem 3. We consider three different unadjusted scenarios: Unadjusted, Unadjusted and Unadjusted. The results reveal that the institutionalization have a significant positive effect on the survival time when estimated using AWE and PSR while it has a positive effect at the level using DR estimator. The unadjusted estimator shows a small negative effect. In other words, without adjusting for either the length-biased sampling or the treatment adjustment, we might incorrectly conclude that institutionalized subjects tend to have a shorter survival time.
7.3 Cox PH model
Although the residual analysis shows that AFT is a suitable model for this data set (Bergeron et al. , we have also estimated the marginal institutionalization effect using the weighted estimating equation proposed for Cox models (Table 4). The proposed estimating equation for Cox models can be fitted using standard software, and equivalent to the following command in R for the observed subset of data, for , where is the observed survival time, wpi is and hatwy is is the Kaplan–Meier estimate for the distribution of the censoring variable. In this parameterization, the coefficients estimated indicate the increase/decrease in the hazard while in the AFT model coefficients indicate an decrease/increase in the survival time, and hence the opposite sign of the coefficients in the AFT and PH model have the same interpretation. To determine whether a fitted Cox model adequately describe the data, we looked at the scaled Schoenfeld residuals plot, Figure 2, for the Cox model. There appears to be a trend in the scaled Schoenfeld residuals for the institution indicator variable which indicates violation of the assumption of PH.
7.4 Survival curves
We compute adjusted and unadjusted survival curves to compare survival with dementia in the course of time between the exposure groups (Figure 3). Several methods have been proposed to adjust for the length-biased sampling such as the nonparametric maximum likelihood estimator [45–47], the truncation product-limit estimator  and the maximum pseudo-partial likelihood estimator . Here, we use the method introduced by Huang and Qin  which incorporates the information from the marginal distribution of the truncation time from disease onset to recruitment time. The bias induced by confounding can be adjusted by creating a pseudo-population using the inverse probability of being in the group that the individuals actually belong to [51–53]. The adjusted survival curves show that the institutionalized patients tend to live longer while the survival curves cross when unadjusted (Figure 3). Moreover, leaving the length-biased sampling unadjusted may lead to overestimate the survival times which is shown in Figure 4. This figure clearly depicts that the survival curve of the institutionalized individuals is always higher than those recruited from the community.
8 Concluding remarks
We have presented two different approaches to estimate the exposure effect from right-censored length-biased samples. The estimating equations adjust for two different types of bias at the same time. Our simulation and real data analysis results highlight the importance of adjusting for the two sources of bias; failure to adjust for either the length-biased sampling or the confounding may lead to misleading results.
We have focused on the stationary case. It would, however, be of interest to extend the method to the general left truncation where the left truncation distribution is unknown. This latter approach is robust against departure from stationarity, though it is less efficient when the stationarity assumption holds [45, 46, 54].
Here, we present the assumptions and proofs of the main and other auxiliary results.
The regularity conditions required for the Cox and the weighted AFT models:
C.1 for is a twice continuously differentiable function.
C.2 is bounded away from zero and one ( where ).
Appendix B: Proofs of Theorems 1 and 2
Proof of Theorem 1
In Section 5, we have shown that the estimating equation given by eq. (13) is unbiased. We need to show that the two representations (12) and (13) are equal. By the definition of the stochastic process , we have where We have the following estimating equation when is replaced by its estimate, , where The estimating equation can be written as Using the strong consistency of to , we have (12)and following the martingale integral representation introduced by Shen et al.  and Qin and Shen , we can show that where with where is independent of and identically distributed to Y. Now we can derive the asymptotic variance of our proposed estimator when is replaced by in the propensity score model. Note (13)where Hence, using the Taylor expansion and Theorem 1 in Pugh et al. , where
Note: The asymptotic variance may be estimated consistently by replacing the expectations in the expressions for , and with expectations with respect to the empirical measure.
Proof of Theorem 2
Following eq. (12) and using the martingale integral representation , we have where and Hence, the generic elements of the class of influence functions can be written as where with In order to show that results in an unbiased estimator, we need to show that . For the first expectation, we have (14)The first expectation on the RHS of eq. (14) is where and . The second equality follows from and where is the propensity score estimated from the length-biased sample and is the true propensity score. The second expectation of the RHS of eq. (14) is also equal to zero since . Similarly, we can show that .
It can be shown that and are uncorrelated. Hence the asymptotic variance of the estimator is given by where and
Appendix C: Misspecified propensity score or mean model
In this appendix, we study the performance of the DR AFT estimator (11) when either the propensity score or the mean model is misspecified. We use the same simulation model as in Section 6.2 with only changing the treatment assignment model to . Our misspecified propensity score ignores the confounder . Table 5 shows results based on 500 data sets of sizes 200 and 800 with 0, 20 and 30 percent censoring. The superscript and represent the propensity score and mean model misspecifications, respectively. The misspecified propensity score ignores the variable and the misspecified mean model ignores the interaction term (see Section 6.2). The results confirm that our estimator is doubly robust.
Appendix D: Derivation of the score function
The score function derived from the following pseudo-partial likelihood after adjusting the risk sets for the confounding and the length-biased sampling where represents the adjusted risk set for both length-biased sampling and confounding. Followed by Shen et al.  and Qin and Shen , we estimate the denominator by where the focus is on the uncensored subjects and the risk set is inversely weighted by . Note, under assumptions A1–A3, we have which justifies the form of the score function .
Appendix E: Cox and AFT estimating equations when either of the confounding or the length-biased sampling is ignored
Estimating equation for Cox model when length biased is left unadjusted:
Estimating equation for Cox model when the confounding is left unadjusted:
Estimating equation for AFT model when length biased is left unadjusted:
Estimating equation for AFT model when the confounding is left unadjusted:
Cox DR, Lewis P. The statistical analysis of series of events. Monographs on applied probability and statistics. London: Chapman and Hall, 1966. Google Scholar
Wicksell SD. The corpuscle problem: a mathematical study of a biometric problem. Biometrika 1925;17:84–99. Google Scholar
Fisher RA. The effect of methods of ascertainment upon the estimation of frequencies. Ann Hum Genet 1934;6:13–25. Google Scholar
Pugh M, Robins J, Lipsitz S, Harrington D (1993): Inference in the Cox proportional hazards model with missing covariate data. Technical report, Harvard School of Public Health, Dept. of Biostatistics. Google Scholar
Rotnitzky A, Robins JM. Inverse probability weighting in survival analysis. In: Armitage P, Coulton, T, editors. Encyclopedia of biostatistics, 2nd ed. New York: Wiley, 2005. Google Scholar
Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Østbye T, Rockwood K, et al. A reevaluation of the duration of survival after the onset of dementia. N Engl J Med 2001;344:1111–16. CrossrefGoogle Scholar
Robins JM. Causal inference from complex longitudinal data. In: Berkane, M, etdior. Latent variable modeling and applications to causality. New York: Springer, 1997:69–117. Google Scholar
van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer Science & Business Media, 2003. Google Scholar
Cox DR, Oakes D. Analysis of survival data. Chapman & Hall/CRC, 1984. Google Scholar
Robins JM, Mark SD, Newey WK. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 1992;48:479–95. URL http://www.jstor.org/stable/2532304. Crossref
Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. 2nd ed. New York: Wiley, 2002. Google Scholar
Hájek J, Dupač V. Sampling from a finite population. New York: Marcel Dekker, 1981. Google Scholar
Tsiatis AA. Semiparametric theory and missing data. New York: Springer Verlag, 2006. Google Scholar
Robins JM. Robust estimation in sequentially ignorable missing data and causal inference models. Proc Am Stat Assoc Sec Bayesian Stat Sci 1999;6–10. Google Scholar
Little R, Rubin DB. Statistical analysis with missing data. Vol. 539. New York: Wiley, 1987. Google Scholar
Asgharian M, Wolfson DB. Asymptotic behavior of the unconditional NPMLE of the length-biased survivor function from right censored prevalent cohort data. Ann Stat 2005;33:2109–31. URL http://www.jstor.org/stable/3448636. Crossref
Pepe, MS, Fleming, TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. J R Stat Soc Ser B (Methodol) 1991; 53(2):341–52. Google Scholar
About the article
Published Online: 2015-03-21
Published in Print: 2015-05-01
Funding: The data reported in this article were collected as part of the Canadian Study of Health and Aging. The core study was funded by the Seniors’ Independence Research Program, through the National Health Research and Development Program (NHRDP) of Health Canada Project 6606-3954-MC(S). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada. The authors would like to thank Professor Christina Wolfson for providing the data. This work was supported in part by NIDA and NSF SES-1260782 grants P50 DA010075. The second and third authors acknowledge the support of Discovery Grants from the Natural Sciences and Engineering Research Council (NSERC) of Canada. Part of the work on this article was completed while the second author was on sabbatical leave at Universite de Bordeaux. He would like to thank hospitality of the Equipe de Biostatistique and in particular, Daniel Commenges and Helene Jacqmin-Gadda.