Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

IMPACT FACTOR 2018: 1.309

CiteScore 2018: 1.11

SCImago Journal Rank (SJR) 2018: 1.325
Source Normalized Impact per Paper (SNIP) 2018: 0.715

Mathematical Citation Quotient (MCQ) 2017: 0.07

See all formats and pricing
More options …

Semiparametric Regression Analysis of Clustered Interval-Censored Failure Time Data with Informative Cluster Size

Xinyan Zhang
  • Corresponding author
  • School of Public Health, Harvard University, 651 Huntington Ave. FXB 502, Boston, MA 02115, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Jianguo Sun
Published Online: 2013-08-13 | DOI: https://doi.org/10.1515/ijb-2012-0047


Clustered interval-censored failure time data are commonly encountered in many medical settings. In such situations, one issue that often arises in practice is that the cluster size is related to the risk for the outcome of interest. It is well-known that ignoring the informativeness of the cluster size can result in biased parameter estimates. In this article, we consider regression analysis of clustered interval-censored data with informative cluster size with the focus on semiparametric methods. For the problem, two approaches are presented and investigated. One is a within-cluster resampling procedure and the other is a weighted estimating equation approach. Unlike previously published methods, the new approaches take into account cluster sizes and heterogeneous correlation structures without imposing strong parametric assumptions. A simulation experiment is carried out to evaluate the performance of the proposed approaches and indicates that they perform well for practical situations. The approaches are applied to a lymphatic filariasis study that motivated this study.

Keywords: estimating equation; informative cluster size; interval-censoring; proportional hazards model

1 Introduction

Interval-censored failure time data arise in numerous medical settings and in such situations, the time to an event of interest cannot be exactly observed, but is known only to lie between certain time intervals. For example, in clinical trials, patients are often examined at pre-scheduled visits or more generally a sequence of patient-specific time points. In this case, it is clear that the exact occurrence of a clinical event of interest cannot be observed and instead the occurrence can be only known to be between two visits, thus yielding interval-censored data [1]. Such data become clustered interval-censored failure time data if there are two or more associated failure events of interest involved. In dental or family disease studies, for example, all teeth of the same person or all members of the same family are naturally clustered together. The cluster size, defined as the number of subunits within each cluster could be fixed or may be a random variable related to the failure times of interest. The latter case is often referred to as clustered data with informative cluster size.

Data with informative cluster size arise when cluster size may change and contain relevant information about the risk for the outcome of interest. For example, in a dental study, the effects of behavioral factors such as cigarette smoking and hygiene status may predict tooth survival for the patients with chronic periodontitis. The patients with more teeth may have higher molar survival probability, and in this case, the number of teeth at initiation is likely informative about tooth survival. For such data, it is obvious that one may obtain biased results if applying the standard methods developed for the analysis of clustered interval-censored data that treat cluster sizes as non-informative constants.

A more specific example of clustered interval-censored failure time data with informative cluster size, that motivated this study, is given by a lymphatic filariasis (LF) study conducted in Brazil, which compared the effect of co-administration of diethylcarbamazine and albendazole (DEC/ALB) (new treatment) versus DEC alone (standard treatment) for the treatment of LF (Williamson et al. 2007). The cause of the LF is Wuchereria bancrofti, a thread like worm. W. bancrofti larvae can be transmitted by infected mosquitoes to the lymphatic vessels and develop into adult filial worms residing in nests in blood vessels. The live/death status of the adult worms can be monitored by ultrasound. In the study, 47 participants were periodically examined by ultrasound and followed for 1 year after their treatment, with 25 on DEC treatment and 22 on DEC/ALB treatment. They were periodically examined by ultrasound to check if the worms were still alive. For the times to the clearance of the worms in each nest, the variable of interest, only clustered interval-censored data are available with each study subject as a cluster and the cluster size being the number of nests of adult filial worms in each subject. The number of nests per subject ranges from 1 to 5, and it was shown that the time to clearance was positively related to the number of nests (i.e. cluster size) that a subject had [2]. The average percentage of nests cleared during the 1 year follow-up was 82% for the participants with one nest but only 33% in the participants with four or five nests. More details about the study are given in Section 4.

There exists an extensive literature on regression analysis of clustered right-censored failure time data where the cluster size has no relevant information for the failures times of interest [35]. At the same time, some methods for clustered data have also been developed for the situation where the cluster size is informative. For example, Dunson et al. [6] developed a Bayesian procedure that models the relationship between the failure times of interest and the cluster size through a latent variable. Williamson et al. [7] proposed a modified generalized estimating equation (GEE) method [8] and in the approach, the estimating equation is inversely and equally weighted by cluster sizes. Cong et al. [9] and Williamson et al. (2007) presented some weighted score function approach and within-cluster resampling (WCR) procedure in right-censored data. The former is essentially a weighted estimating equation (WEE) approach, whereas the latter samples a single subject from each cluster and transforms the data to usual univariate failure time data. In the WCR procedure, one observation is randomly sampled with replacement from each of the N clusters. The resampled dataset of size N can then be analyzed by a generalized linear model, since the observations in the resampled data are independent. If this process is repeated a large number of times, say Q, where each of the Q analysis provides a consistent estimator of parameter of interest, one can estimate the WCR estimates by taking the average of the Q resample-based estimates. The key idea behind the WCR procedure is to avoid the correlation problem among the subjects within the same cluster. On the other hand, the method can be computationally intensive.

Although there exists a relatively less literature on regression analysis of interval-censored failure time data, many authors have discussed the problem [1, 10, Li and Pu [11]]. However, there basically exists little research on clustered interval-censored data with informative cluster size except Zhang and Sun [12], who generalized the parametric approach given in Williamson et al. (2007) to clustered interval-censored failure time data. In addition, Kim [13] proposed a joint modeling approach to address both the association among failure times from the same cluster and the dependency of failure times on cluster size based on interval-censored data. However, in that article, the cluster size was incorporated as a covariate in the joint model, which may not be appropriate in some situations. When taking the cluster size as a covariate, the covariate effect is based on a randomly selected individual given a certain number of individuals that a cluster has. Our inferences in this article focus on the covariate effect on a randomly selected cluster with a random number of subunits, with the informative cluster size properly adjusted.

Specifically, we present a semiparametric approach under the proportional hazards model. A semiparametric procedure is preferred over a parametric one, since the former is more flexible in terms of model assumption. To estimate the unknown parameters, we apply the maximum likelihood approach to maximize the likelihood function, which involves a vector of finite-dimensional regression parameters and an infinite-dimensional nuisance parameter function [1]. Both the regression parameters and the nuisance function need to be estimated together, which could be very difficult. For this, we adopt the sieve maximum likelihood method discussed by Huang and Rossini [14] and establish estimating equations for regression parameters and baseline hazards simultaneously. It approximates the infinite-dimensional nuisance function by piecewise linear functions and usually gives the faster convergence than the maximum likelihood approach.

In the remainder of this article, we first introduce in Section 2 some notation and then present two procedures for estimating covariate effects on clustered interval-censored failure time data with informative cluster size. In both procedures, we assume that the failure times of interest follow the proportional hazards model marginally and leave both the relationship among correlated failure times and the relationship between the failure times and the cluster sizes arbitrary. The first procedure is a WCR-based approach, and its main advantage is that it can be relatively easily implemented. The second procedure is a WEE-based approach and can be regarded as a generalization of the method given in Williamson et al. (2007). Section 3 gives some results from a simulation study conducted for evaluating the performance of the proposed estimation procedures, and the results indicate that the procedures perform well in practical situations. In Section 4, we apply the proposed methodology to the LF data discussed above, and Section 5 contains discussion and concluding remarks.

2 Estimation procedures

Consider a survival study that includes N independent clusters and in which each cluster includes some possibly related subunits. Let denote the failure time of interest for the jth subunit in the ith cluster with an associated vector of covariates , , , where is the cluster size of the ith cluster. The covariates could be cluster-specific or subunit-specific. Under the scenario that the survival time of a randomly chosen subunit from a randomly chosen cluster is independent of cluster size ’s, we assume it follows the proportional hazards model Cox DR 1975 [18] with the hazard function (1)

given . Of note, the real failure times ’s in clustered interval-censored data with informative cluster size are correlated with cluster size . Here, denotes an unknown baseline hazard function, and is the vector of regression parameters. Under the model [1], the marginal survival function of has the form (2)

where denotes the baseline cumulative hazard function and X is the vector of covariates. In the following, we assume that the failure times of interest ’s may be related to the cluster size s, and the main goal is to estimate regression parameters .

Suppose only interval-censored data are available and given by . Here denotes the interval within which subject j in cluster i is observed to fail. That is, . We assume that and are independent of given the covariates . Thus, the likelihood contribution from subject j in cluster i is given by

where and are cumulative hazard functions at time and .

To estimate , we first present a resampling-based estimation procedure and then an estimating equation-based one.

2.1 A resampling-based estimation procedure

To estimate , a commonly used method is to apply the maximum likelihood approach, which would require the specification of the joint distribution of . However, it is not quite easy for correlated data. To avoid this, one can construct new sets of data that consist one observation from each cluster by using the WCR approach and make every new data set an independent sample.

Another complicated issue in fitting model [1] is the estimation of the infinite-dimensionality of . For this, following Huang and Rossini [14], we propose to employ the sieve estimation approach that approximates by piecewise linear functions. Specifically, let denote a set of partition time points, where denotes the longest follow-up time. Assume the baseline cumulative hazard function can be approximated by piecewise linear function

where and are unknown parameters. Instead of dealing with the monotone increasing ’s, we could define the parameters with and , for ease of calculation.

Let and Q be a positive integer. For each , one can randomly select a data point from each cluster. Let denote the set of indices of the selected data points. Define to be the maximum likelihood estimate of given by the likelihood function

It is apparent that is the observed likelihood function if for all i. By following Hoffman et al. [15], one can estimate by

and approximate the distribution of by the multivariate normal distribution with the covariance matrix

In the above, denotes the estimate of the covariance matrix of given by the inverse of the observed Fisher information matrix from

2.2 An estimating equation-based procedure

The main advantage of the WCR-based estimation procedure described above is its simplicity and flexibility in its implementation, as one can apply the algorithms developed for independent interval-censored failure time data [10] and it remains valid when cluster size is informative.

As mentioned above, to use the maximum likelihood approach, one needs to specify the joint distribution of . If one is willing to treat as independent, then the following pseudo-likelihood function (3)

could be used for estimation of and conditional on the cluster sizes s. This would yield the following score estimating equation (4)

with . Motivated by this, we propose to estimate and by using the following estimating equations (5) and (6)

It is easy to see that both and are weighted score functions. Similar ideas have been used by Wei et al. [16].

Let denote the estimates of and given by the estimating eqs [5] and [6], which can be solved by, for example, the Newton–Raphson algorithm. It can be shown that they are consistent and their joint distribution can be approximated by the multivariate normal distribution with the covariance matrix that can be consistently estimated by , where


3 A simulation study

We conducted a simulation study to evaluate the performance of the two estimation procedures presented in the previous section with the focus on estimation of the regression parameter . In the simulation, it was assumed that there existed two covariates and with being the cluster-specific covariate taking value 0 or 1 with probability 0.5 and being the subject-specific covariate generated from a uniform distribution over (0, 70). To generate the clustered failure times s, we assumed that there existed a random sample of latent variables s and, given s, the ’s were independent and followed the exponential distribution for given constants , , and .

It was assumed that the s follow the positive stable distribution defined by [17], where is a positive constant () [19], is a random number from the exponential distribution with mean one, and with being equal to times a random number from the uniform distribution over (0,1). The constant a above represents the correlation among the failure times within a cluster with corresponding to the independent case and meaning that they are completely determined by each other. It can be easily proved by Laplace transformation that for given , , and a in a conditional model, the parameters in marginal survival function [2] will be and . It was assumed that the informative cluster sizes were related to the s. If was less than or equal to the median of the positive stable distribution, was assumed to follow the binomial distribution and otherwise to follow the binomial distribution . In both cases, the size 0 and 7 were discarded.

For the generation of censoring intervals, we assumed that and followed the uniform distribution over the region . With respect to the estimation of , the results given below are based on seven equally spaced time intervals with the number of knots . We tried different m and obtained similar results. In this simulation, we took and = 0.2, 0.5, and 0.8 corresponding to different correlation levels.

Table 1 presents the simulation results on estimation of and with and and being , , or based on 1,000 replications. The results include the means of the estimates given by each of the two procedures (MEAN), the averages of the estimated standard errors of the estimates (ESE), and the sample standard deviations of the estimates based on the simulated data (STD). In addition to the two estimates of proposed in Section 2, for comparison, we also calculated and include in the table the estimates obtained by using the estimating eqs [5] and [6] with setting all , thus ignoring the informativeness of cluster size. The resulting estimates are referred to as the unweighted estimating equation estimates (UWEE) in the table, while the two estimates proposed in Section 2 are referred to as the WEE estimates (WEEE) and the WCR estimates (WCRE), respectively. For WCRE, the results are based on .

Table 1

Estimation of regression parameters with N 50.

Table 2

Estimation of regression parameters with N 500.

The results above indicate that both proposed estimation procedures seem to work well, especially for cluster-specific covariates. For the unit-specific covariates, there seems to exist some biases for the WEE-based estimates with the number of clusters . They disappeared when we increased the number of clusters to , for which the results are given in Table 2. Also one can see from Tables 1 and 2 that the UWEE seems to be biased. In other words, the strategy of weighting each cluster by the inverse of the cluster size indeed seems to effectively reduce biases that could be introduced by the informative cluster size.

4 Analysis of the lymphatic filariasis study

In this section, we apply the inference procedures proposed in Section 2 to the LF study discussed earlier on examining the effect of the co-administration of DEC and ALB against DEC alone for the LF treatment. ALB is an anti-parasitic drug, which is commonly used to treat interstima worm infections. When co-administered with DEC it helps break the cycle of LF transmission between mosquitoes and humans, and by using ultrasound the doctor can detect the movement of the living adult worms. As mentioned above, the study consists of 47 men followed for 1 year and periodically examined by ultrasound to determine whether any of the worms were still alive in the nests. The variable of interest is the time to the clearance of the worms in each nest. It took much longer to clear a nest in the men with more nests than in the men with fewer nests. In other words, the cluster size seems to be informative.

Among 47 study subjects, 22 received the co-administration of DEC and ALB, while the others were given DEC alone. In total, 78 adult worm nests were detected by ultrasound and the cluster size, , ranged from 1 to 5. Two covariates were of particular interest to the investigators. One is of course the treatment indicator, and the other is the age of the subject in years at baseline, ranging from 16 to 66. The observation time ranges from 0 to 360 days. In the analysis below, we chose 18 equally spaced time intervals and set 360. For subject i, define to be 0 if subject i was given the co-administration of DEC and ALB and 1 otherwise and let be the age of the subject, . Note that here we only have cluster-specific covariates.

Table 3

Estimation of covariate effects for the LF study.

Table 3 presents the estimated covariate effects given by the two estimation procedures proposed in the previous sections. The results include the estimated treatment and age effects on the time to the clearance of the worms along with the p-values for testing the effects equal to zero. They suggest that there seems to be no significant difference between the co-administration of DEC and ALB and the use of DEC alone for the LF treatment. On the other hand, the patient age seems to be a significant factor related to the clearance of the worms. These results are similar to those given in Zhang et al. (2010) based on the parametric analysis.

In Table 3, we presented several estimated treatment and age effects given by WCRE with the use of different values for Q. As shown in the simulation, the estimates seem to be robust for Q greater than 100. For the data here, we also performed the analysis by using different number of knots m (), the number of knots in the sieve approach for the approximation of the baseline cumulative hazard function, and the results are given in Table 4. We choose equally spaced knots between 0 and 36. It can be seen that all estimates are comparable.

Table 4

Estimation of covariate effects with different m.

5 Discussion and concluding remarks

This article discussed regression analysis of clustered interval-censored failure time data with informative cluster sizes. There are two issues need to be addressed to conduct a valid analysis: one is to take into account the correlation among the failure times within the same cluster and the other is to deal with the informative cluster size. As shown in the simulation study, the analysis would give biased estimates if the informativeness of the cluster size is ignored. Interval-censoring makes the analysis very difficult, as it introduces the incompleteness of the observed data and limits the amount of relevant information observed. Considering these issues, two semiparametric procedures, a WCR-based procedure and a WEE-based procedure, were developed. In the first procedure, to deal with the estimation of the baseline cumulative hazard function, we used the sieve maximum likelihood method. The simulation study conducted indicates that both procedures work well for practical situations.

As mentioned before, the main advantage of the WCR-based procedure is that it can be easily implemented. On the other hand, it could be computationally intensive. To implement it, one needs to choose the number of resamples Q, the number of knots m, and the location of the knots. The simulation results suggest that the resulted estimates of regression parameters seem to be robust with the selections. In the WEE-based procedure, we use the inverse of the cluster size as the weight to adjust for the informative cluster size, but this is not the only choice. One could apply more general weights such as some functions of cluster sizes into weight. One may also want to include the information about censoring intervals . It would be useful to investigate such generalized estimation procedures and determine some optimal weights.

Note that both estimation procedures proposed here are marginal approaches, which have the advantage in that they leave the correlation among failure times within the same cluster arbitrary. GEE-based approaches are easy to compute and robust to the misspecification of higher-order correlations among family members. However, they may suffer from the loss of efficiency. An alternative is to develop joint modeling approaches for clustered interval-censored data with informative cluster size using a shared frailty through a full likelihood procedure. It would be attractive if one is interested in the determination of the correlation and dependence structure. Another direction for future research is to develop formal methods that can be used to detect or test the informativeness of cluster sizes in clustered failure time data. A simple and informal approach is to divide study subjects or clusters into groups based on cluster sizes and to compare some characteristics. Also it is interesting to develop some procedures for model checking or the goodness-of-fit test of the marginal model [1, 20].


We thank the editor and two referees for their very helpful comments and suggestions that greatly improved the paper. This work was partly supported by a NCI R01 grant to the second author.


  • 1

    Sun J. The statistical analysis of interval-censored failure time data. New York: Springer, 2006. Google Scholar

  • 2

    Williamson J, Kim HY, Manathuga A, Addiss DG. Modeling survival data with informative cluster size. Stat Med 2007;27:543–55. Web of ScienceGoogle Scholar

  • 3

    Cai J, Prentice RL. Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrics 1995;82:151–64. CrossrefGoogle Scholar

  • 4

    Clayton DG, Cuzick J. Multivariate generations of the proportional hazards model. J Royal Stat Soc Ser A 1985;148:82–117.CrossrefGoogle Scholar

  • 5

    Hougaard P. Modelling multivariate survival. Scand J Stat 1987;14:291–304. Google Scholar

  • 6

    Dunson DB, Chen Z, Harry J. Bayesian joint models of cluster size and subunit-specific outcomes. Biometrics 2003;59:521–30. PubMedCrossrefGoogle Scholar

  • 7

    Williamson JM, Datta S, Satten GA. Marginal analyses of clustered data when cluster size is informative. Biometrics 2003;59:36–42. CrossrefPubMedGoogle Scholar

  • 8

    Liang K, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13–22. CrossrefGoogle Scholar

  • 9

    Cong X, Yin G, Shen Y. Marginal analysis of correlated failure time data with informative cluster size. Biometrics 2007;63:663–72. Web of SciencePubMedCrossrefGoogle Scholar

  • 10

    Finkelstein DM, Wolfe RA. Isotonic regression for interval censored survival data using an E-M algorithm. Communications Stat Theory Methods 1986;15:2493–505. CrossrefGoogle Scholar

  • 11

    Li L. and Pi Z. Rank estimation of log-linear regression with interval censored data. Lifetime data analysis 2003;9:57–70. CrossrefPubMedGoogle Scholar

  • 12

    Zhang X, Sun J. Regression analysis of clustered interval-censored failure time data with informative cluster size. Computational Statistics and Data Analysis 2010;54:1817–23. Web of ScienceCrossrefGoogle Scholar

  • 13

    Kim Y-J. Regression analysis of clustered interval-censored data with informative cluster size. Stat Med 2010;29:2956–62. PubMedCrossrefWeb of ScienceGoogle Scholar

  • 14

    Huang J, Rossini JA. Sieve estimation for the proportional odds model with interval-censoring. J Am Stat Assoc 1997;92:960–7. CrossrefGoogle Scholar

  • 15

    Hoffman EB, Sen PK, Weinberg CR. Within cluster resampling. Biometrika 2001;88:1121–34. CrossrefGoogle Scholar

  • 16

    Wei LJ, Lin DY, Weissfeld L. Regression analysis of multivariable incomplete failure time data by modeling marginal distributions. J Am Stat Assoc 1989;84:1065–73. CrossrefGoogle Scholar

  • 17

    Lee EW, Wei LJ, Amato DA. Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In Klein JP, God PK, editors. Survival Anal State Arts. Dordrecht, Germany: Kluwer Academic; 1992:237–47. Google Scholar

  • 18

    Cox DR. Partial likelihood. Biometrika 1975;62:269–76. CrossrefGoogle Scholar

  • 19

    Hougaard P. Analysis of multivariate survival data. New York: Springer, 2000. Google Scholar

  • 20

    Wang L, Sun L, Sun J. A goodness-of-fit test for the marginal cox model for correlated interval-censored failure time data. Biometrical J 2006;5:1–9. Google Scholar

About the article

Published Online: 2013-08-13

Published in Print: 2013-11-01

Citation Information: The International Journal of Biostatistics, Volume 9, Issue 2, Pages 205–214, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X, DOI: https://doi.org/10.1515/ijb-2012-0047.

Export Citation

© 2013 by De Gruyter.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Comments (0)

Please log in or register to comment.
Log in