Interval-censored failure time data arise in numerous medical settings and in such situations, the time to an event of interest cannot be exactly observed, but is known only to lie between certain time intervals. For example, in clinical trials, patients are often examined at pre-scheduled visits or more generally a sequence of patient-specific time points. In this case, it is clear that the exact occurrence of a clinical event of interest cannot be observed and instead the occurrence can be only known to be between two visits, thus yielding interval-censored data [1]. Such data become clustered interval-censored failure time data if there are two or more associated failure events of interest involved. In dental or family disease studies, for example, all teeth of the same person or all members of the same family are naturally clustered together. The cluster size, defined as the number of subunits within each cluster could be fixed or may be a random variable related to the failure times of interest. The latter case is often referred to as clustered data with informative cluster size.

Data with informative cluster size arise when cluster size may change and contain relevant information about the risk for the outcome of interest. For example, in a dental study, the effects of behavioral factors such as cigarette smoking and hygiene status may predict tooth survival for the patients with chronic periodontitis. The patients with more teeth may have higher molar survival probability, and in this case, the number of teeth at initiation is likely informative about tooth survival. For such data, it is obvious that one may obtain biased results if applying the standard methods developed for the analysis of clustered interval-censored data that treat cluster sizes as non-informative constants.

A more specific example of clustered interval-censored failure time data with informative cluster size, that motivated this study, is given by a lymphatic filariasis (LF) study conducted in Brazil, which compared the effect of co-administration of diethylcarbamazine and albendazole (DEC/ALB) (new treatment) versus DEC alone (standard treatment) for the treatment of LF (Williamson et al. 2007). The cause of the LF is *Wuchereria bancrofti*, a thread like worm. *W. bancrofti* larvae can be transmitted by infected mosquitoes to the lymphatic vessels and develop into adult filial worms residing in nests in blood vessels. The live/death status of the adult worms can be monitored by ultrasound. In the study, 47 participants were periodically examined by ultrasound and followed for 1 year after their treatment, with 25 on DEC treatment and 22 on DEC/ALB treatment. They were periodically examined by ultrasound to check if the worms were still alive. For the times to the clearance of the worms in each nest, the variable of interest, only clustered interval-censored data are available with each study subject as a cluster and the cluster size being the number of nests of adult filial worms in each subject. The number of nests per subject ranges from 1 to 5, and it was shown that the time to clearance was positively related to the number of nests (i.e. cluster size) that a subject had [2]. The average percentage of nests cleared during the 1 year follow-up was 82% for the participants with one nest but only 33% in the participants with four or five nests. More details about the study are given in Section 4.

There exists an extensive literature on regression analysis of clustered right-censored failure time data where the cluster size has no relevant information for the failures times of interest [3–5]. At the same time, some methods for clustered data have also been developed for the situation where the cluster size is informative. For example, Dunson et al. [6] developed a Bayesian procedure that models the relationship between the failure times of interest and the cluster size through a latent variable. Williamson et al. [7] proposed a modified generalized estimating equation (GEE) method [8] and in the approach, the estimating equation is inversely and equally weighted by cluster sizes. Cong et al. [9] and Williamson et al. (2007) presented some weighted score function approach and within-cluster resampling (WCR) procedure in right-censored data. The former is essentially a weighted estimating equation (WEE) approach, whereas the latter samples a single subject from each cluster and transforms the data to usual univariate failure time data. In the WCR procedure, one observation is randomly sampled with replacement from each of the *N* clusters. The resampled dataset of size *N* can then be analyzed by a generalized linear model, since the observations in the resampled data are independent. If this process is repeated a large number of times, say *Q*, where each of the *Q* analysis provides a consistent estimator of parameter of interest, one can estimate the WCR estimates by taking the average of the *Q* resample-based estimates. The key idea behind the WCR procedure is to avoid the correlation problem among the subjects within the same cluster. On the other hand, the method can be computationally intensive.

Although there exists a relatively less literature on regression analysis of interval-censored failure time data, many authors have discussed the problem [1, 10, Li and Pu [11]]. However, there basically exists little research on clustered interval-censored data with informative cluster size except Zhang and Sun [12], who generalized the parametric approach given in Williamson et al. (2007) to clustered interval-censored failure time data. In addition, Kim [13] proposed a joint modeling approach to address both the association among failure times from the same cluster and the dependency of failure times on cluster size based on interval-censored data. However, in that article, the cluster size was incorporated as a covariate in the joint model, which may not be appropriate in some situations. When taking the cluster size as a covariate, the covariate effect is based on a randomly selected individual given a certain number of individuals that a cluster has. Our inferences in this article focus on the covariate effect on a randomly selected cluster with a random number of subunits, with the informative cluster size properly adjusted.

Specifically, we present a semiparametric approach under the proportional hazards model. A semiparametric procedure is preferred over a parametric one, since the former is more flexible in terms of model assumption. To estimate the unknown parameters, we apply the maximum likelihood approach to maximize the likelihood function, which involves a vector of finite-dimensional regression parameters and an infinite-dimensional nuisance parameter function [1]. Both the regression parameters and the nuisance function need to be estimated together, which could be very difficult. For this, we adopt the sieve maximum likelihood method discussed by Huang and Rossini [14] and establish estimating equations for regression parameters and baseline hazards simultaneously. It approximates the infinite-dimensional nuisance function by piecewise linear functions and usually gives the faster convergence than the maximum likelihood approach.

In the remainder of this article, we first introduce in Section 2 some notation and then present two procedures for estimating covariate effects on clustered interval-censored failure time data with informative cluster size. In both procedures, we assume that the failure times of interest follow the proportional hazards model marginally and leave both the relationship among correlated failure times and the relationship between the failure times and the cluster sizes arbitrary. The first procedure is a WCR-based approach, and its main advantage is that it can be relatively easily implemented. The second procedure is a WEE-based approach and can be regarded as a generalization of the method given in Williamson et al. (2007). Section 3 gives some results from a simulation study conducted for evaluating the performance of the proposed estimation procedures, and the results indicate that the procedures perform well in practical situations. In Section 4, we apply the proposed methodology to the LF data discussed above, and Section 5 contains discussion and concluding remarks.

## Comments (0)

General note:By using the comment function on degruyter.com you agree to our Privacy Statement. A respectful treatment of one another is important to us. Therefore we would like to draw your attention to our House Rules.