Dependence measure for length-biased survival data using copulas

Abstract The linear correlation coefficient of Bravais-Pearson is considered a powerful indicator when the dependency relationship is linear and the error variate is normally distributed. Unfortunately in finance and in survival analysis the dependency relationship may not be linear. In such case, the use of rank-based measures of dependence, like Kendall’s tau or Spearman rho are recommended. In this direction, under length-biased sampling, measures of the degree of dependence between the survival time and the covariates appear to have not received much intention in the literature. Our goal in this paper, is to provide an alternative indicator of dependence measure, based on the concept of information gain, using the parametric copulas. In particular, the extension of the Kent’s [18] dependence measure to length-biased survival data is proposed. The performance of the proposed method is demonstrated through simulations studies.


Introduction
Survival data occur in many areas such as medicine, epidemiology, biology, economics and manufacturing. The principal goal in survival analysis is the study of the occurrence of a speci c event. Most of the literature on length-biased sampled data concentrates on statistical methods for the survival function (e.g., [7]; [32], estimating the density function (e.g., [4]; [17]), kernel smoothing [33], proportional hazards models [35] and covariate bias induced by length-biased sampling of failure times (e.g., [3]). The phenomena of length-biased sampling appears naturally in many areas of research, see for instance [24] in land economics, [36] in screening and early detection of disease, [34] in epidemiology and geriatric medicine. There are many situations where length-biased data arise without censoring, for example quality control problems for estimating ber length distribution [7], shopping center sampling and mall intercept surveys [25]. For further examples of length-biased sampling see for example [26].
The analogue of Kent's measure for length-biased survival data (see, e.g., [18]) has not received much attention in the literature. In this context, for example, it is of interest to know if there exists any correlation between survival times with dementia and associated covariates such as age at onset, sex and years of education. In this sense, for more general regression models used in survival analysis a measure of dependence can be de ned using the concept of information gain (see, e.g., [18]; [19]). This concept generalizes more common measures such as the multiple correlation coe cient. The purpose of this paper is to extend the dependence measure of [18] under length-biased sampling. More speci cally, we propose a new measure of dependence between survival time and one continuous covariate without censoring. The main idea consists in expressing the extended dependence measure in terms of the underlying copula under length-biased sampling.
The remainder of the paper is organized as follows: In Section 2, we introduce notations and present some preliminaries. In Section 3, we derive the dependence measure for length-biased data without censoring for the case of one continuous covariate. We develop an estimation procedure for the proposed measure based on parametric copulas and bootstrap technique. Section 4 presents a simulation study allowing to investigate the performance of the proposed method.

Notations and preliminaries
In this section, we rst introduce the concept of information gain and then, under length-biased sampling, we review distributions for length-biased data and we expose some general notions of copulas.

. Concept of information gain
Let (X, Y) be a random vector with true joint density g(x, y) modelled by a parametric family f (x, y; θ), θ ∈ Θ . Suppose that X and Y are modelled as independent random variables under Θ ⊂ Θ . For the comparison between the best tting models under Θ and Θ , [18] used Fraser information [10] to extend the work of [21] and provided the joint information gain to be where Φ(θ) = log f (x, y; θ) g(x, y)dxdy and θ i maximizes Φ(θ) over Θ i . As information gain increases, the model under Θ gets closer to the true density g(x, y) compared with the model under Θ . [18] proposed as a measure of dependence between X and Y . On the other hand, if X is modelled conditionally on Y by a parametric family f (x|y; θ), θ ∈ Θ , [18] used conditional Fraser information [10] on the expected conditional log-likelihood Φ C (θ) = log f (x|y; θ) g(x, y)dxdy in order to adapt the joint information gain to a conditional information gain, de ned as and the conditional dependence measure of [18] is Note that, if g(x, y) = f (x, y; θ * ) for some θ * ∈ Θ , then the information gain with respect to [18] reduces to twice the [20] information gain (see, e.g., [18]; [19]). When the concept of information is used, we need to assume that Γ < ∞ (Φ(θ i ) < ∞) and Γ C < ∞ (Φ C (θ i ) < ∞). Furthermore, since Θ ⊂ Θ , Γ and Γ C are always nonnegative. The measures ρ J and ρ C have the following properties (see, [18]): • if X and Y are two independent random variables, then ρ J = (ρ C = in conditional models); • ≤ ρ J < in continuous models. This is also true for ρ C ; • under normal models, ρ J reduces to the product-moment correlation and ρ C is the squared multiple correlation coe cient.

. Length-biased sampling and Length-biased distributions
Length-biased sampling occurs when one naturally collects samples from a given population, but the sampling distribution is di erent from the target population. It such case, not every unit in the population has an equal chance to be sampled when the natural sampling plan is adopted. For example, suppose in a boy school, data are collected on the number of brothers and sisters in the family of each boy in this school. Since this is a boy school and each family has at least one boy, the collected data are clearly a biased representation of the target population. We will give examples of length-biased distribution derived from the discrete and continuous distributions. To do this, let X be a discrete random variable representing the size of some group from a target population with probability mass function Suppose that a group from this target population is observed only when at least one of the individuals in the group is sighted and each individual has an independent probability p of being observed. From [27] the probability that the observed group has X = k individuals is given as The distribution of the observed group size is . This distribution is called the length-biased distribution derived from f (k). Next we give an example of length-biased distribution derived from a continuous distribution, (see, [1]). Let U be a continuous random variable taking values in ( , c), with density function f U (u). Let T be the left truncation time, with density function g(t), and independent of U. Suppose that a unit U of size u in the population is recorded only if U > T. Then, the joint density of (U, T) given U ≥ T can be expressed as if U ≥ T and 0 otherwise. Now, If the onset times follow a stationary Poisson process, the truncation times are uniformly distributed over the interval ( , c) and P (U ≥ c) = , see [35]. It follows that where µ is the mean failure time. Therefore, The density function of U conditional on U ≥ T is then Note that, f (u|U ≥ T) is a length-biased density derived from f U (u).
[7] discussed several procedures used in sampling of textile bres. One procedure is called length-biased and occurs when the chance of selection is proportional to bre length. From [7], the length-biased density of a positive random variable (r.v) U, which denotes the failure lifetime or survival time, is de ned by where f U (u) is the unbiased density and µ = uf U (u)du < ∞. According to (1), we de ne the length-biased density of U conditional on the covariate Z = z as where µ (z) = uf U (u|z)du < ∞ and f U (u|z) denotes the unbiased density corresponding to f LB (u|z) . Under length-biased sampling, the covariate associated with the survival time follows a biased density where f Z (z) is the unbiased density of the covariate (see, [3]).

. Some general notions of copulas and goodness-of-t procedures
In this section, we recall some basic de nitions and properties of copulas. Also, we provide some examples of parametric copulas and we discuss goodness-of-t procedures.
In several research areas such as nance, medicine and biology, researchers are constantly striving to understand the dependence structure between two or more random variables. The relationship is described by the joint cumulative distribution function (CDF). However, determining this joint CDF can be a very tedious task. The concept of copulas is an innovative tool for modeling this dependence structure. Indeed, the knowledge of this concept is essential to understanding many areas of application in particular, survival analysis. Thus, whenever it is necessary to model the dependence structure, we can use the copulas.
Let H be a joint distribution function of a random pair (X, Y) and let F and G be the marginal distributions of X and Y, respectively. The copula C is simply the distribution corresponding to the random vector (U, V) with uniform margins de ned by U = F(X) ∼ U [ , ] and V = G(Y) ∼ U [ , ] . Note that [31] provides an important link between the joint CDF H, the marginal distributions F and G, and copula C described by the following representation If F and G are continuous then C is unique; otherwise, C is uniquely determined on RanF×RanG, where RanF is the range of F. Moreover, if a copula C is twice di erentiable then it admits a density de ned by From Sklar's theorem [31] with representation in (4), we can see that the copula C is independent of the marginal distributions. In addition, C is considered as the dependence function associated to the random vector (X, Y). In practice, Sklar's Theorem is very interesting because it models F, G and the dependence structure separately. The following two examples illustrate some applications of this theorem. [23]

Example 2.1. (Construction of bivariate distribution): Consider the following copula which is given in
If the marginal distributions of the random variables X and Y are given by F(x) = G(x) = − e −x , x ≥ then from (4), we get the next joint distribution of the random vector (X, Y) [16] given by

Example 2.2. (Extraction of copula from a given joint distribution). Let H θ (x, y) be the joint distribution function of Gumbel's bivariate exponential distribution
where θ is a parameter in [ , ]. Clearly, the marginals are exponentially distributed: . Hence the corresponding copula is The parameter θ ∈ [ , ] of the copula C θ can be viewed as a dependence parameter.
An important property of copulas comes from the fact that for strictly monotone transformations of the random variables, copulas are invariant. In other words, If f and g are strictly increasing transformations on RanX and RanY, respectively, then the random vectors (X, Y) and (f (X), g(Y)) have the same copula. Next, we discuss an important class of copulas known as Archimedean copulas de ned in [11]. In fact, these copulas nd a wide range of applications, in practice, for number of reasons: the ease with which they can be constructed; the great variety of families of copulas which belong to this class; the many nice properties possessed by the members of this class. Furthermore, the dependency structure depends on a single parameter of the generator function ϕ de ned below. Formally, an Archimedean copulas is de ned by the relation where ϕ denotes a continuous, strictly decreasing convex function de ned from [ , ] to [ , ∞) such that ϕ( ) = . The function ϕ − represents the inverse of ϕ. The mapping ϕ is so-called the generator of the copula C ϕ . In what follows, we present some relevant examples of copulas widely used in practice. These parametric copulas will be utilized, in simulation part, to illustrate the proposed estimation method.
Let us now discuss goodness-of-t (GOF) procedures for copula. The concept of copulas, particulary Archimedean copulas, is frequently used as a good tool for describing the dependence between two random variables X and Y with continuous marginal distributions F and G,respectively. Given a random sample (X , Y ), . . . , (Xn , Yn) with joint CDF the most frequent question is which copula family is associated with H(x, y)? The GOF procedures for copula, which we explain below, can be considered as a good practical answer to this question. Consider a continuous random vector (X, Y) with margins F, G and bivariate CDF H. Assume further that the copula C of (X, Y) belongs to a class of parametric copula . . , n denote independent copies of (X, Y). Suppose one wants to choose between the null and alternative hypotheses of belonging or not to a given parametric family, namely Several goodness-of-t procedures allowing to test H versus H have been developed in the literature, e.g. [12], [29], [5], [8], [13], [28], [22], [15] and [2]. The formal GOF tests are rank-based. In other words, instead of using the observations ( Here, Fn and Gn denote empirical CDF of X and Y, respectively. Note that, the pseudo-observations can be expressed as and considered as a sample from the copula C. In addition, they are not mutually independent and their components are only approximately uniform on ( , ). We note that, the factor n/(n+ ) in (11) is introduced to avoid problems with C θ blowing up at the boundary [ , ] . The idea behind using the pseudo-observations is that the copula C of a random vector is invariant by continuous, strictly increasing transformations of its components.
The study of some GOF tests for copula and their implementation using the copula package leads to describe one method which is very useful in survival analysis. This approach gave the best results overall, as mentioned by [15] and later, [2] con rmed this remark resulting from examination and comparison of several GOF tests. In what follows, we describe a copula GOF based on the empirical copula: For testing H : C ∈ C , [15] used the pseudo-observations U , . . . , Un and proposed to work with a consistent estimation of an unknown copula C. In particular, the empirical copula [9] showed under various conditions that Cn is a consistent estimator of the true underlying copula C. The idea in this approach is to compare Cn with an estimator of C under H : C ∈ C . In a goodness-of-t setting, [15] suggested to use the empirical process [14] established the convergence of (13), and showed that the test based on S (E) n is consistent. A speci c parametric bootstrap procedure, developed by [15] can be used to approximate the P-value for this statistic. The validity of this method has been shown in [14].

Information gain under length-biased sampling based on copulas
The objective of this section is to exploit the concept of information gain, based on the parametric copulas method, to derive the joint and conditional dependence measures, under length-biased sampling without censoring for the case of one continuous covariate and provide an estimation method for these measures. To do this, let U denote length-biased survival time with CDF G LB (u, λ) and PDF g LB (u, λ) while Z represents a continuous covariate with CDF F B (z, ψ) and PDF f B (z, ψ). Suppose that the random vector (U, Z) has a parametric copula Cα . Using Sklar's Theorem, a joint length-biased CDF of (U, Z) is where θ = (α, λ, ψ) is the parameter vector of the model. The corresponding joint length-biased density of (U, Z) is given as where cα is the parametric copula density given in (5). Consequently, the conditional density of U given Z = z can be expressed in terms of the parametric copula density as The most copula families Cα, α ∈ Θ, contain the independence copula, that is, Cα coincides with the independence copula for some α ∈ Θ. This means that the r.v.'s U and Z are independent which implies that f B (z; ψ ) = f Z (z; ψ ) and F B (z; ψ ) = F Z (z; ψ ), where F Z (z; ψ ) and f Z (z; ψ ) are, respectively, CDF and PDF of the unbiased covariate under the independence model. Therefore, if the covariate sample from the incident cases is available, one can estimate ψ by the MLEψ . In this case, the parameter of the independence model becomes θ = (α , λ ) which leads to • Cα G LB (u; λ ), F Z (z;ψ ) = G LB (u; λ )F Z (z;ψ ).
When the covariate sample from the incident cases is not available, one can use the bootstrap techniques to obtain a new sample Z * , . . . , Z * n following approximately f Z (z). First, consider a random sample (U i , Z i ), i = , . . . , n, from f LB (u, z). In particular, U = (U , . . . , Un) from f LB (u). Then, use the bootstrap techniques with replacement for the original sample U to obtain a new sample U * = U * , . . . , U * n following approximately f U (u) . The idea is that, U i is chosen to be included in the new sample U * with probability p i . For j = , . . . , n, the probability p i , i = , . . . , n can be found using (1) as Here,μ = n n i= U − i − is an estimator of µ in (1) (see [7]). Note that, the bootstrap technique described above converges as shown by [7]. Now, from (U i , Z i ), i = , . . . , n, nd Z * = Z * , . . . , Z * n associated with U * = U * , . . . , U * n . Therefore, given this new sample Z * , one can use the standard kernel density estimator method in order to estimate the unbiased PDF f Z (z):f where the function K h (s) = h − K h − s , h is the bandwidth of the estimator and K : R → R is de ned to be any smooth function satisfying the following assumptions. A practical estimator of the optimal bandwidth was proposed by Silverman (1986) asĥ opt = . σn − , wherê σ = min s, R/ . . Here, s and R are the standard deviation and interquartile range of the data, respectively. Note that, the standard normal density is a very useful kernel function satisfying Assumptions 3.1.

. Copula-based modelling of conditional information gain under length-biased sampling
Hereafter, we express the conditional information gain under length-biased sampling in terms of the underlying copula of (U, Z). The resulting formula is used to estimate the conditional measure of dependence.

Proposition 3.2. Let (U, Z) be a pair of random variables possibly dependent with true density f LB (u, z; θ )
given in (16). Under length-biased sampling, the conditional information, based on the parametric copula density, can be expressed as Proof. By testing the two hypotheses H : α = α versus H : α ≠ α , the twice Kullback-Leibler (1951) information gain is where g LB (u|z; θ ) is given by (17) and we used the fact that under the independence model: g LB (u|z; θ ) = g LB (u; θ ) = g LB (u; λ ) .
Consequently, from Proposition 3.2, the conditional dependence measure with respect to the work of [18] is In order to estimate the conditional information gain and conditional dependence measure, let (U i , Z i ), i = , . . . , n be a random sample from f LB (u, z; θ ) given in (16). Based on Proposition 3.2, the conditional information gain can be formulated as An estimator of Γ C iŝ whereθ = α ,λ ,ψ andθ = α ,λ are the parameter values that maximize, respectively, the observed log-likelihood Therefore, an estimator of the conditional measure of dependence is then whereΓ C is given by (23).

. Copula-based modelling of joint information gain under length-biased sampling
In this section, we provide a new way to estimate the joint information gain under length-biased sampling.
To this end, we rst establish an expression of the twice [20] information gain in terms of parametric copula density.

Proposition 3.3.
Let (U, Z) be a pair of random variables possibly dependent with true density f LB (u, z; θ ) given in (16). Under length-biased sampling, the joint information gain, based on the parametric copula density, is Proof. The twice [20] information gain, by testing H : α = α versus H : α ≠ α , would be where f LB (u, z; θ ) is given by (16) and we used the fact that under the independence model: f LB (u, z; θ ) = g LB (u; λ ) f Z (z; ψ ) .
From Proposition 3.3, the joint dependence measure with respect to the work of Kent (1983), is It can be shown that from Proposition 3.3, one can write where Γ C is given by (20) and Γ B is expressed by is the information gain obtained through knowledge of the bias of covariate. In order to estimate the conditional information gain and conditional dependence measure, let (U i , Z i ), i = , . . . , n be a random sample from f LB (u, z; θ ) given in (16). There exist two ways for estimating the joint information. The rst method is based on (25). An estimator of Γ iŝ whereθ = α ,λ ,ψ andθ = α ,λ are the parameter values that maximize the observed loglikelihood, respectively, The second method is based on (27). In this direction, an estimator of the joint information gain iŝ whereΓ C is given by (23) and the estimator of Γ B iŝ Hence, an estimator of the joint measure of dependence iŝ We note that, in the case where the covariate sample from the incident cases is not available, a natural estimator of the unbiased density of the covariate, f Z , is given by (19) aŝ

Simulation study
In this section, we develop some useful simulation algorithms in order to simulate length-biased survival times with one continuous covariate using parametric copula method. Also, we investigate the performance of this method by providing some results of several simulations assessing the behaviour of the estimated information gain and dependence measure given length-biased data.

. Simulation algorithms
The following algorithm allows us to simulate length-biased survival times from the length-biased distribution, if the CDF G LB and its inverse G − LB admit a closed form.
Often, it is di cult to simulate length-biased data directly from length-biased distribution because in general the CDF G LB (u) and its inverse G − LB (u) may not have a closed form. In this case, we can use the following algorithm which is based on the bootstrap techniques. We note that the probability p i , given in the latter algorithm, can be obtained in the same way as for (18). Now, we develop some useful algorithms for the parametric copula method. On one hand, we develop a method allowing to simulate data from the joint unbiased density through its underlying copula. On the other hand, we derive a practical approach, based on the bootstrap techniques, for simulating length-biased data from the joint length-biased density. Speci cally, let f U (u; λ) and F U (u; λ) denote unbiased PDF and unbiased CDF of the continuous r.v. U (survival time). Also, let f Z (z; ψ) and F Z (z; ψ) be unbiased PDF and unbiased CDF of the continuous r.v. Z (covariate). From Theorem 4, the joint unbiased CDF of the random vector (U, Z) can be written as a function of a parametric copula as follows where θ = (α, λ, ψ) . The joint unbiased density of the random vector (U, Z), denoted by f U (u, z, θ) , can be derived from (4) provided ∂ Cα (u, v) /∂u∂v exists. Algorithm 4.3 can be used to simulate a random sample Note that, Algorithm 4.3 allows us to simulate a random sample U * i , Z * i i = , . . . , N directly from the joint unbiased density f U (u, z; θ) . However, as we will show, we cannot simulate a random sample (U i , Z i ) i = , . . . , n directly from the joint length-biased density f LB (u, z; θ) . A bootstrap techniques will be proposed as a simple solution for this simulation problem. Using the fact that µ(z) The length-biased density of U conditional on Z = z becomes Even for a given parametric copula associated with some known unbiased CDF F U (u, z), equations (34) and (35) cannot be used to simulate, directly, a random sample (U , Z ) , . . . , (Un , Zn) from f LB (u, z). Because, in most cases, there is no closed forms of f B (z), F B (z), F − B (z), g LB (u|z), G LB (u|z) and G − LB (u|z). Hereafter, we describe an alternative way based on bootstrap techniques enables to simulate length-biased data. Given length-biased data (U i , Z i ) , i = . . . , n generated from Algorithm 4.4, the question is which copula family is associated with the joint CDF F LB (u, z)? A practical answer to this question is to use the goodnessof-t procedures for copula to nd the parametric copula is associated with that length-biased data. In such a case, we suggest to use the goodness-of-t statistic computed from the empirical copula processes, S (E) n , given in (14). Table 1 leads to the conclusion that the test based on S (E) n con rms that the Clayton family copula associated with the unbiased CDF F U (u, z) is the same as for the length-biased CDF F LB (u, z), but with di erent estimated values of dependence parameter, denoted byαLB, as shown by the Table 2.  Table 2: Av. estimated dependence parametersα andαLB , based on 1000 replicates, for Clayton copula associated with the CDF's F U (u, z) and F LB (u, z), respectively, for N = , n = m = , r = and p = .
Now, based on the next Algorithm, our principal objective is to examine, for di erent values of α given in Table 2, the behavior of information gain and dependence measure estimators. Recall that, the copula family under length-biased sampling is Clayton copula with dependence parameter, denoted by αLB, and the lengthbiased density of the survival time g LB (u) is GG(r, p, k), where k = + r − . For simplicity, a simple choice used to estimate the unbiased density f Z (z) and the biased density f B (z) is the kernel density estimator as followsf Let θLB = (αLB, r, p, k) denote the parameter of the model under length-biased sampling.
To estimate the joint dependence measure, Algorithm 4.6 can be used providedΓ B = ( /n) n i= log f B (Z i ) − n i= log f Z (Z i ) ,Γ =Γ C +Γ B andρ J (U, Z) = − e −Γ . Table 3 indicates the average maximum likelihood estimators of θLB under hypotheses H and H , using parametric copula method.   Table 4 is that the estimated conditional and joint dependence measures are slightly di erent due to the small values ofΓ B for all estimated values ofαLB. This can be explained, simply, by the initial choice of the model parameters. In particular, if the shape parameter r = . , the parametric copula associated with the CDF F LB (u, z) is always Clayton copula.  . . . Table 6: Av. estimated information gain and dependence measure given simulated length-biased data, using parametric copula method, for N = , n = m = , r = . and p = .

The most important remark from
Av.αLB Av.  Table 5 indicates that, the new value of the shape r = . in uences considerablyΓ B andΓ C . Consequently, from Table 6, the di erence between estimated conditional and joint dependence measure is very signi cant and hence we can conclude that given length-biased data we cannot ignore the potential e ect of the covariate on the survival time.

Discussion
This paper provides a measure of dependence for length-biased survival data, by extending the dependence measure of [18], under length-biased sampling. More speci cally, we looked at a measure of dependence between survival time (without censoring) and one continuous covariate. In this direction, we developed parametric copulas method based on information gain. It would be interesting to adapt this approach for several continuous covariates especially under censoring and consider other types of covariates in the model. This can be done using the concept of copulas which takes into account censored data.