A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes

David Rindt , Dino Sejdinovic and David Steinsaltz

Abstract

We propose a nonparametric test of independence, termed optHSIC, between a covariate and a right-censored lifetime. Because the presence of censoring creates a challenge in applying the standard permutation-based testing approaches, we use optimal transport to transform the censored dataset into an uncensored one, while preserving the relevant dependencies. We then apply a permutation test using the kernel-based dependence measure as a statistic to the transformed dataset. The type 1 error is proven to be correct in the case where censoring is independent of the covariate. Experiments indicate that optHSIC has power against a much wider class of alternatives than Cox proportional hazards regression and that it has the correct type 1 control even in the challenging cases where censoring strongly depends on the covariate.

1 Introduction

We propose a nonparametric test of independence between a possibly multidimensional covariate and a right-censored lifetime. Existing approaches to this problem suffer from several limitations: if we cluster the continuous covariate into groups, and then test for equality of lifetime distributions among the groups, the results will depend on the arbitrary choice of boundaries between the groups, while the spread of covariates within groups reduces power. Alternatively, one might fit a (semi-)parametric regression model, and test whether the regression coefficient corresponding to the covariate differs significantly from zero. The most commonly used such method is the Cox proportional hazards (CPH) model, which makes two assumptions [1]: first, the hazard function must factorize into a function of time and a function of the covariate (the proportional hazards or relative risk condition); second, the effect of a covariate on the logarithm of the hazard function must be linear. Although this is a flexible model, in some cases these assumptions are violated. More complicated hazards are found for example when studying the relationship between body mass index and mortality – [2] reports U- or V-shaped hazards – or between diastolic blood pressure and various health outcomes [3].

Since distance- and kernel-based approaches have been used successfully for independence testing on uncensored data [4], [5], it is natural to investigate whether these methods can be extended to the case of right-censored lifetimes. To this end we propose applying optimal transport to transform the censored dataset into an uncensored dataset in such a way that, (1) the new uncensored dataset preserves the dependencies of the original dataset, and (2) we can apply a standard permutation test to the new dataset with test statistic given by Distance Covariance (DCOV) [5] or, equivalently, the Hilbert–Schmidt Independence Criterion (HSIC) [4].

Progress in kernel-based independence testing for censored data is further motivated by the fact that in the simpler context of uncensored data the corresponding methods have been further developed into tests of conditional independence, mutual independence, and have been applied to causal inference [6], [7], and in particular detection of confounders. While the present work does not propose methods for testing conditional or mutual independence, we believe independence testing is a first step towards those ends. Additionally, since our method allows for multidimensional covariates, one can first test for a dependency based on the full multidimensional covariate, and then test whether the dependency remains when certain sub-dimensions are omitted from the covariate.

Section 2 overviews relevant concepts in survival analysis, distance- and kernel-based independence testing, and optimal transport. Section 3 proposes a transformation of the data based on optimal transport. Section 4 introduces our testing procedure named optHSIC. Although we have not yet been able to prove control of the type 1 error rate in full generality, we do show the type 1 error rate to be correct in the case where censoring is independent of the covariate. Furthermore we obtain very promising results in simulation studies, showing correct type 1 error control even under censoring that depends strongly on the covariate. Section 5 explores alternative kernel-based approaches under the additional assumption that censoring is independent of the covariate. These methods serve as benchmarks for the power performance of optHSIC. Section 6 compares the power and type 1 error of all tests and CPH regression in simulated data.

2 Background material

Let T 0 be a lifetime subject to right-censoring, so that we do not observe T directly, but instead observe Z min { T , C } for some censoring time C 0 , as well as the indicator Δ 1 { C > T } . We further observe a covariate vector X d , where d is equipped with the Borel sigma algebra. In total, for an i.i.d. sample of size n, we thus obtain the data D ( ( x i , z i , δ i ) ) i = 1 n ( d × 0 × { 0 , 1 } ) n .

The main goal of this paper is developing a test of H 0 : X T versus H 1 : X T based on the sample D. Throughout this paper we will make the following assumptions.

Assumption 1:

We assume that conditional on { X i } i = 1 n , the random variables { ( T i , C i ) } i = 1 n are mutually independent.

Assumption 2:

We assume that C T | X .

Denote F T | X ( t | x ) = P ( T t | X = x ) and F C | X ( t | x ) = P ( C t | X = x ) . We assume F T | X has a density f T | X . Let S T | X ( t | x ) = 1 F T | X ( t | X ) . We define the hazard rate of an individual with covariate x to be λ T | X ( t | x ) = f T | X ( t | x ) / S T | X ( t | x ) . The CPH assumes that the hazard rate can be written as λ T | X ( t | x ) = λ ( t ) exp ( β T x ) for some baseline hazard λ ( t ) and a vector β d . This model enables estimation of β and testing the significance of the difference of entries of β from zero. The CPH model is the most commonly used regression method in survival analysis.

A last important concept is that of the ‘individuals at risk at a time t’. By this we mean the set { i : Z i t } . We also use the notation [ a 1 , , a k ] for a multiset (a set with potentially repeated elements) with elements a 1 , , a k . We refer, for example, to the multiset AR t = [ x i : i { 1 , , n } , Z i t ] , the multiset of ‘covariates at risk at time t’.

2.2 Independence testing using kernels

Kernel methods have been successfully used for nonparametric independence- and two-sample testing [4], [8]. We now give some of the relevant background in kernel methods.

Definition 2.1:

(Reproducing Kernel Hilbert Space) [9] Let X be a non-empty set and H a Hilbert space of functions f : X endowed with dot product , . Then H is called a reproducing kernel Hilbert (RKHS) space if there exists a function k : X × X with the following properties.

1. k satisfies the reproducing property

(1) f , k ( x , ) = f ( x ) f o r a l l f H , x X .

2. k spans H, that is, H = span { k ( x , ) | x X } where the bar denotes the completion of the space.

Let X together with a sigma-algebra be a measurable space and let H X be an RKHS on X with reproducing kernel k. Let P be a probability measure on X . If E P k ( X , X ) < , then there exists an element μ P H X such that E P f ( X ) = f , μ P for all f H X [8], where the notation E P f ( X ) is defined to be f ( x ) P ( d x ) . The element μ P is called the mean embedding of P in H X . Given a second distribution Q on X , for which a mean embedding exists, we can measure the dissimilarity of P and Q by the distance between their mean embeddings in H X :

MMD ( P , Q ) μ P μ Q H X ,

which is also called the Maximum Mean Discrepancy (MMD). The name comes from the following equality [8],

μ P μ Q H X = sup f H X : | | f | | 1 E P f ( X ) E Q f ( X )

showing that MMD is an integral probability metric. Given a sample { x i } i = 1 n and the empirical distribution, i = 1 n δ ( x i ) / n , where δ ( x ) denotes the Dirac measure at x, the corresponding mean embedding is given by i = 1 n k ( x i , ) / n .

Suppose now that Y together with some sigma algebra is a second measurable space, and let H Y be an RKHS on Y with kernel l. Let X be a random variable in X with law P X and similarly let Y be a random variable in Y with law P Y . Finally let P X Y denote the joint distribution on X × Y equipped with the product sigma-algebra. We let H denote the RKHS on X × Y with kernel

K ( ( x , y ) , ( x , y ) ) k ( x , x ) l ( y , y ) .

In [4] it was proposed that the dependence of X and Y could be quantified by the following measure:

Definition 2.2:

The Hilbert–Schmidt independence criterion (HSIC) of X and Y is defined by

HSIC ( X , Y ) μ P X Y μ P X P Y H 2

where P X P Y denotes the product measure of P X and P Y .

Let ( X , Y ) , ( X , Y ) and ( X , Y ) be three mutually independent copies of the same random variable with law P X Y . Using the reproducing property and the definition of mean embeddings, it can be shown that

HSIC ( X , Y ) = E X Y E X Y k ( X , X ) l ( Y , Y ) + E X X k ( X , X ) E Y Y l ( Y , Y ) 2 E X Y E X Y k ( X , X ) l ( Y , Y ) .

Now assume we are given a sample D = ( ( x i , y i ) ) i = 1 n of independent observations of the random pair ( X , Y ) . An empirical estimate of HSIC ( X , Y ) can be obtained by measuring the distance between the embedding of the empirical distribution of the data and the embedding of the product of the marginal empirical distributions. That is, we define HSIC ( D ) by

HSIC ( D ) 1 n i = 1 n K ( ( x i , y i ) , ) 1 n 2 i = 1 n j = 1 n K ( ( x i , y j ) , ) H 2 .

Using the reproducing property of the kernel and the definition of K in terms of k and l, HSIC ( D ) can be shown to equal

HSIC ( D ) = 1 n 2 i , j = 1 n k ( x i , x j ) l ( y i , y j ) + 1 n 2 i , j = 1 k ( x i , x j ) 1 n 2 q , r = 1 n l ( y q , y r ) 2 n 3 i , j , r = 1 n k ( x i , x j ) l ( y i , y r ) .

While the biased HSIC ( D ) defined above is the most commonly used estimator in the literature [4], [5], unbiased estimators of HSIC ( X , Y ) exist too [10]. The bias of HSIC ( D ) is O ( n 1 ) [4], and for appropriate choices of kernels, the permutation test with the biased statistic and the permutation test with the unbiased statistic are consistent tests – see section 2.2.1 for details. Both tests also have correct type 1 error rate. Because the biased HSIC ( D ) is more commonly encountered in the literature and has a slightly easier analytic form, we use HSIC ( D ) throughout our paper. Algorithm 1 in Figure 1 shows how HSIC is commonly combined with a permutation test for independence testing.

Figure 1:

The algorithm for independence testing using HSIC and a permutation test, resulting in a p-value and a rejection decision.

2.2.1 Choice of kernel

Throughout this paper we assume the covariates take values in the Euclidean space X = d for some d . Note that in our case Y = 0 . We let both k and l be instances of the covariance kernel of Brownian motion. That is,

k ( x , x ) = ( x + x x x ) ,

and

l ( y , y ) = ( | y | + | y | | y y | ) ,

where denotes the Euclidean norm. See [11] for a discussion of this kernel.

The reason for this choice is three-fold. Firstly, under this choice of kernels, HSIC coincides with Distance Covariance (DCOV) [5]. The equivalence between HSIC and DCOV was proved in [11]. DCOV is a well studied measure of dependence and if X , Y are random variables with compact support, then DCOV ( X , Y ) = 0 if and only if X Y . Furthermore, the permutation test with test statistic DCOV is consistent in the sense that, for each distribution P X Y with compact support and such that X Y , the probability that the permutation test rejects the null hypothesis converges to 1, as the sample size converges to infinity [12]. Secondly, unlike other potential kernels, such as the Gaussian kernel, the covariance kernel of Brownian motion spares us the need to select a bandwidth. Thirdly, our test relies on optimal transport to transform the original data into a transformed dataset. The optimal transport procedure, discussed in the next section, requires a metric with respect to which similarity is preserved. It appears natural to measure the dependency in the transformed dataset using the same metric as the one underlying the transformation.

2.3 Optimal transport

Let A = [ a 1 , , a | A | ] and B = [ b 1 , , b | B | ] be multisets of length | A | and | B | with a i , b j d . Let v A be a distribution vector on A, by which we mean: v A = ( v 1 A , , v | A | A ) | A | so that v i A 0 and i = 1 | A | v i A = 1 . Let v B be a distribution vector on B. Let m ( a , b ) be a metric on d . Then the earth mover’s distance (EMD) between v A and v B is defined as

(2) min P i j | A | × | B | i = 1 | A | j = 1 | B | P i j m ( a i , b j )

subject to

P i j 0 i = 1 , , | A | , j = 1 , , | B | , j = 1 | B | P i j = v i A i = 1 , , | A | , i = 1 | A | P i j = v j B j = 1 , , | B | .

Let U be a random variable taking values in A with distribution v A , and U a random variable taking values in B with distribution v B . In the next section we will write ‘let P be the joint distribution that solves the optimal transport problem between U and U ’. By that we mean P ( U = a i , U = b j ) P i j for i = 1 , , | A | , j = 1 , , | B | , where P i j is the matrix that minimizes Eq. (2). Note that P is a valid joint distribution of U and U . Furthermore, for any i with v i A > 0 it holds that P ( U = b j | U = a i ) = P i j / v i A .

3 Data transformation based on optimal transport

3.1 Objective of the algorithm

To use HSIC for an independence test of X and T, one requires the explicit values l ( T i , T j ) for i , j = 1 , , n . Since for right-censored data T i may be known only in terms of a lower bound, there is no straightforward way to apply these methods to right-censored data. In this section we propose a transformation of the original, censored dataset, into a synthetic dataset consisting of n observed events. The algorithm uses optimal transport and its goal is twofold: first, it should return a dataset to which we can apply a permutation test with test-statistic HSIC, and obtain correct p-values under the null hypothesis; second, it should return a dataset in which the dependence between X and T is similar to the dependence in the original dataset. Indeed, the transformation of the data is of independent interest: we use the standard permutation test with test statistic HSIC/DCOV, but other statistics could be considered. We do believe that the main value of the transformation lies in how it can be straightforwardly combined with permutation testing on the transformed data and we expect it may be difficult to use the transformation for other purposes.

3.2 Definition of the transformation

To define the algorithm, we use the following notation. Let A = [ a 1 , , a | A | ] be a multiset where a i d for i = 1 , , | A | . We define Uniform ( A ) = i = 1 | A | δ ( a i ) / | A | , to be the uniform distribution over the elements, where δ ( a ) is the Dirac measure at a. By B = Remove ( a , A ) we set B to be the multiset that remains when one instance of a is removed from A, and by B = Add ( a , A ) we set B to be the multiset that consists of the elements of A, with an added element a. We further assume that we are given a sample D = ( ( x i , z i , δ i ) ) i = 1 n so that z 1 < z 2 < < z n : that is, we have n distinct event times and they are labeled in increasing order. For a variable a we use a b to change the value of a to b. In computing the optimal transport coupling as defined in Section 2.3, we use the Euclidean metric | | x y | | = i = 1 d ( x i y i ) 2 for x d . The proposed transformation is defined in Algorithm 2 in Figure 2. Figure 3 visualizes a dataset with 1-dimensional covariates before and after applying the transformation.

Figure 2:

The optHSIC algorithm, resulting in a transformed dataset D ~ .

Figure 3:

Above: The dataset as originally observed. Below: the synthetic dataset after applying the OPT-based transformation. The labels indicate which individual the observation corresponds to in each dataset. The data is sampled from a parabolic relationship between covariates and lifetimes.

3.3 Comments on the transformation algorithm

We now give a verbal explanation of Algorithm 2.

Initialization: The input is simply D = ( ( x i , z i , δ i ) ) i = 1 n where z 1 < < z n . We initialize an empty transformed dataset D ˜ , to which we will add n observations of the form ( x i , t j ) for some i , j { 1 , , n } in the following way. We will loop over the times z i from i = 1 , , n . At each time, the multiset AR lists the covariates at risk in the dataset D and AR is initialized as the multiset containing all covariates. The multiset L will list all covariates that have not been added to D ˜ yet. Indeed, as D ˜ is initialized empty, L initially contains all n covariates. The variable d will count the number of observed δ = 1 events and is initialized 0.

First loop from i = 1 , , n 1 : At time z i we distinguish two cases: if δ i = 0 we leave L and D ˜ unchanged, and simply remove one instance of x i from AR. If δ i = 1 we add 1 to d (to count the number of observed events). We also select a covariate x ˜ from L as follows: First a joint distribution of Uniform ( AR ) and Uniform ( L ) is computed using optimal transport. This is a matrix P of size | AR | × | L | . This distribution is then conditioned on the event that Uniform ( AR ) = x i , yielding a distribution over L. We sample from this distribution to obtain the covariate x ˜ . The pair ( x ˜ , z i ) is then added to D ˜ , and the covariate x ˜ is removed from L. There are now d observations in D ˜ and there are n d covariates left in L. We also remove x i from AR.

When we finish the first loop: It is now the case that AR = [ x n ] (note the previous loop went up to i = n 1 ) and the multiset L contains n d covariates. In the second loop, that runs from d + 1 to n, each remaining covariate in L is combined with the time z n and added to D ˜ . After this loop, L is empty, and D ˜ contains all n covariates x 1 , , x n , each associated with a time. The transformed dataset D ˜ is returned.

Consider the special case in which there is no censoring and δ i = 1 for i = 1 , , n . Then it is easy to verify that in the first loop AR = L at each step i, so that the optimal transport algorithm returns a scaled identity matrix. Consequently, at step i the covariate x ˜ is equal to x i with probability 1, and the final output is D ˜ = D . A nontrivial example is worked out in Section A.1.

3.4 Intuition behind the transformation

Before we prove properties of the proposed transformation, we briefly comment on the intuition behind the transformation. To this end, first consider a permutation test in the absence of censoring, when we simply observe D = ( ( x i , t i ) ) i = 1 n . The permuted datasets can be generated as follows: loop through the events in order of time, and to each time, associate a covariate that you have not associated to any earlier time. Comparing the original dataset with the datasets obtained in this manner is justified for the following reason: Under the null hypothesis a sample can be generated by firstly sampling t i for i = 1 , n i.i.d. and independently sampling x i for i = 1 , n i.i.d., and secondly looping through the times in order, and associating to each time a covariate that has not yet been chosen at a previous time. The original dataset and the permuted datasets are thus equal in distribution: intuitively, the permutation test checks whether the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.

It is not obvious how to translate this to censored data. Due to censoring, it may not be true that the ith event covariate is chosen uniformly from the covariates that have not had an observed event time before. For this reason survival analysis often compares the i–th event covariate with the covariates at risk (not failed and not censored) just before the i–th event. If the null hypothesis holds, then intuitively it holds that the i–th event covariate is chosen uniformly from the covariates at risk at time z i . That is, if AR denotes the multiset of covariates at risk and x i is the event covariate, we would like to test if x i was chosen uniformly from AR.

To do so using a permutation test, our algorithm couples Uniform ( AR ) to a uniform choice from those that have not yet been chosen in the synthetic dataset: Uniform ( L ) (see Algorithm 2). In this manner, when the null hypothesis is true, (intuitively) it holds that in the synthetic dataset the covariates are chosen uniformly from those not chosen before. But this last statement is exactly our intuition behind a permutation test: namely, we use a permutation test to see if the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.

The intuition discussed thus far related to the workings of our transformation under the null hypothesis, and had at its core that Uniform ( AR ) is coupled to Uniform ( L ) in Algorithm 2. However there are many couplings between these two distributions. The reason we choose the coupling that solves the optimal transport problem is the following: assume the alternative hypothesis holds and that given AR i , certain covariates have a higher hazard rate than others. Then we would like this bias towards these covariates to be visible in D ˜ . In other words we would like x ˜ to be close to x ˜ i in Algorithm 2. That is precisely what the optimal transport coupling achieves.

Finally, including the remaining covariates in the transformed dataset at the last time ensures the permuted datasets correctly reflect all alternative choices that can be made when covariates are chosen at random from those not chosen before. The association of these remaining covariates to the last event time reflects they have not been selected at each of the earlier times, which may indicate they have had less risk of having an event up to that point.

The goal of the transformation is thus twofold: firstly, ensuring that under the null hypothesis a permutation test is appropriate on the transformed dataset, which motivates the coupling of Uniform ( AR ) and Uniform ( L ) ; secondly, given the first goal, ensure covariates in D ˜ associated to times z i for δ i = 1 are as close as possible to the covariates associated to z i in D, which motivates the chosen coupling to be the coupling that solves the optimal transport problem defined in Section 2.3.

An interesting alternative approach would be to compare x i with Uniform ( AR i ) directly, without first transforming the data. We do not see, however, how to combine that approach with a permutation test, or with another procedure to test for significance.

4 Applying HSIC to the transformed dataset: optHSIC

We have thus far described how to transform the dataset. For the hypothesis test of H 0 : X T , we propose to apply a permutation test with test statistic DCOV/HSIC to the transformed dataset. This approach is summarized in Algorithm 3 presented in Figure 4.

Figure 4:

The optHSIC algorithm, resulting in a p-value and rejection decision of H 0 : X T .

4.1 Computational cost of optHSIC

The algorithm optHSIC consists of two parts: First, the dataset is transformed. Second, a permutation test with test statistic HSIC/DCOV is performed. The computational cost of the transformation may be high: the solution of the earth mover’s distance has n 3 log ( n ) complexity [13]. Since the earth mover’s distance is computed for each i = 1 , , n such that δ i = 1 , complexity of the transformation is excessively high. Luckily, fast approximations of the earth mover’s distance exist that are linear in time (e.g. [13]), making the transformation O ( n 2 ) . Also in the case where the covariates are 1-dimensional, the optimal transport problem has a simple solution [14]. Different algorithms and/or approximations also exist for the EMD with other metrics [13]. The second step, computation of HSIC is also an O ( n 2 ) -operation for which large-scale approximations can be made [15]. Furthermore, both the computation of HSIC and the permutation test computations can be easily parallelized. In our simulations we did not use approximations of the earth mover’s distance nor of HSIC, as for samples up to about 500 the optHSIC test can be performed in about a second on an ordinary PC.

4.2 Theoretical results on optHSIC

We say a test with p-value p has correct type 1 error rate if under the null hypothesis P H 0 ( p α ) α . The main theoretical result we obtained for optHSIC is that the type 1 error rate is correct when C X . Unfortunately, we have not been able to prove other important results such as (asymptotically) correct type 1 error rate when C X , or consistency (power converging to one for each alternative hypothesis). However, as Section 6 details, extensive simulations demonstrate that optHSIC achieves correct type 1 error rate also when C X . Simulations further show that optHSIC is able to detect a wide range of dependencies between X and T without losing much power, relative to the CPH likelihood ratio test, even when the CPH model assumptions holds. Obtaining further theoretical results is an important future challenge. In the uncensored case a permutation test with covariance kernel of Brownian motion is consistent for distributions with compact support and has correct type 1 error rate [12]. The difficulty of extending these proofs to optHSIC is that optHSIC requires sequential analysis of the optimal transport distributions, breaking independence between observations in the transformed dataset.

We first prove an auxiliary result: namely, although we propose to permute the transformed dataset, this is equivalent to permuting the original dataset, and then transforming the permuted datasets.

Lemma 4.1.

Let π 1 , , π B be independent uniform random permutations, and let T , T 1 , , T B be independent optHSIC transformations.

[ T ( D ) , π 1 ( T ( D ) ) , , π B ( T ( D ) ) ] = d [ T ( D ) , T 1 ( π 1 D ) , , T B ( π B D ) ]

Proof.

See Appendix. □

Note that this implies that in the definition of optHSIC we could have equally permuted D first, and then applied the transformation to each of the permuted datasets. However this is computationally more expensive. Lemma 4.1 enables us to show that optHSIC has correct type 1 error rate when C X .

Theorem 4.1.

Let D = ( ( X i , Z i , Δ i ) ) i = 1 n be an i.i.d sample. Assume that C X . Let p be the p-value resulting from applying optHSIC (Algorithm 3) to D with B permutations and level α [ 0 , 1 ] . If the null hypothesis holds, i.e. T X , then p U n i f o r m [ 1 B + 1 , , B + 1 B + 1 ] and in particular it holds that

P H 0 ( p α ) α .

Proof.

See Appendix. □

Table 1 summarizes our findings of the type 1 error of optHSIC.

Table 1:

An overview of the type 1 error results obtained for optHSIC in the case when C X and when C X . B denotes the number of permutations. The p-value is defined in Algorithm 3.

Correct type 1 error rate in theory: P H 0 ( p α ) α . Distribution p under H 0 Correct type 1 error rate in simulations
C X Yes (Theorem 4.1) Uniform [ 1 B + 1 , , B + 1 B + 1 ] (Theorem 4.1) Yes (Sec 6.1)
C X Unknown Unkown, but experiments suggest approximately Uniform [ 1 B + 1 , , B + 1 B + 1 ] (Figure 11 Appendix) Yes (Sec 6.1)

4.3 Using multiple transformations

Algorithm 2 defines the transformation used in optHSIC. In line 7, a covariate is sampled from the multiset L with distribution v. The sampled element may differ for different iterations of the algorithm. Consequently, given a fixed dataset D, the transformed dataset will look different for different iterations of the algorithm. The optHSIC algorithm proposes to use only one transformation, resulting in a single p-value.

Say that instead of applying the optHSIC algorithm once, we repeat the optHSIC algorithm m times, resulting in m p-values p 1 , , p m . An example of the distribution of p-values for a fixed dataset is given in Figure 5.

Figure 5:

A histogram of the p-values obtained from optHSIC for a fixed dataset D sampled from D.1 of Table 2. The sample was of size n = 300 and 61 % of the individuals was observed ( δ = 1 ) . The variability in p comes from the fact that the transformation in optHSIC is random.

Figure 5 shows that in the used dataset, where 39 % of observations is censored and n = 300 , the variance in the obtained p-values is not excessively high, and in particular all p-values would have led to the decision to reject the null-hypothesis.

We also studied several techniques one may use to combine p-values, typically at the cost of being more conservative when there is little variance among p-values, like in the example above. This is discussed in Section A.9.

5 Alternative approaches when censoring is independent of the covariate

We are not aware of fully nonparametric methods to test independence between right-censored times and continuous covariates. When C is independent of X the challenge is mitigated, since in that case any statistic can be combined with a permutation test that permutes the covariates (see Theorem 5.2). Using that approach, this Section proposes some alternative tests in the case when C X that will serve as benchmarks for studying power performance of optHSIC in addition to the Cox proportional hazards likelihood ratio test. In Section 6 we compare the performance of these methods with optHSIC.

Before we propose alternative test statistics, we state formally why standard permutation tests can be used when C X . We first state a Theorem on the correctness of the type 1 error rate of permutation tests without censoring. We include the proof following [16] in the Appendix for completeness.

Theorem 5.1.

Let D = ( X i , Y i ) i = 1 n be an i.i.d. sample of size n from distribution P X Y on X × Y . Denote π D = ( X π ( i ) , Y i ) i = 1 n . Let π 1 , π B be permutations sampled uniformly and independently from S n . Let H be any statistic of the data. Let R be the rank of the first coordinate in the vector:

( H ( D ) , H ( π 1 D ) , , H ( π B D ) )

when ties are broken at random and where R = 1 is the rank of the largest element and R = B + 1 the rank of the smallest element. Define p R / ( B + 1 ) . Then under the null hypothesis H 0 : X T it holds that: p U n i f o r m [ 1 B + 1 , , B + 1 B + 1 ] . So in particular for α [ 0 , 1 ]

P H 0 ( p α ) α .

Proof.

See Appendix. □

In survival analysis we apply the above, setting Y = ( Z , Δ ) , where Z = min { T , C } and Δ = 1 { T C } .

Theorem 5.2.

Assume C X and T C | X . Write D = ( X i , Z i , Δ i ) i = 1 n , and let π 1 , , π B be sampled i.i.d. uniformly from S n . Write π D ( X π ( i ) , Z i , Δ i ) i = 1 n . Let H ( D ) be any statistic of the data. Let R be the rank of the first coordinate in the vector:

( H ( D ) , H ( π 1 D ) , , H ( π B D ) )

where ties are broken at random and R = 1 is the rank of the largest element and R = B + 1 is the rank of the smallest element. Define p R / ( B + 1 ) . Then under the null hypothesis H 0 : X T it holds that p U n i f o r m [ 1 B + 1 , , B + 1 B + 1 ] . So in particular for α [ 0 , 1 ]

P H 0 ( p α ) α .

Proof.

See Appendix. □

Having shown that we can use a permutation test on any test statistic when C X , the next two sections explore meaningful measures of dependency on right-censored data. Indeed such methods may lead to type 1 errors when C X .

5.1 wHSIC

HSIC relies on estimating the mean embedding of the joint distribution P X T . The estimated embedding is then compared to the embedding of the product of marginal distributions. In the uncensored case the empirical distribution equals i = 1 n δ ( x i , t i ) / n which has corresponding mean embedding i = 1 n K ( ( x i , t i ) ) / n .

Since we do not observe the t i ’s we could consider replacing the empirical distribution by a weighted version i = 1 n w i δ ( x i , z i ) / n where we try to find weights w i such that i = 1 n w i f ( x i , z i ) E f ( X , T ) for every measurable function f such that the expectation exists. A natural idea is to give an observed point ( x , z , δ ) a weight of zero if it is censored, and a weight equal to the inverse probability of being uncensored otherwise. This can be motivated by the following lemma, that applies also if C X . We first define

g ( t , x ) = P ( C > t | X = x ) .

The following lemma proposes a weight function W in terms of g.

Lemma 5.1.

Let f be an ( X , T ) - measurable function such that E f ( X , T ) < and define W by

W = { 0 i f Δ = 0 1 g ( Z , X ) i f Δ = 1.

Then E W f ( X , Z ) = E f ( X , T ) .

Proof.

See Appendix. □

We would thus like to use weights w i = 0 if δ i = 0 and 1 / n g ( x i , z i ) if δ = 1. As we will be working under the assumption C X we may estimate 1 / P ( C > t | X ) = 1 / P ( C > t ) using Kaplan–Meier weights, which we define now. Assume that there are no ties in the data and that z i < z i + 1 for i = 1 , , n . Then the Kaplan–Meier survival curve is given by:

S ˆ T ( z k ) = i = 1 k ( n i n i + 1 ) δ i .

Kaplan–Meier weights are then defined by

w k = P ˆ ( T = z k ) = S ˆ T ( z k 1 ) S ˆ T ( z k ) .

One can also estimate the survival probability of the censoring time using Kaplan–Meier weights (simply replace δ i by 1 δ i in the above formula). The following lemma shows that the weights w k defined above, correspond up to a constant of 1 / n , to the inverse of the estimated probability of being uncensored.

Lemma 5.2.

Let S ˆ C ( z k ) = P ˆ ( C > z k ) denote the Kaplan–Meier estimate of the survival probability of the censoring distribution. Then:

w k = 1 n 1 P ˆ ( C > z k )

Proof.

See Appendix. □

We use these weights to define the statistic wHSIC as the RKHS distance between the embedding of i = 1 n w i δ ( x i , z i ) and the embedding of the product of the marginals, i = 1 n j = 1 n w i w j δ x i δ z j . That is:

wHSIC ( D ) i = 1 n w i K ( ( x i , z i ) , ) i = 1 n j = 1 n w i w j K ( ( x i , y j ) , ) H 2 .

Here K ( ( x , y ) , ( x , y ) ) = k ( x , x ) l ( y , y ) . The following theorem shows how to compute wHSIC efficiently.

Theorem 5.3.

(Computation of wHSIC) Given a dataset D = ( ( x i , y i ) ) i = 1 n with a weight vector w n ,

wHSIC ( D ) = tr ( H w K H w L ) ,

where K i j = k ( x i , x j ) and L i j = l ( y i , y j ) and H w = D w w w , where D w = d i a g ( w ) , a diagonal matrix with ( D w ) i i = w i .

Proof.

See Appendix. □

It is not difficult to see wHSIC has the same computational time as (uncensored) HSIC. Algorithm 4 in Figure 6 shows how wHSIC is combined with a permutation test.

Figure 6:

The wHSIC algorithm, resulting in a p-value and rejection decision of H 0 : X T .

5.2 zHSIC

If C X , then any dependence between X and Z must be due to dependence between X and T. Hence we may estimate HSIC ( X , Z ) to measure the strength of the dependence. This motivates the test proposed in Algorithm 5 labeled zHSIC (see Figure 7). This test is, indeed, expected to yield false rejections if C X fails to hold, and to lack power when a large portion of the events is censored. We include this test mainly to see how the power of optHSIC and wHSIC compare with this approach.

Figure 7:

The zHSIC algorithm, resulting in a p-value and rejection decision of H 0 : X T .

6 Numerical evaluation of the methods

We generate data from various distributions of X, T and C to compare the power and type 1 error rate of optHSIC, wHSIC, zHSIC and CPH. CPH stands for the Cox proportional hazards likelihood ratio test. In each scenario, we let the sample size range from n = 40 to n = 400 in intervals of 40. To obtain p-values in the three HSIC-based methods we use a permutation test with 1999 permutations. We reject the null hypothesis if our obtained p-value is less than 0.05. In the main paper, we present results of rejection rates under distributions that are chosen such that approximately 60 % of the observations are observed ( δ = 1 ). We investigate the rejection rates under varying censoring regimes in Section A.10.2 of the Appendix.

6.1 Type 1 error rate

We begin by investigating the type 1 error rate. The distributions in which we test the type 1 error rate are found in Table 3. In these distributions the null hypothesis holds, i.e. X T and we consider both cases where C X and C X , as well as the case of multidimensional covariates. We estimate the rejection rates by sampling 5000 times from each distribution for each sample size and applying the different tests to the samples. The obtained rejection rates of optHSIC are found in Table 4. The type 1 error rate of the remaining methods is found in Section A.10.1. The most important finding is that optHSIC is found to have correct type 1 error rate of α = 0.05 both when C X and when C X . In contrast, wHSIC and zHSIC yield many false rejections when C X , as expected, but have the correct type 1 error rate when C X . Investigation of the p-values of optHSIC furthermore showed that p-values are distributed approximately according to Uniform [ 1 B + 1 , , B + 1 B + 1 ] under the null hypothesis, even when C X (see Figure 11 in Section A.10.1).

Table 2:

The eight scenarios in which we test the power in the main text. N 10 ( 0 , Σ ) is a 10-dimensional Gaussian random variable, with 0 denoting a vector of zeros of length 10, and Σ = M M T where M is a 10 × 10 matrix of i.i.d N ( 0 , 1 ) entries. In these distributions about 60 % of the individuals is observed. Distributions with dependent censoring and varying censoring percentages are discussed in Section A.10.2.

D. X T | X C | X
1 N ( 0 , 1 ) exp ( mean = exp ( X / 6 ) ) exp ( mean = 1.5 )
2 N ( 0 , 1 ) exp ( mean = exp ( X 2 / 5 ) ) exp ( mean = 2.25 )
3 Unif [ 1 , 1 ] Weib ( shape = 1.75 X + 3.25 ) exp ( mean = 1.75 )
4 Unif [ 1 , 1 ] N ( 100 X , 2 X + 5.5 ) 82 + exp ( mean = 35 )
5 N 10 ( 0 , Σ ) exp ( mean = exp ( 1 T X / 30 ) ) exp ( mean = 1.5 )
6 N 10 ( 0 , Σ ) exp ( mean = exp ( X 1 / 8 ) ) exp ( mean = 1.5 )
7 N 10 ( 0 , Σ ) exp ( mean = exp ( X 1 2 / 4 ) ) exp ( mean = 10 )
8 N 10 ( 0 , Σ ) exp ( mean = exp ( X 1 2 / 4 + X 2 / 7 ) ) exp ( mean = 8 )
Table 3:

The eight scenarios in which we simulate the type 1 error rate. N 10 ( 0 , Σ ) is a 10-dimensional Gaussian random variable, with 0 denoting a vector of zeros of length 10, and Σ = M M T where M is a 10 × 10 matrix of i.i.d N ( 0 , 1 ) entries. Note that C depends on X in D.3–8.

D. X T | X C | X % δ = 1
1 Unif [ 1,1 ] Unif [ 0,1 ] Unif [ 0,1.5 ] 66%
2 Unif [ 1,1 ] exp ( mean = 5 / 2 ) exp ( mean = 5 / 3 ) 40%
3 Unif [ 1,1 ] exp ( mean = 2 / 3 ) exp ( mean = exp ( X ) ) 60%
4 Unif [ 1,1 ] exp ( mean = 1.6 ) exp ( mean = exp ( 9 X 2 ) ) 60%
5 Unif [ 1,1 ] exp ( mean = 0.9 ) Weib ( shape = 1.75 X + 3.25 ) 60%
6 Unif [ 1,1 ] exp ( mean = 0.9 ) 1 + X 60%
7 N 10 ( 0 , Σ ) exp ( mean = 0.6 ) exp ( mean = exp ( 1 T X ) ) 60%
8 N 10 ( 0 , Σ ) exp ( mean = 0.6 ) exp ( mean = exp ( X 1 / 8 ) ) 60%
Table 4:

The rejection rate of optHSIC for the distributions D.1–8 of Table 3. Note the type 1 error rate is very close to the level α = 0.05 as desired, also in the scenarios where C X .

n = 40 80 120 160 200 240 280 320 360 400
D.1 0.047 0.053 0.051 0.050 0.054 0.051 0.045 0.056 0.049 0.048
D.2 0.049 0.051 0.057 0.044 0.053 0.050 0.047 0.049 0.052 0.048
D.3 0.052 0.052 0.048 0.049 0.044 0.050 0.048 0.054 0.054 0.048
D.4 0.050 0.054 0.049 0.054 0.054 0.054 0.050 0.053 0.050 0.046
D.5 0.050 0.056 0.056 0.051 0.053 0.051 0.046 0.055 0.050 0.049
D.6 0.050 0.049 0.048 0.047 0.048 0.057 0.050 0.050 0.048 0.050
D.7 0.050 0.055 0.047 0.048 0.056 0.054 0.054 0.052 0.051 0.048
D.8 0.041 0.050 0.051 0.048 0.056 0.049 0.054 0.050 0.054 0.052

6.2 Comparison of power

To compare the power of the four methods we consider a number of distributions in which T X . For each distribution we let the sample size range from n = 40 to n = 400 in steps of 40. At each sample size we take 1000 samples to estimate the rejection rate. In these distributions censoring is independent of the covariate such that the methods wHSIC and zHSIC do not have inflated rejection rate due to dependencies between C and X. Distributions with varying censoring rates and dependent censoring are investigated in Section A.10.2. Parameters are chosen so that that 60 % of the observations is uncensored ( 60 % has δ i = 1 ), and rejection rates take a range of values (i.e. to exclude trivial distributions so that each method rejects with probability one for each sample size).

6.2.1 Power for one-dimensional covariates

In distributions 1–4 of Table 2 the covariate is one-dimensional. Scatterplots of the samples and rejection rates are displayed in Figure 8. Note how in D.1 the CPH assumption holds, so the CPH method suits this example very well. We find however that the rejection rate of optHSIC is very similar to that of the CPH likelihood ratio test (first row, right of Figure 8). D.2 features a case in which hazard is highest in the middle, and lower for extreme values of the covariate. The CPH likelihood ratio test does not have power to detect this relationship, and optHSIC is the most powerful method. In D.3 and D.4 the hazard functions of different covariates cross each other. In this case, wHSIC is the top-performing method, but optHSIC is also able to detect the more complicated relationship, while CPH is not able to do so.

Figure 8:

Scatterplots and rejection rates for D.1 (top row) – D.4 (bottom row) of Table 2.

6.2.2 Power for multidimensional covariates

In D.5–8 of Table 2 the covariates are multidimensional. Figure 9 shows the rejection rates of the four methods, both as the dimension is fixed at 10, and the sample size increases (left column) and when the sample size is fixed at 200, and the dimension increases (right column). In D.5 and D.6 the CPH assumption holds. Again we see that the power of optHSIC is relatively close to the power of the CPH likelihood ratio test, which is specifically designed for these assumptions. This holds both when the dependence is on a single sub-dimension of the covariate (D.6) and when the dependence is on all covariates (D.5). In D.7 and D.8 a non-linear term is present in the hazard rate. Together with zHSIC, optHSIC is now the best performing method. Note that as dimension increases, the dependence in D.6–D.8 becomes harder to detect, whereas the dependence in D.5 becomes easier to detect since in the latter T depends on all covariates.

Figure 9:

Rejection rates of the four methods for D.5 (top row) – D.8 (bottom row) of Table 2. Left column: dimension fixed at 10, and the sample size increases. Right column: sample size fixed at 200, and the dimension increases.

We also investigated the power when C X and as the censoring percentage varies from 0 % to 80 % . Distributions and rejection rates are presented in Section A.10.2. The main observations are that, although all methods lose power as the censoring rate increases, the relative performance of the four methods remains similar to the results presented in the main text.

We thus find that for continuous covariates optHSIC is able to detect a wider range of dependencies than the CPH likelihood ratio test, while not losing much power when the CPH assumptions hold. This is true both when the covariate is 1-dimensional and when the covariate is multidimensional, even when the dependence arises from a lower-dimensional subspace of the covariate. Importantly, unlike the methods wHSIC and zHSIC, the type 1 error rate of optHSIC is found to be of the correct level both when C X and when C X .

6.3 Binary covariates

Lastly, we did experiments with binary covariates, in which case independence testing is equivalent to two-sample testing. For two-sample testing there are other alternative approaches, including [17], [18]. Furthermore, the wHSIC approach can be modified firstly to allow for the computation of weights within groups, and secondly a permutation test can be constructed that is valid also when censoring differs between groups. This permutation test was proposed by [19]. This special case of wHSIC coincides with the approach discussed in [18] for two-sample testing.

Our main finding is that optHSIC has a decent all-round performance for two-sample testing. However, especially when the distributions meet additional assumptions used in semi-parametric tests, these semi-parametric tests outperform optHSIC. We also note that, as optHSIC relies on choosing ‘similar covariates’ during the transformation of Algorithm 2, it is not ideally suited for the two-sample case, as there is a chance of choosing the opposite covariate.

We also provide a ‘failure mode’ of optHSIC (we thank a reviewer for this example): a scenario in which optHSIC performs much worse than the CPH test. In this example, covariates are binary (i.e. there are two groups), group sizes are very unequal, 90 % of observations is censored, censoring appears only in the large group, and survival in the two groups is identical until beyond the censoring has occurred. See A.12 for a detailed description.

A full presentation of the experiments and methods used for the case of binary covariates is given in Section A.11. We conclude that the main value of optHSIC lies in independence testing with continuous covariates.

7 Discussion

The main contribution of this paper is the proposal of optHSIC, combining a novel way of using optimal transport to transform right-censored datasets, with nonparametric permutation-based independence testing using HSIC/DCOV. We have shown optHSIC has power against a wider range of alternatives than the commonly used CPH model, while forfeiting little power when the CPH assumptions are satisfied exactly. Extensive numerical simulations suggest that the approach is well calibrated, yielding reliable p-values even when censoring strongly depends on the covariate. Under the assumption that censoring does not depend on the covariate, we have proven correct type 1 error rate of optHSIC. Theoretical guarantees for the type 1 error under dependent censoring are a topic of future work. We furthermore proposed reweighting the original dataset, and measuring the distance between the resulting weighted mean embeddings. While these methods showed some promise, they do rely on the very strong assumption that C and X are independent. An interesting future challenge is to develop nonparametric tests for covariates and right-censored times for mutual independence testing and conditional independence testing, two methods used in causal inference, and which are relevant to many problems in biostatistics.

8 Supplementary Materials and code

Supplementary materials contain a worked out example of the transformation proposed in Algorithm 2, proofs of all results and details on additional experiments. They also contain methods of combining p-values from multiple transformation of the same dataset. Code to replicate the experiments and of the tests is available on www.github.com/davidrindt/opthsic.

Corresponding author: David Rindt, Department of Statistics, University of Oxford, Oxford, UK, E-mail:

Acknowledgment

1. Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

2. Research funding: David Rindt is funded by the Engineering and Physical Sciences Research Council (EPSRC). David Steinsaltz is funded by BBSRC grant BB/S001824/1.

3. Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

1. Cox, DR. Regression models and life tables. J Roy Stat Soc B 1972;34:187–202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x.Search in Google Scholar

2. Zajacova, A, Burgard, SA. Shape of the BMI-mortality association by cause of death, using generalized additive models: NHIS 1986–2006. J Aging Health 2012;24:191–211. https://doi.org/10.1177/0898264311406268.Search in Google Scholar PubMed PubMed Central

3. Lip, S, Tan, LE, Jeemon, P, McCallum, L, Dominiczak, AF, Padmanabhan, S. Diastolic blood pressure j-curve phenomenon in a tertiary-care hypertension clinic. Hypertension 2019;74:767–75. https://doi.org/10.1161/hypertensionaha.119.12787.Search in Google Scholar PubMed PubMed Central

4. Gretton, A, Fukumizu, K, Teo, C, Song, L, Schölkopf, B, Smola, A. A kernel statistical test of independence. Adv Neural Inf Process Syst 2008;20:585–92.Search in Google Scholar

5. Székely, G, Rizzo, M. Brownian distance covariance. Ann Appl Stat 2009;3:1236–65. https://doi.org/10.1214/09-aoas312.Search in Google Scholar PubMed PubMed Central

6. Pfister, N, Bühlmann, P, Schülkopf, B, Peters, J. Kernel-based tests for joint independence. J Roy Stat Soc B 2018;80:5–31. Available from: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12235.10.1111/rssb.12235Search in Google Scholar

7. Zhang, K, Peters, J, Janzing, D, Schölkopf, B. Kernel-based conditional independence test and application in causal discovery. In: Proceedings of the twenty-seventh conference on uncertainty in artificial intelligence. Corvallis, OR, USA: AUAI Press; 2011:804–13 pp.Search in Google Scholar

8. Gretton, A, Borgwardt, K, Rasch, M, Schölkopf, B, Smola, A. A kernel two–sample test. J Mach Learn Res 2012;12:723–73.Search in Google Scholar

9. Schölkopf, B, Smola, AJ. Learning with kernels: support vector machines, regularization, optimization, and beyond, 1st ed. Massachusetts: MIT Press; 2001.Search in Google Scholar

10. Song, L, Smola, A, Gretton, A, Bedo, J, Borgwardt, K. Feature selection via dependence maximization. J Mach Learn Res 2012;13:1393–434.Search in Google Scholar

11. Sejdinovic, D, Sriperumbudur, B, Gretton, A, Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat 2012;41:2263–91.10.1214/13-AOS1140Search in Google Scholar

12. Rindt, D, Sejdinovic, D, Steinsaltz, D. Consistency of permutation tests for HSIC and dHSIC. arXiv preprint arXiv:2005.06573; 2020.Search in Google Scholar

13. Shirdhonkar, S, Jacobs, DW. Approximate earth mover’s distance in linear time. In: 2008 IEEE conference on computer vision and pattern recognition. Anchorage, AK, USA: IEEE; 2008:1–8 pp.10.1109/CVPR.2008.4587662Search in Google Scholar

14. Cohen, S. Finding color and shape patterns in images. [Ph.D. thesis] Stanford University; 1999.Search in Google Scholar

15. Zhang, Q, Filippi, S, Gretton, A, Sejdinovic, D. Large-scale kernel methods for independence testing. Stat Comput 2018;28:113–30. https://doi.org/10.1007/s11222-016-9721-7.Search in Google Scholar

16. Berrett, TB, Samworth, RJ. Nonparametric independence testing via mutual information. Biometrika 2019;106:547–66. https://doi.org/10.1093/biomet/asz024.Search in Google Scholar

17. Ditzhaus, M, Friedrich, S. More powerful logrank permutation tests for two-sample survival data. J Stat Comput Simulat 2020;90:2209–27. https://doi.org/10.1080/00949655.2020.1773463.Search in Google Scholar

18. Matabuena, M. Energy distance and kernel mean embedding for two sample survival test. arXiv preprint arXiv:1901.00833; 2019.Search in Google Scholar

19. Wang, R, Lagakos, S, Gray, R. Testing and interval estimation for two-sample survival comparisons with small sample sizes and unequal censoring. Biostatistics 2010;11:676–92. https://doi.org/10.1093/biostatistics/kxq021.Search in Google Scholar PubMed PubMed Central

Supplementary Material

Accepted: 2020-10-27
Published Online: 2020-12-02

© 2020 Walter de Gruyter GmbH, Berlin/Boston