Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery

Eric V. Strobl, Kun Zhang and Shyam Visweswaran

Abstract

Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales at least quadratically with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time.

1 The problem

Constraint-based causal discovery (CCD) algorithms such as Peter-Clark (PC) and Fast Causal Inference (FCI) infer causal relations from observational data by combining the results of many conditional independence (CI) tests [1]. In practice, a CCD algorithm can easily request p-values from thousands of CI tests even with a sparse underlying graph. Developing fast and accurate CI tests is therefore critical for maximizing the usability of CCD algorithms across a wide variety of datasets.

Investigators have developed many fast parametric methods for testing CI. For example, we can use partial correlation to test for CI under the assumption of Gaussian variables [2], [3]. We can also consider testing for unconditional independence XY|Z=z for each constant value z when Z is discrete and P(Z=z)>0. The chi-squared test for instance utilizes this strategy when both X and Y are also discrete [4]. Another permutation-based test generalizes the same strategy even when X and Y are not necessarily discrete [5].

Testing for CI in the non-parametric setting generally demands a more sophisticated approach. One strategy involves discretizing continuous conditioning variables Z as Z˘ in some optimal fashion and assessing unconditional independence Z˘=z˘ [6], [7]. Discretization however suffers severely from the curse of dimensionality because consistency arguments demand smaller bins with increasing sample size, but the number of cells in the associated contingency table increases exponentially with the conditioning set size. A second method involves measuring the distance between estimates of the conditional densities f(X|Y,Z) and f(X|Z), or their associated characteristic functions, by observing that f(X|Y,Z)=f(X|Z) when XY|Z [8], [9]. However, the power of these tests also deteriorates quickly with increases in the dimensionality of Z.

Several investigators have since proposed reproducing kernel-based CI tests in order to tame the curse of dimensionality. Indeed, kernel-based methods in general are known for their strong empirical performance in the high dimensional setting. The Kernel Conditional Independence Test (KCIT) for example assesses CI by capitalizing on a characterization of CI in reproducing kernel Hilbert spaces (RKHSs; [10]). Intuitively, KCIT works by testing for vanishing regression residuals among functions in RKHSs. Another kernel-based CI test called the Permutation Conditional Independence Test (PCIT) reduces CI testing to two-sample kernel-based testing via a carefully chosen permutation found at the solution of a convex optimization problem [11].

The aforementioned kernel-based CI tests unfortunately suffer from an important drawback: both tests scale at least quadratically with sample size and therefore take too long to return a p-value in the large sample size setting. In particular, KCIT’s bottleneck lies in the eigendecomposition as well as the inversion of large kernel matrices [10], and PCIT must solve a linear program that scales cubicly with sample size in order to obtain its required permutation [11]. As a general rule, it is difficult to develop exact kernel-based methods which scale sub-quadratically with sample size, since the computation of kernel matrices themselves scales at least quadratically.

Many investigators have nonetheless utilized random Fourier features in order to quickly approximate kernel methods in linear time with respect to the number of Fourier features. For example, Lopez-Paz and colleagues developed an unconditional independence test using statistics obtained from canonical correlation analysis with random Fourier features [12]. Zhang and colleagues have also utilized random Fourier features for unconditional independence testing but took a different approach by approximating the kernel cross-covariance operator [13]. The authors further analyzed block and Nyström-based kernel approximations to the unconditional independence testing problem. The authors ultimately concluded that the random Fourier feature and the Nyström-based approaches both outperformed the block-based approach on average. Others have analyzed the use of random Fourier features for predictive modeling (e. g., [14], [15]) or dimensionality reduction [16]. In practice, investigators have observed that methods which utilize random Fourier features often scale linearly with sample size and achieve comparable accuracy to exact kernel methods [12], [13], [14], [15], [16].

In this paper, we also use random Fourier features to design two fast tests called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which approximate the solution of KCIT. Simulations show that RCIT, RCoT and KCIT have comparable accuracy, but both RCIT and RCoT scale linearly with sample size in practice. As a result, RCIT and RCoT return p-values several orders of magnitude faster than KCIT in the large sample size context. Moreover, experiments demonstrate that the causal structures returned by CCD algorithms using either RCIT, RCoT or KCIT have nearly identical accuracy.

2 High-level summary

We now provide a high-level summary for investigators who simply wish to perform CCD with RCIT or RCoT. We want to test whether we have XY|Z in a fast and accurate manner without resorting to parametric assumptions. Previously, Zhang and colleagues introduced a non-parametric CI test called KCIT which can test whether we have XY|Z in an accurate but not fast manner by analyzing the partial cross-covariance (operator) using the kernel method [10]. Kernels however are expensive to compute because they scale at least quadratically with sample size. Another line of work has fortunately shown that we can approximate kernels by averaging over a variety of non-linear transformations called random Fourier features (e. g., [14], [15], [17]). We therefore propose to approximate KCIT by utilizing random Fourier features, specifically by analyzing the partial cross-covariance matrix of {X,Z} and Y (RCIT) or X and Y (RCoT) after subjecting the variable sets to the non-linear transformations and then non-linearly regressing out the effect of Z. Simulations show that RCIT and RCoT return p-values in a much shorter time frame while also matching or outperforming KCIT in approximating the null distribution. RCoT in particular also returns the most accurate p-values when the conditioning set size Z is large (≥4).

3 Preliminaries on kernels

We will deal with kernels in this paper, so we briefly review the corresponding theory; see [18] for a more extensive discussion of similar concepts. Capital letters X,Y,Z denote sets of random variables with codomains X,Y,Z, respectively. Let HX correspond to a Hilbert space of functions mapping X to R. We say that HX is more specifically a reproducing kernel Hilbert space, if the Dirac delta operator δx:HXR (which maps fHX to f(x)R) is a bounded linear functional [19]. We can associate HX with a unique positive definite kernelkX:X×XR in which the feature mapϕ(x):XHX satisfies the reproducing propertyf,ϕ(x)HX=f(x), fHX, xX by the Moore-Aronszajn Theorem. We will require that HX be separable (i. e., it must have a complete orthonormal system) throughout this paper [18]. Hein and Bousquet [20] showed that any continuous kernel on a separable space X (e. g., Rd) induces a separable RKHS. We will likewise consider other kernels such as kY on the separable space Y.

We have the following norm defined on linear operators between RKHSs:

Definition.

[21] Denote byΣ:HXHYa linear operator. Then, provided that the sum converges, the Hilbert Schmidt (HS) norm of Σ is defined as:

(1)ΣHS2=i,jui,ΣvjHY2,
whereuiandvjare orthonormal bases ofHYandHX, respectively.

We denote the probability distribution of X as PX and that of Y as PY. We may then define the mean elementsE[ϕ(X)]=μXHX and E[ψ(Y)]=μYHY with respect to the probability distributions; here, ϕ denotes the feature map from X to HX, and ψ likewise denotes the feature map from Y to HY. We also define the quantity μXHX2 by applying the expectation twice:

(2)μXHX2=EXX[ϕ(X),ϕ(X)HX]=EXX[kX(X,X)],

where X is an independent copy of X which follows the same distribution. The mean elements μX and μY exist so long as their respective norms in HX or HY are bounded; this is true so long as the kernels kX and kY are also bounded so that EXX[kX(X,X)]< and EYY[kY(Y,Y)]< [18].

The cross-covariance operator associated with the joint probability distribution PXY over (X,Y) is a linear operator ΣXY:HYHX defined as follows:

(3)f,ΣXYgHX=EXY[f(X)g(Y)]EX[f(X)]EY[g(Y)],

for all fHX and gHY. Gretton et al. [18] showed that the HS-norm of the cross covariance operator, ΣXYHS2, can be written in terms of kernels as follows:

(4)ΣXYHS2=EXXYY[kX(X,X)kY(Y,Y)],

provided that XY, and kX as well as kY are centered (i. e., EXX[kX(X,X)]=0 and EYY[kY(Y,Y)]=0). We can therefore consider the following empirical estimate of the ΣXYHS2 using n i. i. d. samples x,y as follows, if we assume that XY [10], [18]:

(5)Txy=1n2i=1nj=1n[K˜X]ij[K˜Y]ij=1n2tr[K˜XK˜Y],

where TxypΣXYHS2 [18]; the dependent case can be found in Lemma 1 of the same paper for the interested reader. The notations K˜X and K˜Y correspond to centralized kernel matrices such that:

(6)K˜X=HKXH,

and likewise for K˜Y. Here, H=I1n11T, I denotes an n×n identity matrix, and 1 denotes a vector of ones. The notation KX denotes a kernel matrix such as the RBF kernel where [KX]ij=exp(xixj22σ) with xi,xjx. The transformation in Equation 6 ensures that 1n2i=1nj=1n[K˜X]ij=0 similar to the centered kernels kX and kY in Equation 4.

4 Characterizations of conditional independence

We denote the probability distribution of X as PX and the joint probability distribution of (X,Z) as PXZ. Let LX2 denote the space of square integrable functions of X, and LXZ2 that of (X,Z). Here, LX2={s(X)EX(|s|2)<} and likewise for LXZ2. Next consider a dataset of n i. i. d. samples drawn according to PXYZ.

We use the notation XY|Z when X and Y are conditionally independent given Z. Perhaps the simplest characterization of CI reads as follows: XY|Z if and only if PXY|Z=PX|ZPY|Z. Equivalently, we have PX|YZ=PX|Z and PY|XZ=PY|Z.

4.1 Characterization by RKHSs

A second characterization of CI is given in terms of the cross-covariance operator ΣXY on RKHSs [22]. Recall the cross-covariance operator from HY to HX in Equation 3. We may then define the partial cross-covariance operator of (X,Y) given Z by:

(7)ΣXY·Z=ΣXYΣXZΣZZ1ΣZY,

where we use the right inverse instead of the inverse, if ΣZZ is not invertible (see Corollary 3 in Fukumizu et al. [22]). Notice the similarity of the partial cross-covariance operator to the linear partial cross-covariance matrix (as well as the conditional cross-covariance matrix in the Gaussian case).[1] We can interpret the above equation as the partial covariance between {f(X),fHX} and {g(Y),gHY} given {h(Z),hHZ} (i. e., the partial covariance of X and Y given Z after passing these three variable sets through the functions in the RKHSs HX,HY and HZ).

A kernel kX is characteristic if EXPX[f(X)]=EXQX[f(X)], fHX implies PX=QX, where PX and QX are two probability distributions of X [23]; alternatively, a kernel is characteristic if equality in the mean elements under the two distributions μPX=μQX implies equality of the distributions. Two examples of characteristic kernels include the Gaussian kernel and the Laplacian kernel. Now if we use characteristic kernels in (7), then the partial cross-covariance operator is related to the CI relation via the following conclusion:

Proposition 1.

[22] , [23] LetX¨=(X,Z)andkX¨=kXkZ. Also letHX¨represent the RKHS corresponding tokX¨. AssumeE[kX(X,X)]<andE[kY(Y,Y)]<.[2]Further assume thatkX¨kYis a characteristic kernel on(X×Y)×Z, and thatHZ+R(the direct sum of the two RKHSs) is dense inLZ2. Then

(8)ΣX¨Y·Z=0XY|Z.

Here, ΣX¨Y·Z=0 means that f,ΣX¨Y·ZgHX¨=0 for all fHX¨ and all gHY. Recall further that ΣX¨Y·Z=0ΣX¨Y·ZHS2=0 because ΣX¨Y·ZHS2=i,jui,ΣX¨Y·ZvjHX¨2, where ui and vj are orthonormal bases of HX¨ and HY, respectively, by Definition 1.

4.2 Characterization by L2 spaces

We also consider a different characterization of CI which is intuitively more appealing because it allows to use to directly view CI as the uncorrelatedness of functions in certain spaces rather than a norm of the partial cross-covariance operator. In particular, consider the following constrained L2 spaces:

(9)FXZ{f˜LXZ2E(f˜|Z)=0},FYZ{g˜LYZ2E(g˜|Z)=0},FY·Z{h˜h˜=h(Y)E(h|Z),hLY2}.

Here, we write E(f˜|Z=z) as shorthand for E(f˜(·,z)) when f˜LXZ2, and likewise for g˜LYZ2. Notice that we can construct the three spaces listed above using nonlinear regression. For instance, for any fLXZ2 in FXZ, we have:

(10)f˜(X¨)=f(X¨)E(f|Z)=f(X¨)h(Z),

where h(Z)LZ2 is the regression function of f(X¨) on Z.

We then have the following result:

Proposition 2.

[24] The following conditions are equivalent:

1. 1.

XY|Z,

2. 2.

E(f˜g˜)=0,f˜FXZandg˜FYZ,

3. 3.

E(f˜g)=0,f˜FXZandgLYZ2,

4. 4.

E(f˜h˜)=0,f˜FXZandh˜FY·Z,

5. 5.

E(f˜g)=0,f˜FXZandgLY2.

The second condition means that any “residual” function of (X,Z) given Z is uncorrelated with that of (Y,Z) given Z. The equivalence also represents a generalization of the case when (X,Y,Z) is jointly Gaussian; here, XY|Z if and only if any residual function of X given Z is uncorrelated with that of Y given Z; i. e., the linear partial correlation coefficient ρXY·Z is zero.

We also encourage the reader to observe the close relationship between Proposition 1 and claim 4 of Proposition 2. Notice that claim 4 of Proposition 2 implies that we have ΣX¨Y·Z=0 in Proposition 1. Moreover, Proposition 1 only considers functions in RKHSs, while claim 4 of Proposition 2 considers functions in L2 spaces. We find Proposition 1 more useful than claim 4 of Proposition 2 because the RKHS of a characteristic kernel might be much smaller than the corresponding L2 space.

5 Kernel conditional independence test

We now consider the following hypotheses:

(11)H0:XY|Z,H1:X⫫̸Y|Z.

We may equivalently rewrite the above null and alternative more explicitly using Proposition 1 as follows, provided that the kernels are chosen such that the premises of that proposition are satisfied (e. g., when the kernels are Gaussian or Laplacian [22]):

(12)H0:ΣX¨Y·ZHS2=0,H1:ΣX¨Y·ZHS2>0.

The above hypothesis implies that we can test for conditional independence by testing for uncorrelatedness between functions in reproducing kernel Hilbert spaces.

Zhang et al. [ 10] exploited the equivalence between 11 and 12 in the Kernel Conditional Independence Test (KCIT), which we now describe. Consider the functional spaces fHX¨, gHY, and hHZ. We can compute the corresponding centered kernel matrices K˜X¨, K˜Y and K˜Z from n i. i. d. samples x,y and z as in 6. We can then use these matrices to perform kernel ridge regression in order to estimate the function hHZ in 10 as follows: hˆ(z)=K˜Z(K˜Z+λI)1f(x¨), where λ denotes the ridge regularization parameter and fHX¨. Consequently, we can estimate the residual function f˜ as f˜(x¨)=f(x¨)hˆ(z)=RZf(x¨) where:

(13)RZ=IK˜Z(K˜Z+λI)1=λ(K˜Z+λI)1.

Next consider the eigenvalue decomposition K˜X¨=VX¨ΛX¨VX¨. Let ϕX¨=VX¨ΛX¨1/2. Then the empirical kernel map of HX¨|Z is given by ϕ˜X¨=RZϕX¨ because K˜X¨|Z=ϕ˜X¨ϕ˜X¨T. We may similarly write K˜Y|Z=ϕ˜Yϕ˜YT. We can thus write the centralized kernel matrices corresponding to f˜ and g˜ as follows:

(14)K˜X¨|Z=RZK˜X¨RZ,K˜Y|Z=RZK˜YRZ.

We can use the above two centered kernel matrices to compute an empirical estimate of the HS norm of the partial cross covariance operator similar to how we computed the quantity TXY in Equation 5:

(15)Tx¨y·z=1n2tr(K˜X¨|ZK˜Y|Z).

Note that Tx¨y·z denotes an empirical estimate of ΣX¨Y·ZHS2. KCIT uses the statistic SK=nTx¨y·z in order to determine whether or not to reject H0; we multiply the empirical estimate of the HS norm by n in order to ensure convergence to a non-degenerate distribution.

6 Proposed test statistic & its asymptotic distribution

Observe that computing SK requires the inversion of large kernel matrices which scales strictly greater than quadratically with sample size; the exact complexity depends on the algorithm used to compute the matrix inverse (e. g., O(n2.376) if we use the Coppersmith-Winograd algorithm [25]). CCD algorithms run with KCIT therefore take too long to complete. In this report, we will propose an inexpensive hypothesis test that almost always rejects or fails to reject the null whenever conditional dependence or independence holds, respectively, and even outperforms 12 as illustrated in our experimental results in Section 7. We will use a strategy that has already been successfully adopted in the context of unconditional dependence testing in doing so [12], [13].

We will in particular also take advantage of the characterization of CI presented in Proposition 1. Recall that the Frobenius norm corresponds to the Hilbert-Schmidt norm in Euclidean space. We therefore consider the following hypotheses as approximations to those in 12 and 11:

(16)H0:CA¨B·ZF2=0,H1:CA¨B·ZF2>0,

where CA¨B·Z=E[(A¨iE(A¨|Z))(BiE(B|Z))T] corresponds to the ordinary partial cross covariance matrix, where E(A¨|Z) and E(B|Z) may be non-linear functions of Z. We also have A¨=f(X¨){f1(X¨),,fm(X¨)} with fj(X¨)GX¨, j. Similarly, B=h(Y){h1(Y),,hq(Y)} with hk(Y)GY, k. The terms GX¨ and GY denote two spaces of functions, which we set to be the support of the process 2cos(WT·+B), WPW, BUniform([0,2π]). In other words, we select m functions from GX¨ and q functions from GY. We select these specific spaces because we can use them to approximate continuous shift-invariant kernels. A kernel k is said to be shift-invariant if and only if, for any aRp, we have k(xa,ya)=k(x,y), (x,y)Rp×Rp; examples of shift-invariant kernels include the Gaussian kernel frequently used in KCIT or the Laplacian kernel. The following result allows us to perform the approximation using the proposed spaces:

Proposition 3.

[14] For a continuous shift-invariant kernelk(x,y)onRp, we have:

(17)k(x,y)=RpeiwT(xy)dFw=E[ζ(x)ζ(y)],
whereFWrepresents the CDF ofPWandζ(x)=2cos(WTx+B)withWPWandBUniform([0,2π]).

The precise form of PW depends on the type of shift-invariant kernel one would like to approximate (see Figure 1 of Rahimi and Recht [14] for a list). Since investigators most frequently implement KCIT using the Gaussian kernel kσ(x,y)=exp(xy2/σ) with hyperparameter σ, we choose to approximate the Gaussian kernel by setting PW to a centered Gaussian with standard deviation σ/2.

We will use the squared Frobenius norm of the empirical partial cross-covariance matrix as the statistic for RCIT:

(18)S=nCˆA¨B·ZF2,

where CˆA¨B·Z=1n1i=1n[(A¨iEˆ(A¨|Z))(BiEˆ(B|Z))T]. Recall however that E(A¨|Z) and E(B|Z) may be non-linear functions of Z and therefore difficult to estimate. We would instead like to approximate E(A¨|Z) and E(B|Z) with linear functions. We therefore let C=g(Z){g1(Z),,gd(Z)} with gl(Z)GZ,l, where we also set GZ to be the support of the process 2cos(WT·+B), WPW, BUniform([0,2π]). We will approximate CA¨B·Z with CˆA¨B·C=CˆA¨BCˆA¨C(ΣˆCC+γI)1CˆCB similar to 7, where γ denotes a small ridge parameter; recall that this is equivalent to computing the cross-covariance matrix across the residuals of A¨ and B given C using linear ridge regression. We can justify this procedure because the particular choice of C allows us to approximate both E(A¨|Z) and E(B|Z) with linear functions of C as described below.

Let fj=fjE(fj|Z). Then E(fj|Z)=0, so fjFXZ. Moreover, hkE(hk|Z)FY·Z. Note that we can estimate E(fj|Z) with the linear ridge regression solution uˆjTg(Z) under mild conditions because we can guarantee the following:

Proposition 4.

(Section 3.1 of Sutherland and Schneider [15]) Consider performing kernel ridge regression offjon Z. Assume that (1)i=1nfj,i=0and (2) the empirical kernel matrix of Z,kZ, only has finite entries (i. e.,kZ<). Further assume that the range of Z,ZRdZ, is compact. We then have:

(19)P[|E˘(fj|Z)uˆjTg(Z)|ε]c0ε2edε2c1,
whereE˘(fj|Z)denotes the estimate ofE(fj|Z)by kernel ridge regression, andc0andc1are both constants that do not depend on n or d.

The exponential rate with respect to d in the above proposition suggests we can approximate the output of kernel ridge regression with a small number of random Fourier features, a hypothesis which we verify empirically in Section 7. Moreover, we can estimate E(hk|Z) with uˆkTg(Z), because we can similarly guarantee that P[|E˘(hk|Z)uˆkTg(Z)|ε]0 for any fixed ε>0 at an exponential rate with respect to d.

We can therefore consider the following spaces for S which are similar to the L2 spaces used in claim 4 of Proposition 2:

(20)GˆX¨{ffj=fjE(fj|Z),fjGX¨},GˆY·Z{hhk=hkE(hk|Z),hkGY}.

We then approximate CI with S in the following sense:

1. 1.

We always have XY|ZE(fh)=0, fGˆX¨andhGˆY·Z.

2. 2.

The reverse direction may hold for a larger subset of all possible joint distributions as the values of m and q increase; this is because at least one entry of CA¨B·C will likely be greater than zero for any given distribution as the values of m,q increase.

Note the second point refers to the population CA¨B·C as opposed to its finite sample estimate. In this paper, we only deal with the classical low dimensional scenario where m,q are fixed and the sample size n. This is reasonable because nearly all CB algorithms only test for CI when X and Y each contain a single variable. We find that the second point held in all of the cases we tested in Section 7 with only m,q=5, since we were always able to reject the null H0:CA¨B·CF2=0 by generating enough samples with m,q=5 when X⫫̸Y|Z.

6.1 Null distribution

We now consider the asymptotic distribution of S under the null. Let Π refer to a positive definite covariance matrix of the vectorization of (A¨E(A¨|C))(BE(B|C))T. We may denote an arbitrary entry in Π as follows:

(21)ΠA¨iBj,A¨kBl=E[(A¨iE(A¨i|C))(BjE(Bj|C))(A¨kE(Ak|C))(BlE(Bl|C))].

We have the following result:

Theorem 1.

Consider n i. i. d. samples fromPXYZ. Let{z1,,zL}denote i. i. d. standard Gaussian variables (thus{z12,,zL2}denotes i.i.dχ12variables) and λ the eigenvalues of Π. We then have the following asymptotic distribution under the null in16:

(22)nCˆA¨B·CF2di=1Lλizi2,
where L refers to the number of elements inCˆA¨B·C.

Proof.

We may first write:

(23)nCˆA¨B·CF2=ntr(CˆA¨B·CCˆA¨B·CT)=nv(CˆA¨B·C)Tv(CˆA¨B·C),=[nv(CˆA¨B·C)]T[nv(CˆA¨B·C)],

where v(CˆA¨B·C) stands for the vectorization of CˆA¨B·C. By CLT of the sample covariance matrix (see Lemma 1 in the Appendix A) combined with the continuous mapping theorem and the null, we know that nv(CˆA¨B·C)dN(0,Π). Here, we write an arbitrary entry ΠA¨iBj,A¨kBl under the null as follows:

(24)ΠA¨iBj,A¨kBl=Cov[(A¨iE(A¨i|C))(BjE(Bj|C)),(A¨kE(A¨k|C))(BlE(Bl|C))]=E[(A¨iE(A¨i|C))(BjE(Bj|C))(A¨kE(A¨k|C))(BlE(Bl|C))].

Now consider the eigendecomposition of Π written as Π=EΛET. Then, we have ET[nv(CˆA¨B·C)]dN(0,Λ) by the continuous mapping theorem. Note that:

(25)[nv(CˆA¨B·C)]T[nv(CˆA¨B·C)]=(ET[nv(CˆA¨B·C)])T(ET[nv(CˆA¨B·C)])di=1Lλizi2.

□

We conclude that the null distribution of the test statistic is a positively weighted sum of i. i. d. χ12 random variables.

Note that multiple methods exist for estimating the conditional expectations in S and Π in the above theorem. In this report, we will obtain consistent estimates of the conditional expectations by using kernel ridge regressions with the RBF kernel; here, consistency holds so long as the conditional expectations are continuous because the RBF kernel is dense in the space of continuous functions mapping Z to R [26]. We therefore have C˘A¨B·CpCA¨B·C, where C˘A¨B·C=1n1i=1n[(A¨iE˘(A¨|C))(BiE˘(B|C))T], by the continuous mapping theorem and weak law of large numbers assuming continuity of the conditional expectations. We can similarly approximate any arbitrary entry in Π because we may write the following:

(26)1nr=1n(A¨i,rE˘(A¨i|C))(Bj,rE˘(Bj|C))(A¨k,rE˘(A¨k|C))(Bl,rE˜(Bl|C))pE[(A¨iE(A¨i|C))(BjE(Bj|C))(A¨kE(A¨k|C))(BlE(Bl|C))].

Kernel ridge regressions however scale (strictly greater than) quadratically with sample size due to the inversion of the kernel matrix, so they may not be practical in the large sample size regime. Fortunately, we will not need to perform the kernel ridge regressions directly, because we can approximate the output of kernel ridge regression to within an arbitrary degree of accuracy for any fixed sample size n using linear ridge regression with enough random Fourier features as highlighted previously in Proposition 4. In particular, Proposition 4 implies that uˆTg(Z)E˜(A¨|C)pE˘(A¨|C) at rate exponential in d for any fixed n. We can also conclude that Σ˜A¨B·CpΣ˘A¨B·C as d for any fixed n. We can finally approximate the kernel ridge regression estimate of Π because we may write the following for an arbitrary entry in Π as d:

(27)1nr=1n(A¨i,rE˜(A¨i|C))(Bj,rE˜(Bj|C))(A¨k,rE˜(A¨k|C))(Bl,rE˜(Bl|C))p1nr=1n(A¨i,rE˘(A¨i|C))(Bj,rE˘(Bj|C))(A¨k,rE˘(A¨k|C))(Bl,rE˘(Bl|C)).

We conclude that, for a dataset of fixed sample size n, we can substitute the conditional expectation estimates of kernel ridge regression with those of linear regression with random Fourier features when estimating S as well as Π for applying Theorem 1.

Unfortunately though, a closed form CDF of a positively weighted sum of chi-squared random variables does not exist in general for applying Theorem 1. We can approximate the CDF by Imhof’s method which inverts the characteristic function numerically [27]. We should consider Imhof’s method as exact, since it provides error bounds and can be used to compute the distribution at a fixed point to within a desired precision [28], [29]. However, Imhof’s method is too computationally intensive for our purposes. We can nonetheless utilize several fast methods which approximate the null by moment matching.

6.2 Approximating the null distribution by moment matching

We write the cumulants of a positively weighted sum of i. i. d. χ12 random variables as follows:

(28)cr=2r1(r1)!i=1Lλir,

where λ={λ1,,λL} denotes the weights. We may for example derive the first three cumulants:

(29)m1=i=1Lλi,m2=2i=1Lλi2,m3=8i=1Lλi3.

We then recover the moments from the cumulants as follows:

(30)mr=cr+i=1r1r1i1cimri,r=2,3,

Now the Satterthwaite-Welch method [30], [31], [32] represents perhaps the simplest and earliest moment matching method. The method matches the first two moments of the sum with a gamma distribution Γ(gˆ,θˆ). Zhang and colleagues adopted a similar strategy in their paper introducing KCIT [10]. Here, we have:

(31)gˆ=12c12/c2,θˆ=c2/c1.

We however find the above gamma approximation rather crude. We therefore also consider applying more modern methods to estimating the distribution of a sum of positively weighted chi-squares. Improved methods such as the Hall-Buckley-Eagleson [33], [34] and the Wood F [35] methods match the first three moments of the sum to other distributions in a similar fashion. On the other hand, the Lindsay-Pilla-Basak method [36] matches the first 2L moments to a mixture distribution.

We will focus on the Lindsay-Pilla-Basak method in this paper, since Bodenham & Adams have already determined that the Lindsay-Pilla-Basak method performs the best through extensive experimentation [37], [38]. We therefore choose to use the method as the default method for RCIT. Briefly, the method approximates the CDF under the null FH0 using a finite mixture of L Gamma CDFs FΓ(g,θi):

(32)FH0=i=1LπiFΓ(g,θi),

where π0, i=1Lπi=1, and we seek to determine the 2L+1 parameters g, θ1,,θL, and π1,,πL. The Lindsay-Pilla-Basak method computes these parameters by a specific sequence of steps that makes use of results concerning moment matrices (see Appendix II in Uspensky [39]). The sequence is complicated and beyond the scope of this paper, but we refer the reader to [36] for details.

6.3 Testing for conditional un-correlatedness

Strictly speaking, we must consider the extended variable set X¨ to test for conditional independence according to Proposition 1. However, we have two observations: (1) we can substitute a test for non-linear conditional uncorrelatedness with tests for conditional independence in almost all cases encountered in practice because most conditionally dependent variables are correlated after some functional transformations, and (2) using the extended variable set X¨ makes estimating the null distribution more difficult compared to using the unextended variable set X. The first observation coincides with the observations of others who have noticed that Fisher’s z-test performs well (but not perfectly) in ruling out conditional independencies with non-Gaussian data. We can also justify the first observation with the following result using the cross-covariance operator ΣXY·Z:

Proposition 5.

[22] , [23] AssumeE[kX(X,X)]<andE[kY(Y,Y)]<. Further assume thatkXkYis a characteristic kernel onX×Y, and thatHZ+R(the direct sum of the two RKHSs) is dense inLZ2. Then

(33)ΣXY·Z=0EZ[PXY|Z]=EZ[PX|ZPY|Z].

In other words, we have:

(34)ΣXY·Z=0PXY=PX|ZPY|ZdPZ,ΣXY·Z=0EZ[PX|ZPY|Z]=EZ[PXY|Z]XY|Z.

Notice that ΣXY·Z=0 is almost equivalent to CI, in the sense that ΣXY·Z=0 just misses those rather contrived distributions where PXY=PXY|ZdPZ=PX|ZPY|ZdPZ when X⫫̸Y|Z. In other words, if PXYPX|ZPY|ZdPZ when X⫫̸Y|Z, then we have ΣXY·Z=0ΣX¨Y·Z=0 (under the corresponding additional assumptions of Propositions 1 and 5).

Let us now consider an example of a situation where PXY|ZdPZPX|ZPY|ZdPZ when X⫫̸Y|Z. Take three binary variables X,Y,Z{0,1}. Let PZ=0=0.2 and PZ=1=0.8. Also consider the four probability tables in Table 1. Here, we have chosen the probabilities in the tables carefully by satisfying the following equation:

(35)PXY|ZdPZ=PX|ZPY|ZdPZPZ=0(PX|Z=0PY|Z=0)+PZ=1(PX|Z=1PY|Z=1)=PZ=0PXY|Z=0+PZ=1PXY|Z=1.

Of course, the equality holds when we have conditional independence PXY|Z=PX|ZPY|Z. We are however interested in the case when conditional dependence holds. We therefore instantiated the values of Tables 1a and 1b as well as the second column in Table 1c (PXY|Z=0) such that PXY|Z=0PX|Z=0PY|Z=0. We then solved for PXY|Z=1 using Equation 35 in order to complete Table 1c. This ultimately yielded Table 1d.

Table 1

Example of a situation where PXY|ZdPZ=PX|ZPY|ZdPZ when X⫫̸Y|Z using binary variables.

Notice that we obtain a unique value for PXY|Z=1 by solving Equation 35. Hence, PXY|Z=1 has Lebesgue measure zero on the interval [0,1], once we have defined all of the other variables in the equation. Thus, ΣXY·Z=0 is not always equivalent to XY|Z, but satisfying the condition PXY|ZdPZ=PX|ZPY|ZdPZ when X⫫̸Y|Z requires a very particular setup which is probably rarely encountered in practice.

The aforementioned argument motivates us to also consider a different empirical estimate of the squared Hilbert-Schmidt norm of the partial cross covariance operator:

(36)SK=nTxy·z=1ntr(K˜X|ZK˜Y|Z),

where we have replaced X¨ with X. We can approximate the null distribution of SK by utilizing the strategies presented in Sections 3.3 and 3.4 of Zhang et al. [10]. Here, we utilize the hypotheses:

(37)H0:ΣXY·ZHS2=0,H1:ΣXY·ZHS2>0.

We similarly consider a corresponding finite dimensional partial cross-covariance matrix:

(38)S=nCˆAB·CF2,

where we have replaced A¨ with A. The above statistic is a generalization of linear partial correlation, because we consider uncorrelatedness of the residuals of non-linear functional transformations after performing non-linear regression. The asymptotic distribution for S in Theorem 1 also holds for S, when we replace A¨ with A. Here, we use the hypotheses:

(39)H0:CAB·CF2=0,H1:CAB·CF2>0.

In practice, the test which uses S, which we now call the Randomized conditional Correlation Test (RCoT), usually rivals or outperforms RCIT and KCIT, because (1) nearly all conditionally dependent variables encountered in practice are also conditionally correlated after at least one functional transformation, and (2) we can easily calibrate the null distribution of the test using S even when Z has large cardinality. We will therefore find this test useful for replacing RCIT when we have large conditioning set sizes (≥4).

6.4 Time complexity

We now show that RCIT and RCoT have linear time complexity with respect to the sample size n. Let d denote the number of random Fourier features in C.

Proposition 6.

If we haven>d, then RCIT and RCoT have time complexityO(d2n).

Proof.

Wlog, we will prove the claim for RCIT (the proof for RCoT will follow analogously). Note that it suffices to show that all sub-procedures of RCIT have time complexity O(d2n) or o(d2n).

The first step of RCIT computes the random Fourier features on the support of the process 2cos(WT·+B), WPW, BUniform([0,2π]). Let aX denote the number of dimensions in X, and likewise for aY and aZ. Let m, q and d denote the number of random Fourier features in A¨,B and C, respectively. Note that computing the samples of A¨ requires O((aX+aZ)mn) time because W is a (aX+aZ)×m matrix. As a result, computing the samples of A¨ requires O(n) time with m, aZ and aX fixed. Similarly, computing the samples of B requires O(n) time with aY,q fixed and that of C requires O(dn) time with only aZ fixed.

The second step of RCIT estimates Eˆ(A¨|C) and Eˆ(B|C) using linear ridge regressions. Recall that linear ridge regression scales O(d2n) in time, where d corresponds to the number of features (assuming n>d). Now RCIT requires m+q linear ridge regressions in order to compute Eˆ(A¨|C) and Eˆ(B|C). We therefore conclude that estimating Eˆ(A¨|C) and Eˆ(B|C) can be done in O(d2n) time with m and q fixed.

Next, computing the covariance ΣˆA¨B·C requires O(mqn) time. Finally, the time complexity of all of the methods used to approximate the null distribution do not depend on d or n once given ΣˆA¨B·C; we thus conclude that all of the null distribution approximation methods have time complexity O(1) when m and q are fixed.

We have shown that all sub-procedures of RCIT scale O(d2n) or o(d2n). We therefore conclude that RCIT has time complexity O(d2n).  □

The above proposition implies that RCIT and RCoT have time complexity O(n) because d is set to a fixed number regardfless of sample size. Recall that d is fixed because we have an exponential convergence rate with respect to d as highlighted previously in Proposition 4; in other words, a fixed d is reasonable so long as n does not become extremely large. In practice, we have found that setting m=5, q=5 and p=25 works well for a variety of sample sizes and dimensions of Z as highlighted in the next section. The statement holds so long as we choose a very small regularization constant λ (e. g., 1E10); note that this is different from the standard prediction regime, where we must carefully tune λ to prevent overfitting. We can utilize a small regularization constant in the proposed CI test setting because the entries of CˆA¨B·C are never seen during training time.

7 Experiments

We carried out experiments to compare the empirical performance of the following tests:

1. RCIT: uses S with the Lindsay-Pilla-Basak approximation,

2. RCoT: uses S with the Lindsay-Pilla-Basak approximation,

3. KCIT: uses SK with a simulated null by bootstrap.

4. KCoT: uses SK with a simulated null by bootstrap.

We estimated the conditional expectations for S and S using linear ridge regressions with random Fourier features. We also compared RCIT and RCoT against permutation tests and list the results in the Appendix A. We present the results of KCoT in Appendix A.3 as well. Note that KCIT with the gamma approximation performs slightly faster than KCIT with bootstrap,[3] but the bootstrap results in a significantly better calibrated null distribution. We focus on large sample size (≥500) scenarios because we can just apply KCIT with bootstrap otherwise. We ran all experiments using the R programming language (Microsoft R Open) on a laptop with 2.60 GHz of CPU and 16 GB of RAM.

7.1 Hyperparameters

We used the same hyperparameters for RCIT and RCoT. Namely, we used the median Euclidean distance heuristic across the first 500 samples of X¨, X, Y and Z for choosing the σX¨, σX, σY, and σZ hyperparameters for the Gaussian kernels kσ(x,y)=exp(xy2/σ), respectively[4] [16] , [40]. We also fixed the number of Fourier features for X¨, X and Y to 5 and the number of Fourier features for Z to 25. We standardized all original and Fourier variables to mean zero unit variance in order to help ensure numerically stable computations. Finally, we set γ to 1E10 in order to keep bias minimal. These parameters are reasonable because we designed both RCIT and RCoT for the purposes of causal discovery, where the variable set Z is small with a sparse underlying causal graph. We can therefore utilize a relatively small number of random Fourier features as compared to the sample size n. Authors who wish to apply RCIT or RCoT in the high dimensional scenario should consider utilizing more Fourier features for Z and choosing the lambda values carefully (e. g., through cross-validation or information criteria).

With KCIT, we set σ to the squared median Euclidean distance between (X,Y) using the first 500 samples times double the conditioning set size; the hyperparameters as described in the original paper, the hyperparameters in the author-provided MATLAB implementation and the hyperparameters of RCIT/RCoT all gave worse performance.

7.2 Type I error

We analyzed the Type I error rates of the three CI tests as a function of sample size and conditioning set size. We evaluated the algorithms using the Kolmogorov-Smirnov (KS) test statistic. Recall that the KS test uses the following statistic:

(40)K=supxR|Fˆ(x)F(x)|=FˆXFX,

where FˆX denotes the empirical CDF, and FX some comparison CDF. If the sample comes from PX, then K converges to 0 almost surely as n by the Glivenko-Cantelli theorem.

Now a good CI test controls the Type I error rate at any α value, when we have a uniform sampling distribution of the p-values over [0,1]. Therefore, a good CI test should have a small KS statistic value, when we set FX to the uniform distribution over [0,1].

To compute the KS statistic values, we generated data from 1000 post non-linear models [10], [11]. We can describe each post non-linear model as follows: X=g1(Z+ε1), Y=g2(Z+ε2), where Z,ε1,ε2 have jointly independent standard Gaussian distributions, and g1,g2 denote smooth functions. We always chose g1,g2 uniformly from the following set of functions: {(·),(·)2,(·)3,tanh(·),exp(·2)}. Thus, we have XY|Z in any case. Notice also that this situation is more general than the additive noise models proposed in Ramsey [41], where we have X=g1(Z)+ε1, Y=g2(Z)+ε2. The post non-linear models allow us to simulate heteroskedastic noise which is commonly encountered in real scenarios but not captured with additive noise models.

7.2.1 Sample size

We first assess the Type I error rate as a function of sample size. We used sample sizes of 500, 1000, 2000, 5000, ten thousand, one hundred thousand and one million. A good CI test should control the Type I error rate across all α values at any sample size. Figure 1a summarizes the KS statistic values for the three different CI tests. Observe that all tests have similar KS statistic values across different sample sizes. We conclude that all three tests perform comparably in controlling the Type I error rate with a single conditioning variable at different sample sizes.

The run time results however tell a markedly different story. Both RCIT and RCoT output a p-value much more quickly than KCIT at different sample sizes (Figure 1b). Moreover, KCIT ran out of memory at 5000 samples while RCIT and RCoT handled one million samples in a little over 6 seconds. RCIT and RCoT also completed more than two orders of magnitude faster than KCIT on average at a sample size of 2000 (Figure 1c). We conclude that RCIT and RCoT are more scalable than KCIT. Moreover, the experimental results agree with standard matrix complexity theory; RCIT and RCoT scale linearly with sample size (see Proposition 6), while KCIT scales strictly greater than quadratically with sample size.

7.2.2 Conditioning set size

CCD algorithms request p-values from CI tests using large conditioning set sizes. In fact, algorithms which do not assume causal sufficiency, such as FCI, often demand very large conditioning set sizes (>5). We should however also realize that CCD algorithms search for minimal conditioning sets in order to establish ancestral relations. This means that we must focus on testing for cases where X⫫̸Y|Z, but we have either XY|ZA or X⫫̸Y|ZA, where |A|=1.

We therefore evaluated the Type I error rates of the CI tests as a function of conditioning set size by fixing the sample size at 1000 and then adding 1 to 10 standard Gaussian variables into the conditioning set so that X=g1(1kj=1kZj+ϵ1), Y=g2(1kj=1kZj+ϵ2), k={1,,10} in 1000 models. Note that this situation corresponds to 1 to 10 common causes.

Figure 1d summarizes the KS statistic values in the aforementioned scenario. We see that the KS statistic values for RCoT remain the smallest for nearly all conditioning set sizes, followed by RCIT and then KCIT. This implies that RCoT best approximates the null distribution out of the three CI tests. We also provide the histograms of the p-values across the 1000 post non-linear models at a conditioning set size of 10 for KCIT, RCIT, and RCoT in Figures 1e–1g. Notice that the histograms become progressively more similar to a uniform distribution. We conclude that RCoT controls its Type I error rate the best even with large conditioning set sizes while KCIT controls its rate the worst.

Now the run times of all three tests only increased very slightly with the conditioning set size (Figure 1h). However, both RCIT and RCoT still completed 40.91 times faster than KCIT on average (95 % confidence interval: ±0.44). These results agree with standard matrix complexity theory, as we expect all tests to scale linearly with dimensionality.

Figure 1

Experimental results of RCIT, RCoT and KCIT when conditional independence holds. (a) All tests have comparable KS statistic values as a function of sample size. (b) However, both RCIT and RCoT complete much faster than KCIT. (c) The relative difference in speed between RCIT vs. KCIT and RCoT vs. KCIT grows with increasing sample size. (d) RCoT maintains the lowest KS statistic value with increases in dimensionality. (e–g) Histograms with a conditioning set size of 10 show that KCIT, RCIT and RCoT obtain progressively more uniform null distributions. (h) Run times of all three tests scale linearly with dimensionality of the conditioning set.

7.3 Power

We next evaluated test power (i. e., 1-(Type II error rate)) by computing the area under the power curve (AUPC). The AUPC corresponds to the area under the empirical CDF of the p-values returned by a CI test when the null does not hold. A CI test has higher power when its AUPC is closer to one. For example, observe that if a CI test always returns a p-value of 0 in the perfect case, then its AUPC corresponds to 1.

We examined the AUPC by adding the same small error εbN(0,1/16) to both X and Y in 1000 post non-linear models as follows: X=g1(εb+ε1), Y=g2(εb+ε2), ZN(0,1). Here, we do not allow the CI tests to condition on εb, so we always have X⫫̸Y|Z; this situation therefore corresponds to a hidden common cause.

7.3.1 Sample size

We first examine power as a function of sample size. We again tested sample sizes of 500, 1000, 2000, 5000, ten thousand, one hundred thousand, and one million. We have summarized the results in Figure 2a. Both RCIT and RCoT have comparable AUPC values to KCIT with sample sizes of 500, 1000 and 2000. At larger sample sizes, KCIT again did not scale due to insufficient memory, but the AUPC of both RCIT and RCoT continued to increase at similar values. We conclude that all three tests have similar power.

The run time results mimic those of Section 7.2.1; RCIT and RCoT completed orders of magnitude faster than KCIT (Figures 2b and 2c).

7.3.2 Conditioning set size

We next examined power as a function of conditioning set size. To do this, we fixed the sample size at 1000 and set Z=(Z1,,Zk) with ZN(0,Ik), k={1,,10} in the 1000 post non-linear models. We therefore examined how well the CI tests reject the null under an increasing conditioning set size with uninformative variables. A good CI test should either (1) maintain its power or, more realistically, (2) suffer a graceful decline in power with an increasing conditioning set size because none of the variables in the conditioning set are informative for rendering conditional independence by design.

We have summarized the results in Figure 2d. Notice that all tests have comparable AUPC values with small conditioning set sizes (between 1 and 3), but the AUPC value of KCIT gradually increases with increasing conditioning set sizes; the AUPC value should not increase under the current setup with a well-calibrated null because the extra variables are uninformative. To determine the cause of the unexpected increase in power, we permuted the values of X in each run in order to assess the calibration of the null distribution. Figure 2f summarizes the results, where we can see that only KCIT’s KS statistic grows with an increasing conditioning set size. We can therefore claim that the increasing AUPC value of KCIT holds because of a badly calibrated null distribution with larger conditioning set sizes. We conclude that both RCIT and RCoT maintain steady power under an increasing conditioning set size while KCIT does not.

The run times in Figures 2e and 2g again mimic those in Section 7.2.2 with RCIT and RCoT completing in a much shorter time frame than KCIT.

Figure 2

Experimental results with RCIT, RCoT and KCIT as a function of sample size and conditioning set size when conditional dependence holds. (a) All tests have comparable AUPC values as a function of sample size with a conditioning set size of one. (b–c) Both RCIT and RCoT again complete much faster than KCIT. (d) KCIT’s AUPC value unexpectedly increases with the dimensionality of the conditioning set. Associated run times for (d) in (e). (f) The cause of KCIT’s AUPC increase lies in a badly calibrated null distribution; here we see that only KCIT’s KS statistic value increases under the null. Associated run times for (f) in (g).

7.4 Causal structure discovery

We next examine the accuracy of graphical structures as recovered by PC [1], FCI [42] and RFCI [43] when run using RCIT, RCoT or KCIT.

We used the following procedure in Colombo et al. [43] to generate 250 different Gaussian DAGs with an expected neighborhood size E(N)=2 and v=20 vertices. First, we generated a random adjacency matrix A with independent realizations of Bernoulli(E(N)/(v1)) random variables in the lower triangle of the matrix and zeroes in the remaining entries. Next, we replaced the ones in A by independent realizations of a Uniform([1,0.1][0.1,1]) random variable. We interpret a nonzero entry Aij as an edge from Xi to Xj with coefficient Aij in the following linear model:

(41)X1=ε1,Xi=r=1v1AirXr+εi.

for i=2,,v where ε1,,εv are mutually independent standard Gaussian random variables. The variables {X1,,Xv}=X then have a multivariate Gaussian distribution with mean 0 and covariance matrix Σ=(IvA)1(IvA)T, where Iv is the v×v identity matrix. To introduce non-linearities, we passed each variable in X through a non-linear function g again chosen uniformly from the set {(·),(·)2,(·)3,tanh(·),exp(·2)}.

For FCI and RFCI, we introduced latent and selection variables using the following procedure. For each DAG, we first randomly selected a set of 0–3 latent common causes L. From the set XL, we then selected a set of 0–3 colliders as selection variables S. For each selection variable in S, we subsequently eliminated the bottom q percentile of samples, where we drew q according to independent realizations of a Uniform([0.1,0.5]) random variable. We finally eliminated all of the latent variables from the dataset.

We ultimately created 250 different 500 sample datasets for PC, FCI and RFCI. We then ran the sample versions of PC, FCI and RFCI using RCIT, RCoT, KCIT and Fisher’s z-test (FZT) at α=0.05. We also obtained the oracle graphs by running the oracle versions of PC, FCI and RFCI using the ground truth.

We have summarized the results as structural Hamming distances (SHDs) from the oracle graphs in Figure 3a. PC run with RCIT and PC run with RCoT both outperformed PC run with KCIT by a large margin according to paired t-tests (PC RCIT vs. KCIT, t=14.76, p<2.2E16; PC RCoT vs. KCIT, t=12.87, p<2.2E16). We found similar results with FCI and RFCI, although by only a small margin; 3 of the 4 comparisons fell below the Bonferonni corrected threshold of 0.05/6 and the other comparison fell below the uncorrected threshold of 0.05 (FCI RCIT vs. KCIT, t=2.00, p=0.047; FCI RCoT vs. KCIT t=2.96, p=0.0034; RFCI RCIT vs. KCIT, t=3.56, p=4.5E4; RFCI RCoT vs. KCIT, t=2.80, p=0.0055). All algorithms with any of the kernel-based tests outperformed the same algorithms with FZT by a large margin (p<7E14 in all cases). Finally, the run time results in Figure 3b show that the CCD algorithms run with RCIT and RCoT complete at least 13 times faster on average than those run with KCIT. We conclude that both RCIT and RCoT help CCD algorithms at least match the performance of the same algorithms run with KCIT, but RCIT and RCoT do so within a much shorter time frame than KCIT.

Figure 3

Results of CCD algorithms as evaluated by mean (a) SHD and (b) run times. The CCD algorithms run with KCIT perform comparably (or even slightly worse) to those run with RCIT and RCoT in (a). Run times in (b) show that the CCD algorithms run with RCIT and RCoT complete at least 13 times faster on average than those with KCIT. Error bars denote 95 % confidence intervals of the mean.

7.5 Real data

We finally ran PC, FCI and RFCI using RCIT, RCoT, KCIT and FZT at α=0.05 on a publicly available longitudinal dataset from the Cognition and Aging USA (CogUSA) study [44], where scientists measured the cognition of men and women above 50 years of age. The dataset contains 815 samples, 18 variables and two waves (thus 18/2=9 variables in each wave) separated by two years after some data cleaning.[5] Note that we do not have access to a gold standard solution set in this case. However, we can utilize the time information in the dataset to detect false positive ancestral relations directed backwards in time.

We ran the CCD algorithms on 30 bootstrapped datasets. Results are summarized in Figure 4. Comparisons with PC did not reach the Bonferonni level among the kernel-based tests, although PC run with either RCIT or RCoT yielded fewer false positive ancestral relations on average than PC run with KCIT near an uncorrected level of 0.05 (PC RCIT vs. KCIT, t=2.76, p=9.85E3; PC RCoT vs. KCIT, t=1.99, p=0.056). However, FCI and RFCI run with either RCIT or RCoT performed better than those run with KCIT at a Bonferroni corrected level of 0.05/6 (FCI RCIT vs. KCIT, t=29.57, p<2.2E16; FCI RCoT vs. KCIT, t=17.41, p<2.2E16; RFCI RCIT vs. KCIT, t=6.50, p=4.13E7; RFCI RCoT vs. KCIT, t=7.39, p=3.85E8). The CCD algorithms run with FZT also gave inconsistent results; PC run with FZT performed the best on average, but FCI and RFCI run with FZT also performed second from the worst. Here, we should trust the outputs of FCI and RFCI more strongly than those of PC, since both FCI and RFCI allow latent common causes and selection bias which often exist in real data. Next, CCD algorithms run with RCIT performed comparably to those run with RCoT (PC RCIT vs. RCoT, t=1.05, p=0.301; FCI RCIT vs. RCoT, t=1.54, p=0.134; RFCI RCIT vs. RCoT, t=0.89, p=0.380). We finally report that the CCD algorithms run with RCIT and RCoT complete at least 40 times faster on average than those run with KCIT (Figure 4b). We conclude that CCD algorithms run with either RCIT or RCoT perform at least as well as those run with KCIT on this real dataset but with large reductions run time.

Figure 4

Results of CCD algorithms as evaluated on real longitudinal data. Part (a) displays mean counts of the number of ancestral relations directed backwards time. We do not display 95 % confidence intervals when we computed a standard error of zero. Part (b) summarizes the mean run times.

8 Conclusion

We developed two statistical tests called RCIT and RCoT for fast non-parametric CI testing. Both RCIT and RCoT approximate KCIT by sampling Fourier features. Moreover, the proposed tests return p-values orders of magnitude faster than KCIT in the large sample size setting. RCoT in particular also has a better calibrated null distribution than KCIT especially with larger conditioning set sizes. In causal graph discovery, RCIT and RCoT help CCD algorithms recover graphical structures at least as accurately as KCIT but, most importantly, also allow the algorithms to complete in a much shorter time frame. We believe that the speedups provided by RCIT and RCoT will make non-parametric causal discovery more accessible to scientists who wish to apply CCD algorithms to their datasets.

Note that RCIT and RCoT may estimate the null distribution more accurately than KCIT for multiple reasons. First, RCIT and RCoT both assess for non-vanishing covariances across a smaller set of functions than KCIT. This in turn allows the two tests to more easily estimate the null distribution than KCIT because KCIT must deal with all of the functions in the associated RKHS. Second, both RCIT and RCoT utilize more advanced methods of estimating of the null distribution than KCIT. RCIT and RCoT more specifically utilize the Lindsay-Pilla-Basak method as explained in Section 6.2 as opposed to matching the first two moments of a gamma distribution. Third, RCoT in particular utilizes the variable set X rather than X¨ which allows for lower dimensional inferences as explained in Section 6.3. RCIT and RCoT thus take advantage of new technologies and additional structure inherent within real data in order to achieve better control of the Type I error rate.

Funding source: National Human Genome Research Institute

Award Identifier / Grant number: U54HG008540

Funding source: U.S. National Library of Medicine

Award Identifier / Grant number: T15LM007059

Award Identifier / Grant number: R01LM012095

Funding statement: Research reported in this publication was supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge initiative. The research was also supported by the National Library of Medicine of the National Institutes of Health under award numbers T15LM007059 and R01LM012095.

Acknowledgment

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix A

A.1 CLT for sample covariance

We will prove the central limit theorem (CLT) for the sample covariance matrix. We first have the following sample covariance matrices with known and unknown expectation vector, respectively:

(42)C¨=1ni=1n[XiE(X)][XiE(X)]T,Cˆ=1n1i=1n[XiEˆ(X)][XiEˆ(X)]T.

Now observe that we may write:

(43)(n1)Cˆ=i=1n[XiE(X)(Eˆ(X)E(X))][XiE(X)(Eˆ(X)E(X))]T=i=1n(XiE(X))(XiE(X))T+n(Eˆ(X)E(X))(Eˆ(X)E(X))T2(Eˆ(X)E(X))i=1n(XiE(X))T=nC¨n(Eˆ(X)E(X))(Eˆ(X)E(X))T

It follows that:

(44)n(CˆC)=n(n1n1CˆC)=n(nn1C¨nn1(Eˆ(X)E(X))(Eˆ(X)E(X))TC)=nnn1C¨nnn1(Eˆ(X)E(X))(Eˆ(X)E(X))TnC=nnn1C¨nnn1(Eˆ(X)E(X))(Eˆ(X)E(X))Tn1n1nC=nnn1(C¨C)nnn1(Eˆ(X)E(X))(Eˆ(X)E(X))T+nn1C

We are now ready to state the result:

Lemma 1.

LetX1,,Xnrefer to a sequence of i. i. d. random k-vectors. Denote the expectation vector and covariance matrix ofX1asμ1andC1, respectively. Assume thatC˘1=Cov[vu((X1μ1)(X1μ1)T)]is positive definite, wherevu(M)denotes the vectorization of the upper triangular portion of a real symmetric matrix M. Then, we have:

(45)n(vu(Cˆ)vu(C1))dN(0,C˘1).

Proof.

Consider the quantity aT[n(vu(Cˆ)vu(C1))]=n(aTvu(Cˆ)aTvu(C1)) where aRk(k+1)/2{0}. Note that aTvu[(X1μ1)(X1μ1)T], …, aTvu[(Xnμ1)(Xnμ1)T] is a sequence of i. i. d. random variables with expectation aTvu(C1) and variance aTC˘1a. Moreover observe that C˘1< because C˘1 is positive definite. We can therefore apply the univariate central limit theorem to conclude that:

(46)n(aTvu(C¨1)aTvu(C1))dN(0,aTC˘1a),

where C¨1=1ni=1n(Xiμ1)(Xiμ1)T. We would however like to claim that:

(47)n(aTvu(Cˆ)aTvu(C1))dN(0,aTC˘1a).

In order to prove this, we use 44 and set:

(48)n(aTvu(Cˆ)aTvu(C1))=aTAn+aTBn,

where we have:

(49)An=nn1n(vu(C¨1)vu(C)),Bn=nn1vu(C1)nnn1vu[(Eˆ(X)μ1)(Eˆ(X)μ1)T].

We already know from 46 that:

(50)n(aTvu(C¨1)aTvu(C1))dN(0,aTC˘1a).

Therefore, so does aTAn by Slutsky’s lemma, when we view the sequence of constants nn1 as a sequence of random variables. For aTBn, we know that:

(51)n(aTEˆ(X)aTμ1)dN(0,aTC1a),

by viewing aTX1,,aTXn as a sequence of random variables, noting that E(X1X1T)< because C˘1 is positive definite and then applying the univariate central limit theorem. We thus have aTBnp0. We may then invoke Slutsky’s lemma again for aTAn+aTBn and claim that:

(52)n(aTvu(Cˆ)aTvu(C1))dN(0,aTC˘1a).

We conclude the lemma by invoking the Cramer-Wold device.  □

A.2 Kernel conditional correlation test (KCoT) results

We compared KCoT against RCIT, RCoT and KCIT. We report the results in Figure 5. All tests perform comparably as a function of sample size (Figures 5a and 5c). However, KCoT performs better than KCIT and underperforms RCoT and RCIT as a function of conditioning set size. In particular, RCIT and RCoT obtain smaller KS-statistic values as a function of the conditioning set size (Figure 5b). KCIT and KCoT also obtain larger AUPC values with an increasing conditioning set size because they fail to maintain a uniform distribution under the null (Figure 5d; we again permuted the values of X for Figure 5e). We therefore conclude that RCoT and RCIT control their Type I error rates better than KCIT and KCoT even with large conditioning set sizes while maintaining power.

Figure 5

Results comparing against KCoT. Sub-figures (a) and (b) correspond to the KS-statistic values as a function of sample size and dimensions, respectively. Notice that RCIT and RCoT have progressively smaller KS-statistic values than both KCIT and KCoT with increasing dimensions. Next, sub-figures (c) and (d) correspond to the AUPC values also as a function of sample size and dimensions, respectively. KCIT and KCoT claim the largest AUPC values because both tests fail to control the Type I error rate well, as summarized by the large KS-statistic values in sub-figure (e) obtained after permuting the values of X.

Figure 6

Results of the same KS statistic experiments as in Section 7.2 except comparing RCIT and RCoT against S-Perm and S-Perm. Subfigures (a) and (b) again vary as a function of sample size, while subfigures (c) and (d) vary as a function of conditioning set size. RCIT and RCoT are faster than the permutation tests but yield comparable average KS statistic values.

Figure 7

Results of the same AUPC experiments as in Section 7.3 except comparing RCIT and RCoT against S-Perm and S-Perm. Subfigures (a) and (b) again vary as a function of sample size, while subfigures (c) and (d) vary as a function of conditioning set size. RCIT and RCoT are much faster than the permutation tests but yield comparable accuracy.

A.3 Comparisons against permutation

We also compared RCIT and RCoT against permutation CI tests. Here, we estimated the null distribution of S and S with permutations and call the resultant CI tests S-Perm and S-Perm, respectively. The permutation tests specifically involve permuting the residuals of the random Fourier features of Y one thousand times in order to estimate the null distribution. We have summarized the Type I error rate results in Figure 6 and the AUPC results in Figure 7. We ran the sample size experiments up to only ten thousand samples due to the long run times of S-Perm and S-Perm. We found that RCIT and RCoT perform similarly to the permutation tests but with significantly reduced average run time.

References

1. Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. 2nd ed. MIT press; 2000.10.7551/mitpress/1754.001.0001Search in Google Scholar

2. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915;10(4):507–21. https://doi.org/10.2307/2331838. ISSN 00063444.10.2307/2331838Search in Google Scholar

3. Fisher RA. On the probable error of a coefficient of correlation deduced from a small sample. Metron. 1921;1:3–32.Search in Google Scholar

4. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos Mag Ser 5. 1900;50:157–75.10.1007/978-1-4612-4380-9_2Search in Google Scholar

5. Tsamardinos I, Borboudakis G. Permutation testing improves bayesian network learning. Berlin, Heidelberg: Springer; 2010. p. 322–37. http://dx.doi.org/10.1007/978-3-642-15939-8_21. ISBN 978-3-642-15939-8.10.1007/978-3-642-15939-8_21Search in Google Scholar

6. Margaritis D. Distribution-free learning of bayesian network structure in continuous domains. In: Proceedings, the twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference. July 9–13, 2005, Pittsburgh, Pennsylvania, USA. 2005. p. 825–30. http://www.aaai.org/Library/AAAI/2005/aaai05-130.php.Search in Google Scholar

7. Huang T-M. Testing conditional independence using maximal nonlinear conditional correlation. Ann Stat. 2010;38(4):2047–91.10.1214/09-AOS770Search in Google Scholar

8. Su L, White H. A consistent characteristic function-based test for conditional independence. J Econom. 2007;141(2):807–34. https://doi.org/10.1016/j.jeconom.2006.11.006. http://www.sciencedirect.com/science/article/B6VC0-4MT59DD-4/2/267e7fc8dd979b6148fc4123998e94ee. ISSN 0304-4076.10.1016/j.jeconom.2006.11.006Search in Google Scholar

9. Su L, White H. A nonparametric hellinger metric test for conditional independence. Econom Theory. 2008;24(4):829–64. http://www.jstor.org/stable/20142523. ISSN 02664666, 14694360.10.1017/S0266466608080341Search in Google Scholar

10. Zhang K, Peters J, Janzing D, Schlkopf B. Kernel-based conditional independence test and application in causal discovery. In: Uncertainty in artificial intelligence. AUAI Press; 2011. p. 804–13. http://dblp.uni-trier.de/db/conf/uai/uai2011.html#ZhangPJS11. ISBN 978-0-9749039-7-2.Search in Google Scholar

11. Doran G, Muandet K, Zhang K, Schölkopf B. A permutation-based kernel conditional independence test. In: Proceedings of the 30th conference on uncertainty in artificial intelligence (UAI2014). Oregon: AUAI Press Corvallis; 2014. p. 132–41.Search in Google Scholar

12. Lopez-Paz D, Hennig P, Schölkopf B. The randomized dependence coefficient. In: Advances in neural information processing systems 26. 2013. p. 1–9.Search in Google Scholar

13. Zhang Q, Filippi S, Gretton A, Sejdinovic D. Large-scale kernel methods for independence testing. Stat Comput. 2017; 1–18. https://doi.org/10.1007/s11222-016-9721-7. ISSN 1573-1375.10.1007/s11222-016-9721-7Search in Google Scholar

14. Rahimi A, Recht B. Random features for large-scale kernel machines. In: Neural information processing systems. 2007.Search in Google Scholar

15. Sutherland DJ, Schneider JG. On the error of random fourier features. In: Meila M, Heskes T, editors. UAI. AUAI Press; 2015. p. 862–71. http://dblp.uni-trier.de/db/conf/uai/uai2015.html#SutherlandS15. ISBN 978-0-9966431-0-8.Search in Google Scholar

16. Lopez-Paz D, Sra S, Smola A, Ghahramani Z, Schölkopf B. Randomized nonlinear component analysis. In: Proceedings of the 31st international conference on machine learning, W&CP 32 (1). JMLR; 2014. p. 1359–67.Search in Google Scholar

17. Rahimi A, Recht B. Uniform approximation of functions with random bases. IEEE; 2008. https://doi.org/10.1109/ALLERTON.2008.4797607.10.1109/ALLERTON.2008.4797607Search in Google Scholar

18. Gretton A, Bousquet O, Smola A, Schlkopf B. Measuring statistical dependence with hilbert-schmidt norms. In: Proceedings in algorithmic learning theory. Springer-Verlag; 2005. p. 63–77.10.1007/11564089_7Search in Google Scholar

19. Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. US: Springer; 2003. https://books.google.com/books?id=v79sBNG34coC. ISBN 9781402076794.10.1007/978-1-4419-9096-9Search in Google Scholar

20. Hein M, Bousquet O. Kernels, associated structures and generalizations. Technical Report 127. Tübingen, Germany: Max Planck Institute for Biological Cybernetics; 2004.Search in Google Scholar

22. Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J Mach Learn Res. 2004;5:73–99. http://dl.acm.org/citation.cfm?id=1005332.1005335. ISSN 1532-4435.10.21236/ADA446572Search in Google Scholar

23. Fukumizu K, Gretton A, Sun X, Schölkopf B. Kernel measures of conditional dependence. In: Advances in neural information processing systems. Red Hook, NY, USA: Curran; 2008. p. 489–96. Max-Planck-Gesellschaft. http://papers.nips.cc/paper/3340-kernel-measures-of-conditional-dependence-supplemental.zip.Search in Google Scholar

24. Daudin JJ. Partial association measures and an application to qualitative regression. 1980;67(3):581–590. https://doi.org/10.1093/biomet/67.3.581. https://doi.org/10.2307/2335127. http://www.jstor.org/stable/2335127. ISSN 0006-3444 (print), 1464-3510 (electronic).Search in Google Scholar

25. Coppersmith D, Winograd S. Matrix multiplication via arithmetic progressions. J Symb Comput. 1990;9(3):251–80. https://doi.org/10.1016/S0747-7171(08)80013-2. ISSN 0747-7171.10.1145/28395.28396Search in Google Scholar

26. Micchelli CA, Xu Y, Zhang H. Universal kernels. J Mach Learn Res. 2006;6:2651–67. http://dblp.uni-trier.de/db/journals/jmlr/jmlr7.html#MicchelliXZ06.Search in Google Scholar

27. Imhof JP. Computing the distribution of quadratic forms in normal variables. Biometrika. 1961;48(3/4):419–26. https://doi.org/10.2307/2332763.10.1093/biomet/48.3-4.419Search in Google Scholar

28. Solomon H, Stephens MA. Distribution of a sum of weighted chi-square variables. J Am Stat Assoc. 1977;72:881–5.Search in Google Scholar

29. Johnson NL, Kotz S, Balakrishnan N. Continuous multivariate distributions. 3rd ed. Wiley; 2002.Search in Google Scholar

30. Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29(3–4):350–62. https://doi.org/10.1093/biomet/29.3-4.350.10.1093/biomet/29.3-4.350Search in Google Scholar

31. Satterthwaite F. An approximate distribution of estimates of variance components. Biom Bull. 1946;2(6):110–4.10.2307/3002019Search in Google Scholar

32. Fairfield-Smith H. The problem of comparing the results of two experiments with unequal errors. J Counc Sci Ind Res. 1936;9:211–2.Search in Google Scholar

33. Hall P. Chi squared approximations to the distribution of a sum of independent random variables. Ann Probab. 1983;11(4):1028–36. https://doi.org/10.1214/aop/1176993451. 11.10.1214/aop/1176993451Search in Google Scholar

34. Buckley MJ, Eagleson GK. An approximation to the distribution of quadratic forms in normal random variables. Aust N Z J Stat. 1988;30(1):150–9.10.1111/j.1467-842X.1988.tb00471.xSearch in Google Scholar

35. Wood ATA. An f approximation to the distribution of a linear combination of chi-squared variables. Commun Stat, Simul Comput. 1989;18:1439–56.10.1080/03610918908812833Search in Google Scholar

36. Lindsay B, Pilla R, Basak P. Moment-based approximations of distributions using mixtures: Theory and applications. Ann Inst Stat Math. 2000;52(2):215–30. http://EconPapers.repec.org/RePEc:spr:aistmt:v:52:y:2000:i:2:p:215-230.10.1023/A:1004105603806Search in Google Scholar

37. Bodenham D, Adams N. A comparison of efficient approximations for a weighted sum of chi-squared random variables. Stat Comput. 2016;26:917–28. https://doi.org/10.1007/s11222-015-9583-4.10.1007/s11222-015-9583-4Search in Google Scholar

38. Bodenham D. momentchi2. 2015. http://cran.r-project.org/web/packages/momentchi2/.Search in Google Scholar

39. Uspensky JVJV. Introduction to mathematical probability. 1st ed. New York, London: McGraw-Hill; 1937. “Problems for solution” with answers at end of each chapter.Search in Google Scholar

40. Gretton A, Fukumizu K, Teo C, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. In: Advances in neural information processing systems 20. Red Hook, NY, USA: Curran; 2008. p. 585–92. Max-Planck-Gesellschaft.Search in Google Scholar

41. Ramsey JD. A scalable conditional independence test for nonlinear, non-gaussian data. CoRR. 2014;abs/1401.5031. http://arxiv.org/abs/1401.5031.Search in Google Scholar

42. Zhang J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif Intell. 2008;172(16–17):1873–96. https://doi.org/10.1016/j.artint.2008.08.001. ISSN 0004-3702.10.1016/j.artint.2008.08.001Search in Google Scholar

43. Colombo D, Maathius M, Kalisch M, Richardson T. Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann Stat. 2012;40(1):294–321. 10.1214/11-AOS940. http://projecteuclid.org/euclid.aos/1333567191.Search in Google Scholar

44. McArdle J, Rodgers W, Willis R. Cognition and aging in the usa (cogusa), 2007–2009, 2015.Search in Google Scholar