Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales at least quadratically with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time.
1 The problem
Constraint-based causal discovery (CCD) algorithms such as Peter-Clark (PC) and Fast Causal Inference (FCI) infer causal relations from observational data by combining the results of many conditional independence (CI) tests . In practice, a CCD algorithm can easily request p-values from thousands of CI tests even with a sparse underlying graph. Developing fast and accurate CI tests is therefore critical for maximizing the usability of CCD algorithms across a wide variety of datasets.
Investigators have developed many fast parametric methods for testing CI. For example, we can use partial correlation to test for CI under the assumption of Gaussian variables , . We can also consider testing for unconditional independence for each constant value z when Z is discrete and . The chi-squared test for instance utilizes this strategy when both X and Y are also discrete . Another permutation-based test generalizes the same strategy even when X and Y are not necessarily discrete .
Testing for CI in the non-parametric setting generally demands a more sophisticated approach. One strategy involves discretizing continuous conditioning variables Z as in some optimal fashion and assessing unconditional independence , . Discretization however suffers severely from the curse of dimensionality because consistency arguments demand smaller bins with increasing sample size, but the number of cells in the associated contingency table increases exponentially with the conditioning set size. A second method involves measuring the distance between estimates of the conditional densities and , or their associated characteristic functions, by observing that when , . However, the power of these tests also deteriorates quickly with increases in the dimensionality of Z.
Several investigators have since proposed reproducing kernel-based CI tests in order to tame the curse of dimensionality. Indeed, kernel-based methods in general are known for their strong empirical performance in the high dimensional setting. The Kernel Conditional Independence Test (KCIT) for example assesses CI by capitalizing on a characterization of CI in reproducing kernel Hilbert spaces (RKHSs; ). Intuitively, KCIT works by testing for vanishing regression residuals among functions in RKHSs. Another kernel-based CI test called the Permutation Conditional Independence Test (PCIT) reduces CI testing to two-sample kernel-based testing via a carefully chosen permutation found at the solution of a convex optimization problem .
The aforementioned kernel-based CI tests unfortunately suffer from an important drawback: both tests scale at least quadratically with sample size and therefore take too long to return a p-value in the large sample size setting. In particular, KCIT’s bottleneck lies in the eigendecomposition as well as the inversion of large kernel matrices , and PCIT must solve a linear program that scales cubicly with sample size in order to obtain its required permutation . As a general rule, it is difficult to develop exact kernel-based methods which scale sub-quadratically with sample size, since the computation of kernel matrices themselves scales at least quadratically.
Many investigators have nonetheless utilized random Fourier features in order to quickly approximate kernel methods in linear time with respect to the number of Fourier features. For example, Lopez-Paz and colleagues developed an unconditional independence test using statistics obtained from canonical correlation analysis with random Fourier features . Zhang and colleagues have also utilized random Fourier features for unconditional independence testing but took a different approach by approximating the kernel cross-covariance operator . The authors further analyzed block and Nyström-based kernel approximations to the unconditional independence testing problem. The authors ultimately concluded that the random Fourier feature and the Nyström-based approaches both outperformed the block-based approach on average. Others have analyzed the use of random Fourier features for predictive modeling (e. g., , ) or dimensionality reduction . In practice, investigators have observed that methods which utilize random Fourier features often scale linearly with sample size and achieve comparable accuracy to exact kernel methods , , , , .
In this paper, we also use random Fourier features to design two fast tests called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which approximate the solution of KCIT. Simulations show that RCIT, RCoT and KCIT have comparable accuracy, but both RCIT and RCoT scale linearly with sample size in practice. As a result, RCIT and RCoT return p-values several orders of magnitude faster than KCIT in the large sample size context. Moreover, experiments demonstrate that the causal structures returned by CCD algorithms using either RCIT, RCoT or KCIT have nearly identical accuracy.
2 High-level summary
We now provide a high-level summary for investigators who simply wish to perform CCD with RCIT or RCoT. We want to test whether we have in a fast and accurate manner without resorting to parametric assumptions. Previously, Zhang and colleagues introduced a non-parametric CI test called KCIT which can test whether we have in an accurate but not fast manner by analyzing the partial cross-covariance (operator) using the kernel method . Kernels however are expensive to compute because they scale at least quadratically with sample size. Another line of work has fortunately shown that we can approximate kernels by averaging over a variety of non-linear transformations called random Fourier features (e. g., , , ). We therefore propose to approximate KCIT by utilizing random Fourier features, specifically by analyzing the partial cross-covariance matrix of and Y (RCIT) or X and Y (RCoT) after subjecting the variable sets to the non-linear transformations and then non-linearly regressing out the effect of Z. Simulations show that RCIT and RCoT return p-values in a much shorter time frame while also matching or outperforming KCIT in approximating the null distribution. RCoT in particular also returns the most accurate p-values when the conditioning set size Z is large (≥4).
3 Preliminaries on kernels
We will deal with kernels in this paper, so we briefly review the corresponding theory; see  for a more extensive discussion of similar concepts. Capital letters denote sets of random variables with codomains , respectively. Let correspond to a Hilbert space of functions mapping to . We say that is more specifically a reproducing kernel Hilbert space, if the Dirac delta operator (which maps to ) is a bounded linear functional . We can associate with a unique positive definite kernel in which the feature map satisfies the reproducing property, , by the Moore-Aronszajn Theorem. We will require that be separable (i. e., it must have a complete orthonormal system) throughout this paper . Hein and Bousquet  showed that any continuous kernel on a separable space (e. g., ) induces a separable RKHS. We will likewise consider other kernels such as on the separable space .
We have the following norm defined on linear operators between RKHSs:
 Denote bya linear operator. Then, provided that the sum converges, the Hilbert Schmidt (HS) norm of Σ is defined as:
We denote the probability distribution of X as and that of Y as . We may then define the mean elements and with respect to the probability distributions; here, ϕ denotes the feature map from to , and ψ likewise denotes the feature map from to . We also define the quantity by applying the expectation twice:
where is an independent copy of X which follows the same distribution. The mean elements and exist so long as their respective norms in or are bounded; this is true so long as the kernels and are also bounded so that and .
The cross-covariance operator associated with the joint probability distribution over is a linear operator defined as follows:
for all and . Gretton et al.  showed that the HS-norm of the cross covariance operator, , can be written in terms of kernels as follows:
where ; the dependent case can be found in Lemma 1 of the same paper for the interested reader. The notations and correspond to centralized kernel matrices such that:
and likewise for . Here, , I denotes an identity matrix, and 1 denotes a vector of ones. The notation denotes a kernel matrix such as the RBF kernel where with . The transformation in Equation 6 ensures that similar to the centered kernels and in Equation 4.
4 Characterizations of conditional independence
We denote the probability distribution of X as and the joint probability distribution of as . Let denote the space of square integrable functions of X, and that of . Here, and likewise for . Next consider a dataset of n i. i. d. samples drawn according to .
We use the notation when X and Y are conditionally independent given Z. Perhaps the simplest characterization of CI reads as follows: if and only if . Equivalently, we have and .
4.1 Characterization by RKHSs
A second characterization of CI is given in terms of the cross-covariance operator on RKHSs . Recall the cross-covariance operator from to in Equation 3. We may then define the partial cross-covariance operator of given Z by:
where we use the right inverse instead of the inverse, if is not invertible (see Corollary 3 in Fukumizu et al. ). Notice the similarity of the partial cross-covariance operator to the linear partial cross-covariance matrix (as well as the conditional cross-covariance matrix in the Gaussian case). We can interpret the above equation as the partial covariance between and given (i. e., the partial covariance of X and Y given Z after passing these three variable sets through the functions in the RKHSs and ).
A kernel is characteristic if , implies , where and are two probability distributions of X ; alternatively, a kernel is characteristic if equality in the mean elements under the two distributions implies equality of the distributions. Two examples of characteristic kernels include the Gaussian kernel and the Laplacian kernel. Now if we use characteristic kernels in (7), then the partial cross-covariance operator is related to the CI relation via the following conclusion:
Here, means that for all and all . Recall further that because , where and are orthonormal bases of and , respectively, by Definition 1.
4.2 Characterization by spaces
We also consider a different characterization of CI which is intuitively more appealing because it allows to use to directly view CI as the uncorrelatedness of functions in certain spaces rather than a norm of the partial cross-covariance operator. In particular, consider the following constrained spaces:
Here, we write as shorthand for when , and likewise for . Notice that we can construct the three spaces listed above using nonlinear regression. For instance, for any in , we have:
where is the regression function of on Z.
We then have the following result:
 The following conditions are equivalent:
The second condition means that any “residual” function of given Z is uncorrelated with that of given Z. The equivalence also represents a generalization of the case when is jointly Gaussian; here, if and only if any residual function of X given Z is uncorrelated with that of Y given Z; i. e., the linear partial correlation coefficient is zero.
We also encourage the reader to observe the close relationship between Proposition 1 and claim 4 of Proposition 2. Notice that claim 4 of Proposition 2 implies that we have in Proposition 1. Moreover, Proposition 1 only considers functions in RKHSs, while claim 4 of Proposition 2 considers functions in spaces. We find Proposition 1 more useful than claim 4 of Proposition 2 because the RKHS of a characteristic kernel might be much smaller than the corresponding space.
5 Kernel conditional independence test
We now consider the following hypotheses:
We may equivalently rewrite the above null and alternative more explicitly using Proposition 1 as follows, provided that the kernels are chosen such that the premises of that proposition are satisfied (e. g., when the kernels are Gaussian or Laplacian ):
The above hypothesis implies that we can test for conditional independence by testing for uncorrelatedness between functions in reproducing kernel Hilbert spaces.
Zhang et al. [ 10] exploited the equivalence between 11 and 12 in the Kernel Conditional Independence Test (KCIT), which we now describe. Consider the functional spaces , and . We can compute the corresponding centered kernel matrices , and from n i. i. d. samples and z as in 6. We can then use these matrices to perform kernel ridge regression in order to estimate the function in 10 as follows: , where λ denotes the ridge regularization parameter and . Consequently, we can estimate the residual function as where:
Next consider the eigenvalue decomposition . Let . Then the empirical kernel map of is given by because . We may similarly write . We can thus write the centralized kernel matrices corresponding to and as follows:
We can use the above two centered kernel matrices to compute an empirical estimate of the HS norm of the partial cross covariance operator similar to how we computed the quantity in Equation 5:
Note that denotes an empirical estimate of . KCIT uses the statistic in order to determine whether or not to reject ; we multiply the empirical estimate of the HS norm by n in order to ensure convergence to a non-degenerate distribution.
6 Proposed test statistic & its asymptotic distribution
Observe that computing requires the inversion of large kernel matrices which scales strictly greater than quadratically with sample size; the exact complexity depends on the algorithm used to compute the matrix inverse (e. g., if we use the Coppersmith-Winograd algorithm ). CCD algorithms run with KCIT therefore take too long to complete. In this report, we will propose an inexpensive hypothesis test that almost always rejects or fails to reject the null whenever conditional dependence or independence holds, respectively, and even outperforms 12 as illustrated in our experimental results in Section 7. We will use a strategy that has already been successfully adopted in the context of unconditional dependence testing in doing so , .
We will in particular also take advantage of the characterization of CI presented in Proposition 1. Recall that the Frobenius norm corresponds to the Hilbert-Schmidt norm in Euclidean space. We therefore consider the following hypotheses as approximations to those in 12 and 11:
where corresponds to the ordinary partial cross covariance matrix, where and may be non-linear functions of Z. We also have with , . Similarly, with , . The terms and denote two spaces of functions, which we set to be the support of the process , , . In other words, we select m functions from and q functions from . We select these specific spaces because we can use them to approximate continuous shift-invariant kernels. A kernel k is said to be shift-invariant if and only if, for any , we have , ; examples of shift-invariant kernels include the Gaussian kernel frequently used in KCIT or the Laplacian kernel. The following result allows us to perform the approximation using the proposed spaces:
 For a continuous shift-invariant kernelon, we have:
The precise form of depends on the type of shift-invariant kernel one would like to approximate (see Figure 1 of Rahimi and Recht  for a list). Since investigators most frequently implement KCIT using the Gaussian kernel with hyperparameter σ, we choose to approximate the Gaussian kernel by setting to a centered Gaussian with standard deviation .
We will use the squared Frobenius norm of the empirical partial cross-covariance matrix as the statistic for RCIT:
where . Recall however that and may be non-linear functions of Z and therefore difficult to estimate. We would instead like to approximate and with linear functions. We therefore let with , where we also set to be the support of the process , , . We will approximate with similar to 7, where γ denotes a small ridge parameter; recall that this is equivalent to computing the cross-covariance matrix across the residuals of and B given C using linear ridge regression. We can justify this procedure because the particular choice of C allows us to approximate both and with linear functions of C as described below.
Let . Then , so . Moreover, . Note that we can estimate with the linear ridge regression solution under mild conditions because we can guarantee the following:
(Section 3.1 of Sutherland and Schneider ) Consider performing kernel ridge regression ofon Z. Assume that (1)and (2) the empirical kernel matrix of Z,, only has finite entries (i. e.,). Further assume that the range of Z,, is compact. We then have:
The exponential rate with respect to d in the above proposition suggests we can approximate the output of kernel ridge regression with a small number of random Fourier features, a hypothesis which we verify empirically in Section 7. Moreover, we can estimate with , because we can similarly guarantee that for any fixed at an exponential rate with respect to d.
We can therefore consider the following spaces for which are similar to the spaces used in claim 4 of Proposition 2:
We then approximate CI with in the following sense:
We always have , .
The reverse direction may hold for a larger subset of all possible joint distributions as the values of m and q increase; this is because at least one entry of will likely be greater than zero for any given distribution as the values of increase.
6.1 Null distribution
We now consider the asymptotic distribution of under the null. Let Π refer to a positive definite covariance matrix of the vectorization of . We may denote an arbitrary entry in Π as follows:
We have the following result:
Consider n i. i. d. samples from. Letdenote i. i. d. standard Gaussian variables (thusdenotes i.i.dvariables) and λ the eigenvalues of Π. We then have the following asymptotic distribution under the null in16:
We may first write:
where stands for the vectorization of . By CLT of the sample covariance matrix (see Lemma 1 in the Appendix A) combined with the continuous mapping theorem and the null, we know that . Here, we write an arbitrary entry under the null as follows:
Now consider the eigendecomposition of Π written as . Then, we have by the continuous mapping theorem. Note that:
We conclude that the null distribution of the test statistic is a positively weighted sum of i. i. d. random variables.
Note that multiple methods exist for estimating the conditional expectations in and Π in the above theorem. In this report, we will obtain consistent estimates of the conditional expectations by using kernel ridge regressions with the RBF kernel; here, consistency holds so long as the conditional expectations are continuous because the RBF kernel is dense in the space of continuous functions mapping to . We therefore have , where , by the continuous mapping theorem and weak law of large numbers assuming continuity of the conditional expectations. We can similarly approximate any arbitrary entry in Π because we may write the following:
Kernel ridge regressions however scale (strictly greater than) quadratically with sample size due to the inversion of the kernel matrix, so they may not be practical in the large sample size regime. Fortunately, we will not need to perform the kernel ridge regressions directly, because we can approximate the output of kernel ridge regression to within an arbitrary degree of accuracy for any fixed sample size n using linear ridge regression with enough random Fourier features as highlighted previously in Proposition 4. In particular, Proposition 4 implies that at rate exponential in d for any fixed n. We can also conclude that as for any fixed n. We can finally approximate the kernel ridge regression estimate of Π because we may write the following for an arbitrary entry in Π as :
We conclude that, for a dataset of fixed sample size n, we can substitute the conditional expectation estimates of kernel ridge regression with those of linear regression with random Fourier features when estimating as well as Π for applying Theorem 1.
Unfortunately though, a closed form CDF of a positively weighted sum of chi-squared random variables does not exist in general for applying Theorem 1. We can approximate the CDF by Imhof’s method which inverts the characteristic function numerically . We should consider Imhof’s method as exact, since it provides error bounds and can be used to compute the distribution at a fixed point to within a desired precision , . However, Imhof’s method is too computationally intensive for our purposes. We can nonetheless utilize several fast methods which approximate the null by moment matching.
6.2 Approximating the null distribution by moment matching
We write the cumulants of a positively weighted sum of i. i. d. random variables as follows:
where denotes the weights. We may for example derive the first three cumulants:
We then recover the moments from the cumulants as follows:
Now the Satterthwaite-Welch method , ,  represents perhaps the simplest and earliest moment matching method. The method matches the first two moments of the sum with a gamma distribution . Zhang and colleagues adopted a similar strategy in their paper introducing KCIT . Here, we have:
We however find the above gamma approximation rather crude. We therefore also consider applying more modern methods to estimating the distribution of a sum of positively weighted chi-squares. Improved methods such as the Hall-Buckley-Eagleson ,  and the Wood F  methods match the first three moments of the sum to other distributions in a similar fashion. On the other hand, the Lindsay-Pilla-Basak method  matches the first moments to a mixture distribution.
We will focus on the Lindsay-Pilla-Basak method in this paper, since Bodenham & Adams have already determined that the Lindsay-Pilla-Basak method performs the best through extensive experimentation , . We therefore choose to use the method as the default method for RCIT. Briefly, the method approximates the CDF under the null using a finite mixture of L Gamma CDFs :
where , , and we seek to determine the parameters g, , and . The Lindsay-Pilla-Basak method computes these parameters by a specific sequence of steps that makes use of results concerning moment matrices (see Appendix II in Uspensky ). The sequence is complicated and beyond the scope of this paper, but we refer the reader to  for details.
6.3 Testing for conditional un-correlatedness
Strictly speaking, we must consider the extended variable set to test for conditional independence according to Proposition 1. However, we have two observations: (1) we can substitute a test for non-linear conditional uncorrelatedness with tests for conditional independence in almost all cases encountered in practice because most conditionally dependent variables are correlated after some functional transformations, and (2) using the extended variable set makes estimating the null distribution more difficult compared to using the unextended variable set X. The first observation coincides with the observations of others who have noticed that Fisher’s z-test performs well (but not perfectly) in ruling out conditional independencies with non-Gaussian data. We can also justify the first observation with the following result using the cross-covariance operator :
In other words, we have:
Notice that is almost equivalent to CI, in the sense that just misses those rather contrived distributions where when . In other words, if when , then we have (under the corresponding additional assumptions of Propositions 1 and 5).
Let us now consider an example of a situation where when . Take three binary variables . Let and . Also consider the four probability tables in Table 1. Here, we have chosen the probabilities in the tables carefully by satisfying the following equation:
Of course, the equality holds when we have conditional independence . We are however interested in the case when conditional dependence holds. We therefore instantiated the values of Tables 1a and 1b as well as the second column in Table 1c () such that . We then solved for using Equation 35 in order to complete Table 1c. This ultimately yielded Table 1d.
Notice that we obtain a unique value for by solving Equation 35. Hence, has Lebesgue measure zero on the interval [0,1], once we have defined all of the other variables in the equation. Thus, is not always equivalent to , but satisfying the condition when requires a very particular setup which is probably rarely encountered in practice.
The aforementioned argument motivates us to also consider a different empirical estimate of the squared Hilbert-Schmidt norm of the partial cross covariance operator:
where we have replaced with X. We can approximate the null distribution of by utilizing the strategies presented in Sections 3.3 and 3.4 of Zhang et al. . Here, we utilize the hypotheses:
We similarly consider a corresponding finite dimensional partial cross-covariance matrix:
where we have replaced with A. The above statistic is a generalization of linear partial correlation, because we consider uncorrelatedness of the residuals of non-linear functional transformations after performing non-linear regression. The asymptotic distribution for in Theorem 1 also holds for , when we replace with A. Here, we use the hypotheses:
In practice, the test which uses , which we now call the Randomized conditional Correlation Test (RCoT), usually rivals or outperforms RCIT and KCIT, because (1) nearly all conditionally dependent variables encountered in practice are also conditionally correlated after at least one functional transformation, and (2) we can easily calibrate the null distribution of the test using even when Z has large cardinality. We will therefore find this test useful for replacing RCIT when we have large conditioning set sizes (≥4).
6.4 Time complexity
We now show that RCIT and RCoT have linear time complexity with respect to the sample size n. Let d denote the number of random Fourier features in C.
If we have, then RCIT and RCoT have time complexity.
Wlog, we will prove the claim for RCIT (the proof for RCoT will follow analogously). Note that it suffices to show that all sub-procedures of RCIT have time complexity or .
The first step of RCIT computes the random Fourier features on the support of the process , , . Let denote the number of dimensions in X, and likewise for and . Let m, q and d denote the number of random Fourier features in and C, respectively. Note that computing the samples of requires time because W is a matrix. As a result, computing the samples of requires time with m, and fixed. Similarly, computing the samples of B requires time with fixed and that of C requires time with only fixed.
The second step of RCIT estimates and using linear ridge regressions. Recall that linear ridge regression scales in time, where d corresponds to the number of features (assuming ). Now RCIT requires linear ridge regressions in order to compute and . We therefore conclude that estimating and can be done in time with m and q fixed.
Next, computing the covariance requires time. Finally, the time complexity of all of the methods used to approximate the null distribution do not depend on d or n once given ; we thus conclude that all of the null distribution approximation methods have time complexity when m and q are fixed.
We have shown that all sub-procedures of RCIT scale or . We therefore conclude that RCIT has time complexity . □
The above proposition implies that RCIT and RCoT have time complexity because d is set to a fixed number regardfless of sample size. Recall that d is fixed because we have an exponential convergence rate with respect to d as highlighted previously in Proposition 4; in other words, a fixed d is reasonable so long as n does not become extremely large. In practice, we have found that setting , and works well for a variety of sample sizes and dimensions of Z as highlighted in the next section. The statement holds so long as we choose a very small regularization constant λ (e. g., ); note that this is different from the standard prediction regime, where we must carefully tune λ to prevent overfitting. We can utilize a small regularization constant in the proposed CI test setting because the entries of are never seen during training time.
We carried out experiments to compare the empirical performance of the following tests:
RCIT: uses with the Lindsay-Pilla-Basak approximation,
RCoT: uses with the Lindsay-Pilla-Basak approximation,
KCIT: uses with a simulated null by bootstrap.
KCoT: uses with a simulated null by bootstrap.
We used the same hyperparameters for RCIT and RCoT. Namely, we used the median Euclidean distance heuristic across the first 500 samples of , X, Y and Z for choosing the , , , and hyperparameters for the Gaussian kernels , respectively  , . We also fixed the number of Fourier features for , X and Y to 5 and the number of Fourier features for Z to 25. We standardized all original and Fourier variables to mean zero unit variance in order to help ensure numerically stable computations. Finally, we set γ to in order to keep bias minimal. These parameters are reasonable because we designed both RCIT and RCoT for the purposes of causal discovery, where the variable set Z is small with a sparse underlying causal graph. We can therefore utilize a relatively small number of random Fourier features as compared to the sample size n. Authors who wish to apply RCIT or RCoT in the high dimensional scenario should consider utilizing more Fourier features for Z and choosing the lambda values carefully (e. g., through cross-validation or information criteria).
With KCIT, we set σ to the squared median Euclidean distance between using the first 500 samples times double the conditioning set size; the hyperparameters as described in the original paper, the hyperparameters in the author-provided MATLAB implementation and the hyperparameters of RCIT/RCoT all gave worse performance.
7.2 Type I error
We analyzed the Type I error rates of the three CI tests as a function of sample size and conditioning set size. We evaluated the algorithms using the Kolmogorov-Smirnov (KS) test statistic. Recall that the KS test uses the following statistic:
where denotes the empirical CDF, and some comparison CDF. If the sample comes from , then converges to 0 almost surely as by the Glivenko-Cantelli theorem.
Now a good CI test controls the Type I error rate at any α value, when we have a uniform sampling distribution of the p-values over [0,1]. Therefore, a good CI test should have a small KS statistic value, when we set to the uniform distribution over [0,1].
To compute the KS statistic values, we generated data from 1000 post non-linear models , . We can describe each post non-linear model as follows: , , where have jointly independent standard Gaussian distributions, and denote smooth functions. We always chose uniformly from the following set of functions: . Thus, we have in any case. Notice also that this situation is more general than the additive noise models proposed in Ramsey , where we have , . The post non-linear models allow us to simulate heteroskedastic noise which is commonly encountered in real scenarios but not captured with additive noise models.
7.2.1 Sample size
We first assess the Type I error rate as a function of sample size. We used sample sizes of 500, 1000, 2000, 5000, ten thousand, one hundred thousand and one million. A good CI test should control the Type I error rate across all α values at any sample size. Figure 1a summarizes the KS statistic values for the three different CI tests. Observe that all tests have similar KS statistic values across different sample sizes. We conclude that all three tests perform comparably in controlling the Type I error rate with a single conditioning variable at different sample sizes.
The run time results however tell a markedly different story. Both RCIT and RCoT output a p-value much more quickly than KCIT at different sample sizes (Figure 1b). Moreover, KCIT ran out of memory at 5000 samples while RCIT and RCoT handled one million samples in a little over 6 seconds. RCIT and RCoT also completed more than two orders of magnitude faster than KCIT on average at a sample size of 2000 (Figure 1c). We conclude that RCIT and RCoT are more scalable than KCIT. Moreover, the experimental results agree with standard matrix complexity theory; RCIT and RCoT scale linearly with sample size (see Proposition 6), while KCIT scales strictly greater than quadratically with sample size.
7.2.2 Conditioning set size
CCD algorithms request p-values from CI tests using large conditioning set sizes. In fact, algorithms which do not assume causal sufficiency, such as FCI, often demand very large conditioning set sizes (>5). We should however also realize that CCD algorithms search for minimal conditioning sets in order to establish ancestral relations. This means that we must focus on testing for cases where , but we have either or , where .
We therefore evaluated the Type I error rates of the CI tests as a function of conditioning set size by fixing the sample size at 1000 and then adding 1 to 10 standard Gaussian variables into the conditioning set so that , , in 1000 models. Note that this situation corresponds to 1 to 10 common causes.
Figure 1d summarizes the KS statistic values in the aforementioned scenario. We see that the KS statistic values for RCoT remain the smallest for nearly all conditioning set sizes, followed by RCIT and then KCIT. This implies that RCoT best approximates the null distribution out of the three CI tests. We also provide the histograms of the p-values across the 1000 post non-linear models at a conditioning set size of 10 for KCIT, RCIT, and RCoT in Figures 1e–1g. Notice that the histograms become progressively more similar to a uniform distribution. We conclude that RCoT controls its Type I error rate the best even with large conditioning set sizes while KCIT controls its rate the worst.
Now the run times of all three tests only increased very slightly with the conditioning set size (Figure 1h). However, both RCIT and RCoT still completed 40.91 times faster than KCIT on average (95 % confidence interval: ±0.44). These results agree with standard matrix complexity theory, as we expect all tests to scale linearly with dimensionality.
We next evaluated test power (i. e., 1-(Type II error rate)) by computing the area under the power curve (AUPC). The AUPC corresponds to the area under the empirical CDF of the p-values returned by a CI test when the null does not hold. A CI test has higher power when its AUPC is closer to one. For example, observe that if a CI test always returns a p-value of 0 in the perfect case, then its AUPC corresponds to 1.
We examined the AUPC by adding the same small error to both X and Y in 1000 post non-linear models as follows: , , . Here, we do not allow the CI tests to condition on , so we always have ; this situation therefore corresponds to a hidden common cause.
7.3.1 Sample size
We first examine power as a function of sample size. We again tested sample sizes of 500, 1000, 2000, 5000, ten thousand, one hundred thousand, and one million. We have summarized the results in Figure 2a. Both RCIT and RCoT have comparable AUPC values to KCIT with sample sizes of 500, 1000 and 2000. At larger sample sizes, KCIT again did not scale due to insufficient memory, but the AUPC of both RCIT and RCoT continued to increase at similar values. We conclude that all three tests have similar power.
7.3.2 Conditioning set size
We next examined power as a function of conditioning set size. To do this, we fixed the sample size at 1000 and set with , in the 1000 post non-linear models. We therefore examined how well the CI tests reject the null under an increasing conditioning set size with uninformative variables. A good CI test should either (1) maintain its power or, more realistically, (2) suffer a graceful decline in power with an increasing conditioning set size because none of the variables in the conditioning set are informative for rendering conditional independence by design.
We have summarized the results in Figure 2d. Notice that all tests have comparable AUPC values with small conditioning set sizes (between 1 and 3), but the AUPC value of KCIT gradually increases with increasing conditioning set sizes; the AUPC value should not increase under the current setup with a well-calibrated null because the extra variables are uninformative. To determine the cause of the unexpected increase in power, we permuted the values of X in each run in order to assess the calibration of the null distribution. Figure 2f summarizes the results, where we can see that only KCIT’s KS statistic grows with an increasing conditioning set size. We can therefore claim that the increasing AUPC value of KCIT holds because of a badly calibrated null distribution with larger conditioning set sizes. We conclude that both RCIT and RCoT maintain steady power under an increasing conditioning set size while KCIT does not.
7.4 Causal structure discovery
We used the following procedure in Colombo et al.  to generate 250 different Gaussian DAGs with an expected neighborhood size and vertices. First, we generated a random adjacency matrix with independent realizations of random variables in the lower triangle of the matrix and zeroes in the remaining entries. Next, we replaced the ones in by independent realizations of a random variable. We interpret a nonzero entry as an edge from to with coefficient in the following linear model:
for where are mutually independent standard Gaussian random variables. The variables then have a multivariate Gaussian distribution with mean 0 and covariance matrix , where is the identity matrix. To introduce non-linearities, we passed each variable in X through a non-linear function g again chosen uniformly from the set .
For FCI and RFCI, we introduced latent and selection variables using the following procedure. For each DAG, we first randomly selected a set of 0–3 latent common causes L. From the set , we then selected a set of 0–3 colliders as selection variables S. For each selection variable in S, we subsequently eliminated the bottom q percentile of samples, where we drew q according to independent realizations of a random variable. We finally eliminated all of the latent variables from the dataset.
We ultimately created 250 different 500 sample datasets for PC, FCI and RFCI. We then ran the sample versions of PC, FCI and RFCI using RCIT, RCoT, KCIT and Fisher’s z-test (FZT) at . We also obtained the oracle graphs by running the oracle versions of PC, FCI and RFCI using the ground truth.
We have summarized the results as structural Hamming distances (SHDs) from the oracle graphs in Figure 3a. PC run with RCIT and PC run with RCoT both outperformed PC run with KCIT by a large margin according to paired t-tests (PC RCIT vs. KCIT, , ; PC RCoT vs. KCIT, , ). We found similar results with FCI and RFCI, although by only a small margin; 3 of the 4 comparisons fell below the Bonferonni corrected threshold of 0.05/6 and the other comparison fell below the uncorrected threshold of 0.05 (FCI RCIT vs. KCIT, , ; FCI RCoT vs. KCIT , ; RFCI RCIT vs. KCIT, , ; RFCI RCoT vs. KCIT, , ). All algorithms with any of the kernel-based tests outperformed the same algorithms with FZT by a large margin ( in all cases). Finally, the run time results in Figure 3b show that the CCD algorithms run with RCIT and RCoT complete at least 13 times faster on average than those run with KCIT. We conclude that both RCIT and RCoT help CCD algorithms at least match the performance of the same algorithms run with KCIT, but RCIT and RCoT do so within a much shorter time frame than KCIT.
7.5 Real data
We finally ran PC, FCI and RFCI using RCIT, RCoT, KCIT and FZT at on a publicly available longitudinal dataset from the Cognition and Aging USA (CogUSA) study , where scientists measured the cognition of men and women above 50 years of age. The dataset contains 815 samples, 18 variables and two waves (thus 18/2=9 variables in each wave) separated by two years after some data cleaning. Note that we do not have access to a gold standard solution set in this case. However, we can utilize the time information in the dataset to detect false positive ancestral relations directed backwards in time.
We ran the CCD algorithms on 30 bootstrapped datasets. Results are summarized in Figure 4. Comparisons with PC did not reach the Bonferonni level among the kernel-based tests, although PC run with either RCIT or RCoT yielded fewer false positive ancestral relations on average than PC run with KCIT near an uncorrected level of 0.05 (PC RCIT vs. KCIT, , ; PC RCoT vs. KCIT, , ). However, FCI and RFCI run with either RCIT or RCoT performed better than those run with KCIT at a Bonferroni corrected level of 0.05/6 (FCI RCIT vs. KCIT, , ; FCI RCoT vs. KCIT, , ; RFCI RCIT vs. KCIT, , ; RFCI RCoT vs. KCIT, , ). The CCD algorithms run with FZT also gave inconsistent results; PC run with FZT performed the best on average, but FCI and RFCI run with FZT also performed second from the worst. Here, we should trust the outputs of FCI and RFCI more strongly than those of PC, since both FCI and RFCI allow latent common causes and selection bias which often exist in real data. Next, CCD algorithms run with RCIT performed comparably to those run with RCoT (PC RCIT vs. RCoT, , ; FCI RCIT vs. RCoT, , ; RFCI RCIT vs. RCoT, , ). We finally report that the CCD algorithms run with RCIT and RCoT complete at least 40 times faster on average than those run with KCIT (Figure 4b). We conclude that CCD algorithms run with either RCIT or RCoT perform at least as well as those run with KCIT on this real dataset but with large reductions run time.
We developed two statistical tests called RCIT and RCoT for fast non-parametric CI testing. Both RCIT and RCoT approximate KCIT by sampling Fourier features. Moreover, the proposed tests return p-values orders of magnitude faster than KCIT in the large sample size setting. RCoT in particular also has a better calibrated null distribution than KCIT especially with larger conditioning set sizes. In causal graph discovery, RCIT and RCoT help CCD algorithms recover graphical structures at least as accurately as KCIT but, most importantly, also allow the algorithms to complete in a much shorter time frame. We believe that the speedups provided by RCIT and RCoT will make non-parametric causal discovery more accessible to scientists who wish to apply CCD algorithms to their datasets.
Note that RCIT and RCoT may estimate the null distribution more accurately than KCIT for multiple reasons. First, RCIT and RCoT both assess for non-vanishing covariances across a smaller set of functions than KCIT. This in turn allows the two tests to more easily estimate the null distribution than KCIT because KCIT must deal with all of the functions in the associated RKHS. Second, both RCIT and RCoT utilize more advanced methods of estimating of the null distribution than KCIT. RCIT and RCoT more specifically utilize the Lindsay-Pilla-Basak method as explained in Section 6.2 as opposed to matching the first two moments of a gamma distribution. Third, RCoT in particular utilizes the variable set X rather than which allows for lower dimensional inferences as explained in Section 6.3. RCIT and RCoT thus take advantage of new technologies and additional structure inherent within real data in order to achieve better control of the Type I error rate.
Funding source: National Human Genome Research Institute
Award Identifier / Grant number: U54HG008540
Funding source: U.S. National Library of Medicine
Award Identifier / Grant number: T15LM007059
Award Identifier / Grant number: R01LM012095
Funding statement: Research reported in this publication was supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge initiative. The research was also supported by the National Library of Medicine of the National Institutes of Health under award numbers T15LM007059 and R01LM012095.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
A.1 CLT for sample covariance
We will prove the central limit theorem (CLT) for the sample covariance matrix. We first have the following sample covariance matrices with known and unknown expectation vector, respectively:
Now observe that we may write:
It follows that:
We are now ready to state the result:
Letrefer to a sequence of i. i. d. random k-vectors. Denote the expectation vector and covariance matrix ofasand, respectively. Assume thatis positive definite, wheredenotes the vectorization of the upper triangular portion of a real symmetric matrix M. Then, we have:
Consider the quantity where . Note that , …, is a sequence of i. i. d. random variables with expectation and variance . Moreover observe that because is positive definite. We can therefore apply the univariate central limit theorem to conclude that:
where . We would however like to claim that:
In order to prove this, we use 44 and set:
where we have:
We already know from 46 that:
Therefore, so does by Slutsky’s lemma, when we view the sequence of constants as a sequence of random variables. For , we know that:
by viewing as a sequence of random variables, noting that because is positive definite and then applying the univariate central limit theorem. We thus have . We may then invoke Slutsky’s lemma again for and claim that:
We conclude the lemma by invoking the Cramer-Wold device. □
A.2 Kernel conditional correlation test (KCoT) results
We compared KCoT against RCIT, RCoT and KCIT. We report the results in Figure 5. All tests perform comparably as a function of sample size (Figures 5a and 5c). However, KCoT performs better than KCIT and underperforms RCoT and RCIT as a function of conditioning set size. In particular, RCIT and RCoT obtain smaller KS-statistic values as a function of the conditioning set size (Figure 5b). KCIT and KCoT also obtain larger AUPC values with an increasing conditioning set size because they fail to maintain a uniform distribution under the null (Figure 5d; we again permuted the values of X for Figure 5e). We therefore conclude that RCoT and RCIT control their Type I error rates better than KCIT and KCoT even with large conditioning set sizes while maintaining power.
A.3 Comparisons against permutation
We also compared RCIT and RCoT against permutation CI tests. Here, we estimated the null distribution of and with permutations and call the resultant CI tests S-Perm and -Perm, respectively. The permutation tests specifically involve permuting the residuals of the random Fourier features of Y one thousand times in order to estimate the null distribution. We have summarized the Type I error rate results in Figure 6 and the AUPC results in Figure 7. We ran the sample size experiments up to only ten thousand samples due to the long run times of S-Perm and -Perm. We found that RCIT and RCoT perform similarly to the permutation tests but with significantly reduced average run time.
2. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915;10(4):507–21. https://doi.org/10.2307/2331838. ISSN 00063444.10.2307/2331838Search in Google Scholar
3. Fisher RA. On the probable error of a coefficient of correlation deduced from a small sample. Metron. 1921;1:3–32.Search in Google Scholar
4. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos Mag Ser 5. 1900;50:157–75.10.1007/978-1-4612-4380-9_2Search in Google Scholar
5. Tsamardinos I, Borboudakis G. Permutation testing improves bayesian network learning. Berlin, Heidelberg: Springer; 2010. p. 322–37. http://dx.doi.org/10.1007/978-3-642-15939-8_21. ISBN 978-3-642-15939-8.10.1007/978-3-642-15939-8_21Search in Google Scholar
6. Margaritis D. Distribution-free learning of bayesian network structure in continuous domains. In: Proceedings, the twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference. July 9–13, 2005, Pittsburgh, Pennsylvania, USA. 2005. p. 825–30. http://www.aaai.org/Library/AAAI/2005/aaai05-130.php.Search in Google Scholar
8. Su L, White H. A consistent characteristic function-based test for conditional independence. J Econom. 2007;141(2):807–34. https://doi.org/10.1016/j.jeconom.2006.11.006. http://www.sciencedirect.com/science/article/B6VC0-4MT59DD-4/2/267e7fc8dd979b6148fc4123998e94ee. ISSN 0304-4076.10.1016/j.jeconom.2006.11.006Search in Google Scholar
9. Su L, White H. A nonparametric hellinger metric test for conditional independence. Econom Theory. 2008;24(4):829–64. http://www.jstor.org/stable/20142523. ISSN 02664666, 14694360.10.1017/S0266466608080341Search in Google Scholar
10. Zhang K, Peters J, Janzing D, Schlkopf B. Kernel-based conditional independence test and application in causal discovery. In: Uncertainty in artificial intelligence. AUAI Press; 2011. p. 804–13. http://dblp.uni-trier.de/db/conf/uai/uai2011.html#ZhangPJS11. ISBN 978-0-9749039-7-2.Search in Google Scholar
11. Doran G, Muandet K, Zhang K, Schölkopf B. A permutation-based kernel conditional independence test. In: Proceedings of the 30th conference on uncertainty in artificial intelligence (UAI2014). Oregon: AUAI Press Corvallis; 2014. p. 132–41.Search in Google Scholar
12. Lopez-Paz D, Hennig P, Schölkopf B. The randomized dependence coefficient. In: Advances in neural information processing systems 26. 2013. p. 1–9.Search in Google Scholar
13. Zhang Q, Filippi S, Gretton A, Sejdinovic D. Large-scale kernel methods for independence testing. Stat Comput. 2017; 1–18. https://doi.org/10.1007/s11222-016-9721-7. ISSN 1573-1375.10.1007/s11222-016-9721-7Search in Google Scholar
14. Rahimi A, Recht B. Random features for large-scale kernel machines. In: Neural information processing systems. 2007.Search in Google Scholar
15. Sutherland DJ, Schneider JG. On the error of random fourier features. In: Meila M, Heskes T, editors. UAI. AUAI Press; 2015. p. 862–71. http://dblp.uni-trier.de/db/conf/uai/uai2015.html#SutherlandS15. ISBN 978-0-9966431-0-8.Search in Google Scholar
16. Lopez-Paz D, Sra S, Smola A, Ghahramani Z, Schölkopf B. Randomized nonlinear component analysis. In: Proceedings of the 31st international conference on machine learning, W&CP 32 (1). JMLR; 2014. p. 1359–67.Search in Google Scholar
18. Gretton A, Bousquet O, Smola A, Schlkopf B. Measuring statistical dependence with hilbert-schmidt norms. In: Proceedings in algorithmic learning theory. Springer-Verlag; 2005. p. 63–77.10.1007/11564089_7Search in Google Scholar
19. Berlinet A, Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. US: Springer; 2003. https://books.google.com/books?id=v79sBNG34coC. ISBN 9781402076794.10.1007/978-1-4419-9096-9Search in Google Scholar
20. Hein M, Bousquet O. Kernels, associated structures and generalizations. Technical Report 127. Tübingen, Germany: Max Planck Institute for Biological Cybernetics; 2004.Search in Google Scholar
21. Murphy G. C*-algebras and operator theory. Academic Press; 1990. https://books.google.com/books?id=emNvQgAACAAJ. ISBN 9780125113601.Search in Google Scholar
22. Fukumizu K, Bach FR, Jordan MI. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J Mach Learn Res. 2004;5:73–99. http://dl.acm.org/citation.cfm?id=1005332.1005335. ISSN 1532-4435.10.21236/ADA446572Search in Google Scholar
23. Fukumizu K, Gretton A, Sun X, Schölkopf B. Kernel measures of conditional dependence. In: Advances in neural information processing systems. Red Hook, NY, USA: Curran; 2008. p. 489–96. Max-Planck-Gesellschaft. http://papers.nips.cc/paper/3340-kernel-measures-of-conditional-dependence-supplemental.zip.Search in Google Scholar
24. Daudin JJ. Partial association measures and an application to qualitative regression. 1980;67(3):581–590. https://doi.org/10.1093/biomet/67.3.581. https://doi.org/10.2307/2335127. http://www.jstor.org/stable/2335127. ISSN 0006-3444 (print), 1464-3510 (electronic).Search in Google Scholar
25. Coppersmith D, Winograd S. Matrix multiplication via arithmetic progressions. J Symb Comput. 1990;9(3):251–80. https://doi.org/10.1016/S0747-7171(08)80013-2. ISSN 0747-7171.10.1145/28395.28396Search in Google Scholar
26. Micchelli CA, Xu Y, Zhang H. Universal kernels. J Mach Learn Res. 2006;6:2651–67. http://dblp.uni-trier.de/db/journals/jmlr/jmlr7.html#MicchelliXZ06.Search in Google Scholar
28. Solomon H, Stephens MA. Distribution of a sum of weighted chi-square variables. J Am Stat Assoc. 1977;72:881–5.Search in Google Scholar
29. Johnson NL, Kotz S, Balakrishnan N. Continuous multivariate distributions. 3rd ed. Wiley; 2002.Search in Google Scholar
30. Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29(3–4):350–62. https://doi.org/10.1093/biomet/29.3-4.350.10.1093/biomet/29.3-4.350Search in Google Scholar
32. Fairfield-Smith H. The problem of comparing the results of two experiments with unequal errors. J Counc Sci Ind Res. 1936;9:211–2.Search in Google Scholar
33. Hall P. Chi squared approximations to the distribution of a sum of independent random variables. Ann Probab. 1983;11(4):1028–36. https://doi.org/10.1214/aop/1176993451. 11.10.1214/aop/1176993451Search in Google Scholar
34. Buckley MJ, Eagleson GK. An approximation to the distribution of quadratic forms in normal random variables. Aust N Z J Stat. 1988;30(1):150–9.10.1111/j.1467-842X.1988.tb00471.xSearch in Google Scholar
36. Lindsay B, Pilla R, Basak P. Moment-based approximations of distributions using mixtures: Theory and applications. Ann Inst Stat Math. 2000;52(2):215–30. http://EconPapers.repec.org/RePEc:spr:aistmt:v:52:y:2000:i:2:p:215-230.10.1023/A:1004105603806Search in Google Scholar
37. Bodenham D, Adams N. A comparison of efficient approximations for a weighted sum of chi-squared random variables. Stat Comput. 2016;26:917–28. https://doi.org/10.1007/s11222-015-9583-4.10.1007/s11222-015-9583-4Search in Google Scholar
38. Bodenham D. momentchi2. 2015. http://cran.r-project.org/web/packages/momentchi2/.Search in Google Scholar
39. Uspensky JVJV. Introduction to mathematical probability. 1st ed. New York, London: McGraw-Hill; 1937. “Problems for solution” with answers at end of each chapter.Search in Google Scholar
40. Gretton A, Fukumizu K, Teo C, Song L, Schölkopf B, Smola A. A kernel statistical test of independence. In: Advances in neural information processing systems 20. Red Hook, NY, USA: Curran; 2008. p. 585–92. Max-Planck-Gesellschaft.Search in Google Scholar
41. Ramsey JD. A scalable conditional independence test for nonlinear, non-gaussian data. CoRR. 2014;abs/1401.5031. http://arxiv.org/abs/1401.5031.Search in Google Scholar
42. Zhang J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif Intell. 2008;172(16–17):1873–96. https://doi.org/10.1016/j.artint.2008.08.001. ISSN 0004-3702.10.1016/j.artint.2008.08.001Search in Google Scholar
43. Colombo D, Maathius M, Kalisch M, Richardson T. Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann Stat. 2012;40(1):294–321. 10.1214/11-AOS940. http://projecteuclid.org/euclid.aos/1333567191.Search in Google Scholar
44. McArdle J, Rodgers W, Willis R. Cognition and aging in the usa (cogusa), 2007–2009, 2015.Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.