Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter October 28, 2017

Detecting Confounding in Multivariate Linear Models via Spectral Analysis

  • Dominik Janzing EMAIL logo and Bernhard Schölkopf

Abstract

We study a model where one target variable Y is correlated with a vector X:=(X1,,Xd) of predictor variables being potential causes of Y. We describe a method that infers to what extent the statistical dependences between X and Y are due to the influence of X on Y and to what extent due to a hidden common cause (confounder) of X and Y. The method relies on concentration of measure results for large dimensions d and an independence assumption stating that, in the absence of confounding, the vector of regression coefficients describing the influence of each X on Y typically has ‘generic orientation’ relative to the eigenspaces of the covariance matrix of X. For the special case of a scalar confounder we show that confounding typically spoils this generic orientation in a characteristic way that can be used to quantitatively estimate the amount of confounding (subject to our idealized model assumptions).

1 Introduction and general model

Estimating the causal influence of some variables X1,,Xd on a target variable Y is among the most important goals in statistical data analysis. However, drawing causal conclusions from observational data alone without intervening on the system is difficult. This is because the observed statistical dependences between Y and each Xj need not be due to an influence of Xj on Y. Instead, due to Reichenbach’s Principle of Common Cause [1], Y may also be the cause of Xj or there may be a common cause Z influencing both. In many applications, time order or other prior information excludes that Y influences Xj. For instance, if Y describes the health condition of a patient at time t and Xj some treatments at an earlier time, we ‘only’ need to decide to what extent the dependences between Xj and Y are due to Xj influencing Y and to what extent they are due to common causes (’confounders’). Here we are not interested in the reason for dependences between the variables Xj themselves, we therefore merge them to a vector-valued variable X. Moreover, we restrict the attention to the case where there is only one real-valued confounder Z. In the case of linear relations, the structural equations then read:

(1)X=bZ+E
(2)Y=a,X+cZ+F.

where E is a random vector with values in Rd and Z,F are scalar random variables. Here, E,Z,F are jointly independent, while the components of E may be dependent. Further, aRd is the vector of structure coefficients determining the influence of the d-dimensional variable X on the scalar target variable Y. Likewise, bRd is the vector determining the influence of Z on X and the scalar cR is the structure coefficient determining the influence of Z on Y. By rescaling b and c, we may assume Z to have unit variance without loss of generality. The corresponding DAG is shown in Figure 1.

Figure 1 Most general DAG considered in this paper. All scenarios discussed later refer to this DAG and differ only by the parameters a,b,σF,ΣEE$\textbf{a},\textbf{b},\sigma_F,\Sigma_{\textbf{E}\textbf{E}}$, c$c$.
Figure 1

Most general DAG considered in this paper. All scenarios discussed later refer to this DAG and differ only by the parameters a,b,σF,ΣEE, c.

If all variables are centered Gaussian, the remaining model parameters are the vectors a,b, the covariance matrix ΣEE and the scalars c,σF, where c describes the strength of the influence of Z on Y and σF the standard deviation of F. Since this paper will be based on second-order statistics alone, we will treat these parameters as the only relevant ones.

The following special cases can be obtained by appropriate choices of these parameters:

Purely causal: The case XY with no confounding can easily be obtained by setting b=0 or c=0. In the first case, Z takes the role of an additional independent noise term (apart from F), see Figure 2(a).

Figure 2 (a) Purely causal case: setting b=0$\textbf{b}=0$ renders Z$Z$ independent of X$\textbf{X}$. Thus, we can consider F˜:=cZ+F$\tilde{F}:=cZ+F$ formally as noise term and then obtain the simple DAG with nodes X,Y$\textbf{X},Y$ only. (b) Purely confounded case: by setting a=0$\textbf{a}=0$ the correlations between X$\textbf{X}$ and Y$Y$ are only due to the confounder Z$Z$.
Figure 2

(a) Purely causal case: setting b=0 renders Z independent of X. Thus, we can consider F˜:=cZ+F formally as noise term and then obtain the simple DAG with nodes X,Y only. (b) Purely confounded case: by setting a=0 the correlations between X and Y are only due to the confounder Z.

The structural equation then reads

(3)Y=a,X+cZ+F=a,X+F˜,

with E,F˜ being jointly independent. For fixed c, the limit of a deterministic influence of X on Y can be obtained by letting at least one component of the vector a grow to infinity. Then Y is dominated by the term a,X.

Purely confounded: Setting a=0 turns the influence of X on Y off. Then the relation between X and Y is generated by the confounder Z only, see Figure 2(b). Depending on the remaining parameters ΣEE, σF and b, we can also obtain a scenario where X provides perfect knowledge about Z (when b2 or ΣEE0) or a scenario where Y provides perfect knowledge of Z (σF0).

Purely anticausal: We have actually excluded a scenario where Y is the cause of X. Nevertheless, if a=0 and σF=0, we have Y=Z almost surely and Z is the cause of X. Hence, the scenario gets indistinguishable from an ’anticausal’ scenario where Y is the cause as in Figure 3(b), although performing interventions on Y would still tell us that it is not the cause.

We now ask how to distinguish between these cases given joint observations of X and Y that are i.i.d. drawn from PX,Y. Conditional statistical independences, which are usually employed for causal inference, [2, 3] may not help to distinguish between the above cases if there are no conditional independences in PX,Y. Moreover, we assume that there are no observed causes of X that could act as so-called instrumental variables [4] which would enable the distinction between ‘causal’ and ‘confounded’.

Figure 3 (a) A special case of the purely confounded case is obtained by setting the noise F$F$ of Y$Y$ to 0$0$ and c=1$c=1$. Then Y$Y$ is an exact copy of Z$Z$, i.e., the cause of X$\textbf{X}$. Since PX,Y=PX,Z$P_{\textbf{X},Y}=P_{\textbf{X},Z}$, no statistical method relying purely on observational data is able to distinguish this case from the scenario in (b), although the difference certainly matters when interventions on Y$Y$ are made.
Figure 3

(a) A special case of the purely confounded case is obtained by setting the noise F of Y to 0 and c=1. Then Y is an exact copy of Z, i.e., the cause of X. Since PX,Y=PX,Z, no statistical method relying purely on observational data is able to distinguish this case from the scenario in (b), although the difference certainly matters when interventions on Y are made.

To see that the parameters a,b,ΣEE,σF are heavily underdetermined in linear Gaussian models, just note that any multivariate Gaussian PX,Y can be explained by X being an unconfounded cause of Y according to the structural eq. (3) by replacing a with

(4)aˆ:=ΣXX1ΣXY.

The vector aˆ is the vector of regression coefficients obtained by regressing Y on X without caring about the true causal structure. Here we use the symbol aˆ instead of a to indicate that it differs from the vector a that appears in the structural eq. (2) which correctly describes the causal relation between X and Y. This way, we obtain a model that correctly describes the observed correlations, but not the causal relations since the impact of interventions is not modelled correctly. – Note that identifying a is typically the main goal of causal data analysis since aj directly describes how changing Xj changes Y. Confusing a with aˆ would be the common fallacy of naively attributing all dependences to the causal influence of Xj on Y. To see the relation between aˆ and a we first find

ΣXY=Cov(X,Y)=Cov(E+bZ,a,X+cZ+F)=Cov(E+bZ,a,E+bZ+cZ+F)=(ΣEE+bbT)a+cb,

where we have used the joint independence of E,Z,F and that Z is normalized. Likewise,

ΣXX=Cov(X,X)=Cov(E+bZ,E+bZ)=ΣEE+bbT.

Due to eq. (4) we thus obtain

(5)aˆ=a+(ΣEE+bbT)1cb.

Equation (5) shows that the vector aˆ obtained by standard regression consists of a (which defines the causal influence of X on Y) and a term that is due to confounding.

It is known that confounding can be detected in linear models with non-Gaussian variables [5]. This is because describing data, that has been generated by the model eqs. (1) and (2), by the structural equation

Y=aˆ,X+F,

as in eq. (3), yields in the generic case a noise variable F that is not statistically independent of X, although it is uncorrelated. Other recent proposals to detect confounding using information beyond conditional statistical dependences rely on different model assumptions. Ref. [6], for instance assumes non-linear relations with additive noise, while Ref. [7] assumes a discrete confounder attaining a few values only.

Here we propose a method for distinguishing the purely causal from the confounded case in the linear scenario above that only relies on second order statistics and thus does not rely on non-Gaussianity of noise variables and higher-order statistical independence tests like [6], for instance. In other words, the relation must be linear but the distributions may be Gaussian or not. The paper is structured as follows. Section 2 defines the strength of confounding, which is the crucial target quantity to be estimated. Section 3 describes the idea of the underlying principle, defines it formally in terms of a spectral measure and justifies it by a toy model where parameters are randomly generated. Section 4 describes the method to estimate the strength and justifies it by intuitive arguments first and by theoretical results which are rigorously shown in Section 5. Section 6 reports results with simulated causal structures according to the model assumptions, while Section 7 describes results from an experiment where the data has been generated by an electronic and optical setup with known causal structure. Section 8 reports some studies with observational data where the causal ground truth is only partially known.

2 Characterizing confounding by two parameters

2.1 Strength of confounding

We now define a strength of confounding that measures how much aˆ deviates from a (relative to the sum of the squared lengths of these vectors). The goal of this paper is to estimate this quantity from PX,Y alone.

Definition 1

(structural strength of confounding)The structural strength of confounding is defined by

(6)β:=cΣXX1b2a2+cΣXX1b2[0,1].

We will later see that, subject to the genericity assumptions stated in Section 3, the denominator can be approximated via

(7)a2+cΣXX1b2a+cΣXX1b2=aˆ2.

It should be emphasized that β does not measure to what extent the confounder changes ΣXY. This could be quantified by a parameter that we call ‘correlative strength of confounding’, denoted by γ, and defined via

(8)γ:=cb2ΣXXa2+cb2[0,1].

It is non-trivially related to β. Remarkably, β and γ can differ by orders of magnitudes[1]. Without claiming that β would be the better measure[2], we focus on β because it is more relevant for causal statements: whenever β is large identifying aˆ with a yields significantly wrong causal conclusions even when γ is small. We have introduced γ only to show that whether confounding is negligible or not highly depends on how it is quantified.

2.2 A second parameter characterizing confounding

The contribution of Z to the covariance between X and Y is given by the product cb. Accordingly, rescaling b with some factor while rescaling c with the inverse factor preserves the correlative strength of confounding. Although structural confounding strength is affected by rescaling b and c in a more sophisticated way, β can also be unaffected by rescaling both b and c in an appropriate way. The regimes with small b and large c versus the one with large b and small c have simple interpretations: in the first case, the uncertainty of X is hardly reduced by knowing Z, while in the second case, knowing Z reduces most of the uncertainty of X.

To distinguish between these different regimes of confounding we introduce the second parameter η, which measures the explanatory power of Z for X:

η:=tr(ΣXX)tr(ΣXX|Z)=tr(ΣXX)tr(ΣEE)=b2ΣXX.

For the entire Section 2 it is important to keep in mind that we always referred to the case where Z has unit variance.

3 Detecting confounders by the principle of generic orientation

3.1 Intuitive idea and background

The idea of our method is based on the recently stated Principle of Independent Conditionals [9, 10] in the context of causal inference. To introduce it, let G be a directed acyclic graph (DAG) formalizing the hypothetical causal relations among the random variables Z1,,Zn. The distributions compatible with this causal structure factorize into

PZ1,,Zn=j=1nPZj|PAj,

where each PZj|PAj denotes the conditional distribution of Zj, given its parents [2]. Informally speaking, the Principle of Independent Conditionals states that, usually, each PZj|PAj describes an independent mechanism of nature and therefore these objects are ‘independent’ and contain ‘no information’ about each other. [9, 10] formalized ‘no information’ by postulating that the description length of one PZj|PAj does not get shorter when the description of the other PZi|PAi for ij are given. Here, description length is defined via Kolmogorov complexity, which is, unfortunately, uncomputable [11]. To deal with this caveat, one can either approximate Kolmogorov complexity, or, as shown in [12], indirectly use the principle as a justification for new inference methods rather than as an inference method itself.

However, there are also other options to give a definite meaning to the term ‘independence’. To see this, consider some parametric model where each PZj|PAj is taken from a set of possible conditionals PZj|PAjθj where θj is taken from some parameter space Θj. Assume that, for a given distribution PZ1,,Zn, the parameters θ1,,θn are related by an equation (i.e., one is the function of the others) that is not satisfied by genericn-tuples. One can then consider this as a hint that the mechanisms correponding to the PZj|PAj have not been generated independently and become skeptical about the causal hypothesis[3]. This philosophical argument is also the basis for Causal Faithfulness [3], that is, the principle of rejecting a causal DAG for which the joint distribution satisfies conditional independences that do not hold for generic vectors (θ1,,θn) because it requires the vector to lie in a lower dimensional manifold. In its informal version, the Principle of Independent Conditionals generalizes this idea by excluding also other ‘non-generic’ relations between parameter vectors θj. More precisely, the assumption of faithfulness can be justified by generating models according to which unfaithful distributions occur with probability zero [13]. For practical purposes one needs to also exclude almost unfaithful distributions and assume that they occur only with small probability (although this is problematic [14]). Here we describe relations between parameters that are also non-generic in the sense that they occur with small probability according to our generating model.

After these general remarks, let us come back to the scenario where the causal hypothesis reads XY (without confounding and within a linear model). Recalling that we consider ΣXX the crucial model parameter for PX and the regression vector a the crucial parameter for PY|X, we therefore postulate that a lies in a ‘generic’ orientation relative to ΣXX in a sense to be described in Subsection 3.2. To approach this idea first by intuition, note, for instance, that it is unlikely that a is close to being aligned with the eigenvector of ΣXX corresponding to the largest eigenvalue (i.e., the first principal component), given that a has been chosen ’without knowing’ ΣXX. Likewise, it is unlikely that it is approximately aligned with the last principal component.

For the more general DAG shown of Figure 1 we again assume that a has generic orientation with respect to the eigenspaces of ΣXX and, in addition, that b has generic orientation with respect to the eigenspaces of ΣEE.

To provide a first intuition about why the orientation of the resulting vector aˆ of regression coefficients is no longer ‘generic’ relative to ΣXX as a result of confounding, we show two somehow opposite extreme cases where aˆ gets aligned with the first and the last principal component of X, respectively. To this end, we consider the purely confounded case where a=0 and thus aˆ=cΣXX1b.

aˆ aligned with the first eigenvector of ΣXX: Let ΣEE be the identity matrix I. Since ΣEE is then rotation invariant, b has certainly ‘generic orientation’ relative to ΣEE, according to any reasonable sense of ‘generic orientation’. Then b does not have generic orientation relative to ΣXX=I+bbT, because it is the unique eigenvector of the latter with maximal eigenvalue. In other words, b is aligned with the first principal component of X. Then, aˆ is also aligned with the same principal component since it is a multiple of b due to aˆ=ΣXX1b=(I+bbT)1b. Note that one also gets close to this scenario when the spectral gaps of ΣEE are small[4] compared to the norm of bbT and to the eigenvalues of ΣEE.

aˆ close to the last eigenvector of ΣXX: Let the spectral gaps between adjacent eigenvalues of ΣEE be much larger than the norm of bbT. Then adding bbT changes the eigenspaces only slightly [15]. Hence, if b has a generic orientation relative to ΣEE, it is still generic relative to ΣXX. Multiplying b with ΣXX1 then generates a vector that has stronger coefficients in the small eigenvalues of ΣXX. If the smallest eigenvalue of ΣXX is much smaller than the others, the orientation of aˆ gets arbitrarily close to the smallest eigenvector.

For general ΣEE, where the gaps between the eigenvalues are neither tiny nor huge compared to the norm of bbT, the orientation of aˆ changes in a more sophisticated way that heavily depends on the structure of the spectrum of ΣEE. This will be analyzed in Section 4.

3.2 Defining ‘generic orientation’ via the induced spectral measure

We start with some notation and terminology and formally introduce two measures which have quite simple intuitive meanings.

For d×d matrices, we introduce the renormalized trace[5]

τ(A):=1dtr(A).

For notational convenience, we will assume that the spectra of all matrices are non-degenerate throughout the paper, i.e., all eigenvalues are different. Every symmetric matrix A thus admits a unique decomposition

(9)A=j=1dλjϕjϕjT,

where λ1>>λd and ϕ1,,ϕd denote the corresponding eigenvectors of unit length. Every A uniquely defines a measure on R, namely the distribution of eigenvalues, formally given as follows:

Definition 2

(tracial spectral measure) Let A be a real symmetric matrix with non-degenerate spectrum. Then the tracial spectral measure μAτ of A is the discrete measure on R given by the uniform distribution over its eigenvalues λ1,λd, i.e.,

μAτ:=1dj=1dδλj,

where δs denotes the point measure on s for some sR.

By elementary spectral theory of symmetric matrices [17], we have:

Lemma 1

(expectation for tracial spectral measure)The expectation of any functionf:{λ1,,λd}R with respect to the tracial measure is given as follows:

f(s)dμAτ(s)=τ(f(A)).

While the spectral measure is a property of a matrix alone, the following measure describes the relation between a matrix and a vector:

Definition 3

(vector-induced spectral measure) Let A be a symmetric matrix and λj,ϕj be defined by eq. (9). For arbitrary ψRd, the (unnormalized) spectral measure induced by ψ on A, denoted by μA,ψ, is given by

μA,ψ(S)=jwithλjSψ,ϕj2,

for any measurable set SR.

For each set of eigenvalues of A, the measure describes the squared length of the component of ψ that lies in the respective eigenspace of A. Accordingly, we have the following normalization condition:

(10)μA,ψ(R)=ψ2.

In analogy to Lemma 1 we obtain:

Lemma 2

(expectations for vector-induced spectral measure)The expectation of any function f on the spectrum of A with respect to μA,ψ is given as follows:

f(s)dμA,ψ(s)=ψ,f(A)ψ.

To deal with the above measures in numerical computations, each measure will be represented by two vectors: first, one vector λ:=(λ1,,λd) listing its support (with the convention λ1λd) and second, the vector w:=(w1,,wd) listing the corresponding weights. For the tracial spectral measures, λ is just the list of eigenvalues and w=(1/d,,1/d) is the uniform distribution on these d points. For spectral measures induced by a vector, λ is still the list of eigenvalues but now w describes the squared coefficients of the vector with respect to the eigenvector decomposition.

To understand our algorithm described later, it is helpful to note that using the eigenvectors ϕj of a matrix A one can easily construct a vector ψ that induces the tracial measure:

ψ:=1dj=1dϕj.

Then we have

(11)μA,ψ=μAτ.

The crucial idea behind our detection of confounding is the insight that μA,ψ is close to ψ2μAτ for any ‘typical’ choices of ψ, which is made precise by the following result shown in Section 5.

Lemma 3

Let (Ad)dN with Ada be a sequence of symmetric matrices whose spectral measure converges weakly to some probability measure μ, i.e.,

(12)μAdτμ.

Let (cd)dN with cdRd be randomly drawn from the sphere of radius r. Then

μAd,cdr2μ,

weakly in probability.

The idea of this paper is that we expect μΣXX,aˆ to be close to μΣXX,aˆ in the absence of confounding. Moreover, we show that confounding perturbs μΣXX,aˆ in a characteristic way, given that the model parameters are ‘typical’. To study properties of μΣXX,aˆ for ‘typical’ choices of the model parameters a,b, we now define the following sequence of models for increasing dimension d:

Covariance matrix of the noise of X: Let (ΣEEd)dN be a uniformly bounded sequence of positive semi-definite d×d-matrices such that their tracial spectral measures converge weakly to some probability measure μ (describing the asymptotic distribution of eigenvalues).

Vector of causal structure coefficients a: Let (ad)dN be a sequence of vectors in Rd drawn uniformly at random from a sphere of fixed radius ra.

Vector of confounding structure coefficients b: Let (bd)dN be a sequence of vectors in Rd drawn uniformly at random (independently of ad) from a sphere of fixed radius rb. Let c be fixed for all d.

Then ΣXXd=ΣEEd+bdbdT and aˆd=ad+c(ΣXXd)1bd and we have the following result that will be shown in Section 5:

Theorem 1

(asymptotic spectral measure for rotation-invariant generating model)For the above generating model the below approximations eqs. (16), (17), and (18) are asymptotically justified in the sense that

(13)μΣXXd,adra2μ(weakly in probability)
(14)μΣEEd,bdrb2μ(weakly in probability)
(15)μΣXXd,ad+c(ΣXXd)1bd(μΣXXd,ad+μΣXXd,c(ΣXXd)1bd)0(weakly in probability)

Intuitively, eq. (13) just states that the causal part of a induces, together with ΣXX, a spectral measure that is close to the tracial measure up to a normalization factor because a has ‘generic’ orientation relative to ΣXX=ΣEE+bbT since a is chosen independently of ΣEE and b. On the other hand, b is chosen independently of Σee, and thus the spectral measure induced by ΣEE and b is close to the corresponding tracial measure up to normalization, as shown by eq. (14). Equation (15) states that μΣXX,a+ΣXX1b almost decomposes into μΣXX,a plus μΣXX,ΣXX1b, that is, the spectral measure induced by the causal vector a and its perturbation ΣXX1b. This is useful because eq. (13) describes the asymptotic behavior of μΣXX,a. To understand the asymptotic of μΣXX,ΣXX1b, on the other hand, we will later use eq. (14).

Assuming a rotation invariant generating model may appear as a too strong assumption for practical purposes. It should therefore be noted that much weaker assumptions would probably also yield the same approximate identities for high dimensions. Possible quantitative versions of Lemma 3 for finite dimensions could show that the overwhelming majority of vectors on the surface of the unit sphere induce spectral measures that are close to the tracial one. Such quantitiave versions could provide some insights on the robustness of our method regarding violation of the strong model assumptions (since also a large class of non-rotation invariant priors then yield the same conclusions), but this goes beyond the scope of this paper. After all, the performance on real data cannot be predicted from purely theoretical arguments.

Motivated by the above asymptotic results, we assume to be d large enough that left and right hand side of eqs. (13) to (15) are approximately satisfied:

Postulate 1(generic orientation of vectors)3.2 If eqs. (1) and (2) are structural equations corresponding to the causal DAG in Figure 1, and d is large, then:

(I) The vector a has generic orientation relative to ΣXX in the sense that

(16)μΣXX,aμΣXXτa2.

(II) The vector b has generic orientation relative to ΣEE in the sense that

(17)μΣEE,bμΣEEτb2.

(III) The vector a is generic relative to b,ΣEE in the sense that

(18)μΣXX,a+cΣXX1bμΣXX,a+μΣXX,cΣXX1b.

We therefore built our algorithm in Section 5 upon the postulates only instead of directly using the generating model.

Note that eqs. (16), (17), and (18) only hold if the -sign is interpreted in a sufficiently loose sense. After all, the differences between both sides of the above equations converge weakly to zero. They are not close, for instance, with respect to total variation distance. Hence, the measures are similar in the same sense as two empirical distributions with large sample size are similar when they are independently sampled from the same distribution.

Related work: We first describe the relation of the above ideas to those underlying the so-called Trace Method [18, 19], which is, to the best of our knowledge, the work in the literature that is closest to the present one. Let X and Y be vector-valued variables with values in Rd and Rm, respectively. Assume X influences Y via the linear model

Y=AX+F,

where A is an m×d matrix and F a noise variable of dimension m. Then A and ΣXX satisfy the trace condition

(19)1mtr(AΣXXAT)1dtr(ΣXX)1mtr(AAT).

For m=1, the 1×d matrix A is given by the row vector vector aT. Then AX=a,X and eq. (19) turns into

(20)a,ΣXXa1dtr(ΣXX)a,a.

In terms of the spectral measures, eq. (20) reads

sdμΣXX,a(s)sdμΣXXτ(s)a2.

Hence, eq. (19) postulates that the first moments of two measures on the left and the right of eq. (16) almost coincide, while our method also accounts for higher order moments which the Trace Condition ignores. As already sketched in [19], the Trace Condition (19) is closely related to the concept of free independence in free probability theory [20]. In the appendix we will explain why eqs. (16), (17), (18) are also related to free independence in spirit, although there is no straightforward way to apply those concepts here.

We should also mention [21], which is remotely related to this paper. The authors study a large number d of observed variables which are related by a sparse graphical model as well as by some latent variables influencing a large subset of the observed variables and provide conditions under which the latent part and the graphical model part can be distinguished. On a high level, this is related to our work because the methods also employs effects in high dimensions to detect latent variables.

4 Description of the method

4.1 Constructing typical spectral measures for given parameter values

The main result of this section states that asymptotically, μΣXX,aˆ depends only on 3 parameters when ΣXX is given: aˆ2, β, η. The first one is directly observable, hence we define a two-parametric family of normalized (i.e. probability) measures νβ,η such that for large d with high probability

μΣXX,aˆaˆ2νβ,η,

for appropriate choices of β,η. We first describe the construction of νβ,η:

  1. Causal part: this part describes the spectral measure that would be obtained in the absence of confounding. It is induced by a and ΣXX. According to eq. (16), it is approximated by the uniform distribution over the spectrum of ΣXX, i.e., the tracial measure introduced in Definition 2. We therefore define:

    νcausal:=μΣXXτ.
  2. Confounding part: we now approximate the spectral measure induced by the vector ΣXX1b and ΣXX. We will justify the construction later after we have described all its steps. We first define the matrix MX:=diag(v1X,,vdX), where vjX are the eigenvalues of ΣXX in decreasing order. Then we define a rank-one perturbation of MX by

    (21)T:=MX+ηggT,

    where g is the vector g:=(1,,1)T/d. We then compute the spectral measure induced by the vector T1g and T and define

    (22)νηconfounded:=1T1g2μT,T1g.
  3. Mixing both contributions: Finally, νβ,η is a convex sum of the causal and the confounded part where the mixing weight of the latter is given by the confounding strength:

    νβ,η:=(1β)νcausal+βνηconfounded.

We now explain Step 2 in the above construction. According to Postulate 3.2, b has generic orientation relative to ΣEE in the sense of eq. (17). With respect to its eigenbasis, ΣEE reads ME:=diag(v1E,,vdE), where vjE are the eigenvalues of ΣEE. Of course we don’t know the vector b, neither do we know the coordinates of b with respect to this basis. Remarkably, it turns out that knowing that b is generic relative to ΣEE is enough because then we can replace b with a vector that is ‘particularly generic’, namely g. This vector satisfies

(23)μME,g=μMEτ,

(see eq. (11)) while asymptotically the overwhelming majority of unit vectors satisfy eq. (23) approximately. Therefore, an appropriate multiple of g nicely mimics the behavior of generic vectors. Accordingly, we can approximate the spectral measure induced by the vector (ΣEE+bbT)1b and the matrix (ΣEE+bbT) by the spectral measure induced by (ME+ηggT)ηg and ME+ηggT. Unfortunately, this construction would need the eigenvalues of ΣEE, which cannot be computed from observing X,Y alone. Asymptotically, however, the difference between the spectra of ΣXX and ΣEE do not matter and we can replace ME with MX. This step will be justified later using the fact that for large d we have

(24)μΣXXτμΣEEτ,

which is made more precise by the following result:

Lemma 4

(tracial measures are close)For any interval [r,l] we have

|μXτ[r,l]μEτ[r,l]|d.

Proof

If v1E>>vdE denote the eigenvalues of ΣEE, then the eigenvalues vjX of ΣXX=ΣEE+bbT satisfy

vjX[vjE,vj1E]j2,

by Theorem 10.2 in [22]. Hence the number of eigenvalues in a given interval can differ by 1 at most.

We now describe the main theoretical result of this article:

Theorem 2

(convergence to two-parametric family) Let (ΣXXd) be a sequence of covariance matrices whose distribution of eigenvalues (i.e. the tracial spectral measure) converges weakly to some probability measure μ supported by a compact interval in R0+. Assume, moreover, we are given sequences of vectors (ad) and (bd) with ad=ra and bd=rb (with ra,ra fixed) and fixed c, such that eqs. (13)–(15) hold (recall ΣEEd=ΣXXdbdbdT). Then νβ,η approximates μΣXX,aˆ up to normalization in the sense that

1aˆd2μΣXXd,aˆdνβ,ηd0(weakly in probability),

where η:=rb2 and the confounding strength β can be derived from μ,rb, and c by equations specified later.

Hence, the theorem states that whenever Postulate 3.2 holds with sufficient accuracy and for sufficiently high dimension, then νβ,η is a good approximation for the induced spectral measure. We will prove Theorem 2 in Section 5.

Apart from this weak convergence result we also know that the measures μΣXXd,aˆd and νβ,ηd have precisely the same support for any d because, by construction, νβ,ηd is also supported by the spectrum of ΣXX. This enables to conveniently represent both measures by vectors whose entries describe the weight of the corresponding eigenvalue.

Two limiting cases as examples: To get a further intuition about νβ,η let it describe for the two motivating limiting cases in Subsection 3.1, both with pure confounding, i.e., β=1. Following Theorem 2, we have to choose η=b2 to mimic μΣXX,aˆ in the limit of d. The case where aˆ is almost aligned with the first eigenvector corresponds to νβ,η having the majority of its weight on the first eigenvalue, while the second case corresponds to νβ,η for which the weights for small eigenvectors are more dominant, as we will see now.

For the first case ΣEE=I, the spectrum of ΣXX=I+bbT reads 1+η,1,,1,1, that is, MX=diag(1+η,1,,1). We thus have

T=I+η(e1e1T+ggT),

where e1:=(1,0,,0)T. The subspace spanned by e1 and g is T-invariant. The restriction T2 of T to this 2-dimensional subspace is asymptotically (for d to infinity) given by (η+1)I because e1 and b become orthogonal. Therefore,

νβ,η=1T1g2μT,T1g=1T21g2μT2,T21gdδη+1,

that is, the weights of T1g tend to zero for all eigenvectors other than v1X, in agreement with our expectation when we recall the remarks in Subsection 3.1

In the second case, let all eigenvalues v1X,,vdX of ΣXX be distinct and η be small compared to the spectral gaps vjXvj+1X. The perturbation of MX=diag(v1X,,vdX) with ηggT changes the eigenvectors and eigenvalues only slightly, that is, TMX and thus

(25)μT,T1gμMX,(MX)1g=1dj=1d(vjX)2δvjX.

Accounting for the normalization factor then yields

νβ,η=1T1g2μT,T1g1j=1d(vjX)2j=1d(vjX)2δvjX.

Hence, as expected due to the remarks in Subsection 3.1, small eigenvalues get higher weights than the higher ones (due to the weighting factor (vjX)2).

4.2 Description of the algorithm

To estimate the confounding parameters we just take the element in the family (νβ,η)β[0,1],η[0,v1X] that is closest to μΣXX,aˆ, but we have to choose an appropriate distance measure. Since Theorem 2 only guarantees weak convergence, l1 or l2 distances between the weight vectors would be inappropriate. Instead, we have decided to smoothen the measures by a Gaussian kernel and then compare the l1 distance. As kernel bandwidth we have worked with σ:=0.2(v1XvdX). This heuristic choice is motivated by the idea that the kernel should smoothen over a significant fraction of the eigenvalues. For dimension d=10, for instance, the bandwidth 0.2 thus is two times the average distance between adjacent eigenvalues.

Accordingly, we define a distance between two weight vectors w and w by

(26)D(w,w):=K(ww)1,

where K denotes the kernel smoothing matrix with entries

K(i,j):=e(viXvjX)22σ2.
Algorithm 1 Estimating the strength of confounding
1: Input: I.i.d. samples from P(X,Y).
2: Compute the empirical covariance matrices ΣXX and ΣXY
3: Compute the regression vector aˆ:=ΣXX1ΣXY
4: PHASE 1: Compute the spectral measure μΣXX,aˆ
5: Compute eigenvalues v1X>>vdX and the corresponding eigenvectors ϕ1,,ϕd of ΣXX
6: Compute the weights wj=aˆ,ϕj2 and then the normalized weights wj:=wj/jwj.
7: PHASE 2: find the parameter values βˆ,ηˆ that minimize the distance D(w,wβ,η) with D defined
by eq. (26), where wβ,η denotes the weight vector of the measure νβ,η.
8: Output: Estimated confounding strength βˆ

Based on these findings, we describe how to estimate β in Algorithm 4.2. The code as well as all the data sets used in this paper and instructions on how to reproduce the results can be found at http://webdav.tuebingen.mpg.de/causality/.

Since the pseudocode does not describe how to compute the weight vector wβ,η we provide this missing detail now. First compute the matrix T as defined by eq. (21) and compute its eigenvectors ψ1,,ψd. Then we compute the vector v:=T1g/T1g. The squared coefficients of v with respect to the eigenvector basis ψ1,,ψd describe the weights of νηconfounded, see eq. (22). To obtain the weights of νβ,η we need to add the contribution of νcausal and finally obtain the weights

wjβ,η:=1d(1β)+βv,ϕj2.

4.3 Remark on normalization

So far we have ignored the case where the variables X1,,Xd refer to quantities that are measured in different units. If, for instance, X1 denotes the temperature and X2 the traffic density (which both influence the NOx concentration in the air), the relative scale of their numeric values depend on the units one choses. Another related issue is that all Xj refer to the same unit, but the variance of one of the variables is overwhelmingly larger than the variance of the others, which results in a covariance matrix whose rank is basically one. A pragmatic and straightforward solution for both issues is to normalize all variables Xj as preprocessing step. Actually, this is obviously in conflict with the justification of the method because normalization jointly changes ΣXX and a, which spoils the idea of ‘independence’. In our limited simulation studies, however, the results turned out to be reasonably robust with respect to normalizing all Xj. Here, robustness is only meant in the sense that the performance over a large number of runs looked almost the same. For every single experiment, however, the estimated values βˆ can significantly differ by the amount of uncertainty that is inherent to our method anyway. Due to the lack of theoretical justification, we recommend to avoid normalization if possible and remain skeptical about the results with normalized data.

5 Proofs of asymptotic statements

5.1 Proof of Lemma 3

It is sufficient to show the statement for r=1 because the measure obviously scales quadratically in r. Since the support of μAd,cd is contained in the compact interval [a,a], it is sufficient to show convergence of all moments, i.e., that for every kN

skdμAd,cd(s)skμ(s),

in probability. To this end, we drop most indices d and consider fixed dimension d. To generate a random unit vector c, we first take d independent Gaussian random variables CjN(0,1/d) and define the jth coefficient of c by Cj/i=1dCi2. Let A=diag(λ1,,λd) without loss of generality. Then the kth moment reads:

(27)skdμA,c(s)=j=1dλjkCj2j=1dCj2=:ΛdΓd.

One easily checks

E[Λd]=j=1dλjkd=skdμAdτ(s).

Moreover, since all Cj are independent and because squared standard Gaussians have variance 2, we have

Var[Λd]=j=1dλj2kVar[Cj2]=2d2j=1dλj2k2da2k,

where we used λja. By Chebyshev’s inequality, the probability for large deviations from the mean can be bounded by

(28)Pr|ΛdskdμAτ(s)|ϵ2dϵ2a2k.

Then we get

(29)skdμAd,cd(s)skdμ(s)=ΛdΓdskdμ(s)=ΛdskdμAdτ(s)Γd+skdμAdτ(s)Γdskdμ(s)<label>29</label>.

We have Γd1 almost surely by the strong law of large numbers. Due to eq. (28), the quotient on the left of expression (29) converges to zero in probability. Thus, expression (29) converges to zero in probability due to the assumption skdμAdτ(s)skdμ(s).

5.2 Proof of Theorem 1

The difference between the left and the right hand sides of eq. (16) converges weakly to zero in probability due to Lemma 3 because ad is drawn independently of (ΣXX)d=ΣEEd+bdbdT, since ad and ad are drawn independently by assumption. Statement (17) follows even more directly from Lemma 3 because bd is drawn independently of (ΣEE)d.

To show that the difference between the left and the right hand side of eq. (18) converges weakly to zero in probability, we recall that it is sufficient to show that expectations of bounded continuous functions converge. For any measurable function g:RR the difference of expectations reads:

gdμΣXX,a+ΣXX1bgdμΣXX,agdμΣXX,ΣXX1b=a+ΣXX1b,g(ΣXX)(a+ΣXX1b)a,g(ΣXX)aΣXX1b,g(ΣXX)ΣXX1b=2a,g(ΣXX)ΣXX1b.

Hence, we only have to show that

(30)a,g(ΣXX)ΣXX1b0

in probability. Note that this already follows from the fact that a is chosen independently from the vector on the right hand side in the inner product eq. (30), due to the following elementary result:

Lemma 5

(asymptotic orthogonality) Let vdRd be a sequence of vectors. Let cdRd be drawn uniformly at random from the unit sphere. Then,

vd,cd20,

almost surely.

Proof

Without loss of generality, assume v=(v,0,,0)T with vR. Generate the entries cj of c by first taking independent standard Gaussians Cj and renormalizing afterwards. Then

limdvd,cd2=limd1dC121dj=1dCj2=0,

because 1dj=1dCj2 converges to E[Cj2]=1 almost surely due to the law of large numbers.

5.3 Proof of Theorem 2

We first need some definitions and tools. The following one generalizes Definition 3 to infinite-dimensional Hilbert spaces (see [17] for spectral theory of self-adjoint operators):

Definition 4

(vector-induced spectral measure) Let H be a Hilbert space and A:HH a self-adjoint operator with spectral decomposition A=λdEλ, where (Eλ)λR denotes the spectral family of A (that is, Eλ projects onto the spectral subspace corresponding to all spectral values not larger than λ). For any ψH, let μA,ψ be defined by

fdμA,ψ=fdψ,Eλψ,

for all integrable functions f:RR.

We then define a map on the space of measures that will be a convenient tool for the proof:

Definition 5

(rank one perturbation for general measures) Let ν be a (not necessarily normalized) finite Borel measure on R. Let L2(ν,R) denote the Hilbert space of real square integrable functions on R. Define an operator Bν on L2(ν,R) by

(31)Bν:=id+11,.,

where id denotes identical map ss and 1 the constant function attaining always 1. Then we define the rank one perturbation of ν by

R(ν):=μBν,1.

The name ‘rank one perturbation’ is justified because R describes how the spectral measure induced by an operator A and a vector ψ changes by replacing A with its rank-one perturbation A+ψψT. To see this, let ν=μA,ψ. Assume, without loss of generality, that ψ is a cyclic vector for A, i.e., that the span of {Akψ}kN is dense in H (otherwise we restrict A to the completion of this span). By standard spectral theory of operators [17], there is a unitary map U:HL2(R,μA,ψ) ‘diagonalizing’ A in the sense that A=U1idU and Uψ=1. Therefore,

μid+11T,1=μA+ψψT,ψ.

We do not have a more explicit description of R, but the relation between the Cauchy transforms of ν and R(ν) is remarkably simple. To describe the relation, we first introduce Cauchy transforms [23]:

Definition 6

(Cauchy transform)Let ν be a not necessarily normalized measure. Then the Cauchy transform of ν is defined[6] as the complex-valued function from C+ (that is, the set of complex numbers with positive imaginary part) to C given by

Fν(z):=(zt)1dν(t).

Then we find:

Lemma 6

(spectral measure for rank-one perturbation)Let A be a self-adjoint operator on some Hilbert space H and let cH be some vector. Define the rank one perturbation Ac:=A+cc,.. Set ν:=μAc,c. Then

(32)Fν=FμA,c1FμA,c.

Proof

Equation (32) is a special case of the so-called Aronszajin-Krein formula [24, 25, 26, 27, 28]. It can be easily seen as follows. Set Az:=AzI. Moreover, by slightly abusing notation define the linear form cT:=c,.. Using the Sherman-Morrison formula [29]

(Az+ccT)1=Az1Az1ccTAz11+c,Az1c,

one easily obtains

c,(Az+ccT)1c=c,Az1c1+c,Az1c.

Then the statement follows using

FμA,c(z)=c,Az1cFA+ccT,c(z)=c,(Az+ccT)1c.

By applying Lemma 6 to the operator Bν in eq. (31) we obtain:

Corollary 1

(Cauchy transform of rank one perturbation)For any finite Borel measure ν on R, the Cauchy transforms of ν and R(ν) are related by

FR(ν)=Fν1Fν.

Moreover, we will need the following map:

Definition 7

(multiplication map) If ν denotes a Borel measure on R, we define M(ν) by

f(λ)dM(ν)(λ)=f(λ)λ2dν(λ),

for every measurable function f on R.

The transformation M describes how the spectral measure induced by any matrix A and any vector c changes when c is replaced with A1c, that is

M(μA,c)=μA,A1c.

This is also easily verified by diagonalizing A to a multiplication operator on L2(R,μA,c) as in the remarks after Definition 5.

Using the transformations M and R, we obtain the following concise form for the spectral measure induced by the confounding vector:

μΣXX,cΣXX1b=c2μΣXX,ΣXX1b=c2μ(ΣEE+bbT),(ΣEE+bbT)1b=c2M[R[μΣEE,b]].

We will also need the following result:

Lemma 7

(weak continuity of R and M )Let νd be a sequence of measures with common support [l,r] with r>l>0. If νdν weakly then R(νd)R(ν) and M(νd)M(ν) weakly.

Proof

Since νd converges weakly to ν, Fνd converges pointwise to Fν. Thus, FR(νd) converges pointwise to FR(ν) for all zC+. Due to Section 3, Theorem 13 in [30], FR(ν) is the Cauchy transform of the limit of R(νd), see also [31], Section 5.

Weak continuity of M is immediate since M is the multiplication with a function that is bounded by 1/r2 and 1/l2.

Since we observe ΣXX and not ΣEE it is important for our purpose that the tracial measure of both matrices asymptotically coincide. The asymptotic version of Lemma 4 reads:

Lemma 8

(tracial measures coincide)If μΣXXdτμ weakly then μΣEEdτμ weakly, too.

Proof

We have for every interval [r,l]:

limd|μΣXXτ[r,l]μ[r,l]|limd|μΣXXτ[r,l]μΣEE,τ|+limd|μΣEEτ[r,l]μ[r,l]|.

The first term is zero due to Lemma 4 and the second one by assumption. Since the intervals generate the entire Borel sigma algebra the statement follows.

For each d, the confounding strength βd depends on the length of ΣXX1b, which, in turn, depends on the specific random choice of bd. We will see, however, that asymptotically, one always obtains

β=c2M[R[rb2μ]](R)ra2+c2M[R[rb2μ]](R),

since it turns out that the below proof is true for this value of β.

We are now prepared to prove Theorem 2. Due to Theorem 1 we have

limdμΣXXd,ad+c(ΣXXd)1bd=limdμΣXXd,ad+limdμΣXXd,c(ΣXXd)1bd=ra2μ+c2μΣXXd,(ΣXXd)1bd=ra2μ+c2limdM[R[μΣEEd,bd]],

where the last step is due to

μΣXXd,(ΣXXd)1bd=μΣEEd+bdbdT,(ΣEEd+bdbdT)1bd=M[R[μΣEEd,bd]],

by definition of R and M.

Due to Lemma 7 and Theorem 1 we have

limdM[R[μΣEEd,bd]]=M[R[rb2μ]].

Hence we obtain

(33)limdμΣXXd,aˆdaˆd2=ra2μ+c2M[R[rb2μ]]limdaˆd2.

To evaluate the denominator on the right hand side, we employ eq. (10) and obtain:

limdaˆd2=limdμΣXXd,ad+c(ΣXXd)1bd(R).

Using the asymptotic decomposition of measures in eq. (15) in Theorem 1, we thus have

limdaˆd2=limdμΣXXd,ad(R)+limdμΣXX,c(ΣXXd)1bd(R)
(35)=ra2μ(R)+limdc2M[R[μΣEEd,bd]](R)=ra2+c2M[R[rb2μ]](R),

where the second equality is due to eq. (13) and the definition of R and M. The third equality uses the normalization μ(R)=1 (since μ is a probability measure) and eq. (14). Inserting eq. (35) into eq. (33) yields:

(36)limdμΣXXd,aˆdaˆd2=ra2ra2+c2M[R[rb2μ]](R)μ+c2ra2+c2M[R[rb2μ]](R)M[R[rb2μ]]=(1β)μ+βM[R[rb2μ]]M[R[rb2μ]](R)=(1β)μ+βM[R[ημ]]M[R[ημ]](R).

On the other hand, recalling the construction of νβ,η in Section 4, for fixed d (which we drop first) we have

νβ,η=(1β)μΣXXτ+βμT,T1(ηg)T1(ηg)2=(1β)μΣXXτ+βM[μT,g]M[μT,g](R)=(1β)μΣXXτ+βM[R[μMX,ηg]]M[R[μMX,ηg]](R)=(1β)μΣXXτ+βM[R[ημΣXXτ]]M[R[ημΣXXτ]](R),

where we have used μMX,g=μΣXXτ by construction of g. Hence we obtain:

limdνβ,ηd=limd(1β)μΣXXdτ+limdβM[R[ημΣXXdτ]]M[R[ημΣXXdτ]](R)=(1β)μ+βM[R[ημ]]M[R[ημ]](R),

which coincides with eq. (36).

6 Experiments with simulated data

6.1 Estimation of strength of confounding

We first ran experiments where the data has been generated according to our model assumptions: First, both the influence of X on Y and the influence of Z on X and Y is linear. Second, the vectors a and b are randomly drawn from a uniform distribution over the sphere whose radius is randomly chosen. More specifically, the data generating process reads as follows:

  1. Generate E: first generate n samples of a d-dimensional vector valued Gaussian random variable E˜ with mean zero and covariance matrix I. Then generate a d×d random matrix G whose entries are independent standard Gaussians and set E:=GE˜.

  2. Generate scalar random variables Z and F by drawing n samples of each independently from a standard Gaussian distribution.

  3. Draw scalar model parameters c,ra,rb by independent draws from the uniform distribution on the unit interval.

  4. Draw vectors a,b independently from a sphere of radius ra and rb, respectively.

  5. Compute X and Y via the structural eqs. (1) and (2).

Note that for the above generating process the computation of the true confounding strength β involves only model parameters that are exactly known, even ΣXX need not be estimated from the data matrix because it is simply given by GGT+bbT.

Figure 4 Simulation results: true value β$\beta$ versus estimated value βˆ$\hat{\beta}$ for different dimensions d$d$ and sample sizes n$n$. The performance increases with higher dimensions provided that the sample size is large enough.
Figure 4

Simulation results: true value β versus estimated value βˆ for different dimensions d and sample sizes n. The performance increases with higher dimensions provided that the sample size is large enough.

Figure 4 shows the results for dimensions d=5;10;20 and sample sizes n=100;1000;10,000;100,000. They indicate that the sample size is even more critical for the performance than the dimension. It seems that the required sample sizes grow so quickly with the dimension that data sets with sample sizes under 1000 should only be considered if the dimension is not larger than about 10.

This observation for moderate dimensions is somehow in contrast to the asymptotic statement that the sample size required for non-regularized estimation of covariance matrices (given some fixed error bound in terms of the operator norm) grows only with O(dlogd) [32, 33] for any distribution with finite second moment, a result that may be relevant in the ‘big data regime’ with huge n and d.

Although the results for dimension 5 look quite bad, it should be noted that β and βˆ are already significantly correlated, the correlations coefficients varied in the range between 0.65 and 0.8 for the different sample sizes with p-values below 1010.

We found the true and estimated value of η to be quite uncorrelated, it seems that η is hard to estimate using our method. Since our focus is on the confounding strength, we will not explore this any further.

7 Experiments with real data under controlled conditions

It is hard to find real data where the strength of confounding is known. This is because there are usually unknown confounders in addition to the ones that are obvious for observers with some domain knowledge. For this reason, we have designed an experiment where the variables are observables of technical devices among which the causal structure is known by construction of the experimental setup.

7.1 Setup for a confounded causal influence

To obtain a causal relation where X influences Y and there is, a the same time, a confounder Z influencing both X and Y, we have chosen the setup shown in Figure 5.[7]

Figure 5 Setup for the generic causal relation where X$\textbf{X}$ influences Y$Y$ and Z$Z$ influences both X$\textbf{X}$ and Y$Y$, see text. Note that we used the symbol for light bulbs to represent the LEDs in order to simplify the drawing.
Figure 5

Setup for the generic causal relation where X influences Y and Z influences both X and Y, see text. Note that we used the symbol for light bulbs to represent the LEDs in order to simplify the drawing.

The cause X is a 9-dimensional pixel vector generated by extremely reducing the resolution of an image taken by a webcam to 3×3 pixels. The effect Y is the intensity measured at a light sensor in front of a laptop screen that displays the 3×3 image, amplified to a size of about 10×10 centimeter. The sensor is located at a distance of about 10 centimeter from the screen. To confound the causal relation by a common cause Z, we have generated an independent random voltage that controls the brightness of two LEDs: one influencing X because it is placed in front of the webcam and one that is placed in front of the light sensor. To ensure that X is not entirely determined by the LED, we have placed the webcam in front of a TV. This way, the image taken by the webcam is influenced by both the LED and the TV signal – the latter plays the role of E in our structural eq. (1). To avoid that fluctuations of daylight is an additional confounder we have covered the pair sensor and laptop screen by a towel. Since we have measured Z (the random value of the voltage which determines the brightness of the LEDs), we are able to compute the strength β of confounding up to an extent where the estimations of ΣXX,b,a from empirical data coincide with their true counterparts. We will denote this value by β to emphasize that it may still deviate from the true value β when the sample size is not sufficient. Figure 6 shows β and βˆ for experiments with sample size 1000.

Figure 6 Confounding strength estimated using observations of Z$Z$ vs. confounding strength estimated by our algorithm for sample size 1000$1000$.
Figure 6

Confounding strength estimated using observations of Z vs. confounding strength estimated by our algorithm for sample size 1000.

For this setup, the algorithm tends to underestimate confounding, but shows qualitatively the right tendency since β and βˆ clearly correlate.

After inspecting some spectral measures for the above scenario we believe that the algorithm underestimates confounding for the following reason: The vector a describing the influence of the images on the sensor is not in generic orientation relative to ΣXX. This is because the pixels are usually positively correlated and each pixel has positive influence on the total intensity measured by the sensor. This way, a has stronger weights for high eigenvalues. Since confounding often increases the weights of low eigenvalues (note, however, that this depends also on our second confounding parameter η), the ʼnon-genericness’ of the vector a tends to compensate the effect of confounding on the spectral measure induced by aˆ. It is likely that such an underestimation of confounding occurs for many other scenarios as well. This is because it is not uncommon that all variables in a vector are positively correlated[8] and that they all have a positive influence on some target variable.

The above setting contained the purely confounded and the purely causal scenario as limiting cases: the confounded one by putting the LEDs close to the sensor and the webcam, respectively, and setting the voltage to the maximal values, and the purely causal one by setting the voltage to zero.

To get further support for the hypothesis that the algorithms tends to behave qualitatively in the right way even when the estimated strength of confounding deviates from its true value, we also tested modifications of the above scenario that are purely confounded or purely causal by construction and not only by variations of parameters. This is described in Sections 7.2 and 7.3.

7.2 Purely causal scenarios

To build a scenario without confounding, X is the pixel vector of grey values of an image section (consisting of 3×3 pixel) randomly drawn from a fixed image. The image sections are displayed by the screen of a laptop after rescaling them to a size of about 10 cm ×10  cm. A light sensor, placed in front of the screen with a distance of about 10  cm, measures the light intensity, which is our variable Y. Clearly, X influences Y in an unconfounded way because the selection of the images is perfectly randomized. Fluctuations of the brightness caused by the environment certainly influence Y, but count as independent noise since they do not influence X.

We tried this experiment with sample size 1000, where the estimated confounding strength was βˆ=0. We should also mention that we obtained β=6.8107 in agreement with our statement that the experimental setup is unconfounded because the image sections are drawn randomly. The extremely low value of β also shows that β is indeed very close to the true value β, which justifies to identify them.

7.3 Purely confounded scenario

The setup in Figure 5 can be easily modified to a causal structure where the relation between the pixel vector X and the light intensity Y is purely confounded by a one-dimensional variable Z: we just need to put the light sensor to a place where it neither sees the TV nor the screen of the laptop. If we, again, ensure that the light sensor is not influenced by the same fluctuations of daylight as the webcam (e.g. by covering the sensor by a towel), the statistical dependence between X and Y is due to Z, that is, the fluctuations of the random light signal from the LEDs alone.

We have performed this experiment with sample size 1000 and obtained βˆ=0.68 and β=0.998, which again is consistent with our previous observation that confounding is underestimated.

8 Experiments with real data with partially known causal structure

The experiments in this section refer to real data where the causal structure is not known with certainty. For each data set, however, we will briefly discuss the plausibility of the results in light of our limited domain knowledge. The main purpose of the section is to show that the estimated values of confounding strength indeed spread over the whole interval [0,1]. A priori, we could not be sure whether empirical data follow probability distributions that are so different from our model assumptions that only small or only large values of confounding were estimated.

8.1 Taste of wine

This dataset [36] describes the dependence between Y, the scores on the taste between 0 and 10 (given by human subjects) of red wine, and 11 different ingredients: X1: fixed acidity, X2: volatile acidity, X3: citric acid, X4: residual sugar, X5: chlorides, X6: free sulfur dioxide, X7: total sulfur dioxide, X8: density, X9: pH, X10: sulphates, X11: alcohol. It turned out that the largest eigenvalue of the covariance matrix is by orders of magnitude larger than the others. We therefore normalized the Xj to unit variance and obtained the covariance matrix

ΣXX=10.260.670.110.090.150.110.670.680.180.060.261.000.550.000.060.010.080.020.230.260.200.670.5510.140.200.060.040.360.540.310.110.110.000.1410.060.190.200.360.090.010.040.090.060.200.0610.010.050.200.270.370.220.150.010.060.190.0110.670.020.070.050.070.110.080.040.200.050.6710.070.070.040.210.670.020.360.360.200.020.0710.340.150.500.680.230.540.090.270.070.070.3410.200.210.180.260.310.010.370.050.040.150.2010.090.060.200.110.040.220.070.210.500.210.091

We observe several correlation coefficients around ±0.5 and ±0.6, hence ΣXX significantly differs from the identity, which is important because ΣXX=I would render the method pointless. The vector of regression coefficients reads:

aˆ=(0.044,0.194,0.036,0.023,0.088,0.046,0.107,0.034,0.064,0.155,0.294),

showing that alcohol has by far the strongest association with taste. According to common experience, alcohol indeed has a significant influence on the taste. Also the other associations are likely to be mostly causal and not due to a confounder. We estimated confounding for this data set and obtained βˆ=0.

Figure 7 Spectral measure μΣXX,aˆ$\mu_{\Sigma_{\textbf{X}\textbf{X}},\hat{\textbf{a}}}$ for the taste of wine for the case where all ingredients are included (left) versus the case where X11$X_{11}$ (alcohol) has been dropped (right). In the latter case, the weights decrease for the larger eigenvalues.
Figure 7

Spectral measure μΣXX,aˆ for the taste of wine for the case where all ingredients are included (left) versus the case where X11 (alcohol) has been dropped (right). In the latter case, the weights decrease for the larger eigenvalues.

Since the above experiments suggest that the set of ingredients influence the target variable taste essentially in an unconfounded way, we now explore what happens when we exclude one of the variables X:=(X1,,X11). Since this variable Xj will typically be correlated with the remaining ones and since it, at the same time, influences Y, this will typically confound the causal relation between XXj and Y. For each j{1,,11} we have therefore estimated the structural confounding strength and obtained the following results: for all j<11 the algorithm estimated the confounding strength 0.0 (the lowest possible value), while it estimated 0.55 for j=11. Since X11 has the strongest influence on Y, this result is remarkable because it is plausible that dropping it corresponds to strong confounding.

Figure 7, left visualizes the weights of the spectral measure for the case where all 11 variables are included and compares it to the one obtained when alcohol is dropped (right). In the latter case, one clearly sees that the weights decrease towards large eigenvalues, which may indicate confounding.

8.2 Chigaco crime data

This dataset [37] reports[9] the number of crimes for each of 77 community areas in Chicago, USA, and some potential factors influencing the crime rate [37]. Here, Y denotes the assaults (homicides) and X consists of the following 6 features: X1: below poverty level, X2: crowded housing, X3: dependency, X4: no high-school diploma, X5: per capita income, X6: unemployment. After normalization we obtain the following estimated vector of structure coefficients:

aˆ=(3.3,3.5,2.8,7.7,2.6,9.7).

It seems reasonable that the unemployment rate has the strongest influence. It is, however, surprising that ‘no high-school diploma’ should have a negative influence on the number of crimes. This is probably due to a confounding effect. The estimated confounding strength reads βˆ=0.0.

8.3 Compressive strength and ingredients of concrete

This experiment considers the data set ‘concrete and compressive strength’ [38, 39] in the machine learning repository Y is the compressive strength in megapascals and X1 to X7 are the following components, measured in kg/m3. X1: cement, X2: blast furnace, X3: fly ash, X4: water, X5: superplasticizer, X6: coarse aggregate, X7: fine aggregate. X8 is the age in days. After normalization, the estimated vector of structure coefficients reads

aˆ=(12.5,9.0,5.6,3.2,1.7,1.4,1.6,7.2)

The amount of superplasticizer seems to have the strongest influence, followed by cement. The estimated confounding strength reads βˆ=0.36, but it is hard to speculate about possible confounders here.

9 Discussion

We have described a method that estimates the strength of a potential one-dimensional common cause Z that confounds the causal relation between a d-dimensional cause X (with ‘large’ d) and a one-dimensional effect Y. The presence of Z can, to some extent, be detected from the joint statistics of X and Y when the vector a of regression coefficients (after regressing Y on X) is decomposed into the eigenvectors of the covariance matrix ΣXX of X. This is because generically, without confounding, the weights of this decomposition are roughly uniformly spread over the principal values of X, while the presence of Z will typically modify the weights in a way that is characteristic for the corresponding confounding scenario.

The method is based on the assumption that the vector a has, in a certain sense, ‘generic orientation’ with respect to ΣXX. The justification of our method relies on a highly idealized model where the vectors of model parameters are randomly generated from a rotation invariant prior. This yields to several concentration of measure phenomena for high dimensions d which the method employs. There is some hope that empirical data show similar concentration of measure phenomena although our model assumptions are probably significantly violated.

Given the difficulty of the enterprise of inferring causal relations from observational data, one should not expect that any method is able to detect the presence of confounders with certainty. Following this modest attitude, the results can be considered encouraging; after all the joint distribution of X and Y seems to contain some hints on whether or not their causal relation is confounded.

Although the theoretical justification of the method (using asymptotic for dimension to infinity) suggests that the methods should only be applied to large dimension, it should be emphasized that we have so far computed the regression without regularization which quickly requires prohibitively high sample sizes. Future work may apply the method following regularized regression but then one has to make sure that the regularizer does not spoil the method by violating our symmetry assumptions.

To generalize the ideas of this paper to non-linear causal relations one may consider kernel methods [40]. Then the covariance matrix in feature space captures also non-linear relations [41] between the components Xj and the regression vector also non-linear relations between X and Y. It is unclear, however, how the idea of ‘generic relative orientation’ translates into this setting given that a rotation invariant prior is impossible in an infinite-dimensional space.

Regardless of whether there is a sufficiently number of real-world problems for which the method developed here is applicable, it may help understand three things: first, it shows a possible formalization of the informal idea of ‘independence’ between Pcause and Peffect|cause for the common scenario of multivariate linear models. Second, it shows what kind of sophisticated dependences are generated by confounding in such a scenario, and third, that employing the dependences for confounder detection may be challenging even when the idealized model assumptions hold. This way, the scenario discussed in the present paper may help develop novel practical methods for confounder detection.

10 Appendix: Relation to free independence

Free probability theory [20, 30] defines a notion of independence that is asymptotically satisfied for independently chosen high-dimensional random matrices. To sketch the idea, we start with a model of generating independent random matrices considered in [42]:

Let (Fd)dN and (Gd)dN be sequences of d×d matrices whose tracial spectral measures converge weakly. Let (Ud)dN be random orthogonal matrices drawn from O(d) according to the Haar measure. We start with the simplest case of the independence conditions: Fd and Gd:=UdGdUdT satisfy asymptotically the Trace Condition [18, 19], i.e.,

|τ(FdGd)τ(Fd)τ(Gd)|0

in probability, where τ denotes again the renormalized trace. It is then convenient to introduce limit objects F,G in a C-algebra [16] and a functional ϕ, expressing the limit of renormalized traces, for which the Trace Condition then holds exactly:

(37)ϕ(FG)=ϕ(F)ϕ(G).

However, eq. (37) is only the simplest one of an infinity of independence statements. First, one obtains statements on higher moments like

ϕ(FkGk)=ϕ(Fk)ϕ(Gl),

which is analog to E[XkYl]=E[Xk]E[Yl] for independent random variables X and Y. But the model above also yields independence statements like ϕ(FGFG)=0 whenever ϕ(F)=ϕ(G)=0, which have no counterpart with classical random variables. F and G are also considered ‘non-abelian random variables’ and free independence as a stronger version of usual statistical independence which can only hold because the variables do not commute.

The following difficulty arises when we try to apply the above ideas to our generating model used in Theorem 1. Our sequence (ΣXXd)dN may take the role of (Fd)dN. To draw ad uniformly from the unit sphere, we may define an arbitrary sequence (ad) of unit vectors and set ad:=Udad with Ud being a random rotation as above. Then one could naively argue that ad(a)dT takes the role of Gd and one also expects

τ((ΣXXd)kadadT)τ((ΣXXd)k)τ(adadT),

which is equivalent to

ad,(ΣXXd)kadτ((ΣXXd)k)ad2.

Hence,

skdμΣXXd,ad(s)skμΣXXdτ(s),

for all kN for large d. Thus, all moments of μΣXX,a coincide almost with μΣXXτa2. This argument, however, blurs the fact that τ(adad) converges to zero while tr(adadT) is constant, i.e., we need to consider the asymptotic of dτ(FdGd) instead of τ(FdGd), which is not covered by free probability theory to the best of our knowledge. Theorem 1 is thus close to the above statements although we do not see any straightforward way to derive it from existing work.

Acknowledgements:

We would like to thank Uli Wannek for helping us with the implementation of the video experiment. Many thanks also to Roland Speicher and his group in Saarbrücken for helpful discussions about free probability theory and to Steffen Lauritzen for pointing out that real data sometimes show the MTP2 property, which may be an issue here.

References

[1] Reichenbach H. The direction of time. Berkeley: University of California Press, 1956.10.1063/1.3059791Search in Google Scholar

[2] Pearl J. Causality: Models, reasoning, and inference. Cambridge University Press, 2000.10.1017/CBO9780511803161Search in Google Scholar

[3] Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search (Lecture notes in statistics). New York, NY: Springer-Verlag, 1993.10.1007/978-1-4612-2748-9Search in Google Scholar

[4] Bowden R, Turkington D. Instrumental variables. Cambridge: Cambridge University Press, 1984.10.1017/CCOL0521262410Search in Google Scholar

[5] Hoyer P, Shimizu S, Kerminen A, Palviainen M. Estimation of causal effects using linear non-gaussian causal models with hidden variables. Int J Approx Reason. 2008;49:362–378.10.1016/j.ijar.2008.02.006Search in Google Scholar

[6] Janzing D, Peters J, Mooij J, Schölkopf B. Identifying latent confounders using additive noise models. In: Ng A, Bilmes J, editor. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009). Corvallis, OR, USA: AUAI Press, 2009:249–257.Search in Google Scholar

[7] Janzing D, Sgouritsa E, Stegle O, Peters P, Schölkopf B. Detecting low-complexity unobserved causes. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011). Available at: http://uai.sis.pitt.edu/papers/11/p383-janzing.pdf.Search in Google Scholar

[8] Janzing D, Balduzzi D, Grosse-Wentrup M, Schölkopf B. Quantifying causal influences. Ann Stat. 2013;41:2324–2358.10.1214/13-AOS1145Search in Google Scholar

[9] Janzing D, Schölkopf B. Causal inference using the algorithmic Markov condition. IEEE Trans Inf Theo. 2010;56:5168–5194.10.1109/TIT.2010.2060095Search in Google Scholar

[10] Lemeire J, Janzing D. Replacing causal faithfulness with algorithmic independence of conditionals. Minds Mach. 2012;23:227–249.10.1007/s11023-012-9283-1Search in Google Scholar

[11] Li M, Vitányi P. An Introduction to Kolmogorov Complexity and its Applications. New York: Springer, 1997 (3rd edition: 2008).10.1007/978-1-4757-2606-0Search in Google Scholar

[12] Janzing D, Steudel B. Justifying additive-noise-based causal discovery via algorithmic information theory. Open Syst Inf Dynam. 2010;17:189–212.10.1142/S1230161210000126Search in Google Scholar

[13] Meek C. Strong completeness and faithfulness in Bayesian networks. In: Proceedings of 11th Uncertainty in Artificial Intelligence (UAI). Montreal, Canada: Morgan Kaufmann, 1995:411–418.Search in Google Scholar

[14] Uhler C, Raskutti G, Bühlmann P, Yu B. Geometry of the faithfulness assumption in causal inference. Ann Stat. 2013;41:436–463.10.1214/12-AOS1080Search in Google Scholar

[15] Kato T. Perturbation theory for linear operators. Berlin: Springer, 1996.10.1007/978-3-642-66282-9_9Search in Google Scholar

[16] Murphy G. C-algebras and operator theory. Boston: Academic Press, 1990.Search in Google Scholar

[17] Reed M, Simon B. Functional Analysis. San Diego, California: Academic Press, 1980.Search in Google Scholar

[18] Janzing D, Hoyer P, Schölkopf B. Telling cause from effect based on high-dimensional observations. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 06, 2010:479–486.Search in Google Scholar

[19] Zscheischler J, Janzing D, Zhang K. Testing whether linear equations are causal: A free probability theory approach. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011. Available at: http://uai.sis.pitt.edu/papers/11/p839-zscheischler.pdf.Search in Google Scholar

[20] Voiculescu D, editor. Free probability theory, volume 12 of Fields Institute Communications. American Mathematical Society, 1997.Search in Google Scholar

[21] Chandrasekaran V, Parrilo P, Willsky A. Latent variable graphical model selection via convex optimization. Ann Stat. 2012;40:1935–1967.10.1109/ALLERTON.2010.5707106Search in Google Scholar

[22] Datta BN. Numerical Linear Algebra and Applications. Philadelphia, USA: Society for Industrial and Applied Mathematics, 2010.10.1137/1.9780898717655Search in Google Scholar

[23] Cima J, Matheson A, Ross W. The Cauchy Transform. Mathematical Surveys and Monographs 125. American Mathematical Society, 2006.10.1090/surv/125Search in Google Scholar

[24] Simon B. Spectral analysis of rank one perturbations and applications. Lectur given at the Vancouver Summer School in Mathematical Physics (1993). Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.9138, 1994.Search in Google Scholar

[25] Simon B. Trace ideals and their applications. Providence, RI: American Mathematical Society, 2005.Search in Google Scholar

[26] Kiselev A, Simon B. Rank one perturbations with infinitesimal coupling. J Funct Anal. 1995;130:345–356.10.1006/jfan.1995.1074Search in Google Scholar

[27] Albeverio S, Konstantinov A, Koshmanenko V. The Aronszajn-Donoghue theory for rank one perturbations of the H2-class. Integral Equ Operat Theo. 2004;50:1–8.10.1007/s00020-002-1219-3Search in Google Scholar

[28] Albeverio S, Kurasov P. Rank one perturbations, approximations, and selfadjoint extensions. J Func Anal. 1997;148:152–169.10.1006/jfan.1996.3050Search in Google Scholar

[29] Bartlett MS. An inverse matrix adjustment arising in discriminant analysis. Ann. Math. Statist. 1951;22:107–111.10.1214/aoms/1177729698Search in Google Scholar

[30] Mingo J, Speicher R. Free probability and random matrices. New York: Springer, 2017.10.1007/978-1-4939-6942-5Search in Google Scholar

[31] Bercovici H, Voiculescu D. Free convolution of measures with unbounded supports. Ind Univ Math J. 1993;42:733–773.10.1512/iumj.1993.42.42033Search in Google Scholar

[32] Rudelson M. Random vectors in the isotropic position. J Func Anal. 1999;164:60–72.10.1006/jfan.1998.3384Search in Google Scholar

[33] Vershynin R.. How close is the sample covariance matrix to the actual covariance matrix? J Theo Probab. 2012;25:655–686.10.1007/s10959-010-0338-zSearch in Google Scholar

[34] Karlin S, Rinott Y. Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions. J Multiv Anal. 1980;10:467–498.10.1016/0047-259X(80)90065-2Search in Google Scholar

[35] Fallat S, Lauritzen S, Sadeghi K, Uhler C, Wermuth N, Zwiernik P. Total positivity in markov structures. To appear in Annals of Statistics, 2016.10.1214/16-AOS1478Search in Google Scholar

[36] Lichman M. UCI machine learning repository. Available at: http://archive.ics.uci.edu/ml, 2013.Search in Google Scholar

[37] City of Chicago. Data portal: Chicago poverty and crime. Available at: https://data.cityofchicago.org/Health-Human-Services/Chicago-poverty-and-crime/fwns-pcmk.Search in Google Scholar

[38] Yeh C. Concrete compressive strength data set. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+ Strength.Search in Google Scholar

[39] Yeh I-C. Modeling of strength of high performance concrete using artificial neural networks. Cement Concrete Res. 1998.10.1016/S0008-8846(98)00165-3Search in Google Scholar

[40] Schölkopf B, Smola A. Learning with kernels. Cambridge, MA: MIT Press, 2002.Search in Google Scholar

[41] Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel methods for measuring independence. J Mach Learn Res. 2005;6:2075–2129.Search in Google Scholar

[42] Speicher R. Free probability theory and non-crossing partitions. LOTHAR. COMB. 1997;39.Search in Google Scholar

Received: 2017-4-5
Revised: 2017-6-14
Accepted: 2017-9-21
Published Online: 2017-10-28
Published in Print: 2018-3-26

© 2018 Walter de Gruyter GmbH, Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/jci-2017-0013/html
Scroll to top button