We study a model where one target variable is correlated with a vector of predictor variables being potential causes of . We describe a method that infers to what extent the statistical dependences between and are due to the influence of on and to what extent due to a hidden common cause (confounder) of and . The method relies on concentration of measure results for large dimensions and an independence assumption stating that, in the absence of confounding, the vector of regression coefficients describing the influence of each on typically has ‘generic orientation’ relative to the eigenspaces of the covariance matrix of . For the special case of a scalar confounder we show that confounding typically spoils this generic orientation in a characteristic way that can be used to quantitatively estimate the amount of confounding (subject to our idealized model assumptions).
1 Introduction and general model
Estimating the causal influence of some variables on a target variable is among the most important goals in statistical data analysis. However, drawing causal conclusions from observational data alone without intervening on the system is difficult. This is because the observed statistical dependences between and each need not be due to an influence of on . Instead, due to Reichenbach’s Principle of Common Cause , may also be the cause of or there may be a common cause influencing both. In many applications, time order or other prior information excludes that influences . For instance, if describes the health condition of a patient at time and some treatments at an earlier time, we ‘only’ need to decide to what extent the dependences between and are due to influencing and to what extent they are due to common causes (’confounders’). Here we are not interested in the reason for dependences between the variables themselves, we therefore merge them to a vector-valued variable . Moreover, we restrict the attention to the case where there is only one real-valued confounder . In the case of linear relations, the structural equations then read:
where is a random vector with values in and are scalar random variables. Here, are jointly independent, while the components of may be dependent. Further, is the vector of structure coefficients determining the influence of the -dimensional variable on the scalar target variable . Likewise, is the vector determining the influence of on and the scalar is the structure coefficient determining the influence of on . By rescaling and , we may assume to have unit variance without loss of generality. The corresponding DAG is shown in Figure 1.
If all variables are centered Gaussian, the remaining model parameters are the vectors , the covariance matrix and the scalars , where describes the strength of the influence of on and the standard deviation of . Since this paper will be based on second-order statistics alone, we will treat these parameters as the only relevant ones.
The following special cases can be obtained by appropriate choices of these parameters:
Purely causal: The case with no confounding can easily be obtained by setting or . In the first case, takes the role of an additional independent noise term (apart from ), see Figure 2(a).
The structural equation then reads
with being jointly independent. For fixed , the limit of a deterministic influence of on can be obtained by letting at least one component of the vector grow to infinity. Then is dominated by the term .
Purely confounded: Setting turns the influence of on off. Then the relation between and is generated by the confounder only, see Figure 2(b). Depending on the remaining parameters , and , we can also obtain a scenario where provides perfect knowledge about (when or ) or a scenario where provides perfect knowledge of ().
Purely anticausal: We have actually excluded a scenario where is the cause of . Nevertheless, if and , we have almost surely and is the cause of . Hence, the scenario gets indistinguishable from an ’anticausal’ scenario where is the cause as in Figure 3(b), although performing interventions on would still tell us that it is not the cause.
We now ask how to distinguish between these cases given joint observations of and that are i.i.d. drawn from . Conditional statistical independences, which are usually employed for causal inference, [2, 3] may not help to distinguish between the above cases if there are no conditional independences in . Moreover, we assume that there are no observed causes of that could act as so-called instrumental variables  which would enable the distinction between ‘causal’ and ‘confounded’.
To see that the parameters are heavily underdetermined in linear Gaussian models, just note that any multivariate Gaussian can be explained by being an unconfounded cause of according to the structural eq. (3) by replacing with
The vector is the vector of regression coefficients obtained by regressing on without caring about the true causal structure. Here we use the symbol instead of to indicate that it differs from the vector that appears in the structural eq. (2) which correctly describes the causal relation between and . This way, we obtain a model that correctly describes the observed correlations, but not the causal relations since the impact of interventions is not modelled correctly. – Note that identifying is typically the main goal of causal data analysis since directly describes how changing changes . Confusing with would be the common fallacy of naively attributing all dependences to the causal influence of on . To see the relation between and we first find
where we have used the joint independence of and that is normalized. Likewise,
Due to eq. (4) we thus obtain
Equation (5) shows that the vector obtained by standard regression consists of (which defines the causal influence of on ) and a term that is due to confounding.
It is known that confounding can be detected in linear models with non-Gaussian variables . This is because describing data, that has been generated by the model eqs. (1) and (2), by the structural equation
as in eq. (3), yields in the generic case a noise variable that is not statistically independent of , although it is uncorrelated. Other recent proposals to detect confounding using information beyond conditional statistical dependences rely on different model assumptions. Ref. , for instance assumes non-linear relations with additive noise, while Ref.  assumes a discrete confounder attaining a few values only.
Here we propose a method for distinguishing the purely causal from the confounded case in the linear scenario above that only relies on second order statistics and thus does not rely on non-Gaussianity of noise variables and higher-order statistical independence tests like , for instance. In other words, the relation must be linear but the distributions may be Gaussian or not. The paper is structured as follows. Section 2 defines the strength of confounding, which is the crucial target quantity to be estimated. Section 3 describes the idea of the underlying principle, defines it formally in terms of a spectral measure and justifies it by a toy model where parameters are randomly generated. Section 4 describes the method to estimate the strength and justifies it by intuitive arguments first and by theoretical results which are rigorously shown in Section 5. Section 6 reports results with simulated causal structures according to the model assumptions, while Section 7 describes results from an experiment where the data has been generated by an electronic and optical setup with known causal structure. Section 8 reports some studies with observational data where the causal ground truth is only partially known.
2 Characterizing confounding by two parameters
2.1 Strength of confounding
We now define a strength of confounding that measures how much deviates from (relative to the sum of the squared lengths of these vectors). The goal of this paper is to estimate this quantity from alone.
(structural strength of confounding)The structural strength of confounding is defined by
We will later see that, subject to the genericity assumptions stated in Section 3, the denominator can be approximated via
It should be emphasized that does not measure to what extent the confounder changes . This could be quantified by a parameter that we call ‘correlative strength of confounding’, denoted by , and defined via
It is non-trivially related to . Remarkably, and can differ by orders of magnitudes. Without claiming that would be the better measure, we focus on because it is more relevant for causal statements: whenever is large identifying with yields significantly wrong causal conclusions even when is small. We have introduced only to show that whether confounding is negligible or not highly depends on how it is quantified.
2.2 A second parameter characterizing confounding
The contribution of to the covariance between and is given by the product . Accordingly, rescaling with some factor while rescaling with the inverse factor preserves the correlative strength of confounding. Although structural confounding strength is affected by rescaling and in a more sophisticated way, can also be unaffected by rescaling both and in an appropriate way. The regimes with small and large versus the one with large and small have simple interpretations: in the first case, the uncertainty of is hardly reduced by knowing , while in the second case, knowing reduces most of the uncertainty of .
To distinguish between these different regimes of confounding we introduce the second parameter , which measures the explanatory power of for :
For the entire Section 2 it is important to keep in mind that we always referred to the case where has unit variance.
3 Detecting confounders by the principle of generic orientation
3.1 Intuitive idea and background
The idea of our method is based on the recently stated Principle of Independent Conditionals [9, 10] in the context of causal inference. To introduce it, let be a directed acyclic graph (DAG) formalizing the hypothetical causal relations among the random variables . The distributions compatible with this causal structure factorize into
where each denotes the conditional distribution of , given its parents . Informally speaking, the Principle of Independent Conditionals states that, usually, each describes an independent mechanism of nature and therefore these objects are ‘independent’ and contain ‘no information’ about each other. [9, 10] formalized ‘no information’ by postulating that the description length of one does not get shorter when the description of the other for are given. Here, description length is defined via Kolmogorov complexity, which is, unfortunately, uncomputable . To deal with this caveat, one can either approximate Kolmogorov complexity, or, as shown in , indirectly use the principle as a justification for new inference methods rather than as an inference method itself.
However, there are also other options to give a definite meaning to the term ‘independence’. To see this, consider some parametric model where each is taken from a set of possible conditionals where is taken from some parameter space . Assume that, for a given distribution , the parameters are related by an equation (i.e., one is the function of the others) that is not satisfied by generic-tuples. One can then consider this as a hint that the mechanisms correponding to the have not been generated independently and become skeptical about the causal hypothesis. This philosophical argument is also the basis for Causal Faithfulness , that is, the principle of rejecting a causal DAG for which the joint distribution satisfies conditional independences that do not hold for generic vectors because it requires the vector to lie in a lower dimensional manifold. In its informal version, the Principle of Independent Conditionals generalizes this idea by excluding also other ‘non-generic’ relations between parameter vectors . More precisely, the assumption of faithfulness can be justified by generating models according to which unfaithful distributions occur with probability zero . For practical purposes one needs to also exclude almost unfaithful distributions and assume that they occur only with small probability (although this is problematic ). Here we describe relations between parameters that are also non-generic in the sense that they occur with small probability according to our generating model.
After these general remarks, let us come back to the scenario where the causal hypothesis reads (without confounding and within a linear model). Recalling that we consider the crucial model parameter for and the regression vector the crucial parameter for , we therefore postulate that lies in a ‘generic’ orientation relative to in a sense to be described in Subsection 3.2. To approach this idea first by intuition, note, for instance, that it is unlikely that is close to being aligned with the eigenvector of corresponding to the largest eigenvalue (i.e., the first principal component), given that has been chosen ’without knowing’ . Likewise, it is unlikely that it is approximately aligned with the last principal component.
For the more general DAG shown of Figure 1 we again assume that has generic orientation with respect to the eigenspaces of and, in addition, that has generic orientation with respect to the eigenspaces of .
To provide a first intuition about why the orientation of the resulting vector of regression coefficients is no longer ‘generic’ relative to as a result of confounding, we show two somehow opposite extreme cases where gets aligned with the first and the last principal component of , respectively. To this end, we consider the purely confounded case where and thus .
aligned with the first eigenvector of : Let be the identity matrix . Since is then rotation invariant, has certainly ‘generic orientation’ relative to , according to any reasonable sense of ‘generic orientation’. Then does not have generic orientation relative to , because it is the unique eigenvector of the latter with maximal eigenvalue. In other words, is aligned with the first principal component of . Then, is also aligned with the same principal component since it is a multiple of due to . Note that one also gets close to this scenario when the spectral gaps of are small compared to the norm of and to the eigenvalues of .
close to the last eigenvector of : Let the spectral gaps between adjacent eigenvalues of be much larger than the norm of . Then adding changes the eigenspaces only slightly . Hence, if has a generic orientation relative to , it is still generic relative to . Multiplying with then generates a vector that has stronger coefficients in the small eigenvalues of . If the smallest eigenvalue of is much smaller than the others, the orientation of gets arbitrarily close to the smallest eigenvector.
For general , where the gaps between the eigenvalues are neither tiny nor huge compared to the norm of , the orientation of changes in a more sophisticated way that heavily depends on the structure of the spectrum of . This will be analyzed in Section 4.
3.2 Defining ‘generic orientation’ via the induced spectral measure
We start with some notation and terminology and formally introduce two measures which have quite simple intuitive meanings.
For matrices, we introduce the renormalized trace
For notational convenience, we will assume that the spectra of all matrices are non-degenerate throughout the paper, i.e., all eigenvalues are different. Every symmetric matrix thus admits a unique decomposition
where and denote the corresponding eigenvectors of unit length. Every uniquely defines a measure on , namely the distribution of eigenvalues, formally given as follows:
(tracial spectral measure) Let be a real symmetric matrix with non-degenerate spectrum. Then the tracial spectral measure of is the discrete measure on given by the uniform distribution over its eigenvalues , i.e.,
where denotes the point measure on for some .
By elementary spectral theory of symmetric matrices , we have:
(expectation for tracial spectral measure)The expectation of any function with respect to the tracial measure is given as follows:
While the spectral measure is a property of a matrix alone, the following measure describes the relation between a matrix and a vector:
(vector-induced spectral measure) Let be a symmetric matrix and be defined by eq. (9). For arbitrary , the (unnormalized) spectral measure induced by on , denoted by , is given by
for any measurable set .
For each set of eigenvalues of , the measure describes the squared length of the component of that lies in the respective eigenspace of . Accordingly, we have the following normalization condition:
In analogy to Lemma 1 we obtain:
(expectations for vector-induced spectral measure)The expectation of any function on the spectrum of with respect to is given as follows:
To deal with the above measures in numerical computations, each measure will be represented by two vectors: first, one vector listing its support (with the convention ) and second, the vector listing the corresponding weights. For the tracial spectral measures, is just the list of eigenvalues and is the uniform distribution on these points. For spectral measures induced by a vector, is still the list of eigenvalues but now describes the squared coefficients of the vector with respect to the eigenvector decomposition.
To understand our algorithm described later, it is helpful to note that using the eigenvectors of a matrix one can easily construct a vector that induces the tracial measure:
Then we have
The crucial idea behind our detection of confounding is the insight that is close to for any ‘typical’ choices of , which is made precise by the following result shown in Section 5.
Let with be a sequence of symmetric matrices whose spectral measure converges weakly to some probability measure , i.e.,
Let with be randomly drawn from the sphere of radius . Then
weakly in probability.
The idea of this paper is that we expect to be close to in the absence of confounding. Moreover, we show that confounding perturbs in a characteristic way, given that the model parameters are ‘typical’. To study properties of for ‘typical’ choices of the model parameters , we now define the following sequence of models for increasing dimension :
Covariance matrix of the noise of : Let be a uniformly bounded sequence of positive semi-definite -matrices such that their tracial spectral measures converge weakly to some probability measure (describing the asymptotic distribution of eigenvalues).
Vector of causal structure coefficients : Let be a sequence of vectors in drawn uniformly at random from a sphere of fixed radius .
Vector of confounding structure coefficients : Let be a sequence of vectors in drawn uniformly at random (independently of ) from a sphere of fixed radius . Let be fixed for all .
Then and and we have the following result that will be shown in Section 5:
Intuitively, eq. (13) just states that the causal part of induces, together with , a spectral measure that is close to the tracial measure up to a normalization factor because has ‘generic’ orientation relative to since is chosen independently of and . On the other hand, is chosen independently of , and thus the spectral measure induced by and is close to the corresponding tracial measure up to normalization, as shown by eq. (14). Equation (15) states that almost decomposes into plus , that is, the spectral measure induced by the causal vector and its perturbation . This is useful because eq. (13) describes the asymptotic behavior of . To understand the asymptotic of , on the other hand, we will later use eq. (14).
Assuming a rotation invariant generating model may appear as a too strong assumption for practical purposes. It should therefore be noted that much weaker assumptions would probably also yield the same approximate identities for high dimensions. Possible quantitative versions of Lemma 3 for finite dimensions could show that the overwhelming majority of vectors on the surface of the unit sphere induce spectral measures that are close to the tracial one. Such quantitiave versions could provide some insights on the robustness of our method regarding violation of the strong model assumptions (since also a large class of non-rotation invariant priors then yield the same conclusions), but this goes beyond the scope of this paper. After all, the performance on real data cannot be predicted from purely theoretical arguments.
(I) The vector has generic orientation relative to in the sense that
(II) The vector has generic orientation relative to in the sense that
(III) The vector is generic relative to in the sense that
We therefore built our algorithm in Section 5 upon the postulates only instead of directly using the generating model.
Note that eqs. (16), (17), and (18) only hold if the -sign is interpreted in a sufficiently loose sense. After all, the differences between both sides of the above equations converge weakly to zero. They are not close, for instance, with respect to total variation distance. Hence, the measures are similar in the same sense as two empirical distributions with large sample size are similar when they are independently sampled from the same distribution.
Related work: We first describe the relation of the above ideas to those underlying the so-called Trace Method [18, 19], which is, to the best of our knowledge, the work in the literature that is closest to the present one. Let and be vector-valued variables with values in and , respectively. Assume influences via the linear model
where is an matrix and a noise variable of dimension . Then and satisfy the trace condition
For , the matrix is given by the row vector vector . Then and eq. (19) turns into
In terms of the spectral measures, eq. (20) reads
Hence, eq. (19) postulates that the first moments of two measures on the left and the right of eq. (16) almost coincide, while our method also accounts for higher order moments which the Trace Condition ignores. As already sketched in , the Trace Condition (19) is closely related to the concept of free independence in free probability theory . In the appendix we will explain why eqs. (16), (17), (18) are also related to free independence in spirit, although there is no straightforward way to apply those concepts here.
We should also mention , which is remotely related to this paper. The authors study a large number of observed variables which are related by a sparse graphical model as well as by some latent variables influencing a large subset of the observed variables and provide conditions under which the latent part and the graphical model part can be distinguished. On a high level, this is related to our work because the methods also employs effects in high dimensions to detect latent variables.
4 Description of the method
4.1 Constructing typical spectral measures for given parameter values
The main result of this section states that asymptotically, depends only on parameters when is given: , , . The first one is directly observable, hence we define a two-parametric family of normalized (i.e. probability) measures such that for large with high probability
for appropriate choices of . We first describe the construction of :
Causal part: this part describes the spectral measure that would be obtained in the absence of confounding. It is induced by and . According to eq. (16), it is approximated by the uniform distribution over the spectrum of , i.e., the tracial measure introduced in Definition 2. We therefore define:
Confounding part: we now approximate the spectral measure induced by the vector and . We will justify the construction later after we have described all its steps. We first define the matrix , where are the eigenvalues of in decreasing order. Then we define a rank-one perturbation of by(21)
where is the vector . We then compute the spectral measure induced by the vector and and define(22)
Mixing both contributions: Finally, is a convex sum of the causal and the confounded part where the mixing weight of the latter is given by the confounding strength:
We now explain Step 2 in the above construction. According to Postulate 3.2, has generic orientation relative to in the sense of eq. (17). With respect to its eigenbasis, reads , where are the eigenvalues of . Of course we don’t know the vector , neither do we know the coordinates of with respect to this basis. Remarkably, it turns out that knowing that is generic relative to is enough because then we can replace with a vector that is ‘particularly generic’, namely . This vector satisfies
(see eq. (11)) while asymptotically the overwhelming majority of unit vectors satisfy eq. (23) approximately. Therefore, an appropriate multiple of nicely mimics the behavior of generic vectors. Accordingly, we can approximate the spectral measure induced by the vector and the matrix by the spectral measure induced by and . Unfortunately, this construction would need the eigenvalues of , which cannot be computed from observing alone. Asymptotically, however, the difference between the spectra of and do not matter and we can replace with . This step will be justified later using the fact that for large we have
which is made more precise by the following result:
(tracial measures are close)For any interval we have
If denote the eigenvalues of , then the eigenvalues of satisfy
by Theorem 10.2 in . Hence the number of eigenvalues in a given interval can differ by at most.
We now describe the main theoretical result of this article:
(convergence to two-parametric family) Let be a sequence of covariance matrices whose distribution of eigenvalues (i.e. the tracial spectral measure) converges weakly to some probability measure supported by a compact interval in . Assume, moreover, we are given sequences of vectors and with and (with fixed) and fixed , such that eqs. (13)–(15) hold (recall ). Then approximates up to normalization in the sense that
where and the confounding strength can be derived from , and by equations specified later.
Hence, the theorem states that whenever Postulate 3.2 holds with sufficient accuracy and for sufficiently high dimension, then is a good approximation for the induced spectral measure. We will prove Theorem 2 in Section 5.
Apart from this weak convergence result we also know that the measures and have precisely the same support for any because, by construction, is also supported by the spectrum of . This enables to conveniently represent both measures by vectors whose entries describe the weight of the corresponding eigenvalue.
Two limiting cases as examples: To get a further intuition about let it describe for the two motivating limiting cases in Subsection 3.1, both with pure confounding, i.e., . Following Theorem 2, we have to choose to mimic in the limit of . The case where is almost aligned with the first eigenvector corresponds to having the majority of its weight on the first eigenvalue, while the second case corresponds to for which the weights for small eigenvectors are more dominant, as we will see now.
For the first case , the spectrum of reads that is, . We thus have
where . The subspace spanned by and is -invariant. The restriction of to this -dimensional subspace is asymptotically (for to infinity) given by because and become orthogonal. Therefore,
that is, the weights of tend to zero for all eigenvectors other than , in agreement with our expectation when we recall the remarks in Subsection 3.1
In the second case, let all eigenvalues of be distinct and be small compared to the spectral gaps . The perturbation of with changes the eigenvectors and eigenvalues only slightly, that is, and thus
Accounting for the normalization factor then yields
Hence, as expected due to the remarks in Subsection 3.1, small eigenvalues get higher weights than the higher ones (due to the weighting factor ).
4.2 Description of the algorithm
To estimate the confounding parameters we just take the element in the family that is closest to , but we have to choose an appropriate distance measure. Since Theorem 2 only guarantees weak convergence, or distances between the weight vectors would be inappropriate. Instead, we have decided to smoothen the measures by a Gaussian kernel and then compare the distance. As kernel bandwidth we have worked with . This heuristic choice is motivated by the idea that the kernel should smoothen over a significant fraction of the eigenvalues. For dimension , for instance, the bandwidth thus is two times the average distance between adjacent eigenvalues.
Accordingly, we define a distance between two weight vectors and by
where denotes the kernel smoothing matrix with entries
|Algorithm 1 Estimating the strength of confounding|
|1: Input: I.i.d. samples from .|
|2: Compute the empirical covariance matrices and|
|3: Compute the regression vector|
|4: PHASE 1: Compute the spectral measure|
|5: Compute eigenvalues and the corresponding eigenvectors of|
|6: Compute the weights and then the normalized weights .|
|7: PHASE 2: find the parameter values that minimize the distance with defined|
|by eq. (26), where denotes the weight vector of the measure .|
|8: Output: Estimated confounding strength|
Based on these findings, we describe how to estimate in Algorithm 4.2. The code as well as all the data sets used in this paper and instructions on how to reproduce the results can be found at http://webdav.tuebingen.mpg.de/causality/.
Since the pseudocode does not describe how to compute the weight vector we provide this missing detail now. First compute the matrix as defined by eq. (21) and compute its eigenvectors . Then we compute the vector . The squared coefficients of with respect to the eigenvector basis describe the weights of , see eq. (22). To obtain the weights of we need to add the contribution of and finally obtain the weights
4.3 Remark on normalization
So far we have ignored the case where the variables refer to quantities that are measured in different units. If, for instance, denotes the temperature and the traffic density (which both influence the concentration in the air), the relative scale of their numeric values depend on the units one choses. Another related issue is that all refer to the same unit, but the variance of one of the variables is overwhelmingly larger than the variance of the others, which results in a covariance matrix whose rank is basically one. A pragmatic and straightforward solution for both issues is to normalize all variables as preprocessing step. Actually, this is obviously in conflict with the justification of the method because normalization jointly changes and , which spoils the idea of ‘independence’. In our limited simulation studies, however, the results turned out to be reasonably robust with respect to normalizing all . Here, robustness is only meant in the sense that the performance over a large number of runs looked almost the same. For every single experiment, however, the estimated values can significantly differ by the amount of uncertainty that is inherent to our method anyway. Due to the lack of theoretical justification, we recommend to avoid normalization if possible and remain skeptical about the results with normalized data.
5 Proofs of asymptotic statements
5.1 Proof of Lemma 3
It is sufficient to show the statement for because the measure obviously scales quadratically in . Since the support of is contained in the compact interval , it is sufficient to show convergence of all moments, i.e., that for every
in probability. To this end, we drop most indices and consider fixed dimension . To generate a random unit vector , we first take independent Gaussian random variables and define the th coefficient of by . Let without loss of generality. Then the th moment reads:
One easily checks
Moreover, since all are independent and because squared standard Gaussians have variance , we have
where we used . By Chebyshev’s inequality, the probability for large deviations from the mean can be bounded by
Then we get
We have almost surely by the strong law of large numbers. Due to eq. (28), the quotient on the left of expression (29) converges to zero in probability. Thus, expression (29) converges to zero in probability due to the assumption .
5.2 Proof of Theorem 1
The difference between the left and the right hand sides of eq. (16) converges weakly to zero in probability due to Lemma 3 because is drawn independently of , since and are drawn independently by assumption. Statement (17) follows even more directly from Lemma 3 because is drawn independently of .
To show that the difference between the left and the right hand side of eq. (18) converges weakly to zero in probability, we recall that it is sufficient to show that expectations of bounded continuous functions converge. For any measurable function the difference of expectations reads:
Hence, we only have to show that
in probability. Note that this already follows from the fact that is chosen independently from the vector on the right hand side in the inner product eq. (30), due to the following elementary result:
(asymptotic orthogonality) Let be a sequence of vectors. Let be drawn uniformly at random from the unit sphere. Then,
Without loss of generality, assume with . Generate the entries of by first taking independent standard Gaussians and renormalizing afterwards. Then
because converges to almost surely due to the law of large numbers.
5.3 Proof of Theorem 2
We first need some definitions and tools. The following one generalizes Definition 3 to infinite-dimensional Hilbert spaces (see  for spectral theory of self-adjoint operators):
(vector-induced spectral measure) Let be a Hilbert space and a self-adjoint operator with spectral decomposition , where denotes the spectral family of (that is, projects onto the spectral subspace corresponding to all spectral values not larger than ). For any , let be defined by
for all integrable functions .
We then define a map on the space of measures that will be a convenient tool for the proof:
(rank one perturbation for general measures) Let be a (not necessarily normalized) finite Borel measure on . Let denote the Hilbert space of real square integrable functions on . Define an operator on by
where denotes identical map and the constant function attaining always . Then we define the rank one perturbation of by
The name ‘rank one perturbation’ is justified because describes how the spectral measure induced by an operator and a vector changes by replacing with its rank-one perturbation . To see this, let . Assume, without loss of generality, that is a cyclic vector for , i.e., that the span of is dense in (otherwise we restrict to the completion of this span). By standard spectral theory of operators , there is a unitary map ‘diagonalizing’ in the sense that and . Therefore,
We do not have a more explicit description of , but the relation between the Cauchy transforms of and is remarkably simple. To describe the relation, we first introduce Cauchy transforms :
(Cauchy transform)Let be a not necessarily normalized measure. Then the Cauchy transform of is defined as the complex-valued function from (that is, the set of complex numbers with positive imaginary part) to given by
Then we find:
(spectral measure for rank-one perturbation)Let be a self-adjoint operator on some Hilbert space and let be some vector. Define the rank one perturbation . Set . Then
Equation (32) is a special case of the so-called Aronszajin-Krein formula [24, 25, 26, 27, 28]. It can be easily seen as follows. Set . Moreover, by slightly abusing notation define the linear form . Using the Sherman-Morrison formula 
one easily obtains
Then the statement follows using
By applying Lemma 6 to the operator in eq. (31) we obtain:
(Cauchy transform of rank one perturbation)For any finite Borel measure on , the Cauchy transforms of and are related by
Moreover, we will need the following map:
(multiplication map) If denotes a Borel measure on , we define by
for every measurable function on .
The transformation describes how the spectral measure induced by any matrix and any vector changes when is replaced with , that is
This is also easily verified by diagonalizing to a multiplication operator on as in the remarks after Definition 5.
Using the transformations and , we obtain the following concise form for the spectral measure induced by the confounding vector:
We will also need the following result:
(weak continuity of and )Let be a sequence of measures with common support with . If weakly then and weakly.
Weak continuity of is immediate since is the multiplication with a function that is bounded by and .
Since we observe and not it is important for our purpose that the tracial measure of both matrices asymptotically coincide. The asymptotic version of Lemma 4 reads:
(tracial measures coincide)If weakly then weakly, too.
We have for every interval :
The first term is zero due to Lemma 4 and the second one by assumption. Since the intervals generate the entire Borel sigma algebra the statement follows.
For each , the confounding strength depends on the length of , which, in turn, depends on the specific random choice of . We will see, however, that asymptotically, one always obtains
since it turns out that the below proof is true for this value of .
We are now prepared to prove Theorem 2. Due to Theorem 1 we have
where the last step is due to
by definition of and .
Due to Lemma 7 and Theorem 1 we have
Hence we obtain
To evaluate the denominator on the right hand side, we employ eq. (10) and obtain:
Using the asymptotic decomposition of measures in eq. (15) in Theorem 1, we thus have
where the second equality is due to eq. (13) and the definition of and . The third equality uses the normalization (since is a probability measure) and eq. (14). Inserting eq. (35) into eq. (33) yields:
On the other hand, recalling the construction of in Section 4, for fixed (which we drop first) we have
where we have used by construction of . Hence we obtain:
which coincides with eq. (36).
6 Experiments with simulated data
6.1 Estimation of strength of confounding
We first ran experiments where the data has been generated according to our model assumptions: First, both the influence of on and the influence of on and is linear. Second, the vectors and are randomly drawn from a uniform distribution over the sphere whose radius is randomly chosen. More specifically, the data generating process reads as follows:
Generate : first generate samples of a -dimensional vector valued Gaussian random variable with mean zero and covariance matrix . Then generate a random matrix whose entries are independent standard Gaussians and set .
Generate scalar random variables and by drawing samples of each independently from a standard Gaussian distribution.
Draw scalar model parameters by independent draws from the uniform distribution on the unit interval.
Draw vectors independently from a sphere of radius and , respectively.
Note that for the above generating process the computation of the true confounding strength involves only model parameters that are exactly known, even need not be estimated from the data matrix because it is simply given by .
Figure 4 shows the results for dimensions and sample sizes . They indicate that the sample size is even more critical for the performance than the dimension. It seems that the required sample sizes grow so quickly with the dimension that data sets with sample sizes under should only be considered if the dimension is not larger than about .
This observation for moderate dimensions is somehow in contrast to the asymptotic statement that the sample size required for non-regularized estimation of covariance matrices (given some fixed error bound in terms of the operator norm) grows only with [32, 33] for any distribution with finite second moment, a result that may be relevant in the ‘big data regime’ with huge and .
Although the results for dimension look quite bad, it should be noted that and are already significantly correlated, the correlations coefficients varied in the range between and for the different sample sizes with -values below .
We found the true and estimated value of to be quite uncorrelated, it seems that is hard to estimate using our method. Since our focus is on the confounding strength, we will not explore this any further.
7 Experiments with real data under controlled conditions
It is hard to find real data where the strength of confounding is known. This is because there are usually unknown confounders in addition to the ones that are obvious for observers with some domain knowledge. For this reason, we have designed an experiment where the variables are observables of technical devices among which the causal structure is known by construction of the experimental setup.
7.1 Setup for a confounded causal influence
The cause is a -dimensional pixel vector generated by extremely reducing the resolution of an image taken by a webcam to pixels. The effect is the intensity measured at a light sensor in front of a laptop screen that displays the image, amplified to a size of about centimeter. The sensor is located at a distance of about centimeter from the screen. To confound the causal relation by a common cause , we have generated an independent random voltage that controls the brightness of two LEDs: one influencing because it is placed in front of the webcam and one that is placed in front of the light sensor. To ensure that is not entirely determined by the LED, we have placed the webcam in front of a TV. This way, the image taken by the webcam is influenced by both the LED and the TV signal – the latter plays the role of in our structural eq. (1). To avoid that fluctuations of daylight is an additional confounder we have covered the pair sensor and laptop screen by a towel. Since we have measured (the random value of the voltage which determines the brightness of the LEDs), we are able to compute the strength of confounding up to an extent where the estimations of from empirical data coincide with their true counterparts. We will denote this value by to emphasize that it may still deviate from the true value when the sample size is not sufficient. Figure 6 shows and for experiments with sample size .
For this setup, the algorithm tends to underestimate confounding, but shows qualitatively the right tendency since and clearly correlate.
After inspecting some spectral measures for the above scenario we believe that the algorithm underestimates confounding for the following reason: The vector describing the influence of the images on the sensor is not in generic orientation relative to . This is because the pixels are usually positively correlated and each pixel has positive influence on the total intensity measured by the sensor. This way, has stronger weights for high eigenvalues. Since confounding often increases the weights of low eigenvalues (note, however, that this depends also on our second confounding parameter ), the ŉon-genericness’ of the vector tends to compensate the effect of confounding on the spectral measure induced by . It is likely that such an underestimation of confounding occurs for many other scenarios as well. This is because it is not uncommon that all variables in a vector are positively correlated and that they all have a positive influence on some target variable.
The above setting contained the purely confounded and the purely causal scenario as limiting cases: the confounded one by putting the LEDs close to the sensor and the webcam, respectively, and setting the voltage to the maximal values, and the purely causal one by setting the voltage to zero.
To get further support for the hypothesis that the algorithms tends to behave qualitatively in the right way even when the estimated strength of confounding deviates from its true value, we also tested modifications of the above scenario that are purely confounded or purely causal by construction and not only by variations of parameters. This is described in Sections 7.2 and 7.3.
7.2 Purely causal scenarios
To build a scenario without confounding, is the pixel vector of grey values of an image section (consisting of pixel) randomly drawn from a fixed image. The image sections are displayed by the screen of a laptop after rescaling them to a size of about cm cm. A light sensor, placed in front of the screen with a distance of about cm, measures the light intensity, which is our variable . Clearly, influences in an unconfounded way because the selection of the images is perfectly randomized. Fluctuations of the brightness caused by the environment certainly influence , but count as independent noise since they do not influence .
We tried this experiment with sample size , where the estimated confounding strength was . We should also mention that we obtained in agreement with our statement that the experimental setup is unconfounded because the image sections are drawn randomly. The extremely low value of also shows that is indeed very close to the true value , which justifies to identify them.
7.3 Purely confounded scenario
The setup in Figure 5 can be easily modified to a causal structure where the relation between the pixel vector and the light intensity is purely confounded by a one-dimensional variable : we just need to put the light sensor to a place where it neither sees the TV nor the screen of the laptop. If we, again, ensure that the light sensor is not influenced by the same fluctuations of daylight as the webcam (e.g. by covering the sensor by a towel), the statistical dependence between and is due to , that is, the fluctuations of the random light signal from the LEDs alone.
We have performed this experiment with sample size and obtained and , which again is consistent with our previous observation that confounding is underestimated.
8 Experiments with real data with partially known causal structure
The experiments in this section refer to real data where the causal structure is not known with certainty. For each data set, however, we will briefly discuss the plausibility of the results in light of our limited domain knowledge. The main purpose of the section is to show that the estimated values of confounding strength indeed spread over the whole interval . A priori, we could not be sure whether empirical data follow probability distributions that are so different from our model assumptions that only small or only large values of confounding were estimated.
8.1 Taste of wine
This dataset  describes the dependence between , the scores on the taste between 0 and 10 (given by human subjects) of red wine, and 11 different ingredients: : fixed acidity, : volatile acidity, : citric acid, : residual sugar, : chlorides, : free sulfur dioxide, : total sulfur dioxide, : density, : pH, : sulphates, : alcohol. It turned out that the largest eigenvalue of the covariance matrix is by orders of magnitude larger than the others. We therefore normalized the to unit variance and obtained the covariance matrix
We observe several correlation coefficients around and , hence significantly differs from the identity, which is important because would render the method pointless. The vector of regression coefficients reads:
showing that alcohol has by far the strongest association with taste. According to common experience, alcohol indeed has a significant influence on the taste. Also the other associations are likely to be mostly causal and not due to a confounder. We estimated confounding for this data set and obtained .
Since the above experiments suggest that the set of ingredients influence the target variable taste essentially in an unconfounded way, we now explore what happens when we exclude one of the variables . Since this variable will typically be correlated with the remaining ones and since it, at the same time, influences , this will typically confound the causal relation between and . For each we have therefore estimated the structural confounding strength and obtained the following results: for all the algorithm estimated the confounding strength (the lowest possible value), while it estimated for . Since has the strongest influence on , this result is remarkable because it is plausible that dropping it corresponds to strong confounding.
Figure 7, left visualizes the weights of the spectral measure for the case where all variables are included and compares it to the one obtained when alcohol is dropped (right). In the latter case, one clearly sees that the weights decrease towards large eigenvalues, which may indicate confounding.
8.2 Chigaco crime data
This dataset  reports the number of crimes for each of community areas in Chicago, USA, and some potential factors influencing the crime rate . Here, denotes the assaults (homicides) and consists of the following features: : below poverty level, : crowded housing, : dependency, : no high-school diploma, : per capita income, : unemployment. After normalization we obtain the following estimated vector of structure coefficients:
It seems reasonable that the unemployment rate has the strongest influence. It is, however, surprising that ‘no high-school diploma’ should have a negative influence on the number of crimes. This is probably due to a confounding effect. The estimated confounding strength reads .
8.3 Compressive strength and ingredients of concrete
This experiment considers the data set ‘concrete and compressive strength’ [38, 39] in the machine learning repository is the compressive strength in megapascals and to are the following components, measured in . : cement, : blast furnace, : fly ash, : water, : superplasticizer, : coarse aggregate, : fine aggregate. is the age in days. After normalization, the estimated vector of structure coefficients reads
The amount of superplasticizer seems to have the strongest influence, followed by cement. The estimated confounding strength reads , but it is hard to speculate about possible confounders here.
We have described a method that estimates the strength of a potential one-dimensional common cause that confounds the causal relation between a -dimensional cause (with ‘large’ ) and a one-dimensional effect . The presence of can, to some extent, be detected from the joint statistics of and when the vector of regression coefficients (after regressing on ) is decomposed into the eigenvectors of the covariance matrix of . This is because generically, without confounding, the weights of this decomposition are roughly uniformly spread over the principal values of , while the presence of will typically modify the weights in a way that is characteristic for the corresponding confounding scenario.
The method is based on the assumption that the vector has, in a certain sense, ‘generic orientation’ with respect to . The justification of our method relies on a highly idealized model where the vectors of model parameters are randomly generated from a rotation invariant prior. This yields to several concentration of measure phenomena for high dimensions which the method employs. There is some hope that empirical data show similar concentration of measure phenomena although our model assumptions are probably significantly violated.
Given the difficulty of the enterprise of inferring causal relations from observational data, one should not expect that any method is able to detect the presence of confounders with certainty. Following this modest attitude, the results can be considered encouraging; after all the joint distribution of and seems to contain some hints on whether or not their causal relation is confounded.
Although the theoretical justification of the method (using asymptotic for dimension to infinity) suggests that the methods should only be applied to large dimension, it should be emphasized that we have so far computed the regression without regularization which quickly requires prohibitively high sample sizes. Future work may apply the method following regularized regression but then one has to make sure that the regularizer does not spoil the method by violating our symmetry assumptions.
To generalize the ideas of this paper to non-linear causal relations one may consider kernel methods . Then the covariance matrix in feature space captures also non-linear relations  between the components and the regression vector also non-linear relations between and . It is unclear, however, how the idea of ‘generic relative orientation’ translates into this setting given that a rotation invariant prior is impossible in an infinite-dimensional space.
Regardless of whether there is a sufficiently number of real-world problems for which the method developed here is applicable, it may help understand three things: first, it shows a possible formalization of the informal idea of ‘independence’ between and for the common scenario of multivariate linear models. Second, it shows what kind of sophisticated dependences are generated by confounding in such a scenario, and third, that employing the dependences for confounder detection may be challenging even when the idealized model assumptions hold. This way, the scenario discussed in the present paper may help develop novel practical methods for confounder detection.
10 Appendix: Relation to free independence
Free probability theory [20, 30] defines a notion of independence that is asymptotically satisfied for independently chosen high-dimensional random matrices. To sketch the idea, we start with a model of generating independent random matrices considered in :
Let and be sequences of matrices whose tracial spectral measures converge weakly. Let be random orthogonal matrices drawn from according to the Haar measure. We start with the simplest case of the independence conditions: and satisfy asymptotically the Trace Condition [18, 19], i.e.,
in probability, where denotes again the renormalized trace. It is then convenient to introduce limit objects in a -algebra  and a functional , expressing the limit of renormalized traces, for which the Trace Condition then holds exactly:
However, eq. (37) is only the simplest one of an infinity of independence statements. First, one obtains statements on higher moments like
which is analog to for independent random variables and . But the model above also yields independence statements like whenever , which have no counterpart with classical random variables. and are also considered ‘non-abelian random variables’ and free independence as a stronger version of usual statistical independence which can only hold because the variables do not commute.
The following difficulty arises when we try to apply the above ideas to our generating model used in Theorem 1. Our sequence may take the role of . To draw uniformly from the unit sphere, we may define an arbitrary sequence of unit vectors and set with being a random rotation as above. Then one could naively argue that takes the role of and one also expects
which is equivalent to
for all for large . Thus, all moments of coincide almost with . This argument, however, blurs the fact that converges to zero while is constant, i.e., we need to consider the asymptotic of instead of , which is not covered by free probability theory to the best of our knowledge. Theorem 1 is thus close to the above statements although we do not see any straightforward way to derive it from existing work.
We would like to thank Uli Wannek for helping us with the implementation of the video experiment. Many thanks also to Roland Speicher and his group in Saarbrücken for helpful discussions about free probability theory and to Steffen Lauritzen for pointing out that real data sometimes show the MTP2 property, which may be an issue here.
 Hoyer P, Shimizu S, Kerminen A, Palviainen M. Estimation of causal effects using linear non-gaussian causal models with hidden variables. Int J Approx Reason. 2008;49:362–378.10.1016/j.ijar.2008.02.006Search in Google Scholar
 Janzing D, Peters J, Mooij J, Schölkopf B. Identifying latent confounders using additive noise models. In: Ng A, Bilmes J, editor. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009). Corvallis, OR, USA: AUAI Press, 2009:249–257.Search in Google Scholar
 Janzing D, Sgouritsa E, Stegle O, Peters P, Schölkopf B. Detecting low-complexity unobserved causes. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011). Available at: http://uai.sis.pitt.edu/papers/11/p383-janzing.pdf.Search in Google Scholar
 Meek C. Strong completeness and faithfulness in Bayesian networks. In: Proceedings of 11th Uncertainty in Artificial Intelligence (UAI). Montreal, Canada: Morgan Kaufmann, 1995:411–418.Search in Google Scholar
 Murphy G. -algebras and operator theory. Boston: Academic Press, 1990.Search in Google Scholar
 Reed M, Simon B. Functional Analysis. San Diego, California: Academic Press, 1980.Search in Google Scholar
 Janzing D, Hoyer P, Schölkopf B. Telling cause from effect based on high-dimensional observations. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 06, 2010:479–486.Search in Google Scholar
 Zscheischler J, Janzing D, Zhang K. Testing whether linear equations are causal: A free probability theory approach. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011. Available at: http://uai.sis.pitt.edu/papers/11/p839-zscheischler.pdf.Search in Google Scholar
 Voiculescu D, editor. Free probability theory, volume 12 of Fields Institute Communications. American Mathematical Society, 1997.Search in Google Scholar
 Simon B. Spectral analysis of rank one perturbations and applications. Lectur given at the Vancouver Summer School in Mathematical Physics (1993). Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.9138, 1994.Search in Google Scholar
 Simon B. Trace ideals and their applications. Providence, RI: American Mathematical Society, 2005.Search in Google Scholar
 Albeverio S, Konstantinov A, Koshmanenko V. The Aronszajn-Donoghue theory for rank one perturbations of the -class. Integral Equ Operat Theo. 2004;50:1–8.10.1007/s00020-002-1219-3Search in Google Scholar
 Karlin S, Rinott Y. Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions. J Multiv Anal. 1980;10:467–498.10.1016/0047-259X(80)90065-2Search in Google Scholar
 Lichman M. UCI machine learning repository. Available at: http://archive.ics.uci.edu/ml, 2013.Search in Google Scholar
 City of Chicago. Data portal: Chicago poverty and crime. Available at: https://data.cityofchicago.org/Health-Human-Services/Chicago-poverty-and-crime/fwns-pcmk.Search in Google Scholar
 Yeh C. Concrete compressive strength data set. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+ Strength.Search in Google Scholar
 Schölkopf B, Smola A. Learning with kernels. Cambridge, MA: MIT Press, 2002.Search in Google Scholar
 Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel methods for measuring independence. J Mach Learn Res. 2005;6:2075–2129.Search in Google Scholar
 Speicher R. Free probability theory and non-crossing partitions. LOTHAR. COMB. 1997;39.Search in Google Scholar
© 2018 Walter de Gruyter GmbH, Berlin/Boston
This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.