1 Introduction and general model
where
If all variables are centered Gaussian, the remaining model parameters are the vectors
The following special cases can be obtained by appropriate choices of these parameters:
Purely causal: The case
with
Purely confounded: Setting
Purely anticausal: We have actually excluded a scenario where
We now ask how to distinguish between these cases given joint observations of
Equation (5) shows that the vector
as in eq. (3), yields in the generic case a noise variable
Here we propose a method for distinguishing the purely causal from the confounded case in the linear scenario above that only relies on second order statistics and thus does not rely on nonGaussianity of noise variables and higherorder statistical independence tests like [6], for instance. In other words, the relation must be linear but the distributions may be Gaussian or not. The paper is structured as follows. Section 2 defines the strength of confounding, which is the crucial target quantity to be estimated. Section 3 describes the idea of the underlying principle, defines it formally in terms of a spectral measure and justifies it by a toy model where parameters are randomly generated. Section 4 describes the method to estimate the strength and justifies it by intuitive arguments first and by theoretical results which are rigorously shown in Section 5. Section 6 reports results with simulated causal structures according to the model assumptions, while Section 7 describes results from an experiment where the data has been generated by an electronic and optical setup with known causal structure. Section 8 reports some studies with observational data where the causal ground truth is only partially known.
2 Characterizing confounding by two parameters
2.1 Strength of confounding
We now define a strength of confounding that measures how much
It is nontrivially related to
2.2 A second parameter characterizing confounding
The contribution of
For the entire Section 2 it is important to keep in mind that we always referred to the case where
3 Detecting confounders by the principle of generic orientation
3.1 Intuitive idea and background
where each
However, there are also other options to give a definite meaning to the term ‘independence’. To see this, consider some parametric model where each
After these general remarks, let us come back to the scenario where the causal hypothesis reads
For the more general DAG shown of Figure 1 we again assume that
To provide a first intuition about why the orientation of the resulting vector
For general
3.2 Defining ‘generic orientation’ via the induced spectral measure
We start with some notation and terminology and formally introduce two measures which have quite simple intuitive meanings.
where
where
By elementary spectral theory of symmetric matrices [17], we have:
While the spectral measure is a property of a matrix alone, the following measure describes the relation between a matrix and a vector:
for any measurable set
For each set of eigenvalues of
In analogy to Lemma 1 we obtain:
To deal with the above measures in numerical computations, each measure will be represented by two vectors: first, one vector
The crucial idea behind our detection of confounding is the insight that
weakly in probability.
The idea of this paper is that we expect
Covariance matrix of the noise of
Vector of causal structure coefficients
Vector of confounding structure coefficients
Then
Intuitively, eq. (13) just states that the causal part of
Assuming a rotation invariant generating model may appear as a too strong assumption for practical purposes. It should therefore be noted that much weaker assumptions would probably also yield the same approximate identities for high dimensions. Possible quantitative versions of Lemma 3 for finite dimensions could show that the overwhelming majority of vectors on the surface of the unit sphere induce spectral measures that are close to the tracial one. Such quantitiave versions could provide some insights on the robustness of our method regarding violation of the strong model assumptions (since also a large class of nonrotation invariant priors then yield the same conclusions), but this goes beyond the scope of this paper. After all, the performance on real data cannot be predicted from purely theoretical arguments.
Motivated by the above asymptotic results, we assume to be
Postulate 1(generic orientation of vectors)3.2 If eqs. (1) and (2) are structural equations corresponding to the causal DAG in Figure 1, and
We therefore built our algorithm in Section 5 upon the postulates only instead of directly using the generating model.
Note that eqs. (16), (17), and (18) only hold if the
Hence, eq. (19) postulates that the first moments of two measures on the left and the right of eq. (16) almost coincide, while our method also accounts for higher order moments which the Trace Condition ignores. As already sketched in [19], the Trace Condition (19) is closely related to the concept of free independence in free probability theory [20]. In the appendix we will explain why eqs. (16), (17), (18) are also related to free independence in spirit, although there is no straightforward way to apply those concepts here.
We should also mention [21], which is remotely related to this paper. The authors study a large number
4 Description of the method
4.1 Constructing typical spectral measures for given parameter values
for appropriate choices of
 Causal part: this part describes the spectral measure that would be obtained in the absence of confounding. It is induced by
and$\text{a}$ . According to eq. (16), it is approximated by the uniform distribution over the spectrum of${\Sigma}_{\text{X}\text{X}}$ , i.e., the tracial measure introduced in Definition 2. We therefore define:${\Sigma}_{\text{X}\text{X}}$ ${\nu}^{\text{causal}}:={\mu}_{{\Sigma}_{\text{X}\text{X}}}^{\tau}.$  Confounding part: we now approximate the spectral measure induced by the vector
and${\Sigma}_{\text{X}\text{X}}^{1}\text{b}$ . We will justify the construction later after we have described all its steps. We first define the matrix${\Sigma}_{\text{X}\text{X}}$ , where${M}_{X}:=\text{diag}({v}_{1}^{X},\dots ,{v}_{d}^{X})$ are the eigenvalues of${v}_{j}^{X}$ in decreasing order. Then we define a rankone perturbation of${\Sigma}_{\text{X}\text{X}}$ by${M}_{X}$ where$T:={M}_{X}+\eta \text{g}{\text{g}}^{T},$ is the vector$\text{g}$ . We then compute the spectral measure induced by the vector$\text{g}:=(1,\dots ,1{)}^{T}/\sqrt{d}$ and${T}^{1}\text{g}$ and define$T$ ${\nu}_{\eta}^{\text{confounded}}:=\frac{1}{\parallel {T}^{1}\text{g}{\parallel}^{2}}{\mu}_{T,{T}^{1}\text{g}}.$  Mixing both contributions: Finally,
is a convex sum of the causal and the confounded part where the mixing weight of the latter is given by the confounding strength:${\nu}_{\beta ,\eta}$ ${\nu}_{\beta ,\eta}:=(1\beta ){\nu}^{\text{causal}}+\beta {\nu}_{\eta}^{\text{confounded}}.$
which is made more precise by the following result:
by Theorem 10.2 in [22]. Hence the number of eigenvalues in a given interval can differ by
We now describe the main theoretical result of this article:
where
Hence, the theorem states that whenever Postulate 3.2 holds with sufficient accuracy and for sufficiently high dimension, then
Apart from this weak convergence result we also know that the measures
Two limiting cases as examples: To get a further intuition about
that is, the weights of
Hence, as expected due to the remarks in Subsection 3.1, small eigenvalues get higher weights than the higher ones (due to the weighting factor
4.2 Description of the algorithm
To estimate the confounding parameters we just take the element in the family
Algorithm 1 Estimating the strength of confounding 
1: Input: I.i.d. samples from

2: Compute the empirical covariance matrices

3: Compute the regression vector

4: PHASE 1: Compute the spectral measure 
5: Compute eigenvalues 
6: Compute the weights 
7: PHASE 2: find the parameter values 
by eq. (26), where

8: Output: Estimated confounding strength 
Based on these findings, we describe how to estimate
4.3 Remark on normalization
So far we have ignored the case where the variables
5 Proofs of asymptotic statements
5.1 Proof of Lemma 3
We have
5.2 Proof of Theorem 1
The difference between the left and the right hand sides of eq. (16) converges weakly to zero in probability due to Lemma 3 because
in probability. Note that this already follows from the fact that
because
5.3 Proof of Theorem 2
We first need some definitions and tools. The following one generalizes Definition 3 to infinitedimensional Hilbert spaces (see [17] for spectral theory of selfadjoint operators):
We then define a map on the space of measures that will be a convenient tool for the proof:
We do not have a more explicit description of
Then we find:
By applying Lemma 6 to the operator
Moreover, we will need the following map:
for every measurable function
This is also easily verified by diagonalizing
We will also need the following result:
(weak continuity of
Since
Weak continuity of
Since we observe
(tracial measures coincide)If
The first term is zero due to Lemma 4 and the second one by assumption. Since the intervals generate the entire Borel sigma algebra the statement follows.
since it turns out that the below proof is true for this value of
by definition of
which coincides with eq. (36).
6 Experiments with simulated data
6.1 Estimation of strength of confounding
We first ran experiments where the data has been generated according to our model assumptions: First, both the influence of
 –Generate
: first generate$\text{E}$ samples of a$n$ dimensional vector valued Gaussian random variable$d$ with mean zero and covariance matrix$\tilde{\text{E}}$ . Then generate a$\text{I}$ random matrix$d\times d$ whose entries are independent standard Gaussians and set$G$ .$\text{E}:=G\tilde{\text{E}}$  –Generate scalar random variables
and$Z$ by drawing$F$ samples of each independently from a standard Gaussian distribution.$n$  –Draw scalar model parameters
by independent draws from the uniform distribution on the unit interval.$c,{r}_{\text{a}},{r}_{\text{b}}$  –Draw vectors
independently from a sphere of radius$\text{a},\text{b}$ and${r}_{\text{a}}$ , respectively.${r}_{\text{b}}$  –Compute
and$\text{X}$ via the structural eqs. (1) and (2).$Y$
Note that for the above generating process the computation of the true confounding strength
Figure 4 shows the results for dimensions
This observation for moderate dimensions is somehow in contrast to the asymptotic statement that the sample size required for nonregularized estimation of covariance matrices (given some fixed error bound in terms of the operator norm) grows only with
Although the results for dimension
We found the true and estimated value of
7 Experiments with real data under controlled conditions
It is hard to find real data where the strength of confounding is known. This is because there are usually unknown confounders in addition to the ones that are obvious for observers with some domain knowledge. For this reason, we have designed an experiment where the variables are observables of technical devices among which the causal structure is known by construction of the experimental setup.
7.1 Setup for a confounded causal influence
To obtain a causal relation where
The cause
For this setup, the algorithm tends to underestimate confounding, but shows qualitatively the right tendency since
After inspecting some spectral measures for the above scenario we believe that the algorithm underestimates confounding for the following reason: The vector
The above setting contained the purely confounded and the purely causal scenario as limiting cases: the confounded one by putting the LEDs close to the sensor and the webcam, respectively, and setting the voltage to the maximal values, and the purely causal one by setting the voltage to zero.
To get further support for the hypothesis that the algorithms tends to behave qualitatively in the right way even when the estimated strength of confounding deviates from its true value, we also tested modifications of the above scenario that are purely confounded or purely causal by construction and not only by variations of parameters. This is described in Sections 7.2 and 7.3.
7.2 Purely causal scenarios
To build a scenario without confounding,
We tried this experiment with sample size
7.3 Purely confounded scenario
The setup in Figure 5 can be easily modified to a causal structure where the relation between the pixel vector
We have performed this experiment with sample size
8 Experiments with real data with partially known causal structure
The experiments in this section refer to real data where the causal structure is not known with certainty. For each data set, however, we will briefly discuss the plausibility of the results in light of our limited domain knowledge. The main purpose of the section is to show that the estimated values of confounding strength indeed spread over the whole interval
8.1 Taste of wine
showing that alcohol has by far the strongest association with taste. According to common experience, alcohol indeed has a significant influence on the taste. Also the other associations are likely to be mostly causal and not due to a confounder. We estimated confounding for this data set and obtained
Since the above experiments suggest that the set of ingredients influence the target variable taste essentially in an unconfounded way, we now explore what happens when we exclude one of the variables
Figure 7, left visualizes the weights of the spectral measure for the case where all
8.2 Chigaco crime data
It seems reasonable that the unemployment rate has the strongest influence. It is, however, surprising that ‘no highschool diploma’ should have a negative influence on the number of crimes. This is probably due to a confounding effect. The estimated confounding strength reads
8.3 Compressive strength and ingredients of concrete
The amount of superplasticizer seems to have the strongest influence, followed by cement. The estimated confounding strength reads
9 Discussion
We have described a method that estimates the strength of a potential onedimensional common cause
The method is based on the assumption that the vector
Given the difficulty of the enterprise of inferring causal relations from observational data, one should not expect that any method is able to detect the presence of confounders with certainty. Following this modest attitude, the results can be considered encouraging; after all the joint distribution of
Although the theoretical justification of the method (using asymptotic for dimension to infinity) suggests that the methods should only be applied to large dimension, it should be emphasized that we have so far computed the regression without regularization which quickly requires prohibitively high sample sizes. Future work may apply the method following regularized regression but then one has to make sure that the regularizer does not spoil the method by violating our symmetry assumptions.
To generalize the ideas of this paper to nonlinear causal relations one may consider kernel methods [40]. Then the covariance matrix in feature space captures also nonlinear relations [41] between the components
Regardless of whether there is a sufficiently number of realworld problems for which the method developed here is applicable, it may help understand three things: first, it shows a possible formalization of the informal idea of ‘independence’ between
10 Appendix: Relation to free independence
Free probability theory [20, 30] defines a notion of independence that is asymptotically satisfied for independently chosen highdimensional random matrices. To sketch the idea, we start with a model of generating independent random matrices considered in [42]:
which is analog to
for all
We would like to thank Uli Wannek for helping us with the implementation of the video experiment. Many thanks also to Roland Speicher and his group in Saarbrücken for helpful discussions about free probability theory and to Steffen Lauritzen for pointing out that real data sometimes show the MTP2 property, which may be an issue here.
References
 [3]↑
Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search (Lecture notes in statistics). New York, NY: SpringerVerlag, 1993.
 [5]↑
Hoyer P, Shimizu S, Kerminen A, Palviainen M. Estimation of causal effects using linear nongaussian causal models with hidden variables. Int J Approx Reason. 2008;49:362–378.
 [6]↑
Janzing D, Peters J, Mooij J, Schölkopf B. Identifying latent confounders using additive noise models. In: Ng A, Bilmes J, editor. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009). Corvallis, OR, USA: AUAI Press, 2009:249–257.
 [7]↑
Janzing D, Sgouritsa E, Stegle O, Peters P, Schölkopf B. Detecting lowcomplexity unobserved causes. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011). Available at: http://uai.sis.pitt.edu/papers/11/p383janzing.pdf.
 [8]↑
Janzing D, Balduzzi D, GrosseWentrup M, Schölkopf B. Quantifying causal influences. Ann Stat. 2013;41:2324–2358.
 [9]↑
Janzing D, Schölkopf B. Causal inference using the algorithmic Markov condition. IEEE Trans Inf Theo. 2010;56:5168–5194.
 [10]↑
Lemeire J, Janzing D. Replacing causal faithfulness with algorithmic independence of conditionals. Minds Mach. 2012;23:227–249.
 [11]↑
Li M, Vitányi P. An Introduction to Kolmogorov Complexity and its Applications. New York: Springer, 1997 (3rd edition: 2008).
 [12]↑
Janzing D, Steudel B. Justifying additivenoisebased causal discovery via algorithmic information theory. Open Syst Inf Dynam. 2010;17:189–212.
 [13]↑
Meek C. Strong completeness and faithfulness in Bayesian networks. In: Proceedings of 11th Uncertainty in Artificial Intelligence (UAI). Montreal, Canada: Morgan Kaufmann, 1995:411–418.
 [14]↑
Uhler C, Raskutti G, Bühlmann P, Yu B. Geometry of the faithfulness assumption in causal inference. Ann Stat. 2013;41:436–463.
 [18]↑
Janzing D, Hoyer P, Schölkopf B. Telling cause from effect based on highdimensional observations. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 06, 2010:479–486.
 [19]↑
Zscheischler J, Janzing D, Zhang K. Testing whether linear equations are causal: A free probability theory approach. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), 2011. Available at: http://uai.sis.pitt.edu/papers/11/p839zscheischler.pdf.
 [20]↑
Voiculescu D, editor. Free probability theory, volume 12 of Fields Institute Communications. American Mathematical Society, 1997.
 [21]↑
Chandrasekaran V, Parrilo P, Willsky A. Latent variable graphical model selection via convex optimization. Ann Stat. 2012;40:1935–1967.
 [22]↑
Datta BN. Numerical Linear Algebra and Applications. Philadelphia, USA: Society for Industrial and Applied Mathematics, 2010.
 [23]↑
Cima J, Matheson A, Ross W. The Cauchy Transform. Mathematical Surveys and Monographs 125. American Mathematical Society, 2006.
 [24]↑
Simon B. Spectral analysis of rank one perturbations and applications. Lectur given at the Vancouver Summer School in Mathematical Physics (1993). Available at: 1994.
 [25]↑
Simon B. Trace ideals and their applications. Providence, RI: American Mathematical Society, 2005.
 [26]↑
Kiselev A, Simon B. Rank one perturbations with infinitesimal coupling. J Funct Anal. 1995;130:345–356.
 [27]↑
Albeverio S, Konstantinov A, Koshmanenko V. The AronszajnDonoghue theory for rank one perturbations of the H − 2 $H_{2}$class. Integral Equ Operat Theo. 2004;50:1–8.
 [28]↑
Albeverio S, Kurasov P. Rank one perturbations, approximations, and selfadjoint extensions. J Func Anal. 1997;148:152–169.
 [29]↑
Bartlett MS. An inverse matrix adjustment arising in discriminant analysis. Ann. Math. Statist. 1951;22:107–111.
 [31]↑
Bercovici H, Voiculescu D. Free convolution of measures with unbounded supports. Ind Univ Math J. 1993;42:733–773.
 [33]↑
Vershynin R.. How close is the sample covariance matrix to the actual covariance matrix? J Theo Probab. 2012;25:655–686.
 [34]↑
Karlin S, Rinott Y. Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions. J Multiv Anal. 1980;10:467–498.
 [35]↑
Fallat S, Lauritzen S, Sadeghi K, Uhler C, Wermuth N, Zwiernik P. Total positivity in markov structures. To appear in Annals of Statistics, 2016.
 [37]↑
City of Chicago. Data portal: Chicago poverty and crime. Available at: https://data.cityofchicago.org/HealthHumanServices/Chicagopovertyandcrime/fwnspcmk.
 [38]↑
Yeh C. Concrete compressive strength data set. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+ Strength.
 [39]↑
Yeh IC. Modeling of strength of high performance concrete using artificial neural networks. Cement Concrete Res. 1998.
 [41]↑
Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel methods for measuring independence. J Mach Learn Res. 2005;6:2075–2129.
Footnotes
This is an interesting phenomenon in high dimensions: a confounder may generate almost no observable covariance between
Note that quantifying causal influence in causal Bayesian networks is nontrivial and there exists no generally accepted measure [8].
See [10], Theorem 3, for a detailed discussion of the conditions under which one should trust this argument.
In this limit, the first eigenvector of
Note that
Note that some authors define the Cauchy transform as the negative of the below definition.
The dataset and all others used in the paper can be found at http://webdav.tuebingen.mpg.de/causality/.
Note also the concept of multivariate total positivity of order two (MTP2) [34], which implies positive correlations between all variables and occurs in many applications [35] such as Markov random fields that consist only of attractive interaction terms.
This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at ones own risk.