# On the Intersection Property of Conditional Independence and its Application to Causal Discovery

Jonas Peters 1
• 1 ETH Zürich, Switzerland
Jonas Peters

## Abstract

This work investigates the intersection property of conditional independence. It states that for random variables $A,B,C$ and X we have that $X⊥⊥A|B,C$ and $X⊥⊥B|A,C$ implies $X⊥⊥(A,B)|C$. Here, “$⊥⊥$” stands for statistical independence. Under the assumption that the joint distribution has a density that is continuous in $A,B$ and C, we provide necessary and sufficient conditions under which the intersection property holds. The result has direct applications to causal inference: it leads to strictly weaker conditions under which the graphical structure becomes identifiable from the joint distribution of an additive noise model.

## 1 Introduction

This paper investigates the intersection property of conditional independence. For continuous random variables $A,B,C$ and X this property states that $X⊥⊥A|B,C$ and $X⊥⊥B|A,C$ implies $X⊥⊥(A,B)|C$. Here, “$⊥⊥$” stands for statistical independence and “$⊥⊥/$” for statistical dependence (see Section 1.2 for precise definitions). The intersection property does not necessarily hold if the joint distribution does not have a density (e.g. Dawid ). Dawid  provides measure-theoretic necessary and sufficient conditions for the intersection property. In this work we assume the existence of a density (A0), see below.

It is well known that the intersection property holds if the joint distribution has a strictly positive density (e.g. Pearl , 1.1.5). Proposition 1 shows that if the density is not strictly positive, a weaker condition than the intersection property still holds. Corollary 1 states necessary and sufficient conditions for the intersection property. The result about strictly positive densities is contained as a special case. Drton et al. (, exercise 6.6) and Fink  develop analogous results for the discrete case.

In the remainder of this introduction we discuss the paper’s main contribution (Section 1.1) and introduce the required notation (Section 1.2).

### 1.1 Main contributions

In Section 3 we provide a sufficient and necessary condition on the density for the intersection property to hold (Corollary 1). This result is of interest in itself since the developed condition is weaker than strict positivity.

Studying the intersection property has direct applications to causal inference. Inferring causal relationships is a major challenge in science. In the last decades considerable effort has been made in order to learn causal statements from observational data. As a first step, causal discovery methods therefore estimate graphs from observational data and attach a causal meaning to these graphs (the terminology of causal inference is introduced in Section 4.1). Some causal discovery methods based on structural equation models (SEMs) require the intersection property for identification; they therefore rely on the strict positivity of the density. This is satisfied if the noise variables have full support, for example. Using the new characterization of the intersection property we can now replace the condition of strict positivity. In fact, we show in Section 4 that noise variables with a path-connected support are sufficient for identifiability of the graph (Proposition 3). This is already known for linear SEMs  but not for non-linear models. As an alternative, we provide a condition that excludes a specific kind of constant functions and leads to identifiability, too (Proposition 4).

In Section 2, we provide an example of an SEM that violates the intersection property. Its corresponding graph is not identifiable from the joint distribution. In correspondence to the theoretical results of this work, some noise densities in the example do not have a path-connected support and the functions are partially constant. We are not aware of any causal discovery method that is able to infer the correct graph or the correct Markov equivalence class; the example therefore shows current limits of causal inference techniques. It is non-generic in the case that it violates all sufficient assumptions mentioned in Section 4.

All proofs are provided in Appendix A.

### 1.2 Conditional independence and the intersection property

We now formally introduce the concept of conditional independence in the presence of densities and the intersection property. Let therefore $A,B,C$ and X be (possibly multi-dimensional) random variables that take values in metric spaces $A,B,C$ and $X$, respectively. We first introduce assumptions regarding the existence of a density and some of its properties that appear in different parts of this paper.

1. (A0)The distribution is absolutely continuous with respect to a product measure of a metric space. We denote the density by $p(⋅)$. This can be a probability mass function or a probability density function, for example.
2. (A1)The density $(a,b,c)↦p(a,b,c)$ is continuous. If there is no variable C (or C is deterministic), then $(a,b)↦p(a,b)$ is continuous.
3. (A2)For each c with $p(c)>0$ the set $suppc(A,B):={(a,b):p(a,b,c)>0}$ contains only one path-connected component (see Section 3).
4. (A2′)The density $p(⋅)$ is strictly positive.

Condition (A2′) implies (A2). We assume (A0) throughout the whole work.

In this paper we work with the following definition of conditional independence.

Definition 1 (Conditional (In)dependence). We call X independent of A conditional on B and write$X⊥⊥ A|B$if and only if

$p(x,a|b)=p(x|b)p(a|b)$

for all$x,a,b$such that$p(b)>0$. Otherwise, X and A are dependent conditional on B and we write$X⊥⊥/A|B$.

The intersection property of conditional independence is defined as follows (e.g. Pearl , 1.1.5).

Definition 2 (Intersection Property). We say that the joint distribution of$X,A,B,C$satisfies the intersection property if

$X⊥⊥A|B,C and X⊥⊥B|A,C⇒X⊥⊥(A,B)|C.$

The intersection property (2) has been proven to hold for strictly positive densities (e.g. Pearl , 1.1.5). The other direction “$⇐$” is known as the “weak union” of conditional independence .

## 2 Counterexample

We now give an example of a distribution that does not satisfy the intersection property (2). Since the joint distribution has a continuous density, the example shows that the intersection property requires further restrictions on the density apart from its existence. We will later use the same idea to prove Proposition 2 that shows the necessity of our new condition.

Example 1. Consider a so-called additive noise model (ANM; see Section 4.1) for random variables$X,A,B$:
$A=NA,B=A+NB,X=f(B)+NX,$
where$NA,NB,NX$are jointly independent, have continuous densities and satisfy$supp(NA):={n:pNA(n)>0}=(−2;−1)∪(1;2)$and$supp(NB)=supp(NX)=(−0.3;0.3)$. Let the function f be of the form
$f(b)={+10if b > 0.5,0 if b <−0.5,g(b)else,$
where the function g can be chosen to make f arbitrarily smooth. Some parts of this structural equation model (SEM) are summarized in Figure 1. The distribution satisfies$X⊥⊥A|B$and$X⊥⊥B|A$but$X⊥⊥/A$and$X⊥⊥/B$. The (intuitive) reason for this as follows: we see$X⊥⊥A|B$from eq. (3). Further, if we know that A (or B) is positive, X has to take values close to ten and thus$X⊥⊥/A$($X⊥⊥/B$); but when knowing that A is positive, the knowledge of B does not provide any additional information about X ($X⊥⊥B|A$). This means that the intersection property is violated. A formal proof is provided in the more general setting of Proposition 2. Within each component, however, that is if we consider the areas$A,B>0$and$A,B<0$separately, we do have the independence statement$X⊥⊥(A,B)$; therefore the intersection property holds “locally”. This observation will be formalized as the weak intersection property in Proposition 1.

It will turn out to be important that the two path-connected components of the support of A and B cannot be connected by an axis-parallel line. This motivates the notation introduced in Section 3. Remark 1 in Section 4 discusses the causal interpretation of Example 1.

## 3 Necessary and sufficient condition for the intersection property

This section characterizes the intersection property in terms of the joint density over the corresponding random variables. In particular, we state a weak intersection property (Proposition 1) that leads to a necessary and sufficient condition for the classical intersection property, see Corollary 1.

We will see that the intersection property fails in Example 1 because of the two “separated” components in Figure 1. In order to formulate our results we first require the notion of path-connectedness. A continuous mapping $λ:[0,1]→X$ into a metric space $X$ is called a path between $λ(0)$ and $λ(1)$ in $X$. A subset $S⊆X$ is called path-connected if every pair of points in $S$ can be connected by a path in $S$. We can always decompose $X$ into its (disjoint) path-connected components. 1 The following definition provides a formalization of the intuition that the two components in Figure 1 are “separated”.

Definition 3. (i) For each c with$p(c)>0$we define the (not necessarily closed) support of A and B as
$suppc(A,B):={(a,b):p(a,b,c)>0}.$
We further write for all sets$M⊆A×B$
$projA(M):={a∈A:∃b with (a,b)∈M} and$
$projB(M):={b∈B:∃a with (a,b)∈M}.$
(ii) We denote the path-connected components of$suppc(A,B)$by$Zic$, $i∈IZc$, with some index set$IZc$. Two path-connected components$Zi1c$and$Zi2c$are said to be coordinate-wise connected if
$projA(Zi1c)∩projA(Zi2c)≠∅or$
$projB(Zi1c)∩projB(Zi2c)≠∅.$
(The intuition is that we can draw an axis-parallel line from$Zi1c$to$Zi2c$.) We then say that$Zic$and$Zjc$are equivalent if and only if there exists a sequence$Zic=Zi1c,…,Zimc=Zjc$with all neighbours$Zikc$and$Zik+1c(k=1,…,m−1)$being coordinate-wise connected. We represent the equivalence classes by the union of all its members. These unions we denote by$Uic$, $i∈IUc$.
We further introduce a deterministic function$Uc$of the variables A and B. We set
$Uc:=uc(A,B):={iif(A,B)∈Uic 0ifp(A,B,c)=0.$
We have that$Uc=i$if and only if$A∈projA(Uic)$if and only if$B∈projB(Uic)$. Furthermore, the projections$projA(Uic)$are disjoint for different i; the same holds for$projB(Uic)$.

(iii) The case where there is no variable C can be treated as if C was deterministic: $p(c)=1$for some c.

In Example 1 there is no variable C. Figure 1 shows the support $suppc(A,B)$ in black. It contains two path-connected components $Z1c$ and $Z2c$. Since they cannot be connected by axis-parallel lines, they are not equivalent; thus, one of them corresponds to the equivalence class $U1c$ and the other to $U2c$. Figure 2 shows another example that contains three equivalence classes of path-connected components; again, there is no variable C; we formally introduce a deterministic variable C that always takes the value c.

Using Definition 3 we are now able to state the two main results, Propositions 1 and 2. As a direct consequence we obtain Corollary 1 which generalizes the condition of strictly positive densities.

Proposition 1 (Weak Intersection Property). Assume (A0), (A1) and that$X⊥⊥A|B,C$and$X⊥⊥B|A,C$. Consider now c with$p(c)>0$and the variable$Uc$as defined in Definition 3(ii). We then have the weak intersection property:

$X⊥⊥(A,B)|C=c,Uc.$

This means that

$p(x|a,b,c)=p(x|c,uc(a,b))$

for all$x,a,b$with$p(a,b,c)>0$. The values of$A,B$do not provide additional information if we already know$Uc=uc(A,B)$.

We call this property the weak intersection property for the following reason: if $X⊥⊥(A,B)|C$, then by definition $uc(a,b)=i$ if and only if $(a,b)∈Uic$ and therefore
$px|a,b,c=p(x|c)=px|c,(a,b)∈Uuc(a,b)c=px|c,uc(a,b).$
In this sense, eq. (5) is strictly weaker than $X⊥⊥(A,B)|C$.

Furthermore, Proposition 1 includes the intersection property for positive densities as a special case. If the density is indeed strictly positive, then there is only a single path-connected component $Z1c$ and a single equivalence class $U1c$. Therefore, $Uc$ is constant and it follows from eq. (5) and Lemma 1 (see “Proof of Proposition 1” in Appendix A.) that $X⊥⊥(A,B)|C$.

Proposition 2 (Failure of Intersection Property). Assume (A0), (A1) and that there exist two different sets$U1c∗≠U2c∗$for some$c∗$with$p(c∗)>0$. Then there exists a random variable X such that the intersection property (2) does not hold for the joint distribution of$X,A,B,C$.

As a direct corollary from these two propositions we obtain a characterization of the intersection property in the case of continuous densities.

Corollary 1 (Intersection Property). Assume (A0) and (A1).

The intersection property (2) holds for all variables X if and only if all components$Zic are equivalent,ie there is only one set U1c.$

In particular, this is the case if (A2) holds (there is only one path-connected component) or (A2’) holds (the density is strictly positive).

## 4 Application to causal discovery

We will first introduce some graph notation that we use for formulating the application to causal inference.

### 4.1 Notation and prerequisites

Standard graph definitions can be found in Lauritzen , Spirtes et al.  and many others. We follow the presentation of Section 1.1 in Peters et al. . A graph$G=(V,E)$ contains nodes $V={1,…,p}$ (often identified with random variables $X1,…,Xp$) and edges $E⊂V×V$ between nodes. A graph $G1=(V1,E1)$ is called a proper subgraph of G if $V1=V$ and $E1⊂E$ with $E1≠E$. A node i is called a parent of j if $(i,j)∈E$ and a child if $(j,i)∈E$. The set of parents of j is denoted by $PAjG$, the set of its children by $CHjG$. Two nodes i and j are adjacent if either $(i,j)∈E$ or $(j,i)∈E$. We say that there is an undirected edge between two adjacent nodes i and j if $(i,j)∈E$ and $(j,i)∈E$. An edge between two adjacent nodes is directed if it is not undirected. We then write $i→j$ for $(i,j)∈E$. Three nodes are called a v-structure if one node is a child of the two others that themselves are not adjacent. A path in G is a sequence of (at least two) distinct vertices $i1,…,in$, such that there is an edge between $ik$ and $ik+1$ for all $k=1,…,n−1$. If $ik→ik+1$ for all k we speak of a directed path from $i1$ to $in$ and call $in$ a descendant of $i1$. We denote all descendants of $i$ by $DEiG$ and all non-descendants of $i$, excluding i, by $NDiG$. In this work, i is neither a descendant nor a non-descendant of itself. G is called a directed acyclic graph (DAG), if all edges are directed and there is no pair of nodes (j, k) such that there are directed paths from j to k and from k to j. In a DAG, a path between $i1$ and $in$ is blocked by a set$S$ (with neither $i1$ nor $in$ in this set) whenever there is a node $ik$, such that one of the following two possibilities hold: (1) $ik∈S$ and $ik−1→ik→ik+1$ or $ik−1←ik←ik+1$ or $ik−1←ik→ik+1$, or (2) $ik−1→ik←ik+1$ and neither $ik$ nor any of its descendants is in $S$. We say that two disjoint subsets of vertices $A$ and $B$ are d-separated by a third (also disjoint) subset $S$ if every path between nodes in $A$ and $B$ is blocked by $S$. A joint distribution is said to be Markov with respect to the DAG G if $A,B$d-sep. by $C⇒A⊥⊥B|C$ for all disjoint sets $A,B,C$. It is said to be faithful to the DAG G if $A,B$d-sep. by $C⇐A⊥⊥B|C$ for all disjoint sets $A,B,C$. Finally, a distribution satisfies causal minimality with respect to G if it is Markov with respect to G, but not to any proper subgraph of G.

In order to infer graphs from distributions, one requires assumptions that relate the joint distribution with properties of the graph, which is often assumed to be a DAG. Constraint-based or independence-based methods [3, 8] and some score-based methods [10, 11] assume the Markov condition and faithfulness. These two assumptions make the Markov equivalence class of the correct graph identifiable from the joint distribution, i.e. the skeleton and the v-structures of the graph can be inferred from the joint distribution .

Alternatively [6, 9, 13], we can assume an additive noise models (ANMs). In these models, the joint distribution over $X1,…,Xp$ is generated by an SEM
$Xi=fi(XPAi)+Ni,i=1,…,p,$
with continuous, non-constant functions $fi$, additive and jointly independent noise variables $Ni$ with mean zero and sets $PAi$ that are the parents of i in a DAG G. To simplify notation, we have identified variable $Xi$ with its index (or node) i. These models can be shown to satisfy the Markov condition (Pearl , theorem 1.4.1); the functions $fi$ being non-constant correspond to causal minimality (Peters et al. , proposition 17), which is strictly weaker than faithfulness. We now define what we mean by identifiability of the DAG in continuous ANMs. Consider a certain class of SEMs and suppose that the distribution $P=P(X1,…,Xp)$ is generated from such an SEM. We say that G is identifiable from P if P cannot be generated by an SEM from the same class but with a different graph $H≠G$.
Loosely speaking, Peters et al. (, theorem 28) prove that
$(∗)TheidentifiabilityofmodelclassesextendsfromDAGswithtwonodestoDAGswithanarbitrarynumberofvariables.$

### 4.2 Intersection property and causal discovery

We first revisit Example 1 and interpret it from a causal point of view.

Remark 1 (Example 1 continued). Example 1 has the following important implication for causal inference. The distribution can be generated by two different DAGs, namely$A→B→X$and$X←A→B$, see Figure 1. The SEM (3) corresponds to the former DAG. A slightly modified version of eq. (3) where$X=f˜(A)+NX$replaces the last equation in eq. (3) corresponds to the latter DAG. The distribution satisfies causal minimality with respect to both DAGs. Since it violates faithfulness and the intersection property, we are not aware of any causal inference method that is able to recover the correct graph structure based on observational data only. Recall that Peters et al.  assume strictly positive densities in order to assure the intersection property. More precisely, Example 1 shows that lemma 38 in Peters et al. , see Appendix B., does not hold anymore when the positivity is violated.

In order to prove $(∗)$, Peters et al.  require a strictly positive density. This is because the key results used in the proof is proposition 29 which is proved using lemma 38, which itself relies on the intersection property (proposition 29 and lemma 38 are provided in Appendix B.). But since Corollary 1 provides weaker assumption for the intersection property, we are now able to obtain new identifiability results.

Proposition 3. Assume that a joint distribution over$X1,…,Xp$is generated by an ANM (6). Assume further that the noise variables have continuous densities and that the support of each noise variable$Ni$, $i=1,…,p$is path-connected. Then, statement$(∗)$holds.

Example 1 violates the assumption of Proposition 3 since the support of A is not path-connected. It satisfies another important property, too: the function f is constant on some intervals. The following proposition shows that this is necessary to violate identifiability.

Proposition 4. Assume that a joint distribution over$X1,…,Xp$is generated by an ANM (6) with graph G. Let us denote the non-descendants of$Xi$by$NDiG$. Assume that the structural equations are non-constant in the following way: for all$Xi$, for all its parents$Xj∈PAi$and for all$XC⊆NDiG∖{Xj}$, there are$(xj,xj′,xk,xc)$such that$fi(xj,xk)≠fi(xj′,xk)$and$p(xj,xk,xc)>0$and$p(xj′,xk,xc)>0$. Here, $xk$represents the value of all parents of$Xi$except$Xj$. Then for any$PAi∖{j}⊆S⊆NDiG∖{j}$, it holds that$Xi⊥⊥/Xj|S$. Therefore, statement$(∗)$follows.

Proposition 4 provides an alternative way to prove identifiability. The results are summarized in Table 1.

Table 1

This table shows conditions for continuous additive noise models (ANMs) that lead to identi_ability of the directed acyclic graph from the joint distributions. Using the characterization of the intersection property we could weaken the condition of a strictly positive density.

 Additional assumption on continuous ANMs Identifiability of graph, see (*) Noise variables with full support ✓Peters et al.  Noise variables with path-connected support ✓Proposition 3 Non-constant functions, see Proposition 4 ✓Proposition 4 None of the above satisfied ✕Example 1

## 5 Conclusions

It is possible to prove the intersection property of conditional independence for variables whose distributions do not have a strictly positive density. A necessary and sufficient condition for the intersection property is that all path-connected components of the support of the density are equivalent, that is, they can be connected by axis-parallel lines. In particular, this condition is satisfied for densities whose support is path-connected. In the general case, the intersection property still holds after conditioning on an equivalence class of path-connected components; we call this the weak intersection property. We believe that the assumption of a density that is continuous in A, $B$ and C can be weakened even further.

This insight has a direct application in causal inference (which is rather of theoretical nature than having implications for practical methods). In the context of continuous ANMs, we relax important conditions for identifiability of the graph from the joint distribution. Furthermore, there is some interest in uniform consistency in causal inference. For linear Gaussian SEMs, for example, the PC algorithm  exploits conditional independences, that is, vanishing partial correlations. Zhang and Spirtes  prove uniform consistency under the assumption that non-vanishing partial correlations cannot be arbitrarily close to zero (this condition is referred to as “strong faithfulness”). Our work suggests that in order to prove uniform consistency for continuous ANMs, one may need to be “bounded away” from Example 1.

Acknowledgements

The author thanks the anonymous reviewers for their insightful and constructive comments. He further thanks Thomas Kahle and Mathias Drton for pointing out the link to algebraic statistics for discrete variables. The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no 326496.

# Appendix A

## Proofs

### Proof of Proposition 1

We require the following well-known lemma (e.g. Dawid ).

Lemma 1. We have$X⊥⊥A|B$if and only if
$p(x|a,b)=p(x|b)$
for all$x,a,b$such that$p(a,b)>0$.
Proof. (of Proposition 1) To simplify notation we write $uc:=uc(a,b)$. We have by Lemma 1
$p(x|b,c)=p(x|a,b,c)=p(x|a,c)$
for all $x,a,b,c$ with $p(a,b,c)>0$. As the main argument we show that
$p(x|b,c)=p(x|b˜,c)$
for all $x,b,b˜,c$ with $b,b˜∈projB(Uic)$ for the same i.
Step 1, we prove eq. (8) for $b,b˜∈Zic$. We first show that there is a path $λ:t↦(a(t),b(t))$, such that $p(a(t),b(t),c)>0$ for all $0≤t≤1$, and $b(0)=b$ and $b(1)=b˜$. Since the interval $[0,1]$ is compact and $λ$ is continuous, the path ${(a(t),b(t)):0≤t≤1}$ is compact, too (for notational simplicity we identify the path $λ$ with its image). Define for each point $(a(t),b(t))$ on the path an open ball with radius small enough such that all $(a,b)$ in the ball satisfy $p(a,b,c)>0$ (this is possible because $(a,b,c)↦p(a,b,c)$ is assumed to be continuous). Because these balls are path-connected, they also lie in $Zic$. They form an open cover of the path ${(a(t),b(t)):0≤t≤1}$, and we can thus choose a finite subset of balls, of size n say, that still provides an open cover of the path. Without loss of generality let $(a(0),b(0))$ be the centre of ball 1 and $(a(1),b(1))$ be the centre of ball n. It suffices to show that eq. (8) holds for the centres of two neighbouring balls, say $(a1,b1)$ and $(a2,b2)$. Choose a point $(a∗,b∗)$ from the non-empty intersection of those two balls. Since $d((a1,b1),(a∗,b1)) and $d((a2,b2),(a2,b∗)) for the Euclidean metric d, we have that $p(a1,b1,c)$, $p(a∗,b1,c)$, $p(a∗,b∗,c)$, $p(a2,b∗,c)$ and $p(a2,b2,c)$ are all greater than zero. Therefore, using eq. (7) several times,
$p(x|b1,c)=p(x|a1,c)=p(x|a∗,c)=p(x|b∗,c)=p(x|a2,c)=p(x|b2,c)$
This shows eq. (8) for $b,b˜∈Zic$.
Step 2, we prove eq. (8) for $b∈Zic$ and $b˜∈Zi+1c$, where $Zic$ and $Zi+1c$ are coordinate-wise connected (and thus equivalent). If $b∗∈projB(Zic)∩projB(Zi+1c)$, we know that
$p(x|b,c)=p(x|b∗,c)=p(x|b˜,c)$
from the argument given in step 1. If $a∗∈projA(Zic)∩projA(Zi+1c)$, then there is a $bi,bi+1$ such that $(a∗,bi)∈Zic$ and $(a∗,bi+1)∈Zi+1c$. By eq. (7) and the argument from step 1 we have
$p(x|b,c)=p(x|bi,c)=p(x|bi+1,c)=p(x|b˜,c).$
We can now combine these two steps in order to prove the original claim from eq. (8). If $b,b˜∈projB(Uic)$ then $b∈projB(Z1c)$ and $b˜∈projB(Znc)$, say. Further, there is a sequence $Z1c,…,Znc$ with $Zkc$ and $Zk+1c$ being coordinate-wise connected for $k=1,…,n−1$. Combining steps 1 and 2 proves eq. (8).

Consider now $x,b,c$ such that $p(b,c)>0$ (which implies $p(c)>0$) and consider $uc=i$, say. Observe further that $p(a,c)>0$ for $a∈projA(Uic)$. We thus have
$p(x,uc|c)=∫ap(x,a,uc|c)da=∫a∈projA(Uic)p(x,a|c)da=∫a∈projA(Uic)p(x,a,c)p(a,c)p(c)p(a,c)da=∫a∈projA(Uic)p(x|a,c)p(a|c)da=∫a∈projA(Uic),p(a,b,c)>0p(x|a,c)p(a|c)da+∫a∈projA(Uic),p(a,b,c)=0p(x|a,c)p(a|c)da=(7)p(x|b,c)∫a∈projA(Uic),p(a,b,c)>0p(a|c)da+∫Abp(x|a,c)p(a|c)da=:(#)$
with $Ab={a∈projA(Uic):p(a,b,c)=0}$. It is the case, however, that for all $a∈Ab$ there is a $b˜(a)∈projB(Uic)$ with $p(a,b˜(a),c)>0$. But since also $b∈projB(Uic)$ we have $p(x|b˜,c)=p(x|b,c)$ by eq. (8). Ergo,
$(#)=p(x|b,c)∫a∈projA(Uic),p(a,b,c)>0p(a|c)da+∫Abp(x|a,b˜(a),c)p(a|c)da=p(x|b,c)∫a∈projA(Uic),p(a,b,c)>0p(a|c)da+p(x|b,c)∫Abp(a|c)da=p(x|b,c)∫a∈projA(Uic)p(a|c)da=p(x|b,c)p(uc|c)$
This implies
$p(x|c,uc)=p(x|b,c).$
Together with eq. (7) this leads to
$p(x|a,b,c,uc)=p(x|a,b,c)=p(x|c,uc).$

### Proof of Proposition 2

Proof. Define X according to
$X=g(C,UC)+NX,$
where $NX∼U([−0.1,0.1])$ is uniformly distributed with $NX$ independent of $(A,B,C)$. Define g according to
$g(c,uc)={10ifC=c∗anduc∗=10otherwise$
Fix a value c with $p(c)>0$. We then have for all $a,b$ with $p(a,b,c)>0$ that
$p(x|a,b,c)=p(x|c,uc)=p(x|a,c)=p(x|b,c)$
because $Uc$ can be written as a function of A or of B. We therefore have that $X⊥⊥A|B,C$ and $X⊥⊥B|A,C$. Depending on whether b is in $projB(U1c∗)$ or not we have $p(x=0|b,c∗)=0$ or $p(x=10|b,c∗)=0$, respectively. Thus,
$p(x=10|b,c∗)⋅p(x=0|b,c∗)=0,whereas$
$p(x=10|c∗)⋅p(x=0|c∗)≠0.$
This shows that $X⊥⊥/B|C=c∗$. Note that $(x,a,b,c)↦p(x,a,b,c)$ is not necessarily continuous, see (A1). □

### Proof of Proposition 3

Proof. Since the true structure corresponds to a DAG, we can find a causal ordering, i.e. a permutation $π:{1,…,p}→{1,…,p}$ such that
$PAπ(i)⊆{π(1),…,π(i−1)}.$
In this ordering, $π(1)$ is a source node and $π(p)$ is a sink node. We can then rewrite the structural equation model in eq. (6) as
$Xπ(i)=f˜π(i)Xπ(1),…,Xπ(i−1)+Nπ(i),$

where the functions $f˜i$ are the same as $fi$ except they are constant in the additional input arguments.

The density of the random vector $(X1,…,Xp)$ has path-connected support by the following argument: consider a one-dimensional random variable N with mean zero and a (possibly multivariate) random vector X both with path-connected support and a continuous function f. Then, the support of the random vector $(X,f(X)+N)$ is path-connected, too. Indeed, consider two points $(x0,y0)$ and $(x1,y1)$ from the support of $(X,f(X)+N)$. The path can then be constructed by concatenating three sub-paths: (1) the path between $(x0,y0)$ and $(x0,f(x0))$ (N’s support is path-connected), (2) the path between $(x0,f(x0))$ and $(x1,f(x1))$ on the graph of f (which is path-connected due to the continuity of f) and (3) the path between $(x1,f(x1))$ and $(x1,y1)$, analogously to (1).

Therefore, the intersection property (2) holds for any disjoint sets of variables $X,A,B,C∈{X1,…,Xp}$ by Proposition 1. Thus, the statements of lemma 38 and thus proposition 29 from Peters et al.  remain correct, which proves $(∗)$ for noise variables with continuous densities and path-connected support. □

### Proof of Proposition 4

Proof. The proof is immediate. Since $p(xi|xj,xk,xc)≠p(xi|xj′,xk,xc)$ (the means are not the same) the statement follows from Lemma 1.

In this case, lemma 38 might not hold but more importantly proposition 29 does (both from Peters et al. . This proves $(∗)$. □

# Appendix B

## Technical results for identifiability in additive noise models

We provide the two key results required for proving property $(∗)$ in Section 4.1. The intersection property is used to prove the “only if” part of lemma 38, which itself is used to prove proposition 29.

Lemma 38  Consider the random vector $X$ and assume that the joint distribution has a (strictly) positive density. Then the joint distribution over $X$ satisfies causal minimality with respect to a DAG G if and only if $∀B∈X∀A∈PABG$ and $∀S⊂X$ with $PABG∖{A}⊆S⊆NDBG∖{A}$ we have that
$B⊥⊥/A|S.$
Proposition 29  Let G and $G′$ be two different DAGs over variables $X$. Assume that the joint distribution over $X$ has a strictly positive density and satisfies the Markov condition and causal minimality with respect to G and $G′$. Then there are variables $L,Y∈X$ such that for the sets $Q:=PALG∖{Y}$, $R:=PAYG′∖{L}$ and $S:=Q∪R$ we have
$•Y→L in G and L→YinG′$
$•S⊆NDLG∖{Y} and S⊆NDYG′∖{L}$

## References

• 1.

Dawid AP. Some misleading arguments involving conditional independence. J R Stat Soc Ser B 1979;41:24952.

• 2.

Dawid AP. Conditional independence for statistical operations. Ann Stat 1980;8:598617.

• 3.

Pearl J. Causality: models, reasoning, and inference, 2nd ed. New York, NY: Cambridge University Press, 2009.

• 4.

Drton M, Sturmfels B, Sullivant S. Lectures on algebraic statistics. Volume 39 of Oberwolfach Seminars. Basel: Birkhäuser Verlag, 2009.

• Crossref
• Export Citation
• 5.

Fink A. The binomial ideal of the intersection axiom for conditional probabilities. J Algebraic Combinatorics 2011;33:45563.

• Crossref
• Export Citation
• 6.

Shimizu S, Hoyer PO, Hyvärinen A, Kerminen AJ. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res 2006;7:200330.

• Export Citation
• 7.

Lauritzen S. Graphical models. New York, NY: Oxford University Press, 1996.

• 8.

Spirtes P, Glymour C, Scheines R. Causation, prediction, and search, 2nd ed. Cambridge, MA: MIT Press, 2000.

• 9.

Peters J, Mooij JM, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res 2014;15:200953.

• Export Citation
• 10.

Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res 2002;3:50754.

• 11.

Heckerman D, Meek C, Cooper G. A Bayesian approach to causal discovery. In Glymour C, Cooper G, editors. Computation, causation, and discovery. Cambridge, MA: MIT Press, 1999:14165.

• Export Citation
• 12.

Verma T, Pearl J. Equivalence and synthesis of causal models. In: P.B. Bonissone and M. Henrion and L.N. Kanal and J.F. Lemmer editors. Proceedings of the 6th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 1991:25570.

• Export Citation
• 13.

Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. In: D. Koller and D. Schuurmans and Y. Bengio and L. Bottou editors. Advances in neural information processing systems 21 (NIPS), Red Hook, NY: Curran Associates, Inc., 2009:689696.

• Export Citation
• 14.

Zhang J, Spirtes P. Strong faithfulness and uniform consistency in causal inference. In: C. Meek and U. Kjærulff editors. Proceedings of the 19th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 2003:6329.

• Export Citation
• 15.

Dawid AP. Conditional independence in statistical theory. J R Stat Soc Ser B 1979;41:131.

## Footnotes

1

Formally, path-connected components are equivalence classes of points, where two points are equivalent if there exists a path in $X$ connecting them. This equivalence should not be confused with the equivalence appearing in Definition 3(ii).

If the inline PDF is not rendering correctly, you can download the PDF file here.

• 1.

Dawid AP. Some misleading arguments involving conditional independence. J R Stat Soc Ser B 1979;41:24952.

• 2.

Dawid AP. Conditional independence for statistical operations. Ann Stat 1980;8:598617.

• 3.

Pearl J. Causality: models, reasoning, and inference, 2nd ed. New York, NY: Cambridge University Press, 2009.

• 4.

Drton M, Sturmfels B, Sullivant S. Lectures on algebraic statistics. Volume 39 of Oberwolfach Seminars. Basel: Birkhäuser Verlag, 2009.

• Crossref
• Export Citation
• 5.

Fink A. The binomial ideal of the intersection axiom for conditional probabilities. J Algebraic Combinatorics 2011;33:45563.

• Crossref
• Export Citation
• 6.

Shimizu S, Hoyer PO, Hyvärinen A, Kerminen AJ. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res 2006;7:200330.

• Export Citation
• 7.

Lauritzen S. Graphical models. New York, NY: Oxford University Press, 1996.

• 8.

Spirtes P, Glymour C, Scheines R. Causation, prediction, and search, 2nd ed. Cambridge, MA: MIT Press, 2000.

• 9.

Peters J, Mooij JM, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res 2014;15:200953.

• Export Citation
• 10.

Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res 2002;3:50754.

• 11.

Heckerman D, Meek C, Cooper G. A Bayesian approach to causal discovery. In Glymour C, Cooper G, editors. Computation, causation, and discovery. Cambridge, MA: MIT Press, 1999:14165.

• Export Citation
• 12.

Verma T, Pearl J. Equivalence and synthesis of causal models. In: P.B. Bonissone and M. Henrion and L.N. Kanal and J.F. Lemmer editors. Proceedings of the 6th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 1991:25570.

• Export Citation
• 13.

Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. In: D. Koller and D. Schuurmans and Y. Bengio and L. Bottou editors. Advances in neural information processing systems 21 (NIPS), Red Hook, NY: Curran Associates, Inc., 2009:689696.

• Export Citation
• 14.

Zhang J, Spirtes P. Strong faithfulness and uniform consistency in causal inference. In: C. Meek and U. Kjærulff editors. Proceedings of the 19th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 2003:6329.

• Export Citation
• 15.

Dawid AP. Conditional independence in statistical theory. J R Stat Soc Ser B 1979;41:131.

FREE ACCESS

### Search   • Example 1. The plot on the left-hand side shows the support of variables A and B in black. The function f takes values ten and zero in the areas filled with dark grey and light grey, respectively. The ANM (3) corresponds to the top graph on the right-hand side but the distribution can also be generated by an ANM with the bottom graph, this is explained in Remark 1.
• Each block represents one path-connected component Zic of the support of p(a,b). All blocks with the same filling are equivalent since they can be connected by axis-parallel lines (see Definition 3). There are three different fillings corresponding to the equivalence classes U1c, U2c and U3c.