1 Introduction
This paper investigates the intersection property of conditional independence. For continuous random variables
It is well known that the intersection property holds if the joint distribution has a strictly positive density (e.g. Pearl [3], 1.1.5). Proposition 1 shows that if the density is not strictly positive, a weaker condition than the intersection property still holds. Corollary 1 states necessary and sufficient conditions for the intersection property. The result about strictly positive densities is contained as a special case. Drton et al. ([4], exercise 6.6) and Fink [5] develop analogous results for the discrete case.
In the remainder of this introduction we discuss the paper’s main contribution (Section 1.1) and introduce the required notation (Section 1.2).
1.1 Main contributions
In Section 3 we provide a sufficient and necessary condition on the density for the intersection property to hold (Corollary 1). This result is of interest in itself since the developed condition is weaker than strict positivity.
Studying the intersection property has direct applications to causal inference. Inferring causal relationships is a major challenge in science. In the last decades considerable effort has been made in order to learn causal statements from observational data. As a first step, causal discovery methods therefore estimate graphs from observational data and attach a causal meaning to these graphs (the terminology of causal inference is introduced in Section 4.1). Some causal discovery methods based on structural equation models (SEMs) require the intersection property for identification; they therefore rely on the strict positivity of the density. This is satisfied if the noise variables have full support, for example. Using the new characterization of the intersection property we can now replace the condition of strict positivity. In fact, we show in Section 4 that noise variables with a path-connected support are sufficient for identifiability of the graph (Proposition 3). This is already known for linear SEMs [6] but not for non-linear models. As an alternative, we provide a condition that excludes a specific kind of constant functions and leads to identifiability, too (Proposition 4).
In Section 2, we provide an example of an SEM that violates the intersection property. Its corresponding graph is not identifiable from the joint distribution. In correspondence to the theoretical results of this work, some noise densities in the example do not have a path-connected support and the functions are partially constant. We are not aware of any causal discovery method that is able to infer the correct graph or the correct Markov equivalence class; the example therefore shows current limits of causal inference techniques. It is non-generic in the case that it violates all sufficient assumptions mentioned in Section 4.
All proofs are provided in Appendix A.
1.2 Conditional independence and the intersection property
We now formally introduce the concept of conditional independence in the presence of densities and the intersection property. Let therefore
- (A0)The distribution is absolutely continuous with respect to a product measure of a metric space. We denote the density by
. This can be a probability mass function or a probability density function, for example. - (A1)The density
is continuous. If there is no variable C (or C is deterministic), then is continuous. - (A2)For each c with
the set contains only one path-connected component (see Section 3). - (A2′)The density
is strictly positive.
Condition (A2′) implies (A2). We assume (A0) throughout the whole work.
In this paper we work with the following definition of conditional independence.
Definition 1 (Conditional (In)dependence). We call X independent of A conditional on B and write
for all
The intersection property of conditional independence is defined as follows (e.g. Pearl [3], 1.1.5).
Definition 2 (Intersection Property). We say that the joint distribution of
The intersection property (2) has been proven to hold for strictly positive densities (e.g. Pearl [3], 1.1.5). The other direction “
2 Counterexample
We now give an example of a distribution that does not satisfy the intersection property (2). Since the joint distribution has a continuous density, the example shows that the intersection property requires further restrictions on the density apart from its existence. We will later use the same idea to prove Proposition 2 that shows the necessity of our new condition.
It will turn out to be important that the two path-connected components of the support of A and B cannot be connected by an axis-parallel line. This motivates the notation introduced in Section 3. Remark 1 in Section 4 discusses the causal interpretation of Example 1.

Example 1. The plot on the left-hand side shows the support of variables A and B in black. The function f takes values ten and zero in the areas filled with dark grey and light grey, respectively. The ANM (3) corresponds to the top graph on the right-hand side but the distribution can also be generated by an ANM with the bottom graph, this is explained in Remark 1.
Citation: Journal of Causal Inference 3, 1; 10.1515/jci-2014-0015
3 Necessary and sufficient condition for the intersection property
This section characterizes the intersection property in terms of the joint density over the corresponding random variables. In particular, we state a weak intersection property (Proposition 1) that leads to a necessary and sufficient condition for the classical intersection property, see Corollary 1.
We will see that the intersection property fails in Example 1 because of the two “separated” components in Figure 1. In order to formulate our results we first require the notion of path-connectedness. A continuous mapping
(iii) The case where there is no variable C can be treated as if C was deterministic:

Each block represents one path-connected component
Citation: Journal of Causal Inference 3, 1; 10.1515/jci-2014-0015
Using Definition 3 we are now able to state the two main results, Propositions 1 and 2. As a direct consequence we obtain Corollary 1 which generalizes the condition of strictly positive densities.
Proposition 1 (Weak Intersection Property). Assume (A0), (A1) and that
This means that
for all
Furthermore, Proposition 1 includes the intersection property for positive densities as a special case. If the density is indeed strictly positive, then there is only a single path-connected component
Proposition 2 (Failure of Intersection Property). Assume (A0), (A1) and that there exist two different sets
As a direct corollary from these two propositions we obtain a characterization of the intersection property in the case of continuous densities.
Corollary 1 (Intersection Property). Assume (A0) and (A1). The intersection property (2) holds for all variables X if and only if all components In particular, this is the case if (A2) holds (there is only one path-connected component) or (A2’) holds (the density is strictly positive).
4 Application to causal discovery
We will first introduce some graph notation that we use for formulating the application to causal inference.
4.1 Notation and prerequisites
Standard graph definitions can be found in Lauritzen [7], Spirtes et al. [8] and many others. We follow the presentation of Section 1.1 in Peters et al. [9]. A graph
In order to infer graphs from distributions, one requires assumptions that relate the joint distribution with properties of the graph, which is often assumed to be a DAG. Constraint-based or independence-based methods [3, 8] and some score-based methods [10, 11] assume the Markov condition and faithfulness. These two assumptions make the Markov equivalence class of the correct graph identifiable from the joint distribution, i.e. the skeleton and the v-structures of the graph can be inferred from the joint distribution [12].
4.2 Intersection property and causal discovery
We first revisit Example 1 and interpret it from a causal point of view.
Remark 1 (Example 1 continued). Example 1 has the following important implication for causal inference. The distribution can be generated by two different DAGs, namely
In order to prove
Proposition 3. Assume that a joint distribution over
Example 1 violates the assumption of Proposition 3 since the support of A is not path-connected. It satisfies another important property, too: the function f is constant on some intervals. The following proposition shows that this is necessary to violate identifiability.
Proposition 4. Assume that a joint distribution over
Proposition 4 provides an alternative way to prove identifiability. The results are summarized in Table 1.
This table shows conditions for continuous additive noise models (ANMs) that lead to identi_ability of the directed acyclic graph from the joint distributions. Using the characterization of the intersection property we could weaken the condition of a strictly positive density.
Additional assumption on continuous ANMs | Identifiability of graph, see (*) |
Noise variables with full support | ✓ Peters et al. [9] |
Noise variables with path-connected support | ✓ Proposition 3 |
Non-constant functions, see Proposition 4 | ✓ Proposition 4 |
None of the above satisfied | ✕ Example 1 |
5 Conclusions
It is possible to prove the intersection property of conditional independence for variables whose distributions do not have a strictly positive density. A necessary and sufficient condition for the intersection property is that all path-connected components of the support of the density are equivalent, that is, they can be connected by axis-parallel lines. In particular, this condition is satisfied for densities whose support is path-connected. In the general case, the intersection property still holds after conditioning on an equivalence class of path-connected components; we call this the weak intersection property. We believe that the assumption of a density that is continuous in A,
This insight has a direct application in causal inference (which is rather of theoretical nature than having implications for practical methods). In the context of continuous ANMs, we relax important conditions for identifiability of the graph from the joint distribution. Furthermore, there is some interest in uniform consistency in causal inference. For linear Gaussian SEMs, for example, the PC algorithm [8] exploits conditional independences, that is, vanishing partial correlations. Zhang and Spirtes [14] prove uniform consistency under the assumption that non-vanishing partial correlations cannot be arbitrarily close to zero (this condition is referred to as “strong faithfulness”). Our work suggests that in order to prove uniform consistency for continuous ANMs, one may need to be “bounded away” from Example 1.
The author thanks the anonymous reviewers for their insightful and constructive comments. He further thanks Thomas Kahle and Mathias Drton for pointing out the link to algebraic statistics for discrete variables. The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no 326496.
Appendix A
Proofs
Proof of Proposition 1
We require the following well-known lemma (e.g. Dawid [15]).
Proof of Proposition 2
Proof of Proposition 3
where the functions
The density of the random vector
Therefore, the intersection property (2) holds for any disjoint sets of variables
Proof of Proposition 4
Proof. The proof is immediate. Since
In this case, lemma 38 might not hold but more importantly proposition 29 does (both from Peters et al. [9]. This proves
Appendix B
Technical results for identifiability in additive noise models
We provide the two key results required for proving property
References
- 1.↑
Dawid AP. Some misleading arguments involving conditional independence. J R Stat Soc Ser B 1979;41:249–52.
- 3.↑
Pearl J. Causality: models, reasoning, and inference, 2nd ed. New York, NY: Cambridge University Press, 2009.
- 4.↑
Drton M, Sturmfels B, Sullivant S. Lectures on algebraic statistics. Volume 39 of Oberwolfach Seminars. Basel: Birkhäuser Verlag, 2009.
- 5.↑
Fink A. The binomial ideal of the intersection axiom for conditional probabilities. J Algebraic Combinatorics 2011;33:455–63.
- 6.↑
Shimizu S, Hoyer PO, Hyvärinen A, Kerminen AJ. A linear non-Gaussian acyclic model for causal discovery. J Mach Learn Res 2006;7:2003–30.
- 8.↑
Spirtes P, Glymour C, Scheines R. Causation, prediction, and search, 2nd ed. Cambridge, MA: MIT Press, 2000.
- 9.↑
Peters J, Mooij JM, Janzing D, Schölkopf B. Causal discovery with continuous additive noise models. J Mach Learn Res 2014;15:2009–53.
- 10.↑
Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res 2002;3:507–54.
- 11.↑
Heckerman D, Meek C, Cooper G. A Bayesian approach to causal discovery. In Glymour C, Cooper G, editors. Computation, causation, and discovery. Cambridge, MA: MIT Press, 1999:141–65.
- 12.↑
Verma T, Pearl J. Equivalence and synthesis of causal models. In: P.B. Bonissone and M. Henrion and L.N. Kanal and J.F. Lemmer editors. Proceedings of the 6th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 1991:255–70.
- 13.↑
Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B. Nonlinear causal discovery with additive noise models. In: D. Koller and D. Schuurmans and Y. Bengio and L. Bottou editors. Advances in neural information processing systems 21 (NIPS), Red Hook, NY: Curran Associates, Inc., 2009:689–696.
- 14.↑
Zhang J, Spirtes P. Strong faithfulness and uniform consistency in causal inference. In: C. Meek and U. Kjærulff editors. Proceedings of the 19th annual conference on uncertainty in artificial intelligence (UAI), San Francisco, CA: Morgan Kaufmann, 2003:632–9.
Footnotes
Formally, path-connected components are equivalence classes of points, where two points are equivalent if there exists a path in