Unifying Gaussian LWF and AMP Chain Graphs to Model Interference

An intervention may have an effect on units other than those to which it was administered. This phenomenon is called interference and it usually goes unmodeled. In this paper, we propose to combine Lauritzen-Wermuth-Frydenberg and Andersson-Madigan-Perlman chain graphs to create a new class of causal models that can represent both interference and non-interference relationships. Specifically, we define the new class of models, introduce global and local and pairwise Markov properties for them, and prove their equivalence. We also propose an algorithm for maximum likelihood parameter estimation for the new models, and report experimental results. Finally, we adapt Pearl's do-calculus for causal effect identification in the new models.


Motivation
Graphical models are among the most studied and used formalisms for causal inference. Some of the reasons of their success are that they make explicit and testable causal assumptions, and there exist algorithms for causal effect identification, counterfactual reasoning, mediation analysis, and model identification from non-experimental data (Pearl, 2009;Peters et al., 2017;Spirtes et al., 2000). However, in causal inference in general and graphical models in particular, one assumes more often than not that there is no interference, i.e. an intervention has no effect on units other than those to which the intervention was administered (VanderWeele et al., 2012). This may be unrealistic in some domains. For instance, vaccinating the mother against a disease may have a protective effect on her child, and vice versa. A notable exception is the work by Ogburn and VanderWeele (2014), who distinguish three causal mechanisms that may give rise to interference and show how to model them with directed and acyclic graphs (DAGs). In this paper, we focus on what the authors call interference by contagion: One individual's treatment does not affect another individual's outcome directly but via the first individual's outcome. Ogburn and VanderWeele (2014) also argue that interference by contagion typically involves feedback among different individuals' outcomes over time and, thus, it may be modeled by a DAG over the random Figure 1. Models of the mother-child example: (a, b) LWF CGs, and (c) UCG.
variables of interest instantiated at different time points. Sometimes, however, the variables are observed at one time point only, e.g. at the end of a season. The observed variables may then be modeled by a Lauritzen-Wermuth-Frydenberg chain graph, or LWF CG for short (Frydenberg, 1990;Lauritzen, 1996): Directed edges represent direct causal effects, and undirected edges represents causal effects due to interference. Some notable papers on LWF CGs for modeling interference by contagion are Ogburn et al. (2018), Shpitser (2015), 1 Shpitser et al. (2017), and . For instance, the previous motherchild example may be modeled with the LWF CG in Figure 1 (a), where V 1 and V 2 represent the dose of the vaccine administered to the mother and the child, and D 1 and D 2 represent the severity of the disease. This paper only deals with Gaussian distributions, which implies that the relations between the random variables are linear. Note that the edge D 1 −D 2 represents a symmetric relationship but has some features of a causal relationship, in the sense that a change in the severity of the disease for the mother causes a change in severity for the child and vice versa.
In some domains, we may need to model both interference and noninterference. This is the problem that we address in this paper. In our previous example, for instance, the mother and the child may have a gene that makes them healthy carriers, i.e. the higher the expression level of the gene the less the severity of the disease but the load of the disease agent (e.g., a virus) remains unaltered and, thus, so does the risk of infecting the other. We may model this situation with the LWF CG in Figure 1 (b), where G 1 and G 2 represent the expression level of the healthy carrier gene in the mother and the child. However, this model is not correct: In reality, the mother's healthy carrier gene protects her but has no protective effect on the child, and vice versa. This non-interference relation is not represented by the model. In other words, LWF CGs are not expressive enough to model both interference relations (e.g., intervening on V 1 must have an effect on D 2 ) and non-interference relations (e.g., intervening on G 1 must have no effect on D 2 ). To remedy this problem, we propose to combine LWF CGs with Andersson-Madigan-Perlman chain graphs, or AMP CGs for short (Andersson et al., 2001). We call these new models unified chain graphs (UCGs). As we will show, it is possible to describe how UCGs exactly model interference in the Gaussian case. Works such as Ogburn et al. (2018), Shpitser (2015), , and  use LWF CGs to model interference and compute some causal effect of interest. However, they do not describe how interference is exactly modeled.
The rest of the paper is organized as follows. Section 2 introduces some notation and definitions. Section 3 defines UCGs formally. Sections 4 and 5 define global, local and pairwise Markov properties for UCGs and prove their equivalence. Section 6 proposes an algorithm for maximum likelihood parameter estimation for UCGs, and reports experimental results. Section 7 adapts Pearl's do-calculus to UCGs for causal effect identification. Section 8 discusses identifiability of LWF and AMP CGs. Section 9 closes the paper with some discussion. The formal proofs of all the results are contained in an appendix at the end of the paper.

Preliminaries
Unless otherwise stated, the graphs in this paper are defined over a finite set of nodes V . Each node represents a random variable. We assume that the random variables are jointly normally distributed. The graphs contain at most one edge between any pair of nodes. The edge may be undirected or directed. We consider two types of directed edges: Solid (→) and dashed (⇢).
The parents of a set of nodes X in a graph G is the set P a(X) Moreover, the subgraph of G induced by X is denoted by G X . A route between a node V 1 and a node V n in G is a sequence of (not necessarily distinct) A route of distinct nodes is called a path. A chain graph (CG) is a graph with (possibly) directed and undirected edges, and without semidirected cycles. A set of nodes of a CG G is connected if there exists an undirected route in G between every pair of nodes in the set. A chain component of G is a maximal connected set. Note that the chain components of G can be sorted topologically, i.e. for every edge A → B or A ⇢ B in G, the component containing A precedes the component containing B. The chain components of G are denoted by Cc(G). CGs without dashed directed edges are known as LWF CGs, whereas CGs without solid directed edges are known as AMP CGs.
We now recall the interpretation of LWF CGs. 2 Given a route ρ in a LWF CG G, a section of ρ is a maximal undirected subroute of ρ.
We say that ρ is Z-open with Z ⊆ V when (i) all the collider sections in ρ have some node in Z, and (ii) all the nodes that are outside the collider sections in ρ are outside Z. We now recall the interpretation of AMP CGs. 3 Given a route ρ in an AMP CG G, C is a collider node in ρ if ρ has a subroute A ⇢ C ⇠ B or A ⇢ C − B. We say that ρ is Z-open with Z ⊆ V when (i) all the collider nodes in ρ are in Z, and (ii) all the non-collider nodes in ρ are outside Z. Let X, Y and Z denote three disjoint subsets of V . When there is no Z-open route in a LWF or AMP CG G between a node in X and a node in Y , we say that X is separated from Y given Z in G and denote it as X ⊥ G Y Z. Moreover, we represent by X ⊥ p Y Z that X and Y are conditionally independent given Z in a Gaussian distribution p. We say that p satisfies the global Markov property with respect to G when Finally, let X, Y , W and Z be disjoint subsets of V . Every Gaussian distribution p satisfies the following six properties (Studený, 2005, Lemma 2.1, Proposition 2.1 and Corollary 2.4):

Unified Chain Graphs
In this section, we introduce our causal models to represent both interference and non-interference relationships. Let p denote a Gaussian distribution that satisfies the global Markov property with respect to a LWF or AMP CG G. Then, . . , K n denote the chain components of G that precede K in an arbitrary topological ordering of the components. Assume without loss of generality that p has zero mean vector. Moreover, let Σ and Ω denote respectively the covariance and precision matrices of p(K, P a(K)). Each matrix is then of dimension ( K + P a(K) ) × ( K + P a(K) ). Bishop (2006, Section 2.3.1) shows that the conditional distribution p(K P a(K)) is Gaussian with covariance matrix Λ K and the mean vector is a linear function of P a(K) with coefficients β K . Then, β K is of dimension K × P a(K) , and Λ K is of dimension K × K . Moreover, β K and Λ K can be expressed in terms of Σ and Ω as follows: with β K = Σ K,P a(K) Σ −1 P a(K),P a(K) = −Ω −1 K,K Ω K,P a(K) and Λ K = Σ K P a(K) = Σ K,K − Σ K,P a(K) Σ −1 P a(K),P a(K) Σ P a(K),K = Ω −1 where Σ K,P a(K) denotes the submatrix of Σ with rows K and columns P a(K), and Σ −1 P a(K),P a(K) denotes the inverse of Σ P a(K),P a(K) . Moreover, (Lauritzen, 1996, Proposition 5.2).
Furthermore, if G is an AMP CG, then (β K ) i,j = 0 for all i ∈ K and j ∈ P a(K) ∖ P a(i), because i⊥ G P a(K) ∖ P a(i) P a(i). Therefore, the directed edges of an AMP CG are suitable for representing noninterference. On the other hand, the directed edges of a LWF CG are suitable for representing interference. To see it, note that if G is a LWF CG, then Ω i,j = 0 for all i ∈ K and j ∈ P a(K) ∖ P a(i), because i ⊥ G j K ∖ {i} ∪ P a(K) ∖ {j}. However, (β K ) i,j is not identically zero. For instance, if G = {1 → 3 − 4 ← 2} and K = {3, 4}, then (β K ) 4,1 = −(Ω −1 K,K ) 4,3 (Ω K,P a(K) ) 3,1 − (Ω −1 K,K ) 4,4 (Ω K,P a(K) ) 4,1 = −(Ω −1 K,K ) 4,3 (Ω K,P a(K) ) 3,1 because (Ω K,P a(K) ) 4,1 = 0 as 1 → 4 is not in G. In other words, if there is a path in a LWF CG G from j to i through nodes in K, then (β K ) i,j is not identically zero. Actually, (β K ) i,j can be written as a sum of path weights over all such paths. This follows directly from Equation 3 and the result by Jones and West (2005, Theorem 1), who show that (Ω −1 K,K ) i,l can be written as a sum of path weights over all the paths in G between l and i through nodes in K. Specifically, where ρ i,l denotes the set of paths in G between i and l through nodes in K, ρ denotes the length of a path ρ, ρ n denotes the n-th node in ρ, and (Ω K,K ) ∖ρ is the matrix with the rows and columns corresponding to the nodes in ρ omitted. Moreover, the determinant of a zero-dimensional matrix is taken to be 1. As a consequence, a LWF CG G does not impose zero restrictions on β K , because K is connected by definition of chain component. Works such as Ogburn et al. (2018), Shpitser (2015), , and  use LWF CGs to model interference and compute some causal effect of interest. However, they do not describe how interference is exactly modeled. In the case of Gaussian LWF CGs, Equations 3 and 5 shed some light on this question. For instance, consider extending the mother-child example in Section 1 to a group of friends. The vaccine for individual j has a protective effect on individual j developing the disease, which in its turn has a protective effect on her friends, which in its turn has a protective effect on her friends' friends, and so on. Moreover, the protective effect also works in the reverse direction, i.e. the vaccine for the latter has a protective effect on individual j. In other words, the protective effect of the vaccine for individual j on individual i is the result of all the paths for the vaccine's effect to reach individual i through the network of friends. This is exactly what Equations 3 and 5 tell us (assuming that the neighbors of an individual's disease node in G are her friends' disease nodes). The discussion above suggests a way to model both interference and non-interference by unifying LWF and AMP CGs by simply considering two types of directed edges, e.g. ⇢ and →. We call these new models unified chain graphs (UCGs). The lack of an edge ⇢ in an UCG imposes a zero restriction on the elements of β K (as in AMP CGs), whereas the lack of an edge → imposes a zero restriction on the elements of Ω K,P a(K) but not on the elements of β K (as in LWF CGs). Therefore, edges ⇢ may be used to model non-interference, whereas edges → may be used to model interference. For instance, the mother-child example in Section 1 may be modeled with the UCG in Figure 1 (c).
Every UCG is a CG, but the opposite is not true due to the following requirement. We divide the parents of a set of nodes X in an UCG into mothers and fathers as follows. The fathers of X are F a(X) We require that F a(K)∩Mo(K) = ∅ for all K ∈ Cc(G). Therefore, the lack of an edge ⇢ imposes a zero restriction on the elements of β K corresponding to the mothers, and the lack of an edge → imposes a zero restriction on the elements of Ω K,P a(K) corresponding to the fathers. The reason why we require that F a(K) ∩ Mo(K) = ∅ is to avoid imposing contradictory zero restrictions on β K , e.g. the edge j → i excludes the edge j ⇢ i by definition of CG, which implies that (β K ) i,j is identically zero, but j → i implies the opposite. In other words, without this constraint, UCGs would reduce to AMP CGs. Lastly, note that the zero restrictions associated with an UCG induce a Gaussian distribution by Equation 1 and Bishop (2006, Section 2.3.3).

Global Markov Property
In this section, we present a separation criterion for UCGs. Given a route ρ in an UCG, C is a collider node in ρ if ρ has a subroute all the collider sections in ρ have some node in Z, and (iii) all the nodes that are outside the collider sections in ρ and are not collider nodes in ρ are outside Z. Let X, Y and Z denote three disjoint subsets of V . When there is no Z-open route in G between a node in X and a node in Y , we say that X is separated from Y given Z in G and denote it as X ⊥ G Y Z. Note that this separation criterion unifies the criteria for LWF and AMP CGs reviewed in Section 3. Finally, we say that a Gaussian distribution p satisfies the global Markov property with respect to an UCG The next theorem states the equivalence between the global Markov property and the zero restrictions associated with an UCG.

Theorem 2. A Gaussian distribution p satisfies Equation 1 and the zero restrictions associated with an UCG G if and only if it satisfies the global Markov property with respect to G.
We have mentioned before that Jones and West (2005) prove that the covariance between two nodes i and j can be written as a sum of path weights over the paths between i and j in a certain undirected graph (recall Equation 5 for the details). Wright (1921) proves a similar result for DAGs. The following theorem generalizes both results.
Theorem 3. Given a Gaussian distribution that satisfies the global Markov property with respect to an UCG G, the covariance between two nodes i and j of G can be written as a sum of products of weights over the edges of the open paths between i and j in G.

Block-Recursive, Pairwise and Local Markov Properties
In this section, we present block-recursive, pairwise and local Markov properties for UCGs, and prove their equivalence. Equivalence means that every Gaussian distribution that satisfies any of these properties with respect to an UCG also satisfies the global Markov property with respect to the UCG, and vice versa. The relevance of these results stems from the fact that checking whether a distribution satisfies the global Markov property can now be performed more efficiently: We do not need to check that every global separation corresponds to a statistical independence in the distribution, it suffices to do so for the local (or pairwise or block-recursive) separations, which are typically considerably fewer. This is the approach taken by most learning algorithms based on hypothesis tests, such as the PC algorithm for DAGs (Spirtes et al., 2000), and its posterior extensions to LWF CGs (Studený, 1997) and AMP CGs (Peña, 2012). The results in this section pave the way for developing a similar learning algorithm for UCGs, something that we postpone to a future article.
We say that a Gaussian distribution p satisfies the block-recursive Markov property with respect to G if for any chain component K ∈ Cc(G) and vertex and (B3) p(K P a(K)) satisfies the global Markov property with respect to G K .

Theorem 4. A Gaussian distribution p satisfies Equation 1 and the zero restrictions associated with an UCG G if and only if it satisfies the block-recursive Markov property with respect to G.
We say that a Gaussian distribution p satisfies the pairwise Markov property with respect to an UCG G if for any non-adjacent vertices i and j of G with i ∈ K and K ∈ Cc(G) Theorem 5. The pairwise and block-recursive Markov properties are equivalent.
We say that a Gaussian distribution p satisfies the local Markov property with respect to an UCG G if for any vertex i of G with i ∈ K and K ∈ Cc(G) Theorem 6. The local and pairwise Markov properties are equivalent.

Maximum Likelihood Parameter Estimation
In this section, we introduce a procedure for computing maximum likelihood estimates (MLEs) of the parameters of an UCG G, i.e. the MLEs of the non-zero entries of (β K ) K,M o(K) , Ω K,K and Ω K,F a(K) for every chain component K of G. Specifically, we adapt the procedure proposed by Drton and Eichler (2006) for computing MLEs of the parameters of an AMP CG. The procedure combines generalized least squares and iterative proportional fitting. Suppose that we have access to a data matrix D whose column vectors are the data instances and the rows are indexed by V . Moreover, let D X denote the data over the variables X ⊆ V , i.e. the rows of D corresponding to X. Note that thanks to Equation 1, the MLEs of the parameters corresponding to each chain component K of G can be obtained separately. Moreover, recall that K P a(K) ∼ N (β K P a(K), Ω −1 K,K ). Therefore, we may compute the MLEs of the parameters corresponding to the component K by iterating between computing the MLEsΩ K,K andΩ K,F a(K) and thus (β K ) K,F a(K) while fixing (β K ) K,M o(K) , and computing the MLEs (β K ) K,M o(K) while fixing (β K ) K,F a(K) . Specifically, we initialize (β K ) K,M o(K) to zero and then iterate through the following steps until convergence: • ComputeΩ K,K andΩ K,F a(K) from the data D F a(K) and the The first step corresponds to computing the MLEs of the parameters of a LWF CG and, thus, it is solved by running the iterative proportional fitting procedure as indicated in Lauritzen (1996). Equation 3 solves the second step. The third step corresponds to computing the MLEs of the parameters of an AMP CG and, thus, it is solved by Equation 13 in Drton and Eichler (2006). Next, we experimentally evaluate the procedure proposed.
6.1. Experimental Evaluation. First, we generate 1000 UCGs as follows. Each UCG consists of five mothers, five fathers and 10 children. The edges are sampled independently and with probability 0.2 each. If the edges sampled do not satisfy the following two constraints, the sampling process is repeated: (i) There must be an edge from every parent to some child, i.e. the parents are real parents, and (ii) the children must be connected by an undirected path, i.e. they conform a chain component.
Then, we parameterize each of the 1000 UCGs generated above as follows. The 10 children are denoted by K, and the 10 parents by P a(K). The non-zero elements of β K corresponding to the mothers are sampled uniformly from the interval [-3,3]. The non-zero elements of Ω K,K and Ω K,F a(K) are sampled uniformly from the interval [-3,3], with the exception of the diagonal elements of Ω K,K which are sampled uniformly from the interval [0,30]. If the sampled Ω K,K is not positive definite then the sampling process is repeated. The reason why the diagonal elements are sampled from a wider interval is to avoid having to repeat the sampling process too many times.
Finally, note that each of the 1000 parameterized UCGs generated above specifies one probability distribution p(K P a(K)). The goal of our experiments is to evaluate the accuracy of the algorithm presented above to estimate the parameter values corresponding to p(K P a(K)) from a finite sample of p(K, P a(K)). To generate these learning data, we sample first p(P a(K)) and then p(K P a(K)). Each of the 1000 probability distributions p(P a(K)) is constructed as follows. The off-diagonal elements of Ω P a(K),P a(K) are sampled uniformly from the interval [-3,3], whereas the diagonal elements are sampled uniformly from the interval [0,30]. As before, we repeat the sampling process if the resulting matrix is not positive definite. Note that no element of Ω P a(K),P a(K) is identically zero. For the experiments, we consider samples of size 500, 2500 and 5000.
For each of the UCGs and corresponding samples generated above, we run the parameter estimation procedure until the MLEs do not change in two consecutive iterations or 100 iterations are performed. We then compute the relative difference between each true parameter value θ and the MLEθ, which we define as abs((θ−θ) θ). We also compute the residual absolute difference, which we define as the residual with the parameter estimates minus the residual with the true parameter values. The procedure is implemented in R and the code is available at https://www.dropbox.com/s/b9vmqgf99da3qxm/UCGs.R?dl=0.
The results of the experiments are reported in Table 1. The difference between the quartiles Q1 and Q3 is not too big, which suggests that the column Median is a reliable summary of most of the runs. We therefore focus on this column for the rest of the analysis. Note however that some runs are exceptionally good and some others exceptionally bad, as indicated by the columns Min and Max. To avoid a bad run, one may consider a more sophisticated initialization of the parameter estimation procedure, e.g. multiple restarts.
The sample size has a clear impact on the accuracy of the MLEs, as indicated by the decrease in relative difference. For instance, half of the MLEs are less than 27 %, 12 % and 8 % away from the true values for the samples sizes 500, 2500 and 5000, respectively. The effect of the sample size on the accuracy of the MLEs can also be appreciated from the fact that the absolute difference of the residuals does not grow with the sample size. The parameters (β K ) K,M o(K) seem easier to estimate than (β K ) K,F a(K) , as indicated by the smaller relative difference of the former. This is not surprising since the latter may accumulate the errors in the estimation of Ω K,K and Ω K,F a(K) (recall Equation 3). Next, we repeat the experiments with an edge probability of 0.5 instead of 0.2, in order to consider denser UCGs. The results of these experiments are reported in Table 2. They lead to essentially the same conclusions as before. When comparing the two tables, we can see that the results for the denser UCGs are slightly worse, as expected. All in all, both tables show that the quartile Q3 of the absolute difference of the residuals is negative, which indicates that the MLEs induce a better fit of the data than the true parameter values. We therefore conclude that the parameter estimation procedure proposed works satisfactorily.
Finally, we conduct a sanity check aimed to evaluate the behavior of the parameter estimation procedure proposed when the UCG contains spurious or superfluous edges, i.e. edges whose associated parameters are zero and thus may be removed from the UCG. To this end, we repeat the previous experiments with a slight modification. Like before, the elements of (β K ) K,M o(K) , Ω K,F a(K) and Ω K,K associated to the edges in the UCG are sampled uniformly from the interval [-3,3]. However, we now set 25 % of these parameters to zero. We expect the estimates for these zeroed parameters to get closer to zero as the sample size grows. The results of these experiments with an edge probability of 0.5 are reported in Table 3. The results for an edge probability of 0.2 are similar. The first three rows of the table report the number of edges. Each edge has associated one parameter, and 25 % of these parameters have been zeroed. In the next rows, the table reports the absolute difference between the zeroed parameters and their estimates, i.e. the absolute values of the estimates. As expected, the larger the sample size is the closer to zero the MLEs of the zeroed parameters become. For instance, 75 % of MLEs of the zeroed parameters take a value smaller than 0.76, 0.33 and 0.24 for the samples sizes 500, 2500 and 5000, respectively. Note that these numbers are much lower for the zeroed (β K ) K,M o(K) parameters (specifically 0.06, 0.03, 0.02) because, recall, our parameter estimation procedure initializes (β K ) K,M o(K) to zero. Table 3 also shows the absolute difference of the residuals, which is comparable to that in Table 2, which suggests that the existence of spurious edges does not hinder fitting the data. We therefore conclude that the parameter estimation procedure proposed behaves as it should in this sanity check.

Causal Effect Identification
In this section, we adapt Pearl's do-calculus to UCGs for causal effect identification. Intervening on a set of nodes X aims to modify the natural causal mechanism of X, as opposed to (passively) observing X. We represent an intervention on X as do(X). For simplicity, we only consider interventions that set X to fixed values. Specifically, given an UCG G and a chain component K of G, consider the term p(K P a(K)) in Equation 1. Since all the open paths from P a(K) to K in G are causal (with or without interference), observing and intervening on P a(K) has the same effect on K. In other words, p(K P a(K)) = p(K do(P a(K))). Consider now a node A ∈ K. Since all the P a(K)-open paths from A to K ∖ {A} are undirected, they then represent interference or contagion relationships. Despite being symmetric, these relationships have some features of causal relationships (Shpitser, 2015). Therefore, observing and intervening upon A has the same effect on K ∖ {A} given P a(K). In other words, p(K ∖ {A} P a(K) ∪ {A}) = p(K ∖ {A} P a(K), do(A)). This leads to the following formula for an intervention on a set X ⊆ V : which satisfies the global Markov property with respect to the UCG G V ∖X . In other words, the result of intervening on X can be represented by the UCG resulting from removing X from G. For instance, the UCG in Figure 1 (c) indicates that, as expected, {V 1 , G 1 } and {V 2 , G 2 , D 2 } are independent after intervening on D 1 . It is worth mentioning that our treatment of interventions in UCGs is a generalization of that proposed by Lauritzen and Richardson (2002) for interventions in LWF CGs. This may come as a surprise because undirected edges in UCGs represent interference relationships whereas, in the work by Lauritzen and Richardson (2002), they represent dependencies in the equilibrium distribution of a dynamic system with feedback loops. However, it has been suggested that interference is nothing but dependencies in an equilibrium distribution (Ogburn and VanderWeele, 2014;Ogburn et al., 2018;Shpitser, 2015). Pearl's do-calculus consists of three rules whose repeated application together with standard probability manipulations, aim to transform a causal effect of interest into an expression that only involves observational quantities. In other words, the calculus aims to remove the do operator from the expression of interest. For instance, the calculus aims to conclude that p(D 2 do(G 1 )) = p(D 2 ) and p(D 2 do(V 1 )) = p(D 2 V 1 ) in the UCG in Figure 1 (c). In which case, the causal effects can be estimated from observational data. Pearl's do-calculus was originally devised for so-called acyclic directed mixed graphs. However, it carries over into UCGs, as we show next. Let G ′ denote an UCG G augmented with an intervention random variable F A and an edge F A → A for every A ∈ V , such that the domain of F A is the same as that of A plus a state labeled idle: F A = a corresponds an intervention that sets A = a, whereas F A = idle represents that A is observed rather than intervened upon. See Pearl (1995, p. 686) for further details. Moreover, let X, Y , Z and W be disjoint subsets of V , and let the statement X ⊥ G ′ Y Z, do(W ) hold if the separation X ⊥ Y Z holds in the UCG resulting from intervening on W in G ′ , i.e. the result of removing W from G ′ . The calculus consists of the following three rules: • Rule 1 (insertion/deletion of observations): Theorem 7. Rules 1-3 are sound for UCGs.

Identifiability of LWF and AMP CGs
This section proves that identifiability of LWF and AMP CGs is possible when the error variables have equal variance. In other words, given a probability distribution p that is faithful to an UCG G, we can identify the Markov equivalence class of G from p by, for instance, running the learning algorithm developed by Studený (1997) for LWF CGs and by Peña (2012) for AMP CGs. We prove below that we can actually identify G from p if the error variables have equal variance. We discuss this assumption at the end of the section. Our result generalizes a similar result reported by Peters and Bühlmann (2014) for DAGs.
Specifically, let G denote a LWF or AMP CG. Equation 2 can be written as in distribution, where K denotes the chain component of G that contains V i , and ǫ i denotes an error variable representing the unmodelled causes of V i . All such error variables are jointly denoted by ǫ which is distributed according to N (0, Λ), where Λ is a block diagonal matrix whose blocks correspond to the covariance matrices Λ K for all K ∈ Cc(G). Moreover, we assume that the non-zero coefficients β i,j and the non-zero entries of Λ −1 have been selected at random. This implies that p is faithful to G with probability almost 1, by Peña (2011, Theorems 1 and 2) and Levitz et al. (2001, Theorem 6.1). 5 We therefore assume faithfulness hereinafter. We also assume that p has been preprocessed by rescaling ǫ i by multiplying it with λ λ i for all i, where λ 2 i = Λ i,i and λ 2 is an arbitrary positive value. This implies that we assume that the errors ǫ i have equal variance, namely λ 2 . The following lemma proves that, after the rescaling, the error covariance matrix is still positive definite and keeps all the previous (in)dependencies which implies that, after the rescaling, p is still faithful to G with probability almost 1.
Lemma 8. Consider rescaling ǫ i in Equation 8 by multiplying it with λ λ i for all i. Then, the error covariance matrix represents the same independences before and after the rescaling. Moreover, the error covariance matrix is positive definite after the rescaling if and only if it was so before the rescaling.
We are now ready to state formally the main result of this section.

Theorem 9. Let p be a Gaussian distribution generated by Equation 8 with equal error variances. Then, G is identifiable from p.
Note that the theorem above implies that two LWF or AMP CGs that represent the same separations are not Markov equivalent under the constraint of equal error variances, i.e. Theorem 1 does not hold under this constraint. The suitability of the equal error variances assumption should be assessed on a per domain basis. However, it may not be unrealistic to assume that it holds when the variables correspond to a relatively homogeneous set of individuals. For instance, in the case of a contagious disease, the error variable represents the unmodelled causes of an individual developing the disease, e.g. environmental factors. We may assume that these factors are the same for all the individuals, given their homogeneity. Therefore, we may assume equal error variances. We conjecture that the theorem above also holds for UCGs. However, a formal proof of this result requires first a characterization of Markov equivalence for UCGs, something that we postpone to a future article.

Discussion
LWF and AMP CGs are typically used to represent independence models. However, they can also be used to represent causal models. For instance, LWF CGs have been shown to be suitable for representing the equilibrium distributions of dynamic systems with feedback loops (Lauritzen and Richardson, 2002). AMP CGs have been shown to be suitable for representing causal linear models with additive Gaussian noise (Peña, 2016). LWF CGs have been extended into segregated graphs, which have been shown to be suitable for representing causal models with interference (Shpitser, 2015). In this paper, we have shown how to combine LWF and AMP CGs to represent causal models of domains with both interference and non-interference relationships. Moreover, we have defined global, local and pairwise Markov properties for the new models, which we have coined unified chain graphs (UCGs), and shown that these properties are equivalent (Theorems 2, 4, 5 and 6). We have also proposed and evaluated an algorithm for computing MLEs of the parameters of an UCG. Finally, we have adapted Pearl's do-calculus to UCGs for causal effect identification. In the future, we would like to (i) study Markov equivalence of UCGs, (ii) develop structure learning algorithms based on the local and pairwise Markov properties, (iii) extend UCGs to model confounding via bidirected edges, and (iv) make use of the linearity of the relations for causal effect identification in UCGs along the lines in Pearl (2009, Chapter 5).

Appendix: Proofs
Proof of Theorem 1. The result has been proven before for LWF CGs (Peña, 2011, Corollary 1). We prove it here for AMP CGs. The if part is trivial. To see the only if part, note that there are Gaussian distributions p and q that are faithful respectively to G and H (Levitz et al., 2001, Theorem 6.1). Moreover, p satisfies the global Markov property with respect to H, because G and H are Markov equivalent. Likewise for q and G. Therefore, G and H must represent the same separations.
Given an UCG G, let G U denote the subgraph of G resulting from the following process: Start with the empty graph over U and add to it the edge A ⊸ B if and only if G has a route of the form A ⊸ B ⊸ ⋯ ⊸ C where (i) C ∈ U, (ii) ⊸ means ⇢, → or −, and (iii) the route has no subroute R ⇢ S − T . The next lemma shows that if X ⊥ G Y Z for an UCG G, then X and Y can be extended to cover the whole Lemma 10. Given an UCG G and three disjoint subsets X, Y and Assume to the contrary that there exists a Z-open route π between A ′ and B ′ ∈ Y ′ in G X∪Y ∪Z . Moreover, there is a Z-open route σ between A ′ and some A ∈ X in G X∪Y ∪Z , because A ′ ∉ Y ′ . Let ρ = σ ∪ π, i.e. the route resulting from concatenating σ and π. Note that if A ′ is neither a collider node in ρ nor in a collider section of ρ, then ρ is Z-open because A ′ ∉ Z. This contradicts that B ′ ∈ Y ′ and, thus, A ′ must be (i) in a collider section of ρ, or (ii) a collider node in ρ. Note that in case (i), the collider section is of the form V 1 → V 2 − ⋯ − A ′ − ⋯ − V n−1 ← V n where V 2 , . . . , V n−1 ∉ Z because, otherwise, π or σ are not Z-open. This together with the fact that A ′ ∉ Z imply that ρ is not Z-open in either case (i) or (ii). However, it can be modified into a Z-open route between A and B ′ as follows, which contradicts that B ′ ∈ Y ′ .
• Assume that case (i) holds. Recall that A ′ ∈ V (G X∪Y ∪Z ) and consider the following three cases. First, assume that A ′ ∈ V (G Z ). Then, G X∪Y ∪Z has a route ̺ from A ′ to some C ∈ Z that only contains edges −, → and ⇢ and no subroute R ⇢ S −T . Assume without loss of generality that C is the only node in ̺ that belongs to Z. Let ̺ ′ denote the route resulting from traversing ̺ from C to A ′ . Then, the route σ ∪ ̺ ∪ ̺ ′ ∪ π is Z-open, which contradicts that B ′ ∈ Y ′ . Second, assume that . Then, G X∪Y ∪Z has a route ̺ from some A ′′ ∈ X to A ′ that only contains edges −, ← and ⇠ and no Then, we can reach a contradiction much in the same way as in the previous case. • Assume that case (ii) holds. We can prove this case much in the same way as case (i). Simply note that we can now choose the route ̺ in case (i) so that it does not start with an edge −. To see this, note that that A ′ is a collider node in ρ implies that an edge A ′′ ⇢ A ′ is in ρ and, thus, in G X∪Y ∪Z . However, A pure collider route is a route whose all intermediate nodes are collider nodes or in a collider section of the route, i.e.
Lemma 11. Given an UCG G and three disjoint subsets X, Y and Z of V such that X ⊥ G Y Z, then there is no pure collider route in Lemma 10. Assume to the contrary that there is a pure collider route For the same reason, C 2 ∈ Y ′ or C 2 ∈ Z. However, any of the four combinations contradicts that However, this implies that some node in X ′ is adjacent to some node in Y ′ , which contradicts that X ′ ⊥ G X∪Y ∪Z Y ′ Z.
Proof. We prove the result by induction on the number of chain components n. The result clearly holds for n = 1, because the only possible pure collider route in G between i and j is of the form i−j, and the zero restrictions associated with G imply that Assume as induction hypothesis that the result holds for all n < r. We now prove it for n = r. Let U = K 1 ∪ ⋯ ∪ K r−1 and L = K r . By the induction hypothesis, U ∼ N (0, Λ U ). Moreover, let L ∼ N (β L P a(L), Λ L ). Bishop (2006, Section 2.3.3) shows that (U, L) T ∼ N (0, Σ) where Assume that i, j ∈ U. Then, Equation 9 implies that (Σ −1 ) i,j = 0 if (Λ −1 U ) i,j = 0 and (β T L Λ −1 L β L ) i,j = 0. We show below that the first (respectively second) condition holds if there are no pure collider routes in G between i and j through nodes in U (respectively L).
• By the induction hypothesis, (Λ −1 U ) i,j = 0 if there are no pure collider routes in G between i and j through nodes in U.
These conditions rule out the existence of pure collider routes in G between i and j through nodes in L. Now, assume that i ∈ U and j ∈ L. Then, Equation 9 implies that Either case holds if there are no pure collider routes in G between i and j.
Finally, assume that i, j ∈ L. Then, Equation 9 implies that (Σ if there are no pure collider routes in G between i and j. Proof of Theorem 2. We prove first the if part. For any chain component K ∈ Cc(G), clearly K ⊥ G Nd(K) ∖ K ∖ P a(K) P a(K) and thus K ⊥ p Nd(K) ∖ K ∖ P a(K) P a(K), which implies Equation 1. For any vertex i ∈ K, clearly i ⊥ G Mo(K) ∖ Mo(i) F a(K) ∪ Mo(i) and thus i ⊥ p Mo(K) ∖ Mo(i) F a(K) ∪ Mo(i), which implies the zero restrictions on β K . For any vertices i ∈ K and j ∈ F a(K) ∖ F a(i), clearly i ⊥ G j K ∖ {i} ∪ P a(K) ∖ {j} and thus i ⊥ G j K ∖ {i} ∪ P a(K) ∖ {j}, which implies the zero restrictions on Ω K,P a(K) . Finally, for any nonadjacent vertices i, j ∈ K, clearly i ⊥ G j K ∖ {i, j} ∪ P a(K) and thus i⊥ p j K ∖ {i, j} ∪ P a(K), which implies the zero restrictions on Ω K,K .
We now prove the only if part. Consider three disjoint subsets X, Y and Z of V such that X ⊥ G Y Z. Let K 1 , . . . , K n denote the topologically sorted chain components of G X∪Y ∪Z . Note that G K 1 ∪⋯∪Kn and G X∪Y ∪Z only differ in that the latter may not have all the edges ⇢ in the former. Note also that K 1 , . . . , K n are chain components of G, too. Consider a topological ordering of the chain components of G, and let Q 1 , . . . , Q m denote the components that precede K n in the ordering, besides K 1 , . . . , K n−1 . Note that the edges in G from any Q i to any K j must be of the type ⇢ because, otherwise, Q i would be a component of G X∪Y ∪Z . Therefore, G Q 1 ∪⋯∪Qm∪K 1 ∪⋯∪Kn and G Q 1 ∪⋯∪Qm ∪ G X∪Y ∪Z only differ in that the latter may not have all the edges ⇢ in the former. In other words, the latter may impose additional zero restrictions on the elements of some β K j corresponding to the mothers. Consider adding such additional restrictions to the marginal distribution p(Q 1 , . . . , Q m , K 1 , . . . , K n ) obtained from p via Equations 1 and 2, i.e. consider setting the corresponding elements of β K j to zero. Call the resulting distribution p ′ (Q 1 , . . . , Q m , K 1 , . . . , K n ).
Finally, recall again that G Q 1 ∪⋯∪Qm∪K 1 ∪⋯∪Kn and G Q 1 ∪⋯∪Qm ∪G X∪Y ∪Z only differ in that the latter may not have all the edges ⇢ in the former. Note also that every node in X ∪ Y ∪ Z has the same mothers in G Q 1 ∪⋯∪Qm∪K 1 ∪⋯∪Kn and G Q 1 ∪⋯∪Qm ∪ G X∪Y ∪Z . Then, p(X ∪ Y ∪ Z) = p ′ (X ∪Y ∪Z) by Lemma 12 and, thus, X ⊥ p Y Z if and only if X ⊥ p ′ Y Z. Now, recall from Lemma 11 that X ⊥ G Y Z implies that there is no pure collider route in G X∪Y ∪Z between any vertices A ′ ∈ X ′ and B ′ ∈ Y ′ . Then, there is no such a route either in G Q 1 ∪⋯∪Qm ∪ G X∪Y ∪Z , because this UCG has no edges between the nodes in V (G Q 1 ∪⋯∪Qm ) and the nodes in V (G X∪Y ∪Z ). This implies that A ′ ⊥ p ′ B ′ X ′ ∖{A ′ } ∪Y ′ ∖{B ′ } ∪ Z ∪ Q 1 ∪ ⋯ ∪ Q m by Lemma 13 and Lauritzen (1996, Proposition 5.2), which implies X ′ ⊥ p ′ Y ′ Z ∪Q 1 ∪⋯∪Q m by repeated application of intersection. Moreover, X ′ ⊥ p ′ Q 1 ∪ ⋯ ∪ Q m Z because G Q 1 ∪⋯∪Qm ∪ G X∪Y ∪Z has no edges between the nodes in V (G Q 1 ∪⋯∪Qm ) and the nodes in V (G X∪Y ∪Z ). This implies X ⊥ p ′ Y Z by contraction and decomposition which, as shown, implies X ⊥ p Y Z.
Proof of Theorem 3. We prove the result by induction on the number of chain components n. The result holds for n = 1 by Equation 5 (Jones and West, 2005, Theorem 1). Assume as induction hypothesis that the result holds for all n < m. We now prove it for n = m. Let U = K 1 ∪⋯∪K m−1 and L = K m . Let U ∼ N (0, Λ U ) and L ∼ N (β L U, Λ L ) and, thus, (U, L) T ∼ N (0, Σ). Bishop (2006, Section 2.3.3) proves that By the induction hypothesis, (Λ U ) k,j can be written as a sum of products of weights over the edges of the open paths between k and j in G. Moreover, as discussed in Section 3 in relation to Equation 5, (β L ) i,k can be written as a sum of products of weights over the edges of the paths from k to i through nodes in L. Since the latter paths start all with a directed edge out of k, the previous observations together imply the desired result. Finally, As discussed in Section 3, (Λ L ) i,j can be written as a sum of products of weights over the edges of the paths between i and j through nodes in L. By the induction hypothesis, (Λ U ) k,l can be written as a sum of products of weights over the edges of the open paths between k and l in G. Again, as discussed in Section 3, (β L ) i,k (respectively, (β L ) j,l ) can be written as a sum of products of weights over the edges of the paths from k to i (respectively, from l to j) through nodes in L. Since the latter paths start all with a directed edge out of k (respectively, out of l), the previous observations together imply the desired result.
Proof of Theorem 4. We prove first the if part. The property B1 together with weak union and composition imply K ⊥ p Nd(K) ∖ K ∖ P a(K) P a(K) and thus Equation 1. Moreover, B1 implies i⊥ p Mo(K)∖ Mo(i) F a(K)∪Mo(i) by decomposition and, thus, the zero restrictions on β K . Moreover, B2 implies i⊥ p F a(K)∖F a(i) K∖{i}∪F a(i)∪Mo(K) by decomposition, which implies i ⊥ p j K ∖ {i} ∪ P a(K) ∖ {j} for all j ∈ F a(K) ∖ F a(i) by weak union, which implies the zero restrictions on Ω K,P a(K) . Finally, B3 implies the zero restrictions on Ω K,K . We now prove the only if part. Note that if p satisfies Equation 1 and the zero restrictions associated with G, then it satisfies the global Markov property with respect to G by Theorem 2. Now, it is easy to verity that the block-recursive property holds.
Proof of Theorem 5. Note that Nd(i) = Nd(K). Then, the properties P1 and P2 imply respectively B1 and B2 by repeated application of intersection. Similarly, P2 implies i ⊥ p Nd(K) ∖ {i} ∖ Ne(i) ∖ P a(K) P a(K) ∪ Ne(i) by repeated application of intersection, which implies i ⊥ p K ∖ {i} ∖ Ne(i) P a(K) ∪ Ne(i) by decomposition, which implies B3 by Lauritzen (1996, Theorem 3.7). Finally, the property B1 implies P1 by weak union. Likewise, B2 implies P2 by weak union if j ∉ K. Assume now that j ∈ K and note that B2 implies i ⊥ p Nd(K) ∖ K ∖ P a(K) K ∖ {i} ∪ P a(K) by weak union, and B3 implies i ⊥ p K ∖ {i} ∖ Ne(i) P a(K) ∪ Ne(i). Then, i ⊥ p Nd(K) ∖ {i} ∖ Ne(i) ∖ P a(K) P a(K) ∪ Ne(i) by contraction, which implies P2 by weak union.
Proof of Theorem 6. The properties L1 and L2 imply respectively P1 and P2 by weak union. Note also that the pairwise Markov property implies the global property by Theorems 2, 4 and 5. Now, it is easy to verify that the local property holds.
Proof of Theorem 7. Rule 1 holds because the antecedent implies that Y ⊥Z W holds in G V ∖X , and p(V ∖X do(X)) satisfies the global Markov property with respect to G V ∖X due to Equation 6. Rule 2 holds because the antecedent implies that observing Z cannot be distinguished from intervening on Z. Rule 3 holds because the antecedent implies that no W -open path in G ′ after intervening on X is of the form Z ⊸ ⋯ ⊸ A with A ∈ Y and where ⊸ means →, ⇢ or −. For the rest of the W -open paths that reach Z, intervening on Z is irrelevant.
Proof of Lemma 8. Let Λ and Λ denote the error covariance matrices before and after the rescaling, respectively. Note that we only need to consider independences between singletons. In particular, ǫ i is independent of ǫ j conditioned on ǫ L if and only if (Λ −1 ijL,ijL ) i,j = 0 (Lauritzen, 1996, Proposition 5.2) and, thus, if and only if the (j, i) minor of Λ ijL,ijL is zero, i.e. det(M) = 0 where M is the result of removing the row j and column i from Λ ijL,ijL . Note that det(M) = ∑ π∈S sign(π) ∏ k M k,π(k) where S denotes all the permutations over the number of rows or columns of M. Then, det(M ) = det(M) ∏ k λ 2 λ 2 k and, thus, det(M) = 0 if and only det(M ) = 0.
A matrix is positive definite if and only if the determinants of all its upper-left submatrices are positive. Therefore, it follows from the previous paragraph that Λ and Λ are both positive definite or none.
Proof of Theorem 9. Under the faithfulness assumption, we can identify the Markov equivalence class of G from p by, for instance, running the learning algorithm developed by Studený (1997) for LWF CGs and by Peña (2012) for AMP CGs. Assume to the contrary that there is a second LWF CG G ′ in the equivalence class that generates p via Equation 8. Let N and N ′ denote the nodes without parents in G and G ′ . Assume that N ≠ N ′ . Then, there exists some node L ∈ V that is in N but not in N ′ , or vice versa. Assume the latter without loss of generality. Then, there is an edge Y → L that is in G but not in G ′ . Let Q = P a G (L) ∖ {Y }. Then, we have from G that L = β Q Q + β Y Y + ǫ L by Equation 8. Note that β Y ≠ 0 by the faithfulness assumption. Define L * = L Q=q and Y * = Y Q=q in distribution. Since ǫ L ⊥ p Q ∪ {Y } by construction, we have from G that (Peters et al., 2011, Lemma 2) and, thus, var(L * ) > var(ǫ L ) = λ 2 . However, we have from G ′ that var(L * ) ≤ λ 2 (Peters and Bühlmann, 2014, Lemma A1). This is a contradiction and, thus, N = N ′ . Note that the undirected edges between the nodes in N must be the same in G and G ′ , because Markov equivalent CGs have the same adjacencies as shown by Frydenberg (1990, Theorem 5.6) for LWF CGs and by Andersson et al. (2001, Theorem 5) for AMP CGs. For the same reason, the directed edges from N to V ∖ N must be the same in G and G ′ . Since G V ∖N and G ′ V ∖N generate p(V ∖ N N = n) via Equation 8, we can repeat the reasoning above replacing G, G ′ and p by G V ∖N , G ′ V ∖N and p(V ∖ N N = n). This allows us to conclude that the nodes with no parents in G V ∖N and their corresponding edges coincide with those in G ′ V ∖N . By continuing with this process, we can conclude that G = G ′ .