Potential Outcome and Decision Theoretic Foundations for Statistical Causality

In a recent paper published in the Journal of Causal Inference, Philip Dawid has described a graphical causal model based on decision diagrams. This article describes how single-world intervention graphs (SWIGs) relate to these diagrams. In this way, a correspondence is established between Dawid's approach and those based on potential outcomes such as Robins' Finest Fully Randomized Causally Interpreted Structured Tree Graphs. In more detail, a reformulation of Dawid's theory is given that is essentially equivalent to his proposal and isomorphic to SWIGs.


Introduction
In his recent article, Decision Theoretic Foundations for Causality, Philip Dawid elaborates on an earlier theory that he advanced previously [1].We welcome Dawid's efforts to build a foundation for causal models that aims to develop a graphical framework, while placing an emphasis on making assumptions that are both transparent and testable.Similar concerns have also motivated much of our previous work on potential outcome models represented in terms of finest fully randomized causally interpreted structured tree graphs (FFRCISTGs) [2] and single-world intervention graphs (SWIGs) [3].
Indeed, like Dawid, we have argued that, in contrast, the assumption of independent errors that is typically adopted by users of Pearl's non-parametric structural equations (also called structural causal models) is untestable and also imposes (superexponentially) many assumptions that are unnecessary for most purposes; furthermore, the independent error assumptions allow the identification of causal quantities that cannot be identified via any randomized experiment on the observed variables [4].Thus, this assumption contradicts the dictum "no causation without manipulation" and severs the connection between experimentation and causal inference that has been central to much of the conceptual progress during the last century.We also note that [5] cites the move to specifying causal models using potential outcomes rather than error terms as underpinning the "credibility revolution" in Econometrics.
In our view, Dawid's updated theory represents a marked advance on his earlier proposal in that it requires stronger ontological commitments, specifically, the existence of an "intent-to-treat" (ITT) variable, before a model may be called causal.ITT variables are necessary and important in order to encode the notion of ignorability and the effect of treatment on the treated.
In addition, as noted by Dawid, the ITT variables make it possible to connect his approach to that based on potential outcomes¹ and SWIGs.The connection between the two approaches may help to illuminate the strengths and weakness of each formalism.We also present a reformulation of Dawid's theory that is essentially equivalent to his proposal and isomorphic to SWIGs.
We thank Philip Dawid for helpful feedback on our article; in particular, for pointing out a significant omission regarding our proposed definition of distributional consistency for SWIGs.We also thank him for his patience regarding the completion of this manuscript.

Relating observational and experimental worlds
At a high level, every approach to causal inference relates a model describing a factual passively observed world and models describing hypothetical "interventional" worlds in which a treatment (or exposure) variable takes on a specific value.
In both the current and previous decision-theoretic conceptions advocated by Dawid, these worlds "exist" at least hypothetically, as different distributions.The relation is then created by the assertion of equalities linking different parts of these distributions.In Dawid's formalism, the set of distributions is represented using a single kernel object in which non-random regime indicators (also called "policy variables" by [6]) index the different distributions; there is no requirement that these distributions live on the same probability space.Dawid encodes the equalities between the observational and interventional worlds via extended conditional independence (ECI) relations, including independence from (and conditional on) regime indicators.
In the standard presentation of the potential outcome approach, random variables corresponding to the outcomes for an individual under all possible interventions² are assumed to exist, living on a common probability space.The consistency assumption then serves to construct the factual variables as a deterministic function of the potential outcomes.Owing to the fundamental problem of causal inference, the resulting factual distribution is consistent with many different intervention distributions.However, under additional Markov restrictions on the joint distribution of the potential outcomes, the interventional distribution becomes identified from the joint distribution of the factuals under a positivity assumption.Notwithstanding this, often in practice, data are obtained on a subset of the factual variables in which case some or even all interventional distributions become only partially identified from the available (i.e., the observed) data.

SWIGs
The SWIG approach is designed to provide a simple way to relate graphs representing joint distributions over the observed variables and those representing joint distributions over potential outcomes.The approach is "single world" in that each of the constraints defining the model concerns a set of potential outcomes corresponding to a single joint intervention on the target variables.³Following [2,3], we will assume throughout that there is a set of variables indexed by { } = … V p 1, , and that a pre-specified (possibly strict) subset ⊆ A V of these variables are targets for intervention.Often, we will, with a slight abuse of notation, also refer to the corresponding sets of random variables as V and A, respectively. 1 Though Dawid and others distinguish between potential outcomes and counterfactuals on philosophical grounds, we do not do so here; we think that this distinction, though of interest, is a separate issue from those under discussion here.2 This does not mean that it is assumed that all variables can be intervened on.3 Consequently, though SWIGs define a potential outcome model, they do not impose "cross-world" assumptions such as strong ignorability ( ) However, for proofs and formal statements, it is sometimes necessary to distinguish between the random variables and the sets that index them.For this purpose, we introduce the following notation: we define , so that the complete set of factual variables is X V and the subset that are targets for intervention are X A .We use X i as the state space for the variable X i , and we will let X X ≡ × ∈ V i V i and X X ≡ × ∈ A i A i be the state spaces for the variables with indices in V and A, respectively.Similarly, given an assignment x V to the variables (with indices) in V , we let x i and x B refer to the value assigned to X i and to the set X B .We also make use of the usual shorthand, using, for example, A i to refer to X Ai , A for X A , and a i to denote x ai .Definition 1.Given a directed acyclic graph (DAG) with vertex set V , the SWIG ( ) a corresponding to an intervention that sets the variables in is constructed as follows: (1) Every vertex ∈ A A i is split into two halves, a "random half" and a "fixed half." (2) The random half contains A i and inherits all of the incoming edges directed into A i in the original graph.
(3) The fixed half inherits all of the outgoing edges directed out of A i in the original graph and is labeled with the value a i .(4) Random vertices in nodes on the graph are then re-labeled according to one of the schemes mentioned subsequently.
There are three labeling schemes that may be employed in step (4): Temporal labeling: Given a total ordering of the vertices on the original graph, each random vertex with the values corresponding to those vertices , where a Y an a corresponds to those fixed vertices a i that are still ancestors of Y after splitting the nodes in A.
Temporal labeling may be seen as encoding the assumption that interventions in the future do not affect outcomes in the past.Thus, the potential outcome for a variable ( ) … Y a a , , k 1 , in a world in which there is an intervention on . This is the natural labeling scheme to apply in the context , because after intervention on B, there is no directed path from A to C. This labeling corresponds to the interpretation of missing edges in the graph in terms of the absence of individual-level direct effects so that, for example, [3, §7] also discuss more general schemes that assume a time order, but also allow some missing edges to be interpreted at the individual level and others at the population (or distribution) level; in that article ancestral labeling is termed "minimal labeling." Uniform labeling corresponds to the absence of any assumption regarding equality of potential outcomes (as random variables) across different interventions.⁴In the potential outcome framework, this would often appear somewhat unnatural.However, in this article, we will use this labeling to show that although we may wish to adopt the additional equalities between potential outcomes that are implied by the temporal and/or causal relationships, our results do not require these equalities.In addition, SWIGs with this labeling scheme are essentially isomorphic to the augmented decision diagrams proposed in [8].
In particular, note that under the uniform labeling scheme, the set of random variables appearing in two SWIGs ( ) a and ( ) * a , where X ∈ * a a , A , have no overlap; this will continue to hold when, in section 3.7, we consider SWIGs ( ) b where we intervene on a (possibly empty) subset ⊆ B A.

Distributional consistency for SWIGs
In order to relate passively observed distributions to those under intervention, we introduce a consistency assumption relating sets of counterfactual distributions.For this purpose, we introduce the following notation: Thus, A is the set of counterfactual distributions over V that arise from all possible joint interventions setting the variables in A to a value X ∈ a A .Likewise, ⊆ A is the set of counterfactual distributions over V resulting from all possible joint interventions on subsets D of A; this includes the case = ∅ D , corresponding to the observed distribution, so ( ) ∈ ⊆ p V A .We make the following consistency assumption.⁵

Definition 2. (Distributional consistency for SWIGs) The set of distributions ⊆
A will be said to obey distributional consistency if, given ∈ B A i and { } ⊆ ⧹ C A B i , where C may be empty, for all y, b, c: where Equalities (3) and (4) simply state that the probability of the event , where B i is the "natural" or (in Dawid's terminology) ITT variable, remains the same whether or not there is (subsequently) an intervention that targets B i and sets it to b.
. This has the interpretation that an intervention on B i setting it to b is "ideal" in the sense that for the remaining variables Y , the intervention does not change the distribution of Y given can be seen as following from the fact that B i and ( ) B b i represent, respectively, the natural value taken by B i in the absence of an intervention and the natural value of B i immediately prior to an intervention.
Under a standard potential outcome model that includes equalities between random variables, (3) follows directly from the consistency assumption and recursive substitution: As with the discussion of labeling earlier, in a potential outcome theory, it is natural to assume consistency at the level of random variables.Our motivation here for formulating consistency via (3) as a relation between distributions is solely to make clear that we do not require the stronger assumption for our results.However, proceeding in this way makes the notation more cumbersome since every potential outcome variable is labeled with every intervention.Distributional consistency may also be formulated in terms of a dynamic regime.Let * g i denote the dynamic regime⁷ on B which "intervenes" to set the intervention target to the "natural" value that the variable B i would take in the absence of an intervention.Let ( ) *

V g c ,
i be the set of potential outcomes that would arise under * g i in conjunction with an intervention setting C to c.We may then re-express (3) as: In words, in the context of an intervention setting C to c, a dynamic regime that intervenes to set B i to the value that it would have taken anyway has no effect on the distribution of V .⁸Though the distributional consistency assumption involves a single variable B i , repeated applications imply the same conclusion for a set B.

Lemma 3. If ⊆
A obeys distributional consistency, B and C are disjoint subsets of A, where C may be empty, then for all y, b, c: where = ⧹ Y V B.
Proof.We prove this by induction on the size of B. The base case follows by definition of distributional consistency.Let B i be a variable in B, and let  6 In a standard potential outcome model, it would follow by definition of B i , as the "natural value" of treatment, that ( ) = B b B i i .Indeed, in the standard potential outcome approach, there will not be a need to write ( ) B b i .Readers familiar with the dooperator [9] should be aware that whereas in that theory, intervention on a variable precludes observing the natural value, in the potential outcome theory, at least conceptually, we are supposing that the natural value could be observed and then, an instant later, we could intervene upon it without the natural value having any downstream causal effects.Also, note that the SWIG local Markov property (Definition 7) will imply the stronger condition that ( ( ) , B (see Lemma 8).7 A dynamic regime is an intervention in which the value to which the variable is set is a function of the values taken by earlier variables.With * g i , the earlier variable is the natural value that the variable would take on (see footnote 6).8 Note that for (5) to be equivalent to (3), we require that when ( . This will hold if we assume: (i) that the values taken by variables that occur prior to the intervention on B i are unaffected by this intervention, and (ii) for variables that arise after the intervention, it makes no difference whether the value b is imposed due to the dynamic regime * g i and natural value ( .
Here, the second equality applies distributional consistency, taking "C" to be ∪ − B C i ; the third applies the induction hypothesis, taking "Y " to be { } ∪ Y B i and "B" to be − B i .□ The next lemma relates equality of conditional distributions with and without an intervention on B.
Lemma 4. Suppose ⊆ A obeys distributional consistency.Let B and C be disjoint subsets of A, where C may be empty, and let Y and W be disjoint subsets of ⧹ V B. It then follows that: Proof.This follows by applying Lemma 3 to ( ( , , , .□ In addition, we have the following: Lemma 5. Suppose ⊆ A obeys distributional consistency, and let B and C be disjoint subsets of A, where C may be empty.
, is not a function of b, then it follows from distributional consistency that ( ( Proof. ( ) p M b may still be functions of b, in which case there is no way to apply (3) to relate them to distributions in which B is not intervened on.
However, when the conditioning set contains B, we have the following: Lemma 6. Suppose ⊆ A obeys distributional consistency, with B and C disjoint subsets of A, where C may be empty.Further, let Y and W be disjoint sets with Proof.
Here, the second equality uses the fact that ( ( is not a function of b, while the third follows from distributional consistency via Lemma 4. □

Local Markov property defining the SWIG model
Although we derive a SWIG graphically from the original DAG by node splitting, we will define the model by associating a local Markov property with the SWIG and the potential outcome distribution.The resulting model corresponds to the FFRCISTG model of [2] (see [3,Appendix C]).We will then derive the Markov property for the original DAG and the observed distribution from these by applying distributional consistency.
Given a DAG with vertices to indicate the (index) set of variables that are the parents of W i in the original DAG , and let ( ) 1 , the predecessors of i under a total ordering ≺ that is consistent with the edges in .We will drop the subscript when the DAG or ordering is clear from context.
The SWIG local Markov property is defined on the set of distributions , where ⊆ A V is the maximal set of variables that may be intervened on see ([10, §1.2.4] and [11]).

Definition 7. A set of potential outcome distributions
is a function only of In words, (9) states that after intervening on A, the distribution of ( ) X a i given its predecessors depends solely on the values taken by intervention targets in A that are parents of i, and by any other (random) variables that are parents of i but that are not intervened on, and hence are not in A.⁹ Though the function of the local property is to define and characterize the potential outcome model, intuition may be gained by observing that the local property follows from d-separation applied to the SWIG ( ) a .¹⁰Specifically, the condition (9) corresponds to two sets of d-separations.d-separation from fixed nodes: a by the d-separation of ( ) X a i from fixed nodes a j that correspond to vertices A j that are not the parents of X i in given the parents of ( ) X a i in ( ) a , both random and fixed (see [7,10,12]).Specifically, we have: where here we used ⊥ ⊥ d to indicate d-separation¹¹ in the SWIG ( ) a and use lowercase letters, e.g.,

( ) ⧹
a A i pa , to refer to fixed nodes.We may further decompose the set of fixed nodes The set of fixed nodes in are screened off by the random and fixed nodes that are parents of ( ) In [3, §8, Def.44], a weaker Markov property was stated that did not require that (9) is not a function of . This weaker condition does not imply the Markov property for the observed distribution (unless = A V ).Consequently, Propositions 45 and 46 and Theorem 65(c) in [3] are incorrect.Correct reformulations are given in Theorems 10, 11, and 12. 10 This is as to be expected since d-separation encodes the global property that is implied by the local property.11 Note that in other articles [3,7,10] d-connection for SWIGs is defined such that fixed nodes may never occur as non-endpoint vertices on d-connecting paths.In those articles, we never formally condition on fixed nodes.Here, in (10) for the purpose of formulating the local property, we formally include the fixed parents of ( ) X a i in the set that is (graphically) conditioned on.This is solely in order to make the development similar to the decision diagram approach we consider subsequently.
Potential outcome and decision theoretic foundations for statistical causality  7 conditioning on the parents of X a i ( ) in a ( ), both random and fixed: The random vertices X a may be further decomposed: The d-separation of X a i ( ) from nodes representing the natural value of variables that are in A and parents of X i in corresponds to ignorability.On the other hand, the d-separation of X a i ( ) from variables that are predecessors, but not parents, of X i in can be regarded as an associational Markov property.

Example
The d-separations given by ( 11) and (13) can obviously be stated as a single graphical condition for each random vertex V a i ( ) in a ( ).In Tables 1 and 2, we give the SWIG local Markov property corresponding to the SWIG depends corresponds exactly to the number of parents (random and fixed) of the corresponding random variable in x ( ) in Figure 2(b): zero for H x ( ) and X x 0 ( ) and two for Z x ( ), X x 1 ( ), and Y x ( ).This is also the number of terms listed to the right of the conditioning bar in Table 2. Here, as elsewhere in this article, we use the uniform labeling because we wish to emphasize that our results do not require any equalities between random variables.

Consequences of the local Markov property
Under distributional consistency, it follows from the SWIG local Markov property that whether or not interventions in the future occur has no effect on the distribution of prior variables.

( ) via factorization terms
on which this term does not depend are colored red.
Note that the arguments in V on which the term depends, correspond to the parents of V x i ( ) in x ( ); these are written in black.For example, for the term corre- sponding to V Y i = , the arguments are x 1 and Z x ( ), and these are the parents of Y x ( ) in x ( ).
 12 [8, Figure 15] gives the corresponding SWIG under ancestral labeling.In Dawid's discussion of this example [8, Figure 15], two conditions are stated as supporting g-computation.The first of these is correct, but the second should be Y x x X , 0 1 0 Lemma 8.If ⊆ A obeys distributional consistency and A obeys the SWIG ordered local Markov property for DAG under ≺, then for all ∈ k V and , , .
Proof.First observe that since Here, x 1 and x 2 refer to the fixed nodes, and d ⊥ ⊥ indicates d-separation in the SWIG (see also footnote 11 regarding the formal inclusion of fixed nodes on the RHS of the conditioning bar).
We now prove the claim by reverse induction on the ordering of the vertices in V .For the base case, suppose k is the maximal vertex in V .If ∉ k A, then (14) holds trivially since does not depend on a k , and thus, by Lemma 5, ( ( ) , , . Our inductive hypothesis is that (14) holds for = + k j 1, so that , , .
. If ∈ j A, then note that we have already established earlier that the left-hand side (LHS) of ( 15) is not a function of a j .Consequently, the right-hand side (RHS) is also not a function of a j .It then follows from Lemma 5 that , , .
This completes the proof.□ The next lemma gives a simple characterization of the consequences of the SWIG local Markov property in conjunction with distributional consistency.

Lemma 9. If ⊆
A obeys distributional consistency and A obeys the SWIG ordered local Markov property for DAG under ≺, then: Since the SWIG local Markov property (9) states that (16) is not a function of , the equality of ( 16) and (18) may appear to follow immediately.However, as noted in the discussion prior to Lemma 6, the fact that a counterfactual conditional distribution ( ( )| ( )) p Y a W a j j does not depend on the specific value, a j , of an intervention on A j does not imply that Proof.Here, (17) follows since by Lemma 8 does not depend on ( ) . □

Markov property for the observed distribution
We now show that distributional consistency together with the SWIG local Markov property implies the usual local Markov property [13] for the observed distribution.Proof.Let Here, the first equality follows from distributional consistency via Lemma 4. The second follows directly from the equality of ( 17) and (20) in Lemma 9. Since the last line is not a function of Dawid takes the reverse approach to ours: he proposes additional extended Markovian conditions that, when added to the usual Markov property for the observable law, will imply the Markov property for his extended graph.However, as we describe in detail below, our approach appears to be simpler in that, given distributional consistency, it requires only one property per variable, giving | | V constraints in total; in contrast, Dawid requires one property for every observed variable in V , together with two additional properties for each intervention target in A for a total of In addition, our approach captures context-specific independences, corresponding to "dashed" edges in Dawid's diagrams; furthermore, these are not captured directly in Dawid's A+B formulation.We show that by restating the SWIG local property in Dawid's notation, we are able to provide a characterization of the (extended) Markov properties for the augmented graph and the original graph that also requires only one constraint per variable, plus distributional consistency.
It is the case that Dawid incorporates distributional consistency into his defining independences, whereas we state it as a separate property that precedes the definition of the model.However, as we have shown earlier, distributional consistency may be seen as a tautologous property, the truth of which is implicit in the notion of an ideal intervention: distributional consistency states that if B would naturally take the value b, then an ideal intervention that would set B to b has no effect on the distribution of (all) the other variables.For this reason, we believe it is natural to distinguish consistency from the other properties being used to define the model.
However, in the spirit of Dawid's approach, in Appendix A.1, we show that if ⊆ A obeys distributional consistency, then A will obey the SWIG local Markov property corresponding to if: (i) ( ) p V is positive and obeys the (ordinary) local Markov property for the graph ; and (ii) A obeys the SWIG local Markov property corresponding to , a complete supergraph of .This formulation requires | | V 2 restrictions.

Identification of the potential outcome distribution p V a ( ( )) from p V ( )
We show that it follows from the SWIG local Markov property that ( ( )) p V a is identified given the distribu- tion over the observables provided that the relevant conditional distributions are identified from the distribution of the observables.

Theorem 11. Suppose that ⊆
A obeys distributional consistency and A obeys the SWIG ordered local Markov property for and ≺.Let X ∈ a A be an assignment to the intervention targets in A, and let Then, for all i: Consequently, ( ( )) p V a is identified from ( ) p V and obeys d-separation in the SWIG ( ) a , whenever the con- ditional distributions on the RHS of (22) are identified by ( ) p V .
The equality (22) here corresponds to the property referred to as "modularity" in [3]; this is also an instance of the extended g-formula of [2,14].
Here, the first equality follows from the equality of ( 16) and (19); the second follows from the equality of ( 19) and (20); the third follows from distributional consistency via (7).□

Distributions resulting from fewer interventions
Finally, we show that if ( ( )) p V a obeys the SWIG local Markov property for and distributional consistency, then if we intervene on ⊂ B A, the resulting distribution ( ( )) p V b will obey the SWIG local Markov property for with respect to this reduced set of intervention targets.The two previous theorems can be seen as the special case in which = ∅ B .
Theorem 12. Suppose that ⊆ A obeys distributional consistency and A obeys the SWIG ordered local Markov property for and ≺.Let b be an assignment to the intervention targets in ⊆ B A, and let Then for all i: Consequently, every ( ( )) ∈ p V b B obeys the Markov property for the SWIG ( ) b and is identified whenever the conditional distributions on the RHS of (24) are identified by ( ) p V .
Proof.Here, the first equality is by Lemma 5; the second is distributional consistency via Lemma 4; the third follows from Theorem 11 applied to ( ) a ; the fourth is a simplification.□

Critique of Dawid's proposal
We have the following four main issues, which we describe in detail as follows: (1) The inclusion of ITT variables within Dawid's theory appears necessary in order to distinguish causal relationships from happenstance agreement between observational and ("fat hand") intervention distributions.However, including all three of T (the "actual" treatment), * T (the ITT variable), and F T (the regime indicator) introduces deterministically related variables and thereby obscures the content of Dawid's defining conditional independences A and B.
(2) Related to the previous point, d-separation is no longer a complete criterion for determining conditional independence on a graph in which there are definitional deterministic relationships between the variables.¹³ 13 This is also an issue for the twin network approach developed in [15].
(3) Dawid's ITT augmented diagrams incorporate context-specific independence (via dashed edges) but his results do not establish that the resulting distribution obeys all of the implied context-specific independences; these are not implied by his defining conditional independences + A B; these indepen- dences will not hold without additional information concerning the relation of T to * T and F T that is not captured in + A B. (4) Dawid makes use of what he terms "fictitious" independence relations, but he argues that these are assumptions that can be made without loss of generality.This is not the case in general, though, as we show, in the context of his arguments, the resulting logical "gap" can be filled.
We show that all of these issues may be avoided by re-formulating his theory in two simple ways: (I) Marginalizing out the post-intervention treatment variable T while keeping the ITT variable * T .¹⁴(II) Formulating the defining extended independence relations in terms of distributional consistency and the augmented ITT diagram (after marginalizing T ) and intervening on all the variables in A; the local Markov property for the original variables is then implied.
The resulting theory is formally isomorphic to the SWIG theory described earlier; the augmented ITT graph can be viewed as containing the union of the nodes and edges in the original DAG and the SWIG ( ) a , with the fixed nodes in the SWIG corresponding to the (non-idle) regime indicators in the augmented DAG.

The simplest setting
Consider the setting in which there is a single exposure T and an outcome Y ; suppose that T takes a finite set of states T. Dawid's augmented causal graph with the intention-to-treat variable * T is shown in Figure 3(d).Here, * T represents the natural value of treatment which an individual is "selected to receive" [8, p. 52] in the absence of an intervention that would override this.This is distinct from T the "treatment actually applied" [8, p. 54, Def.1]; F T is a regime indicator taking values in T { } ∪ ∅ .Under Dawid's proposal the graph in Figure 3 p T Y t , ; as suggested by the graphical structures, there is a close correspondence between these approaches when ITT variables are included in the decision theory graph.In what follows, we will show that in fact, the two theories can be shown to be isomorphic up to labeling of variables (Table 3).
Although Dawid includes ITT variables in the development here, they were absent in [16] and ultimately his goal is to remove the ITT variables, leaving the DAG shown in Figure 3(c Given this, one may ask why it is necessary to introduce the ITT variables into the theory in the first place.One issue that arises is that without the ITT variables, the decision theoretic approach lacks the language to describe concepts such as the effect of treatment on the treated.In addition, the approach lacks the concepts necessary to distinguish different scenarios where there is equality between distributions in the observed and interventional worlds: those scenarios where the equality reflects agreement between an observational study and a randomized experiment due to the absence of confounding, versus those where the equality is purely "contingent" or spurious.

Potential outcome and decision theoretic foundations for statistical causality  13
To illustrate this, consider the following story.Suppose that a manufacturer of dietary supplements carries out an observational study.They find that those who regularly consume the supplement ( = T 1) have lower levels of "bad" cholesterol (Y ) than the people who do not ( = T 0).Buoyed by these results, the manufacturer hires a company to perform a randomized trial.The results of the previous study are given to the company; it is made clear that the manufacturer would like these results confirmed and that repeat , an intervention setting T to t , so F t ∅ T = ≠ ; and (g) the latent projection of the graph in (d) after marginalizing T .Note that in (a), (b) we use T * (rather than T ) for the natural value of treatment in order to highlight the correspondence to the ITT variables in Dawid's proposal.The graph in (g) corresponds to (a) and (b), under the correspondence Table 3: Correspondence between the potential outcome/SWIG approach and the decision theoretic approach

Potential outcome Decision theoretic
Graph for observed data ITT DAG, Here, in the potential outcome approach we use T * (rather than T ) to denote the natural value of treatment so as to make the correspondence more self-evident business depends on the firm achieving this.In order to comply with this, the testing company carry out a non-blinded study and also modify the software in the cholesterol-measuring system to ensure that the results agree with those in the observational study (see Figure 4 To be clear, the critique here is not that someone who was unaware of the presence of confounding and the devious activities of the company running the trial would infer the wrong causal effect.Rather, it is that without the ITT variables, the decision-theoretic approach lacks the conceptual apparatus necessary to distinguish the situations in Figure 4

Y T F t T
, which will fail to hold if there is unobserved confounding between * T and Y .Note that this latter condition is essentially equivalent to the ignorability condition ( ) ⊥ ⊥ * Y t T in the potential out- come framework; we return to this point below.

Dawid's defining ECI relations
Under Dawid's formalism, the augmented graph with ITT variables, shown in Figure 3(d), defines a causal model via the following ECI relations: (see [8, Eq. (62), (63)]).T t, it need not hold that their observed outcome Y is the same as the outcome they would have had, had they been in an experiment and assigned to t, namely ( ) Y t .16 Here, we are assuming that there is no information available regarding the nature or identity of the possible confounding variables H .
Potential outcome and decision theoretic foundations for statistical causality  15

Dawid's independence A
The first independence (25) states that whether or not there is an intervention on T has no effect on the (distribution of the) ITT value * T .Indeed, Dawid states: Now, * T is determined prior to any (actual or hypothetical) treatment application, and behaves as a covariate [...] this distribution is then the same in all regimes [8,Section 8,p.54].
Similarly, in the potential outcome framework, it is assumed that intervention on a treatment variable does not affect variables whose values are realized prior to that intervention, including the natural value of that treatment variable, * T , so that ( ) = * * T t T .However, Dawid's reference to * T being a covariate that is determined prior to an actual or hypothe- tical treatment application is perhaps surprising: if the value taken by * T is determined prior to the decision regarding the regime F T , then this would appear to imply that, in fact, the random variables in the distributions must live on a common probability space.But in this case, it is hard to see why the random variables in the distributions T should not also live on a common probability space!The primary obstacle to so doing appears to be the use of Y and T to indicate what are distinct random variables (corresponding to different regimes) that are defined on the same space.This problem can obviously be overcome by simply using ( ) to refer to the random variables under the idle and intervention regimes, respectively; following Definition 1 in [8], this would imply that = * T T (under the idle regime) and ( ) = T t t (under an intervention).An analyst who adopted this notation is not obligated to impose any additional equalities relating these random variablessuch as those implied by consistencyshould they not wish to do so.As we did earlier in Section 3, one might choose instead to follow Dawid by merely imposing distributional consistency (see also further discussion below).However, from the perspective of the potential outcome framework, this leads to an unnecessary multiplicity of random variables and more cumbersome notation.For example, in the simple case of a binary treatment, this approach requires three random variables { ( ) ( )} Y Y Y , 0 , 1 corresponding to the response, rather than just two { ( ) ( )} Y Y 0 , 1 with consistency at the level of random variables.¹⁷It is unclear what is gained by assuming consistency at the level of distributions rather than individuals.

Dawid's independence B
The fact that T is a deterministic function of * T and F T means that the number of non-trivial conditional independence statements in (26) is not self-evident.A casual reader might imagine that in (26) the pair different values for each value of the conditioning variable T .However, given = T t, there are only T T (28)

Distributional consistency in B
Equation (27) corresponds to distributional consistency, which [8, eq. ( 14)] defines as: In other words, with Y given by a deterministic function, so Dawid notes that this implies: (see [8,Lemma 1]).However, this formulation also somewhat obscures the actual number of constraints: if T so that the statement becomes trivial, while if = = * T T t, then F T only takes two possible values ∅ and t.Given this, it becomes clear that (30) may be reformulated by defining a dynamic regime * g that "intervenes" to set T to be * T .By defining a special regime indicator, denoted * F T , that takes only two values ∅ or * g , we can re-express (30) as: Note that in so doing, we do not need to refer to T ¹⁸ (see Figure 5(d) for a graphical depiction).
In terms of potential outcomes, the independence (31) may be expressed as: which corresponds to distributional consistency (see ( 4)).

Ignorability in B
Equation (28) expresses the property of ignorability, which Dawid [8, eq. ( 20)] expresses as: However, as Dawid himself notes, given = T t, then either = ∅ F T in which case = * T t (and independence holds trivially), or = F t T , so that this constraint is identical to: Equivalently in terms of potential outcomes,  18 Along similar lines, in his equation (14) [8] notes that the LHS of (29) is equivalent to Potential outcome and decision theoretic foundations for statistical causality  17 (see Figure 3(b)).Again, we note that T is not required for the purpose of expressing this condition.

Simplification
From a graphical perspective, it is perhaps natural to wish to express the invariance of the distribution of Y given T across observational and interventional distributions by examining whether a regime indicator F T is d-separated from Y given T .However, as we have seen, it is necessary to include what Dawid calls the ITT variable (aka the natural value of treatment) * T in order to rule out cases of spurious invariance.Further- more, * T plays a central role in certain notions, such as the effect of treatment on the treated, that are widely used in many studies that apply the potential outcome framework.
As shown previously, there is no need to condition on T when describing the defining independences, and in fact doing so arguably obscures the nature of the specific assumption being made.This suggests that T should be marginalized from the ITT augmented graph, rather than * T as Dawid proposes.Note that, if we distinguish the cases = ∅ F T and = F t T , the resulting graphs (modulo labeling) are isomorphic to those used in the SWIG framework (compare Figure 3(a) to (e), and (b) to (f)).
We carry out this reformulation in full generality in the next section.

Reformulation of decision graphs
Our proposed reformulation of decision graphs follows a strategy similar to that used for SWIGs.In contrast, Dawid aims to give ECI relations that, together with the usual independence relations over the observed variables, will yield the Markov property for the augmented decision graph with ITT variables.As an alternative, we begin by defining a Markov property associated with the augmented decision graph, and then, using distributional consistency we derive the usual observed conditional independences.¹⁹It should be noted that Dawid's independences do not actually imply the full Markov property for the ITT graph because, as noted by the presence of a dashed edge, there are context-specific independences implied by the graph.²⁰However, these are not implied by the independence relations A and B. (To see this, note that the conditions A and B would also hold for a decision DAG with the same structure, but in which T was not a deterministic function of * T and F T , in which case the context-specific independence relations would not hold.); (b) graph illustrating that, if desired, the "applied treatment" variable T may be added to * since it is a deterministic function of T * and F T .Note that although it may seem counterintuitive that T is not a parent of Y in this graph, this is formally correct.


Note also that these extra ECI relations are not restricted solely to those involving T .Consider, for example, the front door graph shown in Figure 7(a).Since Dawid's augmented decision diagram, shown in Figure 7(c), includes a dashed edge from * T to T , indicating that this edge should be removed conditional on = F t T , the diagram implies that Y will be d-separated from F T given M and ≠ ∅ F T .However, even though it is encoded in the augmented graph, the corresponding ECI: does not follow from the independences + A B. In the potential outcome framework, the constraint (36) corresponds to: This constraint is naturally encoded by the d-separation of ( ) Y t from the fixed variable t given ( ) M t on the SWIG ( ) t shown in Figure 7(b) (see [7,10,12]).As these examples suggest, in order to capture the full Markov structure of the augmented decision diagram, including those constraints corresponding to dashed edges, it is natural to use the constraints implied by the decision diagram when no regime indicators are idle, which we express in shorthand as ≠ ∅ F A ; graphically, this corresponds to removing (temporarily) all of the dashed edges.We show below that the independences encoded then imply, via distributional consistency, the Markov property for the observed data that is encoded in the original graph.
Another advantage of this approach is that we will only require the ITT variables * T ; the "applied treatment," which Dawid [8] denotes "T ," will not be required.²¹Specifically, consider a set of variables … , be the (index) set of the targets of intervention.If ∈ i A, then let V i be the corresponding ITT variable (which Dawid denotes by * X i ).Thus, the set … V V , , p 1 consists of ITT variables as well as variables that are not in A and hence not targets of intervention.²²Thus, under the regime where every intervention target has been intervened upon, so that ≠ ∅ F i for all ∈ i A, the variables in … V V , , p 1 correspond to the random variables in the SWIG ( ) a .For every intervention target ∈ i A, let * g i denote the dynamic regime that "intervenes" to set the intervention target to its natural value V i .Let * F i be a regime indicator taking the states ∅ or * g i .
where we use the shorthand ≠ ∅ F C to indicate that for all ∈ j C, ≠ ∅ F j .
Note that, taking { } = ⧹ Y V B i , (38) is equivalent to the following equality, which corresponds exactly to (3): Here, the second equality follows from (38), while the first and third equalities are via the definition of * F and ∅.
As observed by Dawid, in place of Definition 13, we could instead have defined distributional consistency, without reference to the dynamic regime * g i , by simply equating (39) and (40).We have chosen to  21 Since T is a deterministic function of * T and F T , it is possible to add back in these variables if we wish (see Figure 6).22 In this formulation, we will not use the post-intervention target variables, which Dawid denotes, X i .
Potential outcome and decision theoretic foundations for statistical causality  19 make use of * g i in order to emphasize what we see as the tautological nature of distributional consistency, while also formulating condition (38) as a conditional independence.
The following four Lemmas are reformulations of Lemmas 3-6 in the decision diagram framework.Though the proofs are largely translations of those lemmas, we include them here for completeness.

Lemma 14. If ( | )
p V F A obeys distributional consistency, B and C are disjoint subsets of A, where C may be empty, then for all y, b, and c: where Proof.This follows by induction on the size of B. □ Lemma 15.Let B and C be disjoint subsets of A, where C may be empty, and let Y and W be disjoint subsets of ⧹ V B, then distributional consistency implies: Proof.This follows by applying (extended) graphoid axioms to (41).□ Lemma 16.Let B and C be disjoint subsets of A, where C may be empty.If ⊆ B W , then under distributional consistency: . Now: The second equality uses the premise of (43), the third is by definition of * g B , and the fourth is distributional consistency via Lemma 14. □ Lemma 17.Let B and C be disjoint subsets of A, where C may be empty.Let Y and W be disjoint sets with ⊆ B W , then under distributional consistency: , , .
Proof.Similar to the proof of Lemma 16, given the premise in (44), it suffices to show that As in the previous proof, the second equality uses the premise of (44), the third is by definition of * g B , and the fourth is distributional consistency via Lemma 14. □

Reformulated augmented decision diagrams
Let be a DAG with a topologically ordered vertex set where In this section of this article, when we wish to distinguish random variables from index sets, we use W i , rather than X i .This is because in Dawid's development, X is reserved to denote intervention targets.However, we depart from Dawid's notation in that we will not use an asterisk to indicate ITT variables, this is because we will only include the ITT variables associated with intervention targets on the reformulated diagram.24 Note that the vertex set for * corresponds to the sets of variables that in Dawid's notation would be written 1 ; in other words, it consists of domain variables and ITT variables associated with intervention targets.We will have no need to include what Dawid calls "the intervention targets," which he denotes X i , in our reformulated decision diagram, though they may be added (see Figure 6).Note that this property follows from d-separation applied to the graph in which we intervene on every vertex in A. We will show that under distributional consistency, this property implies factorization of the observed distribution with respect to the original graph.
However, it is useful first to further decompose the sets on the RHS of the independence.Specifically, we divide the regime indicators that are not the parents of i into those that occur after i and those that are prior to i: Similarly, we divide the set of random variables that are prior to i and either in A or not parents of i into those that are not parents and those that are the parents that are in A: Thus, independence (45) becomes: Consequently, independence (46) captures the following: • Later interventions have no effect on earlier distributions (time order).
• Given intervention on all earlier targets, the specific value of an intervention does not affect the distribution of a variable given its non-intervened parents unless the intervened on variable is itself a parent (causal Markov property).• Independence from earlier random variables given non-intervened parents (associational Markov property).• An intervention on a parent of a variable renders that variable independent of the natural value of the intervention target conditional on its other non-intervened parents (ignorability).

Example
In Table 4, we show the reformulated decision diagram Markov property corresponding to the augmented DAG * , as shown in Figure 2(c).Note that the local property here corresponds naturally to the graph * under the regime = In particular, note that for each random vertex, the size of the conditioning set in the defining independence (ignoring the term ≠ ∅ F 01 ) is equal to the number of parents that the vertex has in Figure 2(d).

Consequences of the local Markov property
Proof.Here, (48) follows from Lemma 16 since by Definition 18, Similarly, (49) follows from Lemma 17 since by the local Markov property: Finally, (50) and (51) again follow from the local Markov property since . □

Markov property for the observed distribution
The following result shows that the reformulated local Markov property implies, via distributional consistency, the ordinary local Markov property for the observed distribution.This result corresponds to Theorem 10.

Identifiability
The next result shows that the reformulated local Markov property implies that the kernel ( | ) p V F A will be identified from the distribution of the observables provided that the relevant conditional distributions are identified (from the distribution of the observables).This result corresponds to Theorem 11.Let a be an assignment to the intervention targets in A, and let v be an assignment to W V .Then, for every i: Potential outcome and decision theoretic foundations for statistical causality  23  Here, the first equality is by Lemma 16; the second is distributional consistency; the third follows from Theorem 21 applied to * ; the fourth is a simplification.□ 6 The role of "fictitious" independence in Dawid's development Dawid in [8] uses what he terms a "fictitious" independence in his proofs that the distribution of the kernels that condition on the regime indicators F i obey the Markov property for the augmented DAG with ITT vari- ables.Specifically, in his proof of Lemma 4, though not the statement, he makes the formal assumption that and similarly in the proof of Theorem 1, he assumes that all the regime indicators are mutually independent [8, p. 76, eqn.(82)]

Validity of the conclusion for regime indicators
That the implications used in Dawid's proofs of Lemma 4 and Theorem 1 do not holdwithout conditions on the joint space for the regime indicatorsmay at first seem to call into question Dawid's conclusions.However, at least in causal theories making use of DAG representations and involving multiple treatments, the decisions as to whether to intervene, and if so, which value to enforce are unconstrained.Consequently, variation independence will hold, and hence, the conclusion will be valid.However, there are situations in which interventions may be constrained.For example, suppose that there are two strategies for a medical condition; each treatment involves two separate stages (A 1 and A 2 ).At time = t 1, the doctor must decide between strategies "1," "2."It is easy to imagine situations in which, if treatment was commenced at time 1, the treatment at time 2 involves "completing" the treatment that was started at time 1, for example, removing surgical stitches from the specific operation performed at time 1.In this case, the treatment options available at time 2 are constrained by the decision at time 1.
Reflecting this, there have been causal decision theories proposed in which variables do not live in a product space (see [22]).Likewise, in the potential outcome framework, the formulation of causally interpreted structured tree graphs given by [2] also allows for this possibility.
However, even in this case, Dawid's implication will still hold, provided that the following condition obtains.In words, this states that for any possible setting of the regime indicators, in which they are not all "idle," there exists some intervention target A i that is intervened upon under f , which could have not been intervened upon, such that the resulting vector ( ) ∅ − f , i is still a valid value for F A .This condition may still hold in settings in which, if a later target is intervened upon, the regime under which an earlier target is set to "idle" is not well defined.For example, an intervention on A 2 setting = F 1 2  25 For recent related work giving general conditions under which this implication holds for ordinary (not extended) conditional independence, see [19], [20,Ch.4],[21].
may only be well defined if = F 1 1 , but not = ∅ F 1 .In the aforementioned treatment completion example this would be the case if, in the absence of an intervention on A 1 , some patients would receive treatment 2 at time 1, so that the subsequent intervention = .²⁶The condition (64) will always hold provided that treatment decisions follow a time order, and that, regardless of the decisions that have occurred previously, it is always possible to decide to replace the "last" intervention with the idle regime.
It is easy to see that under the condition (64) for any F ∈ f A , there will exist a sequence ( ) such that for = … j q 1, , , F ∈ f j A , and f j contains one more idle regime indicator than − f j 1 .It then follows under this condition that: where here the conditional independence statements implicitly quantify over all the assignments to F A that are in F A , and hence valid.

Figure 1 :
Figure 1: Illustration of SWIG labeling schemes.(a) DAG representing the observed joint distribution p A B C , , ( ); (b) SWIG a b , ( )with uniform labeling; (c) SWIG a b , ( )with temporal labeling; and (d) SWIG a b , ()with ancestral labeling.(These and other figures were created using the swings TikZ package, available on CTAN.) function of b in the second equality and distributional consistency via Lemma 3 in the third.□ Note that distributional consistency (3) does not imply the analogous result for conditional distributions.In particular, it is possible to have ∈ B Y i , ( ( )| ( )) p Y b M b not be a function of b and yet (

Theorem 10 .
If ⊆ A obeys distributional consistency and A obeys the SWIG ordered local Markov property for and ≺, then ( ) p V obeys the usual DAG ordered local Markov property w.r.t. and ≺.

3 . 5 . 1
ordered local Markov property for the DAG holds.□ Discussion of relation to Dawid regime in which case = * T T (see Figure 3(e)) where we have used a colored edge, * T T, to indicate the deterministic relationship between T and * T .Similarly, represented by the dashed edge from * T to T in Figure 3(d) and by the absence of the edge between * T and T in Figure 3(e).For comparison, Figure 3(a) and (b), respectively, show the representations of the observed distribution ( ) * p T Y , and the joint distribution ( ( )) * ) containing only the original variables and the treatment indicators (see bottom of [8, p. 65]).Dawid states that the augmented graphs without ITT variables are sufficient for reasoning about point interventions.

 14
Dawid instead  proposes to marginalizes out the ITT variables.

Figure 3 :,
Figure 3: The simplest case of a single treatment T and outcome Y in the absence of confounding.(a) DAG representing the observed joint distribution p T Y , ( ) * ; (b) SWIG t ( ) corresponding to representing p T Y t , ( ( )) * ; (c) Dawid's augmented DAG representing the set of kernels p Y T F , T ( | ), where F T is a regime indicator; (d) Dawid's augmented DAG with ITT variables, representing the kernels p Y T T F , , T ( | ) * , where F T is a regime indicator; the dashed edge indicates that the edge between T * and T is absent in the interventional regime, while the red edges indicate deterministic relationships; (e) the ITT augmented graph representing the observational regime p T T Y F p T T Y , , ∅ , , T ( | ) ( ) = = * * (b)), here H represents unobserved con- founding and the edge → F Y T indicates the compromised measurement process.¹⁵Since the experimental and observational distributions agree, it will hold that | ⊥ ⊥ Y F T T , as implied by the decision-theoretic graph in Figure 4(a).
(a) and (b).¹⁶In contrast, if the ITT variables * T are included, then no such difficulty arises: the corresponding augmented DAG, shown in Figure 4(c), now additionally requires that | ⊥ ⊥ = *

Figure 4 :
Figure 4: Illustration of the necessity of ITT (aka "natural value of treatment") variable T * in Dawid's proposal.(a) An augmented DAG (without ITT nodes) corresponding to an observational study without confounding and a perfect intervention on T .(b) An augmented DAG (without ITT nodes) representing an observational study with confounding (H) and a mis-targeted ("fat-hand") intervention affecting both T and Y .If the mis-targeted intervention matches the effect of confounding, then there will be equality of the observational and interventional distributions p Y T t F p Y T t F t , ∅ , T T ( | ) ( | ) = = = = = so that the extended independence Y F T T | ⊥ ⊥ will hold, and hence the causal diagram shown in (a) cannot be refuted.The inclusion of T * resolves this.(c) The DAG with ITT variables corresponding to the study without confounding, this implies Y F T T , T | ⊥ ⊥ * , which is not

Figure 5 :;;
Figure 5: Encoding distributional consistency via a special dynamic regime in the setting of a reformulated decision diagram (having marginalized the intervention target "T ").(a) Reformulated augmented graph * representing the observed joint distribution p T Y F , T ( | ) *

Definition 13 .
(Distributional consistency for decision diagrams) The kernel ( | ) p V F A is said to obey distributional consistency if, given ∈ i , where C may be empty,

Figure 7 :
Figure 7: (a) Front-door graph ; (b) the SWIG t ( ) (with ancestral labeling); (c) the augmented decision diagram * ; and (d) the augmented decision diagram given F t T = in which the dashed edge from T * to T is removed.Note that in (d) F T is d- separated from Y given M.However, the corresponding extended independence, Y F M F , ∅ T T | ⊥ ⊥ ≠ , is not implied by Dawid's conditions A+B.

(
Potential outcome and decision theoretic foundations for statistical causality  21This formulation captures the Markov property necessary for the augmented diagram including the context-specific independences that arise from interventions (that are not captured directly in Dawid's + A B formulation).

Lemma 19 .
If the kernel ( | ) p W F V A obeys distribution consistency and the augmented DAG local Markov property w.r.t.* , then:

Theorem 20 .
If the kernel ( | ) p W F V A obeys distribution consistency and the augmented DAG local Markov property w.r.t.* , then ( ) p W V obeys the usual local Markov property w.r.t. .first equality follows by distributional consistency.The second follows directly from the equality of (48) and (51) in Lemma 19.Since the last line only depends on ( ) * w i pa , the ordered local Markov property for the DAG holds.□

Theorem 21 .
Suppose the kernel ( | ) p W F V A obeys distribution consistency and the augmented DAG local Markov property w.r.t.* .

F 1 2
would not be well defined.If the same holds for =

:
Table 1 in terms of factorization; Table 2 via d-separation.Note that for each V i , the number of arguments on which p

Table 1 :
Defining properties for the SWIG x ( ) in Figure 2(b), expressed via factorization

Table 2 :
d-separation relations corresponding to the SWIG local Markov property in the SWIG

Table 4 :
Defining indicate the (constrained) state-space for the set of regime indicators F A .