BY 4.0 license Open Access Published by De Gruyter May 11, 2021

Decision-theoretic foundations for statistical causality

Philip Dawid

Abstract

We develop a mathematical and interpretative foundation for the enterprise of decision-theoretic (DT) statistical causality, which is a straightforward way of representing and addressing causal questions. DT reframes causal inference as “assisted decision-making” and aims to understand when, and how, I can make use of external data, typically observational, to help me solve a decision problem by taking advantage of assumed relationships between the data and my problem. The relationships embodied in any representation of a causal problem require deeper justification, which is necessarily context-dependent. Here we clarify the considerations needed to support applications of the DT methodology. Exchangeability considerations are used to structure the required relationships, and a distinction drawn between intention to treat and intervention to treat forms the basis for the enabling condition of “ignorability.” We also show how the DT perspective unifies and sheds light on other popular formalisations of statistical causality, including potential responses and directed acyclic graphs.

MSC 2010: 62A01; 62C99

1 Introduction

The decision-theoretic (DT) approach to statistical causality has been described and developed in a series of articles [1,2,3, 4,5,6, 7,8,9, 10,11,12, 13,14]; for general overview see refs. [15,16]. It has been shown to be a more straightforward approach, both philosophically and for use in applications, than other popular frameworks for statistical causality based, e.g., on potential responses or directed acyclic graphs (DAGs). In particular, and unlike those other approaches, it handles causality using only familiar tools of statistics (especially decision analysis) and probability (especially conditional independence). It has no need of additional ingredients such as do-operators, distinct potential versions of a variable, mysterious “error” variables, deterministic relationships, etc. And its application generally streamlines proofs.

From the standpoint of DT, “causal inference” is something of a misnomer for the great preponderance[1] of the methodological and applied contributions that normally go by this description. A better characterisation of the field would be “assisted decision making.” Thus, the DT approach focuses on how we might make use of external – typically observational – data to help inform a decision-maker how best to act; it aims to characterise conditions allowing this and to develop ways in which it can be achieved.

In common with other frameworks for causal inference, work to date has concentrated on the nuts and bolts of showing how this particular approach can be applied to a variety of problems, while largely avoiding detailed consideration of how the conditions enabling such application might be justified in terms of still more fundamental assumptions. The main purpose of the present article is to conduct just such a careful and rigorous analysis, to serve as a foundational “prequel” to the DT enterprise. We develop, in detail, the basic structures and assumptions that, when appropriate, would justify the use of a DT model in a given context – a step largely taken for granted in earlier work. We emphasise important distinctions, such as that between cause and effect variables, and that between intended and applied treatment, both of which are reflected in the formal language; another important distinction is that between post-treatment and pre-treatment exchangeability. The rigorous development is based on the algebraic theory of extended conditional independence, which admits both stochastic and non-stochastic variables [21,22,23], and its graphical representation [2].

We also consider the relationships between DT and alternative current formulations of statistical causality, including potential outcomes [24,25], Pearlian DAGs [26], and single-world intervention graphs [27,28]. We develop DT analogues of concepts that have been considered fundamental in these alternative approaches, including consistency, ignorability, and the Stable Unit-Treatment Value Assumption. In view of these connexions, we hope that this foundational analysis of DT causality will also be of interest and value to those who would seek a deeper understanding of their own preferred causal framework, and in particular of the conditions that need to be satisfied to justify their models and analyses.

1.1 Plan of article

Section 2 describes, with simple examples, the basics of the DT approach to modelling problems of “statistical causality,” noting in particular the usefulness of introducing a non-stochastic variable that allows us to distinguish between the different regimes – observational and interventional – of interest. It shows how assumed relationships between these regimes, intended to support causal inference, may be fruitfully expressed using the language and notation of extended conditional independence, and represented graphically by means of an augmented DAG.

In Sections 3 and 4 we describe and illustrate the standard approach to modelling a decision problem, as represented by a decision tree. The distinction between cause and effect is reflected by regarding a cause as a non-stochastic decision variable, under the external control of the decision-maker, while an effect is a stochastic variable, that cannot be directly controlled in this way. We introduce the concept of the “hypothetical distribution” for an effect variable, were a certain action to be taken, and point out that all we need, to solve the decision problem, is the collection of all such hypothetical distributions.

Section 5 frames the purpose of “causal inference” as taking advantage of external data to help me solve my decision problem, by allowing me to update my hypothetical distributions appropriately. This is elaborated in Section 6, where we relate the external data to my own problem by means of the concept of exchangeability. We distinguish between post-treatment exchangeability, which allows straightforward use of the data, and pre-treatment exchangeability, which cannot so use the data without making further assumptions. These assumptions – especially, ignorability – are developed in Section 7, in terms of a clear formal distinction between intention to treat and intervention to treat. In Section 8, we develop this formalism further, introducing the non-stochastic regime indicator that is central to the DT formulation. Section 9 generalises this by introducing additional covariate information, while Section 10 generalises still further to problems represented by a DAG. In Section 11, we highlight similarities and differences between the DT approach to statistical causality and other formalisms, including potential outcomes, Pearlian DAGs, and single-world intervention graphs. These comparisons and contrasts are explored further in Section 12, by application to a specific problem, and it is shown how the DT approach brings harmony to the babel of different voices. Section 13 rounds off with a general discussion and suggestions for further developments. Some technical proofs are relegated to Appendix A.

2 The DT approach

Here we give a brief overview of the DT perspective on modelling problems of statistical causality.

A fundamental feature of the DT approach is its consideration of the relationships between the various probability distributions that govern different regimes of interest. As a very simple example, suppose that we have a binary treatment variable T and a response variable Y . We consider three different regimes, indexed by the values of a non-stochastic regime indicator variable F T :[2]

F T = 1 : This is the regime in which the active treatment is administered to the patient.

F T = 0 : This is the regime in which the control treatment is administered to the patient.

F T = : This is a regime in which the choice of treatment is left to some uncontrolled external source.

The first two regimes may be described as interventional, and the last as observational. In each regime F T = j there will be a joint distribution P j for the treatment and response variables, T and Y . The distribution of T will be degenerate under an interventional regime (with T = 1 almost surely under P 1 , and T = 0 almost surely under P 0 ); but T will typically be non-degenerate under the observational distribution P .

It will often be the case that I have access to data collected in the observational regime F T = ; but for my own decision-making purposes I am interested in comparing and choosing between two interventions available to me, F T = 1 and F T = 0 , for which I do not have direct access to relevant data. I can only use the observational data to address my decision problem if I can make, and justify, appropriate assumptions relating the distributions associated with the different regimes.

The simplest such assumption (which, however, will often not be easy to justify) is that the distribution of Y in the interventional active treatment regime F T = 1 is the same as the conditional distribution of Y , given T = 1 , in the observational regime F T = ; and likewise the distribution of Y under regime F T = 0 is the same as the conditional distribution of Y given T = 0 in the regime F T = . This assumption can be expressed, in the conditional independence notation of ref. [21], as:

(1) Y F T T ,

(read: “ Y is independent of F T , given T ”), which asserts that the conditional distribution of the response Y , given the administered treatment T , does not further depend on F T (i.e. on whether that treatment arose naturally, in the observational regime, or by an imposed intervention), and so can be chosen to be the same in all three regimes.

Note, importantly, that the conditional independence assertion (1) makes perfect intuitive sense, even though the variable F T that occurs in it is non-stochastic. The intuitive content of (1) is made fully rigorous by the theory of extended conditional independence (ECI) [22], which shows that such expressions can, with care,[3] be manipulated in exactly the same way as when all variables are stochastic.

Property (1) can also be expressed graphically, by the augmented DAG [2] of Figure 1. Again, we can include both stochastic variables (represented by round nodes) and non-stochastic variables (square nodes) in such a graph, which encodes ECI by means of the d -separation criterion [32] or the equivalent moralisation criterion [33]. In Figure 1, it is the absence of an arrow from F T to Y that encodes property (1).

Figure 1 
               A simple augmented DAG.

Figure 1

A simple augmented DAG.

The identity, expressed by (1), of the conditional distribution of Y given T , across all the regimes described by the values of the regime indicator F T , can be understood as expressing the invariance or stability [34] of a probabilistic ingredient – the conditional distribution of Y , given T – across the different regimes. This is thus being regarded as a modular component, unchanged wherever it appears in any of the regimes. When it can be justified, the stability property represented by (1) or Figure 1 permits transfer [35] of relevant information between the regimes: we can use the (available, but not directly interesting) observational data to estimate the distributions of response Y given treatment T in regime F T = ; and then regard these observational conditional distributions as also supplying the desired interventional distributions of Y (of interest, but not directly available) in the hypothetical regimes F T = 1 and F T = 0 relevant to my decision problem.[4] Characterising, justifying, and capitalising on such modularity properties are core features of the DT approach to causality.

A more complex example is given by the DAG of Figure 2, which represents a problem where Z is an instrumental variable for the effect of a binary exposure variable X on an outcome variable Y , in the presence of unobserved “confounding variables” U . Note again the inclusion of the regime indicator F X , with values 0, 1, and . As before, F X = labels the observational regime in which data are actually obtained, while F X = 1 [resp., 0] labels the regime where we hypothesise intervening to force X to take the value 1 [resp., 0].

Figure 2 
               Instrumental variable with regimes.

Figure 2

Instrumental variable with regimes.

The figure is nothing more nor less than the graphical representation of the following (extended) conditional independence properties (which it embodies by means of d -separation):

(2) ( Z , U ) F X ,

(3) U Z F X ,

(4) Y Z ( X , U , F X ) ,

(5) Y F X ( X , U ) .

In words, (2) asserts that the joint distribution of Z and U is a modular component, the same in all three regimes, while (3) further requires that, in this (common) joint distribution, we have independence between U and Z . Next, (4) says that, in any regime, the response Y is independent of the instrument Z , conditionally on exposure X and confounders U (the “exclusion restriction”); while (5) further requires that the conditional distribution for Y , given X and U (which, by (4), is unaffected by further conditioning on Z ) be the same in all regimes.

We emphasise that properties (2)–(5) comprise the full extent of the causal assumptions made. In particular – and in contrast to other common interpretations of a “causal graph” [36] – no further causal conclusions should be drawn from the directions of the arrows in Figure 2. In particular, the arrow from Z to X should not be interpreted as implying a causal effect of Z on X : indeed, the figure is fully consistent with alternative causal assumptions, for example that Z and X are merely associated by sharing a common cause [36]. Our restriction of regime indicators to nodes where interventions are meaningful and relevant is in contrast with, for example, the approach of Pearl [26], where it is assumed that it is (at least in principle) possible to consider interventions at every node in a DAG: while this allows one to interpret every arrow as “causal,” that may not be an appropriate representation of the actual problem.

In general, the causal content of any augmented DAG is to be understood as fully comprised by the extended conditional independencies that it embodies by d -separation. This gives a precise and clear semantics to our “causal DAGs.”

To the extent that the assumptions embodied in Figure 2 imply restrictions on the observational distribution of the data, namely,

(6) U Z

(7) Y Z ( X , U ) ,

they tally with the standard assumptions made in instrumental variable analysis [37]. And these assumed properties can be testable from observational data: for example, when X , Y , and Z are discrete, the conditional independence properties (6) and (7) of the observational regime imply that the distributions of ( X , Y ) given Z satisfy the testable “instrumental inequality” [26, Section 8.4]:[5]

(8) max x y max z pr ( X = x , Y = y Z = z ) 1 .

However, even when valid, the purely observational properties (6) and (7) are not enough to justify a causal interpretation. Without the additional stitching together of behaviours under the observational regime and the desired, but unobserved, interventional regimes, it is not possible to use the observational data to make causal inferences. When, and only when, these additional stability assumptions can be made, can we justify application of the usual methods of instrumental variable analysis.

In previous work, we have used the above formulation in terms of extended conditional independencies, involving both stochastic variables and non-stochastic regime indicators, as the starting point for analysis and discussion of statistical causality, both in general terms and in particular applications. In this work, we aim to dig a little deeper into the foundations, and in particular to understand why, when, and how we might justify the specific ECI properties previously simply assumed.

3 Causality, agency, and decision

There is a very wide variety of philosophical understandings and interpretations of the concept of “causality.” Our own approach is closely aligned with the “agency,” or “interventionist,” interpretation [38,39,40, 41,42], whereby a “cause” is understood as something that can (at least in principle) be externally manipulated – this notion being an undefined primitive, whose intended meaning is easy enough to comprehend intuitively in spite of being philosophically contentious [43]. This is not to deny the value of other interpretations of causality, based for example on mechanisms [44,45], simplicity [46], probabilistic independence [47,48] or invariant processes [34], or starting from different primitive notions, such as common cause or direct effect [29], or one variable “listening to” another [49]. The present work, however, has the very limited aim of explicating the agency-based DT approach and makes no pretence to address all issues that might dwell under a broad umbrella view of causal reasoning [50]. In particular, we do not address cases where it is desired to ascribe causal status to a variable that is non-manipulable, or for which a corresponding intervention is not well-defined [51,52].

The basic idea is that an agent (“I,” say) has free choice among a set of available actions, and that performing an action will, in some sense, tend to bring about some outcome. Indeed, whenever I seriously contemplate performing some action, my purpose is to bring about some desired outcome; and that aim will inform my choice between the different actions that may be available. We may consider my action as a putative “cause” of my outcome. This approach makes a clear distinction between cause and effect: the former is represented as an action, subject to my free choice, while the latter is represented as an outcome variable, over which I have no direct control. Correspondingly, we will need different formal representations for cause and effect variables: only the latter will be treated as stochastic random variables.

Now by my action I generally will not be able to determine the outcome exactly, since it will also be affected by many circumstances beyond my control, which we might ascribe to the vagaries of “Nature.” So I will have uncertainty about the eventual outcome that would ensue from my action. We shall take it for granted that it is appropriate to represent my uncertainty by a probability distribution. Then, for any contemplated but not yet executed action a , there will be a joint probability distribution P a over all the ensuing variables in the problem,[6] representing my current uncertainty (conditioned on whatever knowledge I currently have, prior to choosing my action) about how those variables might turn out, were I to perform action a . We shall term the well-defined distribution P a hypothetical only because it is premised on the hypothesis that I perform action a .[7]

There will be a collection A of actions available to me, and correspondingly an associated collection { P a : a A } of my hypothetical distributions – each contingent on just one of the actions I might take. My task is to rank my preferences among these different hypothetical distributions over future outcomes and perform that action corresponding to the distribution P a I like best. I can do this ranking in terms of any feature of the distributions that interests me.

One such way, concordant with Bayesian statistical decision theory [53,54], is to construct a real-valued loss function L , such that L ( y , a ) measures the dissatisfaction I will suffer if I take action a and the value of some associated outcome variable Y later turns out to be y . This is represented in the decision tree of Figure 3.[8]

Figure 3 
               Decision tree.

Figure 3

Decision tree.

The square at node ν indicates that it is a decision node, where I can choose my action, a . The round node ν a indicates the generation of the stochastic outcome variable, Y , whose hypothetical distribution P a will typically depend on the contemplated action a .

Since, at node ν a , Y P a , the (negative) value of taking action a , and thus getting to ν a , is measured by the expected loss L ( a ) E Y P a { L ( Y , a ) } . The principles of statistical decision analysis now require that, at the decision node ν , I should choose an action a minimising L ( a ) .

Note particularly that, whatever loss function is used, this solution will only require knowledge of the collection { P a } of hypothetical distributions for the outcome variable Y .

There are decision problems where explicit inclusion of the action a as an argument of the loss function is natural. For example, I might have a choice between taking my umbrella ( a = 1 ) when I go out, or leaving it at home ( a = 0 ). For either action, the relevant binary outcome variable Y indicates whether it rains ( Y = 1 ) or not ( Y = 0 ). The loss is 1 if I get wet, 0 otherwise, so that L ( 0 , 0 ) = L ( 0 , 1 ) = L ( 1 , 1 ) = 0 , L ( 1 , 0 ) = 1 . In this case, my action presumably has no effect on the outcome Y , so that I might take P 1 and P 0 to be identical; but it enters non-trivially into the loss function. However, it is arguable whether such a problem, where the only effect of my action is on the loss, can properly be described as one of causality. In typical causal applications, the loss function will depend only on the value y of Y , and not further on my action – so that L ( y , a ) simplifies to L ( y ) . The only thing depending on a will then be my hypothetical distribution P a for Y , subsequent to (“caused by”) my taking action a . Then L ( a ) = E Y P a { L ( Y ) } , and my choice of action effectively becomes a choice between the different hypothetical distributions P a for Y associated with my available actions a : I prefer that distribution giving the smallest expectation for L ( Y ) . This specialisation will be assumed throughout this work.

4 A simple causal decision problem

As a simple specific example, we consider the following stylised decision problem.

Example 1

I have a headache and am considering whether or not I should take two aspirin tablets. Will taking the aspirins cause my headache to disappear?

Let the binary decision variable F X denote whether I take the aspirin ( F X = 1 ) or not ( F X = 0 ), and let Z denote the time it takes for my headache to go away. For convenience only, we focus on Y log Z , which can take both positive and negative values.

I myself will choose the value of F X : it is a decision variable and does not have a probability distribution. Nevertheless, it is still meaningful to consider my conditional distribution, P x say, for how the eventual response Y might turn out, were I to take decision F X = x ( x = 0 , 1 ). For the moment, we assume the distributions P 0 , P 1 to be known – this will be relaxed in Section 5. Where we need to be definite, we shall, purely for simplicity, take P x to be the normal distribution N ( μ x , σ 2 ) , with probability density function:

(9) p x ( y ) p ( y F X = x ) = ( 2 π σ 2 ) 1 2 exp ( y μ x ) 2 2 σ 2 ,

having mean μ 0 or μ 1 according as x = 0 or 1, and variance σ 2 in either case.

The distribution P 1 [resp., P 0 ] expresses my uncertainty about how Y would turn out, if, hypothetically, I were to decide to take the aspirin, i.e. under F X = 1 [resp., if I were to decide not to take the aspirin, F X = 0 ]. It can incorporate various sources and types of uncertainty, including stochastic effects of external influences arising or acting between the point of treatment application and the eventual response. My task is to compare the two hypothetical distributions P 1 and P 0 and decide which one I prefer. If I prefer P 1 to P 0 , then my decision should be to take the aspirin; otherwise, not. Whatever criterion I use, all I need to put it into effect, and so solve my decision problem, is the pair of hypothetical distributions { P 0 , P 1 }  for the outcome Y , under each of my hypothesised actions.

One possible comparison of P 1 and P 0 might be in terms of their respective means, μ 1 and μ 0 , for Y ; the “effect” of taking aspirin, rather than nothing, might then be quantified by means of the change in the expected response, δ μ 1 μ 0 . This is termed the average causal effect, ACE (in terms of the outcome variable Y – so more specifically denoted by ACE Y , if required). Alternatively, we might look at the average causal effect in terms of Z = e Y :  ACE Z = E P 1 ( Z ) E P 0 ( Z ) = e σ 2 / 2 ( e μ 1 e μ 0 ) , or make this comparison as a ratio, E P 1 ( Z ) / E P 0 ( Z ) = e μ 1 μ 0 . Or, we could consider and compare the variance of Z , var x ( Z ) = e 2 μ x ( e 2 σ 2 e σ 2 ) under P x ( x = 0 , 1 ). In full generality, any comparison of an appropriately chosen feature of the two hypothetical distributions, P 0 and P 1 , of Y can be regarded as a partial summary of the causal effect of taking aspirin (as against taking nothing).

A fully decision-theoretic formulation is represented by the decision tree of Figure 4.

Suppose (for example) that I were to measure the loss that I will suffer if my headache lasts z = e y minutes by means of the real-valued loss function L ( z ) = log z = y . If I were to take the aspirin ( F X = 1 ), my expected loss would be E Y P 1 ( Y ) = μ 1 ; if not ( F X = 0 ), it would be μ 0 . The principles of statistical decision analysis now direct me to choose the action leading to the smaller expected loss. The “effect of taking aspirin” might be measured by the increase in expected loss, which in this case is just ACE Y ; and the correct decision will be to take aspirin when this is negative.

Although there is no uniquely appropriate measure of “the effect of treatment,” in the rest of our discussion we shall, purely for simplicity and with no real loss of generality, focus on the difference of the means of the two hypothetical distributions for the outcome variable Y :

(10) ACE = E P 1 ( Y ) E P 0 ( Y ) .

Figure 4

Figure 4 
               Decision tree.

Figure 4

Decision tree.

5 Populating the decision tree

The above formulation is fine so long as I know all the ingredients in the decision tree, in particular the two hypothetical distributions P 0 and P 1 . Suppose, however, that I am uncertain about the parameters μ 1 and μ 0 of the relevant hypothetical distributions P 1 and P 0 (purely for simplicity we shall continue to regard σ 2 as known). To make explicit the dependence of the hypothetical distributions on the parameters, we now write them as P 1 , μ 1 , P 0 , μ 0 and denote the associated density functions by p 1 ( y μ 1 ) , p 0 ( y μ 0 ) .

5.1 No-data decision problem

Being now uncertain about the parameter-pair μ = ( μ 1 , μ 0 ) , I should assess my personalist prior probability distribution, Π say, for μ (in the light of whatever information I currently have). Let this have density π ( μ 1 , μ 0 ) . To solve my decision problem, I would then substitute, for the unknown hypothetical distribution P 1 , μ 1 ( y ) , my “prior predictive” hypothetical distribution P 1 for Y , with density

p 1 ( y ) = p 1 ( y μ 1 ) π ( μ 1 , μ 0 ) d μ 1 d μ 0 = p 1 ( y μ 1 ) π 1 ( μ 1 ) d μ 1 ,

where π 1 ( μ 1 ) is my marginal prior density for μ 1 :

π 1 ( μ 1 ) = π ( μ 1 , μ 0 ) d μ 0 .

Similarly, I would replace P 0 , μ 0 ( y ) by P 0 , having density p 0 ( y ) = p 0 ( y μ 0 ) π 0 ( μ 0 ) d μ 0 , where π 0 ( μ 0 ) = π ( μ 1 , μ 0 ) d μ 1 is my marginal prior density for μ 0 . We remark that, in parallel to the property that, with full information, I only need to specify the two hypothetical distributions P 1 and P 0 , when I have only partial information I only need to specify, separately, my marginal uncertainties about the unknown parameters of each of these distributions. In particular, once these margins have been specified, any further dependence structure in my joint personal probability distribution Π for ( μ 1 , μ 0 ) is irrelevant to my decision problem.

5.2 Data

When in a state of uncertainty, that uncertainty can often be reduced by gathering data. Bayesian statistical decision theory [53] shows that, for any decision problem, the expected reduction in loss by using additional data (“the expected value of sample information”) is always non-negative. The effect of obtaining data D is to replace all the distributions entering in Section 5.1 by their versions obtained by further conditioning on D .

Suppose then that I wish to reduce my uncertainty about μ 1 , the parameter of my hypothetical distribution P 1 , by utilising relevant data. What data should I collect, and how should I use them?

What I might, ideally, want to do is gather together a “treatment group” T of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own. We call such individuals exchangeable (both with each other and with me) – this intuitive concept is treated more formally in Section 6. I then give them each two aspirins and observe their responses (how long until their headaches go away). Conditionally on the parameter μ 1 of P 1 = P 1 , μ 1 , I could reasonably[9] model these responses as being independently and identically distributed, with the same distribution, P 1 , μ 1 , that would describe my own uncertainty about my own outcome, Y , were I, hypothetically, to take the aspirins, and thus put myself into the identical situation as the individuals in my sample. Conditionally on μ 1 , I would further regard my own outcome as independent of those in the sample. We shall not here be concerned with issues of sampling variability in finite datasets. So we consider the case that the treatment group T is very large. Then I can essentially identify μ 1 as the observed sample mean μ ^ 1 , and so take my updated P 1 to be N ( μ ^ 1 , σ 2 ) .[10] For any non-dogmatic prior, this will be a close approximation to my Bayesian “posterior predictive distribution” for Y , given the data D (conditionally on my taking the aspirins), and also has a clear frequentist justification.

The above was relevant to my hypothetical distribution P 1 , were I to take the aspirins. But of course an entirely parallel argument can be applied to estimating P 0 , the distribution of my response Y were I not to take the aspirins. I would gather another large group (the “control group,” C ) of individuals similar to myself, with headaches similar to my own, but this time withhold the aspirins from them. I would then use the empirically estimated distribution of the response in this group as my own distribution P 0 .

Let D = T C be the set of “data individuals.” Using the responses of D , I have been able to populate my own decision problem with the relevant hypothetical distributions, P 1 and P 0 . I can now solve it, and so choose the optimal decision for me.

6 Exchangeability

Here, we delve more deeply into the justification for some of the intuitive arguments made above (and below).

In Section 5.2, in the context first of estimating my hypothetical distribution P 1 , we discussed constructing, as the treatment group T ,

a group of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own.

The identical requirement was imposed on the control group C . The formal definition and theory of exchangeability [56,57] seek to put this intuitive conception on a more formal footing.

We consider a collection of individuals, on each of which we can measure a number of generic variables. One such is the generic response variable Y , having a specific instance, Y i , for individual i – that is, Y i denotes the response of individual i . We suppose all individuals considered are included in . In particular, T , C , and I myself am included in , with label 0, say.

6.1 Post-treatment exchangeability

What we are essentially requiring of T , in the description quoted above, is twofold:

  1. (i)

    My joint personalist distribution for the responses in the treatment group, i.e. the ordered set ( Y i : i T ) , is exchangeable – that is to say, I regard the re-ordered set ( Y ρ ( i ) : i T ) as having the same joint distribution as ( Y i : i T ) , where ρ is an arbitrary permutation (re-ordering) of the treated individuals.

  2. (ii)

    If, moreover, I were to take the aspirins, then the above exchangeability would extend to the set T + T { 0 } , in which I too am included.

Parallel exchangeability assumptions would be made for the control group C , from whom the aspirin is withheld: in (i) and (ii) we just replace “treatment” by “control,” T by C (and T + by C + ), and “were to take” by “were not to take.” We shall denote these variant versions by ( i ) and ( ii ) .

Since the aforementioned exchangeability assumptions relate to the responses of individuals after they have (actually or hypothetically) received treatment, we refer to them as post-treatment exchangeability.

Applying de Finetti’s representation theorem [56] to (i), I can regard the responses ( Y i : i T ) in the treatment group as independently and identically distributed, from some unknown distribution.[11] This distribution can then be consistently estimated from the response data in the treatment group. On account of (ii), this same distribution would govern my own response, Y 0 , were I to take the aspirins. It can thus be identified with my own hypothetical distribution P 1 . Taken together, (i) and (ii) thus justify my estimating of P 1 from the treatment group data, and using this to populate the treatment branch of my decision tree.[12] Similarly, using ( i ) and ( ii ) , I can use the data from the control group to populate my own control branch. My decision problem can now be solved.[13]

Some comments

  1. (1)

    Whether or not the exchangeability assumption (i) can be regarded as reasonable will be highly dependent on the background information informing my personal probability assessments. For example, I might know, or suspect, that evening headaches tend to be more long-lasting than morning headaches. If I were also to know which of the headaches in T were evening, and which morning, headaches, then I would not wish to impose exchangeability. I might know that individual 1 had a morning headache, and individual 2 an evening headache. Then it would not be reasonable for me to give the re-ordered pair ( Y 2 , Y 1 ) the same joint distribution as ( Y 1 , Y 2 ) – in particular, my marginal distribution for Y 2 would likely not be the same as that for Y 1 . However, in the absence of specific knowledge about who had what type of headache – “equality of ignorance” – the exchangeability condition (i) could still be reasonable.

  2. (2)

    There may be more than one way of embedding my own response, Y 0 , into a set of exchangeable variables. For example, instead of considering other individuals, I could consider all my own previous headache episodes. (In the language of experimental design, the experimental unit – the headache episode – is nested within the individual.) Then I might use the estimated distribution of my response, among those past headache episodes of my own that I had treated with aspirin, to populate the treatment branch of my current decision problem. This might well yield a different (and arguably more relevant) distribution from that based on observing headaches in other treated individuals. In this sense there is no “objective” distribution P 1 waiting to be uncovered: P 1 is itself an artefact of the overall structure in which I have embedded my problem, and the data that I have observed.

  3. (3)

    Exchangeability must also be considered in relation to my own current circumstances. The exchangeability judgment (i) may not be extendible as required by (ii) if, for example, my current headache is particularly severe. To reinstate exchangeability I might then need to restrict attention to those headache episodes (in other individuals, or in my own past) that had a similar level of severity to mine. Alternatively, I might build a more complex statistical model, allowing for different degrees of severity, and use this to extrapolate from the observed data to my own case.

  4. (4)

    We do not in principle exclude complicated scenarios such as “herd immunity” in vaccination programmes, where an individual’s response might be affected in part by the treatments that are assigned to other individuals. Assuming appropriate symmetry in (my knowledge of) the interactions between individuals, this need not negate the appropriateness of the exchangeability assumptions, and hence the validity of the above analysis – though in this case it would be difficult to give the underlying distributions P 0 and P 1 , conjured into existence by de Finetti’s theorem, a clear frequentist interpretation. However, in such a problem it would usually be more appropriate to enter into a more detailed modelling of the situation.

Exchangeability, while an enormously simplifying assumption, is in any case inessential for the more general analysis of Section 5.2: at that level of generality, I have to assess my conditional distribution for my own response Y 0 (in the hypothetical situation that I decide to take the aspirins), given whatever data D I have available. But modelling and implementing an unstructured prediction problem can be extremely challenging, as well as hard to justify as genuinely empirically based, unless we can make good arguments. When appropriate, judgments of exchangeability constitute an excellent basis for such arguments.

6.2 Pre-treatment exchangeability

The post-treatment exchangeability conditions (i) and (ii), and ( i ) and ( ii ) , are what is needed to let me populate my decision tree with the requisite hypothetical distributions and so solve my decision problem.

Here we consider another interpretation of the expression “a group of individuals whom I can regard, in an intuitive sense, as similar to myself, with headaches similar to my own.” This description has been supposed equally applicable to the treatment group T and the control group C . But this being the case, then – applying Euclid’s first axiom, “Things which are equal to the same thing are also equal to one another” – the two groups, T and C (and their headaches), both being similar to me, must be regarded (again in an intuitive sense) as similar to each other – I must be “comparing like with like.” But how are we to formalise this intuitive property of the two groups being similar to each other? We cannot simply impose full exchangeability of all the responses ( Y i : i D ) , since I typically would not expect the responses of the treated individuals to be exchangeable with those of the untreated individuals.

One way of formalising this intuition is to consider all the individuals in the treatment and control groups before they were given their treatments. Just as I myself can hypothesise taking either one of the treatments, and in either case consider my hypothetical distribution for my ensuing response Y 0 , so can I hypothesise various ways in which treatments might be applied to all the individuals in .

Let the binary decision variable T ˇ i indicate which treatment is hypothesised to be applied to individual i .

We first introduce the following Stable Unit-Treatment Distribution Assumption (SUTDA):

Condition 1.

(SUTDA) For any A , the joint distribution of Y A ( Y i : Y A ) , given hypothesised treatment applications ( T ˇ i = t i : i ) , depends only on ( t i : i A ) . In particular, for any individual i , the distribution of the associated response Y i depends only on the treatment t i applied to that individual.

As discussed further in Section 11.1, SUTDA bears a close resemblance to the Stable Unit-Treatment Value Assumption (SUTVA), typically made in the Rubin potential outcome framework; but – as reflected in its name – differs in the important respect of referring to distributions, rather than values, of variables. It is a weaker requirement than SUTVA, but is as powerful as required for applications.

Note that SUDTA is a genuinely restrictive hypothesis, now excluding cases such as the vaccine example (4) of Section 6. However, we will henceforth assume it holds.

In more complex problems, there will be other generic variables of interest besides Y – we term these (including the response variable Y ) domain variables. Then we extend SUTDA to apply to all domain variables, considered jointly. An important special case is that of a domain variable X such that the joint distribution of ( X i : i ) , given T ˇ i = t i ( i ), does not depend in any way on the applied treatments ( t i ) . Such a variable, unaffected by the treatment, is a concomitant. It will typically be reasonable to treat as a concomitant any variable whose value is fully determined before the treatment decision has to be made: such a variable is termed a covariate. Other concomitants might include, for example, the weather after the treatment decision is made.

Let V be a (possibly multivariate) generic variable. I now hypothesise giving all individuals in (including myself) the aspirins, and consider my corresponding hypothetical[14] joint distribution for the individual instances ( V i : i ) . It would often be reasonable to impose full exchangeability on this joint distribution, since all members of would have been treated the same. A similar assumption can be made for the case that the aspirins are, hypothetically, withheld from all individuals. We term the conjunction of these two hypothetical exchangeability properties pre-treatment exchangeability (of V , over ).

When I can assume this, then under uniform application of aspirin, by de Finetti’s theorem I can regard all the ( V i ) as independent and identically distributed from some distribution Q 1 (initially unknown, but estimable from data on uniformly treated individuals). Similarly, under hypothetical uniform withholding of aspirin, there will be an associated distribution Q 0 . When moreover SUTDA applies, we can conclude that, under any hypothesised application of treatments, T ˇ i = t i ( i ), we can regard the V i as independent, with V i Q t i . We can thus confine attention to the generic variable V , with distribution Q 1 [resp., Q 0 ] under applied treatment T ˇ = 1 [resp., T ˇ = 0 ].

Pre-treatment exchangeability appears, superficially, to be a stronger requirement than post-treatment exchangeability: one could argue that (taken together with SUDTA) pre-treatment exchangeability implies the post-treatment exchangeability properties (i), (ii), ( i ) , and ( ii ) , which would permit me to populate both the treatment and the control branches of my decision tree, and so solve my decision problem. This would indeed be so if the individuals forming the treatment and control groups were identified in advance, and then subjected to their appointed interventions. However, it need not be so in the more general case that we do not have direct control over who gets which treatment. Much of the rest of this article is concerned with addressing such cases, considering further conditions – in particular, ignorability of the treatment assignment process, as described in Section 7.1 – which allow us to bridge the gap between pre- and post-treatment exchangeability.

6.3 Internal and external validity

We might be willing to accept pre-treatment exchangeability, but only over the restricted set D of data individuals, excluding myself – a property we term internal exchangeability. When I can extend this to pre-treatment exchangeability over the set D + D { 0 } , including myself, we have external exchangeability. In the latter case, there is at least a chance that the data D could help me solve my decision problem – the case of external validity of the data.[15] However, when we have internal but not external exchangeability, this conclusion could, at best, be regarded as holding for a new, possibly fictitious, individual who could be regarded as exchangeable with those in the data – this is the case of internal validity. In practice that can be problematic. For example, a clinical trial might have tightly restricted enrolment criteria, perhaps restricting entry to, say, men aged between 25 and 49 years with mild headache. Even if the study has good internal validity, and shows a clear advantage to aspirin for curing the headache, it is not clear that this message would be relevant to a 20-year old female with a severe headache. And indeed, it may not be. Arguments for external validity will generally be somewhat speculative, and not easy to support with empirical evidence.

7 Treatment assignment and application

In Section 5.2 we talked of identifying, quite separately, two groups of individuals, in each case supposed suitably exchangeable (both internally, and with me), where one of the groups is made to take, and the other made not to take, the aspirins. But typically the process is reversed: a single group of individuals, D say, is gathered, some of whom are then chosen to receive active treatment – thus forming the treatment group T – with the remainder forming the control group C .

In this case, the treatment process has the following three stages:

  1. (1)

    First, the data subjects D are identified by some process.

  2. (2)

    Second, certain individuals in D are somehow selected to receive active treatment, the others to receive control.[16]

  3. (3)

    Finally, the assigned treatments are actually administered.

The operation of stage (1) will be crucial for issues of external validity – if the data are to be at all relevant for me, I would want the data subjects to be somehow like me. However, from this point on we shall naïvely assume this has been done satisfactorily – alternatively, we consider “me” to be a possibly fictitious individual who can be regarded as similar to those in the data. We shall thus consider all data subjects, together with myself, as pre-treatment exchangeable. I can then confine attention to the joint distributions P 1 and P 0 over generic variables, under hypothesised application of treatment 1 or 0, respectively.

For further analysis, it will prove important to keep stages (2) and (3) clearly distinct in the notation and the analysis.

We denote by T the generic intention to treat (ITT) variable, generated at stage (2), where T i = 1 if individual i D is selected to receive active treatment, and T i = 0 if not (this is relevant only for the external data D : my own value T 0 need not be defined). Note that T is a stochastic variable. In contrast, we also consider (at stage (3)) the binary non-stochastic generic decision/regime variable T ˇ : T ˇ i = 1 [resp., T ˇ i = 0 ] denotes the (typically hypothetical) situation in which individual i is made to take [resp., prevented from taking] the aspirins.[17] My own decision variable T ˇ 0 (though not yet its value) is well-defined – indeed, is the very focus of my decision problem.

Note that when below we talk of “domain variables” we will exclude T and T ˇ from this description.

If all goes to plan, for i D we shall have T ˇ i = T i . However, there is no bar to considering, between stages (2) and (3), what might happen to an individual, fingered to receive the treatment (so having T i = 1 ), who, contrary to plan, is prevented from taking it (so that T ˇ i = 0 )[18] – indeed, we have already made use of such considerations when introducing pre-treatment exchangeability. So we can meaningfully consider a quantity such as E ( Y T = 1 , T ˇ = 0 ) . And indeed it will prove useful to divorce treatment selection (intention to treat), T , from (actual or hypothetical) treatment application, T ˇ , in this way. For example, what is usually termed the effect of treatment on the treated [62] is more properly expressed as the effect of treatment on those selected for treatment, which can be represented formally as E ( Y T = 1 , T ˇ = 1 ) E ( Y T = 1 , T ˇ = 0 ) [10,60].

Since the selection process is made before any application of treatment, it is appropriate to treat T as a covariate, with the same distribution in both regimes.

We suppose internal exchangeability, in the sense of Section 6.2, for the pair of generic variables ( T , Y ) . In particular, we shall have internal exchangeability, marginally, for the response variable Y – and, to make a link to my own decision problem, we assume this extends to external exchangeability for Y (we here omit T , since that might not even be meaningfully defined for me). However, even internal exchangeability for Y need no longer hold after we condition on the selection variable T – this is the problem of confounding. For example, suppose that, although I myself do not know which of the headaches in D are the (generally milder) morning and which the (generally more long-lasting) evening headaches, I know or suspect that the aspirins have been assigned preferentially to the evening headaches. Then simply knowing that an individual was selected (perhaps self-selected) to take the aspirins ( T = 1 ) will suggest that his headache is more likely to be an evening headache, and so change my uncertainty about his response Y (whichever treatment were to be taken). I might thus expect, e.g., E ( Y T = 1 , T ˇ = t ) > E ( Y T = 0 , T ˇ = t ) , both for t = 0 and for t = 1 . In such a case, even under a hypothetical uniform application of treatment, I could not reasonably assume exchangeability between the group selected to receive active treatment (and thus more likely to have long-lasting evening headaches) and the group selected for control (who are more likely to have short-lived morning headaches). Post-treatment exchangeability is absent, since I would no longer be comparing like with like. This in turn renders external validity impossible, since (even under uniform treatment) I could not now be exchangeable, simultaneously, both with those selected for treatment and with those selected for control, since these are not even exchangeable with each other. This means I can no longer use the data (at any rate, not in the simple way considered thus far) to fully populate, and thus solve, my decision problem.

As explained in Section 6.2, assuming internal exchangeability and SUTDA, I can just consider the joint distribution, Q t , for the bivariate generic variable ( T , Y ) , given T ˇ = t . Since we are treating the selection indicator T as a covariate, its marginal distribution will not depend on which hypothetical treatment application is under consideration, and so will be the same under both Q 1 and Q 0 . We can express this as the extended independence property

(11) T T ˇ ,

which says that the (stochastic) selection variable T is independent of the (non-stochastic) decision variable T ˇ . We denote this common distribution of T in both regimes by P .

By the assumed external exchangeability of Y , the marginal distribution of Y under Q t is my desired hypothetical response distribution, P t . However, in the absence of actual uniform application of treatment t to the data subjects (which in any case is not simultaneously possible for both values of t ), I may not be able to estimate this marginal distribution. In the data, the treatment will have been applied in accordance with the selection process, so that T ˇ = T , and the only observations I will have under regime T ˇ = 1 (say) are those for which T = 1 . From these I can estimate the conditional distribution of Y , given T = 1 under Q 1 – but this need not agree with the desired marginal distribution P 1 of Y under Q 1 .[19]

7.1 Ignorability

The above complication will be avoided when I judge that, both for t = 1 and for t = 0 , if I intervene to apply treatment T ˇ = t on an individual, the ensuing response Y will not depend on the intended treatment T for that individual, i.e. we have independence of Y and T under each Q t . This can be expressed as the ECI property

(12) Y T T ˇ .

When (12) can be assumed to hold, we term the assignment process ignorable. In that case, my desired distribution for Y , under hypothesised active treatment assignment T ˇ = 1 , is the same as the conditional distribution of Y given T = 1 under T ˇ = 1 – which is estimable as the distribution of Y in the treatment group data. Likewise, my distribution for Y under hypothesised control treatment is estimable from the data in the control group.

The ignorability condition (12) requires that the distribution of an individual’s response Y , under either applied treatment, will not be affected by knowledge of which treatment the individual had been fingered to receive – a property that would likely fail if, for example, treatment selection T was related to the overall health of the patient. Note that ignorability is not testable from the available data, in which T ˇ = T . For we would need to test, in particular, that, for an individual taking actual treatment T ˇ = 1 , the distribution of Y given T = 1 is the same as that given T = 0 . But for all such individuals in the data we never have T = 0 , so cannot make the comparison. Hence, any assumption of ignorability can only be justified on the basis of non-empirical considerations. The most common, and most convincing, basis for such a justification is when I know that the treatment assignment process has been carried out by a randomising device, which can be assumed to be entirely unrelated to anything that could affect the responses; but I might be able to make a non-empirical arguments for ignorability in some other contexts also. Indeed, it would be rash simply to assume ignorability without having a good argument to back it up.

8 The idle regime

As a useful extension of the above analysis, we expand the range of the regime indicator T ˇ to encompass a further value, which we term “idle,” and denote by – this indicates the observational regime, where treatments are applied according to plan. (This is relevant only for the data individuals, in D : I myself care only about the two interventions I am considering). We denote this three-valued regime indicator by F T .

Now T is determined prior to any (actual or hypothetical) treatment application, and behaves as a covariate. It is thus reasonable to assume that, under the observational regime F T = , T retains its fixed covariate distribution P . And since this distribution is then the same in all three regimes, we thus have

(13) T F T .

This extends (11) to include also the idle regime. We henceforth assume (13) holds.

We now introduce a new stochastic domain variable T , representing the treatment actually applied when following the relevant regime. This is fully determined by the pair ( F T , T ) as follows:

Definition 1

(Applied Treatment, T )

  1. (1)

    If F T = 0 or 1, then T = F T .

  2. (2)

    If F T = , then T = T .

In particular, T P under F T = , while T has a degenerate distribution at t under F T = t ( t = 0 or 1).

In each of the three regimes we can observe both T and Y . In the observational regime ( F T = ) we can also recover T , since T = T . However, T is typically unobservable in the interventional regimes, and may not even be defined for myself, the case of interest.

To complete the distributional specification of the idle regime we argue as follows. Under F T = , the information conveyed by learning T = t is twofold, conveying both that the individual was initially fingered to receive treatment t , i.e. T = t , and that treatment t was indeed applied. Hence for any domain variable V , the conditional distribution of V given T = t (equivalently, given T = t ), under F T = , should be the same as that of V given T = t , under the (real or hypothetical) applied treatment F T = t . We express this property formally as:

Definition 2

(Distributional consistency) For any domain variable, or set of domain variables, V ,[20]

(14) V ( T = t , F T = ) [ = V ( T = t , F T = ) ] V ( T = t , F T = t ) ( t = 0 , 1 ) ,

where denotes “has the same distribution as.”

Distributional consistency is the fundamental property linking the observational and interventional regimes. It is our, weaker, version of the (functional) consistency property usually invoked in the potential outcome approach to causality – see Section 11.1. In the sequel we shall take (14) for granted.

Lemma 1

Under distributional consistency, for any domain variable V

(15) V F T ( T , T ) .

Proof

We have to show that, for t , t { 0 , 1 } , it is possible to define a conditional distribution for V , given T = t , T = t , that applies in all three regimes.

Let Π t , t denote the conditional distribution of V given T = t in the interventional regime F T = t . This is well-defined in the usual case that the event T = t has positive probability – if not, we make an arbitrary choice for this distribution.

Consider first the case t = 1 .

  1. (1)

    Since T is non-random with value 1 in regime F t = 1 , Π 1 , t is also, trivially, the distribution of V given T = 1 , T = t in regime F T = 1 .

  2. (2)

    Under regime F T = 0 , the event T = 1 , T = t has probability 0, so we are free to define the distribution of V conditional on this event arbitrarily; in particular, we can take it to be Π 1 , t .

  3. (3)

    Under regime F T = , the event T = 1 , T = 0 has probability 0, so we are free to define the distribution of V conditional on this event as Π 1 , 0 .

  4. (4)

    It remains to show that the distribution of V given T = T = 1 in regime F T = is Π 1 , 1 . Since, under F T = , T T , we need only condition on T = 1 . The result now follows from distributional consistency (14).

Since a parallel argument holds for the case t = 0 , we have shown that Π t , t serves as the conditional distribution for V given ( T = t , T = t ) in all three regimes, and (15) is thus proved.□

8.1 Graphical representation

The properties (13) and (15) are represented graphically (using d -separation) by the absence of arrows from F T to T and to Y , respectively, in the ITT (intention to treat) DAG of Figure 5, where again, a round node represents a stochastic variable, and a square node a non-stochastic regime indicator. In addition, we have included further optional annotations:

  • The outline of T is dotted to indicate that T is not directly observed.

  • The heavy outline of T indicates that the value of T is functionally determined by those of its parents F T and T .

  • The dashed arrow from T to T indicates that this arrow can be removed (there is then no dependence of T on T ) under either of the interventional settings F T = 0 or 1.

Figure 5 
                  DAG representing 
                        
                           
                           
                              
                                 
                                    T
                                 
                                 
                                    ∗
                                 
                              
                              
                              
                              ⊥
                              
                              
                              
                              ⊥
                              
                              
                              
                                 
                                    F
                                 
                                 
                                    T
                                 
                              
                           
                           {T}^{\ast }\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T}
                        
                      and 
                        
                           
                           
                              Y
                              
                              
                              ⊥
                              
                              
                              
                              ⊥
                              
                              
                              
                                 
                                    F
                                 
                                 
                                    T
                                 
                              
                              ∣
                              
                                 (
                                 
                                    T
                                    ,
                                    
                                       
                                          T
                                       
                                       
                                          ∗
                                       
                                    
                                 
                                 )
                              
                           
                           Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T}| \left(T,{T}^{\ast })
                        
                     .

Figure 5

DAG representing T F T and Y F T ( T , T ) .

Remark 1

Note that, on further taking into account the functional relationship of Definition 1, Figure 5 already incorporates the distributional consistency property of Definition 2, for V Y . For we have

(16) Y ( T = t , F T = ) = Y ( T = t , T = t , F T = )

(17) Y ( T = t , T = t , F T = t )

(18) = Y ( T = t , F T = t ) .

Here (16) follows from (ii) of Definition 1; (17) from Lemma 1 with V Y , i.e., Y F T ( T , T ) , which is represented in Figure 5; and (16) from (i) of Definition 1.

Now the ITT variable T , while crucial to understanding the relationship between the different regimes, is not itself directly observable. If we confine attention to relationships between F T , T and Y , we find no non-trivial ECI properties. So without further assumptions there is no useful structure of which to avail ourselves.

8.2 Ignorability

Suppose now we impose the additional ignorability property (12). Noting that T ˇ = t is identical with F T = t , this is equivalent to

(19) Y T F T = t , ( t = 0 , 1 ) .

Equivalently, since T is non-random in an interventional regime,

Y T ( T , F T = t ) , ( t = 0 , 1 ) .

Moreover, since in the idle regime, T is identical with T , so non-random when T is given, we trivially have

Y T ( T , F T = ) .

We thus see that ignorability can be expressed as:

(20) Y T ( T , F T ) .

Lemma 2

If ignorability holds, then

(21) Y F T T .

Proof

We first dispose of the trivial case that T has a one-point distribution. In that case, the conditioning on T in (15) is redundant and we immediately obtain (21).

Otherwise, 0 < pr ( T = 1 ) < 1 . We then have

(22) Y ( T = 1 , F T = ) Y ( T = 1 , F T = 1 )

(23) Y F T = 1

(24) Y ( T = 1 , F T = 1 ) .

Note that all conditioning events have positive probability in their respective regimes. Here (22) holds by distributional consistency (14), (23) by ignorability (19), and (24) because, under F T = 1 , T = 1 with probability 1. So we have a common well-defined distribution, Δ 1 say, for Y given T = 1 in both regimes F T = and F T = 1 . Furthermore, since under F T = 0 the event T = 1 has probability 0, we are free to define the conditional distribution of Y given T = 1 in regime F T = 0 as Δ 1 also, so making Δ 1 the common distribution of Y given T = 1 in all three regimes, showing that Y F T T = 1 . Since a similar argument holds for conditioning on T = 0 the result follows.□

Remark 2

An apparently simpler alternative proof of Lemma 2 is as follows. By Lemma 1, the conditional distribution of Y , given ( F T , T , T ) , does not depend on F T , while by (20) this conditional distribution does not depend on T . So (it appears), it must follow that it depends only on T , whence Y ( F T , T ) T , implying the desired result. This is a special case of a more general argument: that X Y ( Z , W ) and X Z ( Y , W ) together imply X ( Y , Z ) W . However, this argument is invalid in general [63]. To justify it in this case we have needed, in our proof of Lemma 2, to call on structural properties (in particular, distributional consistency, and the way in which T is determined by F T and T ) in addition to conditional independence properties.

Corollary 1

Ignorability holds if and only if

(25) Y ( T , F T ) T .

Proof

  1. If:

    Further conditioning (25) on F T yields (20).

  2. Only if:

    Property (25) is equivalent to the conjunction of (20) and (21).□

8.2.1 Graphical representation

The DAG representing (13) and (25) is shown in Figure 6. Compared with Figure 5, we see that the arrow from T to Y has been removed.

Figure 6 
                     Modification of Figure 5 representing ignorability.

Figure 6

Modification of Figure 5 representing ignorability.

Remark 3

We might try and make the deletion of the arrow from T to Y in Figure 5 into a graphically based argument for Lemma 2, for it appears to impose just the additional conditional independence property (20) representing ignorability, and to imply the desired result (21). However, this is again a misleading argument: inference from such surgery on a DAG can only be justified when it has a basis in the algebraic theory of conditional independence [21,23], which here it does not, on account of the fallacious argument identified in Remark 2.

Figure 7 results on “eliminating T ” from Figure 6: that is to say, the conditional independencies represented in Figure 7 are exactly those of Figure 6 that do not involve T . In this case, the only such property is (21).

Figure 7 
                     Collapsed DAG under ignorability, representing 
                           
                              
                              
                                 Y
                                 
                                 
                                 ⊥
                                 
                                 
                                 
                                 ⊥
                                 
                                 
                                 
                                    
                                       F
                                    
                                    
                                       T
                                    
                                 
                                 ∣
                                 T
                              
                              Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{F}_{T}| T
                           
                        .

Figure 7

Collapsed DAG under ignorability, representing Y F T T .

The ECI property (21), and the DAG of Figure 7, are the basic (respectively, algebraic and graphical) representations of “no confounding” in the DT approach, which has been treated as a primitive in earlier work. The above analysis supplies deeper understanding of these representations. Although on getting to this point we have been able to eliminate explicit consideration of the treatment selection variable T , our more detailed analysis, which takes it into account, makes clear just what needs to be argued in order to justify (21): namely, the property of ignorability expressed algebraically by (19) or (20) and graphically by Figure 6, and further described in Section 7.1.

9 Covariates

The ignorability assumption (12) will often be untenable. If, for example, those fingered for treatment (so with T = 1 ) are sicker than those fingered for control ( T = 0 ) – as might well be the case in a non-randomised study – then (under either treatment application T ˇ = t , t = 0 , 1 ) we would expect a worse outcome Y when knowing T = 1 than when knowing T = 0 . However, we might be able to reinstate (12) after further conditioning on a suitable variable X measuring how sick an individual is. That is, we might be able to make a case that, after restricting attention to those individuals having a specified degree X = x of sickness, the further information that an individual had been fingered for treatment would make no difference to the assessment of the individual’s response (under either treatment application). This would of course require that, after taking sickness into account, the treatment assignment process was not further related to other possible indicators of outcome (e.g., sex, age , ). If it is, these would need to be included as components of the (typically multivariate) variable X . We assume that the appropriate variable X is (in principle at least) fully measurable, both for the individuals in the study and (unlike T ) for myself. We assume internal exchangeability of ( X , T , Y ) , extending this to external exchangeability for ( X , Y ) .[21]

If and when such a variable X can be identified, we will be able to justify an assumption of conditional ignorability:

(26) Y T ( X , T ˇ ) .

Furthermore, to be of any use in addressing my own decision problem, such a variable must be a covariate, available prior to treatment application, and so, in particular must (jointly with T , at least for the study individuals, for whom T is defined) have the same distribution under either hypothetical treatment application. This is expressed as

(27) ( X , T ) T ˇ .

In particular, there will be a common marginal distribution, P X say, for X , in both interventional regimes.

When both (26) and (27) are satisfied, we call X a sufficient covariate. These properties are represented by the DAG of Figure 8.

Figure 8 
               DAG representing sufficient covariate 
                     
                        
                        
                           X
                        
                        X
                     
                  : 
                     
                        
                        
                           
                              (
                              
                                 X
                                 ,
                                 
                                    
                                       T
                                    
                                    
                                       ∗
                                    
                                 
                              
                              )
                           
                           
                           
                           ⊥
                           
                           
                           
                           ⊥
                           
                           
                           
                              
                                 T
                              
                              
                                 ˇ
                              
                           
                        
                        \left(X,{T}^{\ast })\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}\check{T}
                     
                   and 
                     
                        
                        
                           Y
                           
                           
                           ⊥
                           
                           
                           
                           ⊥
                           
                           
                           
                              
                                 T
                              
                              
                                 ∗
                              
                           
                           ∣
                           
                              (
                              
                                 X
                                 ,
                                 
                                    
                                       T
                                    
                                    
                                       ˇ
                                    
                                 
                              
                              )
                           
                        
                        Y\hspace{0.33em}\hspace{0.33em}\perp \hspace{-0.3em}\hspace{-0.3em}\hspace{-0.3em}\perp \hspace{0.33em}\hspace{0.33em}{T}^{\ast }| \left(X,\check{T})
                     
                  .

Figure 8

DAG representing sufficient covariate X : ( X , T ) T ˇ and Y T ( X , T ˇ ) .

9.1 Idle regime

As in Section 8, we introduce the regime indicator F T , allowing for consideration of the “idle” observational regime F T = , in addition to the interventional regimes F T = t ( t = 0 , 1 ); and the constructed “applied treatment” variable T of Definition 1. Arguing as for (20), (26) implies

(28) Y T ( X , T , F T ) .

Lemma 3

Let X be a sufficient covariate. Then

(29) ( X , T ) F T

(30) Y ( T , F T ) ( X , T ) .

Proof

By distributional consistency (14),

X T = 1 , F T = X T = 1 , F T = 1 X T = 1 , F T = 0

by (27). Hence, X F T T = 1 . A parallel argument shows X F T T = 0 , so that X F T T . On combining this with (13) we obtain (29).

As for (30), this is equivalent to the conjunction of (28) and Y F T ( T , X ) . The argument for the latter (again, requiring distributional consistency) parallels that for (21), after further conditioning on X throughout.□

The properties (29) and (30) are embodied in the DAG of Figure 9. This implies, on eliminating the unobserved variable T :

(31) X F T

(32) Y F T ( X , T ) ,

as represented by Figure 10.

Figure 9 
                  Full DAG with sufficient covariate 
                        
                           
                           
                              X
                           
                           X
                        
                      and regime indicator.

Figure 9

Full DAG with sufficient covariate X and regime indicator.

Figure 10 
                  Reduced DAG with sufficient covariate 
                        
                           
                           
                              X
                           
                           X
                        
                      and regime indicator.

Figure 10

Reduced DAG with sufficient covariate X and regime indicator.

Properties (31) and (32), as embodied in Figure 10, are the basic DT representations of a sufficient covariate. Assuming X , T , and Y are all observed, this is what is commonly referred to as “no unmeasured confounding.”

10 More complex DAG models

10.1 An example

Consider the following story. In an observational setting, variable X 0 represents the initial treatment received by a patient; this is supposed to be applied independently of an (unobserved) characteristic H of the patient. The variable Z is an observed response depending, probabilistically, on both the applied treatment X 0 and the patient characteristic H . A subsequent treatment, X 1 , can depend probabilistically on both Z and H , but not further on X 0 . Finally, the distribution of the response Y , given all other variables, depends only on X 1 and Z . Figure 11 is a DAG representing this story by means of d -separation.

Figure 11 
                  Observational DAG.

Figure 11

Observational DAG.

In addition to the observational regime, we want to consider possible interventions to set values for X 0 and X 1 . We thus have two non-stochastic regime indicators, F 0 and F 1 : F i = x i indicates that X i is externally set to x i , while F i = allows X i to develop “naturally.” The overall regime is thus determined by the pair ( F 0 , F 1 ) .

Figure 12 augments Figure 11, in a seemingly natural way, to include these regime indicators. It represents, by d -separation, ways in which the domain variables are supposed to respond to interventions. For example, it implies Y ( X 0 , H , F 0 , F 1 ) ( Z , X 1 ) : once we know Z and X 1 , not only are X 0 and H irrelevant for probabilistic prediction of Y but so too is the information as to whether either or both of X 0 , X 1 arose naturally, or were set by intervention. In particular, the conditional distribution of Y given ( Z , X 1 ) , under intervention at X 1 , is supposed to be the same as in the observational regime modelled by Figure 11.

10.1.1 From observational to augmented DAG

It does not follow, merely from the fact that we can model the observational conditional independencies between the domain variables by Figure 11, that their behaviour under the entirely different circumstance of intervention must be as modelled by Figure 12. Strong additional assumptions are required to bridge this logical gap. These we now elaborate.

We again introduce ITT variables, X 0 and X 1 ,[22] the realised X 0 and X 1 , in any regime, being given by

(33) X i = X i if F i = F i if F i .

Since, in the observational regime, X i = X i , Figure 11 would still be observationally valid on replacing each X i by X i .

The different regimes are supposed linked together by the following assumptions, which we first present and then motivate:

(34) X 0 ( F 0 , F 1 )

(35) ( H , Z , X 1 , Y ) ( F 0 , X 0 ) ( F 1 , X 0 )

(36) ( X 0 , H , Z , X 1 ) F 1 F 0

(37) Y ( F 1 , X 1 ) ( F 0 , X 0 , H , Z , X 1 ) .

Note that, since X i is determined by ( F i , X i ) , (35) and (36) are equivalent to:

(38) ( H , Z , X 1 , X 1 , Y ) ( F 0 , X 0 ) ( F 1 , X 0 )

(39) ( X 0 , X 0 , H , Z , X 1 ) F 1 F 0 .

Comments on the assumptions. In order to understand the above assumptions, we should consider Figure 11 as describing, not only the conditional independencies between variables, but also a partial order in which the variables are generated: it is supposed that, in any regime, the value of a parent variable is determined before that of its child. In particular, it is assumed that an intervention on a variable cannot affect that variable’s non-descendants – including their ITT variables and its own; but may affect its descendants – including their associated ITT variables.

  1. (i)

    Similar to (13), (34) expresses the property that an ITT variable, here X 0 , should behave as a covariate for X 0 , and so be independent of which regime, here F 0 , is operating on X 0 . Moreover, X 0 should not be affected by a subsequent intervention (or none), F 1 , at X 1 .

  2. (ii)

    Assumption (35) is a version of the ignorability property (25). It says that an intervention on X 0 should be ignorable in its effect on all other variables. Moreover, this should apply conditional on F 1 , i.e., whether or not there is an intervention at X 1 .

Remark 4

As previously discussed, ignorability is a strong assumption, requiring strong justification. Also note that, as shown by Corollary 1, (35) is implicitly assuming the distributional consistency property (Definition 2), in addition to ignorability.

  1. (iii)

    Assumption (36) expresses the requirement that ( X 0 , H , Z , X 1 ) , being generated prior to X 1 , should not be affected by intervention F 1 at X 1 . (However, they might depend on which regime, F 0 , operates on X 0 .)

  2. (iv)

    Similar to (ii), (37) says that, conditional on all the domain variables, ( X 0 , H , Z ) , generated prior to X 1 , the effect of intervention F 1 at X 1 is ignorable for its effect on Y ; moreover, this should hold whether or not there is intervention F 0 at X 0 . Informally, taken together with (39), this requires that ( X 0 , H , Z ) form a sufficient covariate for the effect of X 1 on Y .

In the following, we make extensive (but largely implicit) use of the axiomatic properties of (extended) conditional independence [21,64]:

  1. P1 (Symmetry):

    X Y Z Y X Z .

  2. P2:

    X Y Y .

  3. P3 (Decomposition):

    X Y Z and W a function of Y X W Z .

  4. P4 (Weak union):

    X Y Z and W a function of Y X Y ( W , Z ) .

  5. P5 (Contraction):

    X Y Z and X W ( Y , Z ) X ( Y , W ) Z .

Lemma 4

Suppose that the observational conditional independencies are represented by Figure 11, and that Assumptions (34)–(37) apply. Then the extended conditional independencies between domain variables, ITT variables, and regime indicators are represented by Figure 13.

Figure 12 
            Augmented DAG.

Figure 12

Augmented DAG.

Figure 13 
                     ITT DAG.

Figure 13

ITT DAG.

Remark 5

A further property apparently represented in Figure 13 is the independence of F 0 and F 1 :

(40) F 0 F 1 .

Now so far we have been able to meaningfully interpret an ECI assertion only when the left-hand term involves stochastic variables only – which seems to render (40) meaningless. Nevertheless, as a purely instrumental device, it is helpful to extend our understanding by considering the regime indicators as random variables also.[23] So long as all our assumptions and conclusions are in the form described in footnote 3, any proof that uses this extended understanding only internally will remain valid for the actual case of non-stochastic regime variables, as may be seen by conditioning on these.[24]

In the light of Remark 5, we shall in the sequel treat F 0 and F 1 as stochastic variables, having the independence property (40).

Proof of Lemma 4

It is straightforward to check that (34)–(37) are all represented by d -separation in Figure 13. We have to show that all the d -separation properties of Figure 13 are implied by these (together with the definitional relationship (33), and the purely instrumental assumption (40)).

Taking the variables in the order F 0 , F 1 , X 0 , X 0 , H , Z , X 1 , X 1 , Y , we thus need to show the following series of properties, where each asserts the independence of a variable from its predecessors, conditional on its parents in the graph.

(41) F 1 F 0

(42) X 0 ( F 0 , F 1 )

(43) X 0 F 1 ( X 0 , F 0 )

(44) H ( F 0 , F 1 , X 0 , X 0 )

(45) Z ( F 0 , F 1 , X 0 ) ( X 0 , H )

(46) X 1 ( F 0 , F 1 , X 0 , X 0 ) ( H , Z )

(47) X 1 ( F 0 , X 0 , X 0 , H , Z ) ( X 1 , F 1 )

(48) Y ( F 0 , F 1 , X 0 , X 0 , H , X 1 ) ( Z , X 1 ) .

On excluding (41), these conclusions will comprise the desired result.

  1. For (41):

    By assumption (40).

  2. For (42):

    By (34).

  3. For (43):

    Follows trivially since X 0 , being functionally determined by ( X 0 , F 0 ) , has a conditional one-point distribution, and so is independent of anything else.

  4. For (44)–(46):

    From (38) we have

    (49) ( H , Z , X 1 ) F 0 ( F 1 , X 0 )

    while from (39) we have

    (50) ( H , Z , X 1 ) F 1 ( F 0 , X 0 ) .

    We now wish to show that (49) and (50) imply

    (51) ( H , Z , X 1 ) ( F 0 , F 1 ) X 0 .

    This requires some caution, on account of Remark 2. To proceed we use the fictitious independence property (40).

    From (39) we have X 0 F 1 F 0 , which together with (40) yields F 1 ( F 0 , X 0 ) , so that

    (52) F 1 F 0 X 0 .

    Combining (49) and (52) yields ( F 1 , H , Z , X 1 ) F 0 X 0 whence

    (53) ( H , Z , X 1 ) F 0 X 0 .

    Finally, combining (53) and (50) yields (51).

    Now (51) asserts that the conditional distribution of ( H , Z , X 1 ) given X 0 is the same in all regimes. In particular (noting that X 1 = X 1 in the observational regime), that conditional distribution inherits the independencies of Figure 11. Properties (44)–(46) follow (on noting that X 0 , being a function of F 0 and X 0 , is redundant in (44) and (46)).

  5. For (47):

    Trivial since X 1 is functionally determined by ( F 1 , X 1 ) .

  6. For (48):

    From (38) we derive both

    (54) Y F 0 ( F 1 , X 0 , H , Z , X 1 )

    (55) Y X 0 ( F 0 , F 1 , X 0 , H , Z , X 1 , X 1 ) ,

    while from (37) we have

    (56) Y F 1 ( F 0 , X 0 , H , Z , X 1 ) ,

    (57) Y X 1 ( F 0 , F 1 , X 0 , H , Z , X 1 ) .

    We first want to show that (54) and (56) are together equivalent to

    (58) Y ( F 0 , F 1 ) ( X 0 , H , Z , X 1 ) .

    To work towards this, we note that, by (38), ( H , Z , X 1 ) F 0 ( F 1 , X 0 ) , which together with (52) gives ( F 1 , H , Z , X 1 ) F 0 X 0 , whence

    (59) F 0 F 1 ( X 0 , H , Z , X 1 ) .

    Then (58) follows from (54), (56), and (59) in parallel to the argument above from (49), (50), and (52) to (51).

    Now in the observational regime, Y ( X 0 , H ) ( Z , X 1 ) . By (58), this must hold in all regimes. This gives

    (60) Y ( F 0 , F 1 , X 0 , H ) ( Z , X 1 ) .

    Properties (57) and (60) are together equivalent to

    (61) Y ( F 0 , F 1 , X 0 , X 1 , H ) ( Z , X 1 ) .

    Combining (61) with (55) now yields (48).□

Augmented DAG. Finally, having derived Figure 13 from Assumptions (34)–(37), we can eliminate X 0 and X 1 from it. The relationships between the domain and regime variables are then represented by the augmented DAG of Figure 12, which can now be used to express and manipulate causal properties of the system, without further explicit consideration of the ITT variables – such consideration only having been required to make the argument to justify this use.

10.2 General DAG

The case of a general DAG follows by extension of the arguments of Section 10.1. Consider a set of domain variables, with observational independencies represented by a DAG D . We consider the variables in some total ordering consistent with the partial order of the DAG.

Some of the variables, say (in order) ( X i : i = 1 , , k ) , will be potential targets for intervention, with associated ITT variables ( X i ) and intervention indicator variables ( F i ). Let V i denote the set of all the domain variables coming between X i 1 and X i in the order. We thus have an ordered list L = ( V 1 , X 1 , , V k , X k , V k + 1 ) of domain variables, some of which are possible targets for intervention.

Let pre i denote the set of all predecessors of X i in L , including X i , and suc i the set of all successors of X i , excluding X i . By pre i we understand the set where all action variables in pre i are replaced by their associated ITT variables, and similarly for suc i . Also F i : j will denote ( F i , , F j ) , and similarly for other variables .

Generalising (34) with (35), or (36) with (37), and with similar motivation, we introduce the following assumptions (noting that B i expresses a strong ignorability property for the effects of all the variables ( X 1 , , X i ) on later variables – which would need correspondingly strong justification in any specific application):

(62) A i : pre i F i : k F 1 : i 1 ,

(63) B i : suc i ( F 1 : i , X 1 : i ) ( F i + 1 : k , pre i ) .

Taking account of the fact that X i is determined by ( F i , X i ) , these are equivalent to:

(64) A i : ( V 1 : i , X 1 : i , X 1 : i 1 ) F i : k F 1 : i 1 ,

(65) B i : ( V i + 1 : k , X i + 1 : k , X i + 1 : k ) ( F 1 : i , X 1 : i ) ( F i + 1 : k , V 1 : i , X 1 : i ) .

Theorem 1

Suppose the observational conditional independencies are represented by a DAG D , and that assumptions A i and B i ( i = 1 , , k ) hold. Then the extended conditional independencies between domain variables, ITT variables, and regime variables (conditional on the regime variables) are represented by the ITT DAG D , constructed by modifying D as follows:

  • Each action variable X i is replaced by the trio of variables F i , X i , and X i , with arrows from F i and X i to X i . It is assumed that (33) holds.

  • F i is a founder node.

  • X i inherits all the original incoming arrows of X i .

  • X i loses its original incoming arrows, but retains its original outgoing arrows.

Proof

See Appendix A.□

Finally, on eliminating the ITT nodes ( X i ) from the ITT DAG, the relationships between the domain variables and regime variables are represented by the augmented DAG D , constructed from D by adding, for each X i , F i as a founder node, with an arrow from F i to X i . As described in Section 2, such an augmented DAG is all we need to represent and manipulate causal properties defined in terms of point interventions. The above argument shows what needs to be assumed – and, more important, justified – to validate its use.[25]

11 Comparison with other approaches

In this section, we explore some of the similarities and differences between the DT approach to statistical causality, considered above, and other currently popular approaches.

11.1 Potential outcomes

In the potential outcomes (PO) formulation of statistical causality [24,25], the conception is that (for a generic individual) there exist, simultaneously and before the application of any treatment, two variables, Y ( 0 ) and Y ( 1 ) : Y ( t ) represents the individuals’s potential response to the (actual or hypothetical) application of treatment t . If treatment 1 (resp., 0) is in fact applied, the corresponding potential outcome Y ( 1 ) (resp., Y ( 0 ) ) will be uncovered and so rendered actual, the observed response then being Y = Y ( 1 ) (resp., Y = Y ( 0 ) ); however, the alternative, now counterfactual,[26] potential outcome Y ( 0 ) (resp., Y ( 1 ) ) will remain forever unobserved – a feature which Holland [17] has termed the fundamental problem of causal inference, although it is not truly fundamental, but rather an artefact of the unnecessarily complicated PO approach.

The pair ( Y ( 1 ) , Y ( 0 ) ) is supposed to have (jointly with the other variables in the problem) a bivariate distribution, common for all individuals – this might be regarded as generated from an assumption of exchangeability of the pairs ( Y i ( 1 ) , Y i ( 0 ) ) across all individuals i . The marginal distribution of Y ( t ) can be identified with our hypothetical distribution P t for the response variable Y under hypothesised application of treatment t , and is thus estimable from suitable experimental data. However, on account of the fundamental problem of causal inference no empirical information is obtainable about the dependence between Y ( 0 ) and Y ( 1 ) , which can never be simultaneously observed.

11.1.1 Causal effect

If I (individual 0) consider taking treatment 1 [resp., 0], I would then be looking forward to obtaining response Y 0 ( 1 ) [resp., Y 0 ( 0 ) ]. Causal interest, and inference, will thus centre on a suitable comparison between the two potential responses. The PO approach typically regards as basic the “individual causal effect,” ICE Y ( 1 ) Y ( 0 ) . However, again on account of the “fundamental problem of causal inference,” ICE is never directly observable, and even its distribution cannot be estimated from data except by making arbitrary and untestable assumptions (e.g., that Y ( 1 ) and