Causal inference in AI education: A primer

: The study of causal inference has seen recent momentum in machine learning and arti ﬁ cial intelligence ( AI ) , particularly in the domains of transfer learning, reinforcement learning, automated diag - nostics, and explainability ( among others ) . Yet, despite its increasing application to address many of the boundaries in modern AI, causal topics remain absent in most AI curricula. This work seeks to bridge this gap by providing classroom - ready introductions that integrate into traditional topics in AI, suggests intui - tive graphical tools for the application to both new and traditional lessons in probabilistic and causal reasoning, and presents avenues for instructors to impress the merit of climbing the “ causal hierarchy ” to address problems at the levels of associational, interventional, and counterfactual inference. Finally, this study shares anecdotal instructor experiences, successes, and challenges integrating these lessons at multiple levels of education.


Introduction
The study of causality seeks to model and reason about systems using a formal language of cause and effect, and has undertaken a number of important endeavors across a diverse set of disciplines, including causal diagrams to inform empirical research [1], structural equation models for econometric analysis [2], systemsthinking in the philosophy of science [3][4][5], modeling elements of human cognition and learning [6][7][8], and many others [9].
Yet, for its long history in other disciplines, causal inference has only recently begun to penetrate traditional topics in machine learning (ML) and the design of artificial agents.Perhaps overshadowed by the impressive advances from deep learning, the artificial intelligence (AI) community is turning to causality to address many of its boundaries, such as to avoid overfitting and to transfer learning [10,11], reasoning beyond observed examples as through counterfactual inference [12], providing meta-cognitive avenues for reinforcement learners in confounded decision-making scenarios [13,14], improving medical diagnostics beyond mere association of symptoms [15,16], reducing bias in ML models through formalizations of fairness [17,18], among others [19,20].
Despite these clarion calls for causality from many prominent researchers and practitioners [21][22][23], it remains a missing topic in the majority of traditional artificial intelligence (AI) curricula.This lag can be explained by a number of factors, including the recency of causal developments in the domain, the lack of a bridge between the topics of causality that statisticians and empirical scientists care about and those that computer scientists do, and the lack of template lesson plans for integration into such curricula; even causality textbooks oriented at undergraduate introduction lack direct examples relating to AI [24].Although efforts do exist in the literature toward bridging causality in AI [25,26], this work serves as a motivator, primer, and introductory handbook for educators to bring causality into the AI classroom, and focuses especially on the tools of graphical causality to intuitively introduce its topics to novices.Specifically, it provides motivated, detailed, and numerical examples of causal topics as they apply to AI, discusses common pitfalls in the course of student learning experiences, and gives a number of other tools ready to be deployed by instructors teaching topics in AI and ML at the high-school and college levels.
As such, the main contributions of the present work are as follows: 1. Provides brief, classroom-ready introductions to the three tiers of data and queries that compose the causal hierarchy: associations, interventions, and counterfactuals.2. Suggests intuitive graphical depictions of core lessons in probabilistic and causal reasoning that enable multi-modal instruction.3. Demonstrates and motivates examples wherein causal concepts can be easily integrated into typical lessons in AI, alongside novel, interactive learning tools to help concrete select topics.4. Shares anecdotal successes, challenges, and instructor experiences from causally motivated lessons deployed at both undergraduate and high-school levels.

Forward
Before embarking on this journey in causality, it is important to contextualize this work with respect to its intended audience, target domains of application, and source experiences from which anecdotal student experiences are shared.
Intended audience.This work represents an invitation to instructors to concert topics in AI and causality, and is thus appropriate for the following readers: 1. Instructors with a background in causality who are looking to incorporate more AI/ML examples and assignments into their courses, either by extending existing lessons in AI/ML with causal topics or by incorporating AI/ML examples into courses primarily on causal inference.2. Instructors teaching AI/ML courses who are looking for entry points/motivations to introduce causality but who may be unfamiliar with causal formalisms or procedures.
For readers of category 1, we include brief refreshers on the formal lessons of causality alongside intuitive examples that are classroom-ready supplements delete both for the foundational concepts and those marketed specifically for AI/ML applications.For readers of category 2, please note that this work is not intended as an in-depth primer for all causal topics (the textbooks referenced in the introduction are better suited for this), but instead, contains examples and problems that we hope compel the integration of causality in many avenues of traditional AI/ML.
Source experiences and target domains.Given these specifications, we include several example syllabi and suggested entry points for causal topics in traditional AI curricula within Appendix A. Of these, two are from the authors' deployments at (a) the high-school level in a course entirely on causality and (b) at the undergraduate level in a course on causal reinforcement learning (mingling the two topics in depth).In the shared experiences in teaching these courses that are mentioned throughout this work, note that the authors enjoyed classes high in student engagement and enjoyment, but for which only anecdotal evidence is available, making no objective claims that must be instead studied empirically.With a wider adoption of causal topics in AI curricula, we invite future study to examine its potential benefits at a more robust population level.

Background
Etiology is core to scientific discovery and philosophical concerns since humans first started asking why things are the way they are.Humans possess a natural ability to learn cause and effect that allows us to understand, deduce, and reason about the data we take in through our senses [27].Modern tools for inferring causes allow us to systematically interpret these causal connections at a more fundamental level with increased confidence, less data, and fewer assumptions.With this deeper causal knowledge, causal inference serves to make accurate predictions, estimate the effects of interventions, and decipher imagined scenarios.Similar to the importance in statistical models, the benefits of causal inference depend on the accuracy and completeness of the assumed causal model.
The distinction between these tasks, their underlying types of data, and the inferences possible given assumptions about the system are delineated in the Pearlian causal hierarchy (PCH) [28].The PCH is organized into three tiers/layers of information, each building upon the expressiveness of the last: 1 Associations: Observing evidence and assessing changes in belief about some variables, e.g., determining the probability of having some disease given presentation of certain symptoms.
2 Interventions: Assessing the probability of some causal effects under manipulation, e.g., determining the efficacy of a drug in treating some condition.
3 Counterfactuals: Determining the probability of some outcomes under hypothetical manipulation that is contrary to what happened in reality, e.g., determining whether a headache would have persisted had one not taken aspirin.
The ability to traverse the different layers of the PCH often demands causal assumptions to be stated in a mathematical language that clearly disambiguates between them.As will be demonstrated in the following sections, certain interventional ( 2 ) and counterfactual ( 3 ) queries of interest cannot be answered using data and traditional observational statistics alone, but can be enabled by an explanation of the system under scrutiny as through a structural causal model (SCM).
U is a set of background variables (also called exogenous), whose values are determined by factors outside the model.
of endogenous variables, whose values are each determined by other variables in ∪ U V.
such that each f i is a mapping from (the respective domains of) ∪ U PA i i to V i , where ⊆ U U i and ⊆ ⧹ PA V V i i and the entire set F forms a mapping from U to V .In other words, each f i in assigns a value to V i that depends on (the values of) a select set of variables.

( )
P u is a probability density defined on the domain of U .
The inputs to the functions in F within an SCM induce a causal diagram in the form of a directed graph.We will only consider SCMs that induce a directed acyclic graph (DAG) in this introductory work as shown in Figure 1.A DAG alone is therefore a partial causal model in itself.This nonparametric causal model can come from expert knowledge and is often the only portion of the SCM to which we have access.1.The set of endogenous variables V , represented as solid nodes (vertices).2. The set of exogenous variables U , represented as hollow nodes (sometimes omitted for brevity).3. The functional relationships, F, between variables, represented by directed edges that connect two variables → 4. Spurious correlations between variables, represented by a bidirected, dashed edge connecting two variables ← −→ V V a b if their corresponding exogenous parents U a and U b are dependent, or if f Va and f Vb share an exogenous variable U i as a parameter to their functions.
Intuitively, causal diagrams characterize cause-effect relationships between variables in a system of functional relationships such that . Equivalently, a DAG provides extra data information to answer many causal queries that rely on the structure of these relationships, even with (parts of) the data generating process hidden.Consequently, DAGs allow for the computation of causal effects, real or counterfactual, despite the absence of experimental data.

Motivating causal inference
Causal inference is thus the umbrella label for tools used to compute queries (generally, those at 2 and 3 ) from an existing causal model.There are many different types of causal queries at these higher tiers of inference, and before formalizing any of them, students will appreciate some intuition surrounding why they are interesting from a data-scientific perspective.
Example 2.1.Pharmacological observations to policies.Consider some observational data collected on medical records relating whether patients took some over-the-counter drug X (e.g., aspirin), presented with some condition Y (e.g., heart disease), and accounting for some pretreatment covariates Z (e.g., age) and some posttreatment covariates M (e.g., blood pressure).There are many interesting causal queries that could be posed to such a system, e.g.: 1.How much of the causal effect of aspirin on heart disease is explained by the aspirin itself vs. its indirect effect on blood pressure?2. In total, to what degree does taking aspirin help or hurt incidence of heart disease? 3. On average across age groups, is aspirin harmful or helpful?What about for patients of specific ages?Note how answers to each of the questions in Example 2.1 have implications for medical policy, e.g., it may be helpful within certain age ranges but harmful in others, it may only be helpful due to its influence on blood pressure (in which case, other prescriptions may more directly help to prevent heart disease), and so on.However, from observational data alone (i.e., outside of the laboratory randomized clinical trial), associations between these variables can complicate the answers to causal questions.SCMs provide formal characterizations and procedures for answering each of these causal questions by applying a structured explanation of the system of causes and effects, which serves as a lens through which one can view the data's associations.Consider an example characterization of this system as diagrammed in Figure 2.With the extra explanatory power of an SCM layered atop the data it fits, we can intuitively define several different types of causal questions based on its structure, some of the recipes for which will be formalized later.

Definition 2.3. (Causal effects [intuition]) [24]
Given any SCM M and its associated causal diagram G, the measurement of causal effects can be informally defined as the controlled influence of some variable X on some outcome Y through so-called causal pathways that are descendant of the intervention X in G.More specific ways of dissecting these causal pathways (referencing Figure 2.1) are as follows: -The direct effect of X on Y is its unmediated influence on the outcome, as through the path → X Y indicating the direct effect of aspirin on heart disease.
-The indirect effect of X on Y is the sum of its mediated influences on the outcome, as through the path → → X M Y , indicating the effect of aspirin on heart disease as mediated through its influence on blood pressure.¹-The total effect of X on Y is the sum of its direct and indirect effects.
-The average causal effect (ACE) of X on Y is its total effect averaged across subpopulations (e.g., the total effect of aspirin on heart disease averaged across controlled age ranges) [31].
The aforementioned represents only a sample of the many causal effects of interest for practitioners wishing to translate their data into actionable policy (e.g., see refs.[32][33][34]).However brief, this list of example causal queries serves as a simple motivation for causal inference, and for intuiting the SCMs that can be used to address them.That said, introductory lessons in causality are dominated by such examples in medicine, econometrics, and others; a major deliverable of this work is to shift the same lessons motivated by these in the empirical sciences to settings that AI scientists will find applicable.

Motivating causal discovery
Given the capabilities of SCMs to answer interesting causal queries like those in Example 2.1, students will likely be curious to learn about the sources of these models.Thus, as an oft close companion to causal inference, causal discovery techniques focus on the construction or learning of SCMs from data.Causal discovery supports the assembly of DAGs, or parts of DAGs, largely by examining independence relations among variables (potentially conditioned on other variables), to offer a mechanism to uncover their causal relationships.In this sense, data alone are sometimes enough for causal inference, but when they are not, a partial DAG (also known as a pattern or equivalence class) can inform practitioners of what else is required to disambiguate.Children instinctively comprehend this and employ playful manipulation to better grasp their environment when information from their senses is insufficient [7]; adults and scientists also perform experiments to confirm their causal hypotheses.Causal discovery with DAGs may provide a systematic way for machines to better understand causal situations beyond the traditional ML task of prediction [23].
All DAGs, regardless of complexity, can be constructed from paths of the three basic structures depicted in Figure 1.The chain in Figure 1(a) consists of X causing Z, followed by Z causing Y .The fork in Figure 1(b) consists of Z having a causal influence on both X and Y .In this case, even though X has no causal effect on Y , knowing the value of X does help predict the value of Y , quintessential correlation without causation.In both the chain and the fork, X is independent of Y if and only if conditioning on Z Colliders, as illustrated in Figure 1(c), behave the opposite to chains and forks in regards to independence.Specifically, ⊥ ⊥ X Y without conditioning on Z: P Y y .However, X and Y notably become dependent when conditioning on Z or any of its descendants: . By holding the common effect Z to a particular value, any change to X would be compensated by a change toY .
Anecdotally, students have appreciated causal stories to explain these rules of dependence in a causal graph, which may also serve as mnemonics.For each of the following examples, a fruitful exercise can be to have students provide a graphical explanation for the story, which then motivates the rules of independence expected of any graphs with the same patterns.
Example 2.2.Mediation: smoking, tar, and lung cancer.In medical records, smoking cigarettes, X, has been shown to be positively correlated with the incidence of lung cancer, Y .It is known that smoking causes deposits of tar, Z, in the lungs, which leads to cancer Y .However, knowing whether a patient has lung tar Z makes its source (e.g., whether or not they smoked, X) independent from their propensity for lung cancer, Y .Z is thus known as a mediator between X and Y , making the causal structure a chain, → → X Z Y.
Example 2.3.Confounding: heat, crime, and ice cream.Data reveals that sales of ice cream, X, are positively correlated with crime rates, Y , yielding the amusing possibilities that criminals enjoy a post- crime ice cream or that ice cream leads people to commit crime.However, the two become independent after controlling for a confounder, temperature, Z, that is responsible for both (and could not be affected by either).Z is known as a confounder that "explains away" the noncausal relationship between X and Y , making the causal structure a fork, ← → X Z Y.
Example 2.4.Colliders: coin flips and coffee.You and your roommates have a game that decides when you will break for coffee: if two of you flip fair coins X and Y , and they both come up heads or both tails, then you will ring a bell Z to summon your dorm to get coffee, C. Alone, the coin flip outcomes X and Y are independent of one another; however, if you hear a bell ring, and know that = X heads , you know also that = Y heads .The same is true if, instead of hearing the bell, you witness your dorm leave to get coffee.This relationship is thus a collider structure with → ← X Z Y, and demonstrates the effects of conditioning upon the descendant of a collider, → Z C.
The graphical nature of these types of exercises can engender high engagement among students compared to typical probability syntax alone.Causal intuition and probabilistic understanding in this puzzle-like context are thus concerted and enhanced.Building upon these intuitions, we can establish independence or isolate effects in more complex graphs by blocking paths from one node to another through a structural criterion called d-separation (directional separation) [35]; d-separation is already taught alongside traditional AI coverage of Bayesian Networks and succinctly stated as follows.If all paths between X and Y are blocked given Z, they are said to be "d-separated" and thus With causal models being core to causal inference, d-separation provides us with an important testing mechanism.Because a DAG demonstrates which variables are independent of each other given a subset of the remaining variables to condition on, probabilities can be estimated from data to confirm these conditional independencies.The fitness of a causal model can therefore be validated (to a degree of confidence), and debugging simplified from global fitness tests to d-separation's ability to pinpoint error localities.Unfortunately, it is not possible to test every causal relationship between nodes in a DAG, meaning that causal discovery does not always yield the complete DAG, nor are these validity measures a guarantee that a recovered graph represents the true reality [21].
Still, certain structural hints provide hope of recovering causal localities.For instance, a v-structure is defined as a pair of nonadjacent nodes, such as X and Y in Figure 1(c), with a common effect (Z in the same figure).These v-structures are often embedded throughout larger causal graphs.An example of a testable implication is to check that Z is not included in the set of nodes that render ⊥ ⊥ X Y .A simple approach to causal discovery is to find every possible DAG compatible with a set of variables and their independence relationships in a dataset.In general, better approaches require further assumptions, but this is an active area of research [36][37][38].The set of compatible DAGs is called an equivalence class, which, for some causal queries, can be sufficient for identifying causal effects even with partial structures.If further experimentation is necessary, an equivalence class can help target those variables on which experiments need to be performed to discover the true structure [39,40].
The inductive causation (IC) algorithm³ [9, p. 204] is a simple approach to causal discovery: 1.For each pair of variables a and b in V , search for a set S ab such that ( | ) ⊥ ⊥ a b S ab holds in P ˆ(stable distribution of V ).Construct an undirected graph G such that vertices a and b are connected with an edge if and only if no set S ab can be found.2. For each pair of nonadjacent variables a and b with a common neighbor c, check if ∈ c S ab .If it is not, then add arrowheads → ← a c b. 3.In the partially directed graph that results, orient as many of the undirected edges as possible subject to two conditions: (i) any alternative orientation would yield a new v-structure and (ii) any alternative orientation would yield a directed cycle.
The first step constructs a complete skeleton.While not all arrowheads in the second and third steps can always be discovered from data alone, systems can also prompt humans for clarity on parts of nonparametric causal models to resolve ambiguity.Robotic algorithms can even perform necessary experiments to disambiguate certain localities of the causal structure.
The whole process of constructing a causal model can be challenging for students not familiar with modeling.The following example demonstrates a simple workflow in which students can engage.
Example 2.5.Workflow: causal model construction.A workflow might consist of using the aforementioned IC algorithm to generate a partial DAG.A probability distribution drawn from pharmacological data from Example 2.1 is presented in Table 1.⁴ The IC algorithm will generate the graph in Figure 3, leaving three edges undirected: − X Z, − X Y, and − X M. To determine the directions of those edges, three techniques can be employed: -New or existing experiment.A randomized controlled trial (RCT) was previously performed, and the proportion of individuals with condition Y in the treatment (X) group differed from the proportion in the control group.This provides evidence for directed edge → X Y.If this RCT did not exist, a new RCT could be conducted.
-Expert knowledge.Consulting a researcher provides evidence that covariate Z (age) affects decisions to take this drug.Therefore, edge → Z X is now directed.-Re-evaluation.The only edge remaining undirected is − X M. The direction ← X M would create a V- structure (nonadjacent parents Z and M to X). V-structures are detected in the IC algorithm's step 2. Since IC did not detect this, the directed edge must be → X M.
Finally, the constructed DAG ends up equivalent to the DAG in Figure 2.This work introduces a companion causal inference learning system⁵ to help students practice and absorb concepts in causal discovery.As depicted in Figure 4, a teacher simply writes the structural functions and data generating processes of the exogenous variables, and students are presented with the resulting probability distribution and nodes of the equivalence class to connect.Causal discovery exercises such as these provide engaging exploration into the etiology of data generation missing in many statistically focused curricula.

Assumptions
This background on chains, forks, colliders, and d-separation affords us basic building blocks for powerful causal inference tools.As important caveats to be discussed with students, the power of a causal model depends on having a correct representation of the system.There are some criteria for assessing whether a model is a fair representation of the underlying data or data generating process.However, some assumptions are sometimes necessary to be asserted.The first that we have assumed is infinite data, leaving the statistical analysis of quantifying uncertainty with finite samples to be dealt with separately.The second is a property known as stability/faithfulness: we assume independencies remain invariant when ( ) P U changes.This means that the conditional independencies in the underlying probability distribution are reflected in the DAG.As a basic violation of this, imagine a child who only eats vegetables (Y ) if their parents convince them (X).The child's parents are always trying to convince them, ( ) P Y y .However, that equality only holds The remainder of this work focuses on the potential of causal inference to both elucidate traditional topics in AI and ML and to inspire new avenues for students to explore.Using the preliminaries outlined in this section, students will be equipped to understand the challenges and opportunities at each tier of the PCH.

Instructor reflections
Intuiting the motivations for causal inference is a challenge to instill in students who have dealt little with real data and the many complex questions that data may or may not be equipped to answer alone.Leading any introductory causal lesson with the intuitions presented in Example 2.1 and Definition 2.3 can spark the important questions that motivated the PCH; questions as simple as "Does obesity shorten life, or is it the soda?" [33] are enough to elicit lively classroom discussion just to introduce the distinctions of types of causal effects and how these are difficult to disentangle from mere associations without the aid of a model.
From these observations, anecdotally, students to treat exercises involving the design and interpretation of compounded conditional independence graphs as puzzles rather than monotonous calculations that may lack translation to purpose.This has elicited classroom enjoyment, which feeds into participatory graph modifications in the spirit of causal discovery.The discussions and debates that ensue develops intuition through active engagement, which can be especially important at the high-school level for engendering intuition before formalism.
For assignments, instructors may find it useful to generate mock datasets (like that in Example 2.5) to help students to understand crucial lessons in causal discovery, d-separation, and challenges like observational equivalence and unobserved confounding.Various software packages exist for this endeavor, though Tetrad and Causal Fusion have been popular choices that students can pick up without large amounts of tutorial.⁶It is likewise important to instill that causal discovery is a difficult exercise that is far from a magicwand to be waved over a dataset to produce a trustworthy model; as an ongoing field of inquiry, even in ideal situations, extracting the causal graph can be difficult and many times rests on extra-data sources of information to implement properly.Given these challenges, there are a myriad of reasons and scenarios in which to pursue their solution, many of which we highlight in the coming sections.

Associations
SCMs are capable of answering a wide swath of queries, the most fundamental being the associational.Queries at this first tier, or layer 1 , consist of predictions based on what has been observed.For instance, after observing many labeled CT scans with and without tumors, an ML algorithm can predict the presence of a tumor in a previously unseen scan.Traditional supervised learning algorithms have excelled in their ability to answer 1 queries, typically trained on data consisting of large feature vectors along with their associated label.If X is an n-dimensional feature vector with … X X X , , , n as the individual features, and Y is the output  6 Tetrad can be found at https://www.ccd.pitt.edu/tools/and Causal Fusion at causalfusion.net.
variable, a model such as a trained neural network will calculate ( | . However, this predictive capacity can be stretched thin when faced with important queries that are not associational; indeed, many pains of modern ML techniques can be blamed on their inability to move beyond this tier, as demonstrated over the following examples.

Simpson's paradox
Example 3.1.AdBot Consider an online advertising agent attempting to maximizing clickthroughs on studying assistance applications catered differently to college and graduate vs. high-school and primary students, with 0, 1 whether it was clicked upon, and { } ∈ Z 0, 1 whether the viewer is younger than 18 ( = Z 0, typically pre-college age) or older ( = Z 1, typically undergraduate or professional studies).A marketing team collects the following data on purchases following ads shown to focus groups to be used by AdBot: .81 1 0 0.75, which may lead AdBot to conclude that Ad 1 is always more effective.However, the same data also show within age-specific strata that indicating that Ad 0 is better.AdBot thus faces a dilemma: if the age of a viewer is not known, which ad is the best choice?This conflict is known as Simpson's paradox, which long haunted practitioners using only 1 tools without causal considerations.Its solution, and those to many other problems, can be found in the next tier.

Linear regression
Linear regression is a common topic in introductory statistics and ML courses.This is due, in part, to linear regression's interpretability, limited overfitting, and simplicity.As shown later, a linear model's coefficients explain the impact each variable has on the outcome.This provides intuition behind how causal structure affects learned parameters.Linear regression also provides a base from which to launch more complex ML models and algorithms, and topics like parameters, degrees of freedom, and nonlinearity can be added incrementally.The simplicity of linear regression makes for an ideal starting point for introducing causality to ML.Although this simplicity will seldom yield highly predictive algorithms with real-world data, linear regression can clearly illustrate the value of causal constructs through coding exercises.Student discussion can be fostered through debate about linearity assumptions among exercises and examples.
Other work has provided examples for inferring causal effects from associational multivariate linear regression [41], but which we adapt herein as useful exercises for ML students to start examining problems from different tiers of the causal hierarchy.A first exercise corresponds to the chain DAG shown in Figure 1(a).
Example 3.2.Athletic performance Consider an athletic sport where the goal is to predict an athlete's performance.An ML model uses features X and Z, corresponding to training intensity and skill The next step is to train an ML model that lacks nonlinear activation functions.The weights of the model can then be analyzed: model = train_model(features, y) # train 1-layer model on features X,Z, and outcome Y weights, bias = model.parameters()# retrieve weights and bias for the neural network print(weights.tolist())# print the weights for X and Z to the console # [[-0.00918455421924591,2.9990761280059814]] print(bias.item())# print the bias to the console # -0.004577863961458206 The weight on X has a negligible⁸ impact on the result.This also makes intuitive sense as the model was trained on both X and Z, while Y only "listens to" Z (i.e., since Y is a function of Z, ( ) f z u , y y ).Looking only at the weights, it would seem that training intensity is irrelevant to athletic performance.If an analyst wanted to predict the performance of someone with increased training intensity, using this model they would observe no difference in performance.On the other hand, if the model had been trained only on X: model = train_model(x, y) # train model only on X instead of both X and Z weights, bias = model.parameters()print(weights.tolist())# [[6.0043745040893555]] print(bias.item())# 0.0020016487687826157 Here, X clearly plays a major role in predicting performance.This time, making a prediction using this model with increased training intensity will yield increased athletic performance.
Which feature vector do we use for our ML model?This decision is not clear because predicting athletic performance when changing only training intensity is an intervention.Thus, this is a causal question requiring tools from 2 covered in the following section.
Example 3.3.Competitiveness How an athlete fares in a competition against others depends, among other things, on their athletic ability and preparation.Unfortunately, The Tortoise and the Hare taught us that high performers often suffer from overconfidence, which reduces their preparation time and effort.To predict an athlete's level of competitiveness, Y , an ML model uses features X and Z, corresponding to preparation and athletic performance.The following PyTorch code generates example data accordingly.The weight on X indicates a positive impact on the outcome.Predicting the level of competitiveness of someone with increased preparation time would yield an increased level of competitiveness.This makes sense as the example data were generated, where Y was calculated with a positive multiple of X (1 to be precise).Next, a singleton feature vector of X produces the following weights and bias: This time, the weight on X is negative, indicating a negative impact on the outcome.It would seem that increasing preparation in this model decreases competitiveness.
These two models have very different weights on X.Which model is correct?The answer depends on the quantity of interest.A causal question, such as, "What is the effect of preparation on competitiveness?" requires an analysis in 2 .
Example 3.4.Money How much money does an athlete earn?This depends, among other things, on their previous athletic performance and their ability to negotiate.Can an ML model predict an athlete's This time, the weight on X is negligible.We know, from the code that generated the example data, that negotiating skill and athletic performance are uncorrelated.So, this appears to be a better model for understanding the causal effect of negotiating skill on athletic performance (a null causal effect).In addition to the two previous examples, this is another example where 2 tools are necessary to know which variables to include in the feature vector.
Examining the weights is helpful to foster an intuition for why feature selection is critical in understanding causal relationships and queries.As students investigate more expressive, nonlinear, models (for which libraries like PyTorch provide a number of tools), weights become less interpretable despite what may be an increase in accuracy.Still, these causal intuitions to feature selection, their relationship to SCMs, and how they may bias queries remain.

Instructor reflections
Within the courses in which these causal concepts have been tested, students have exhibited surprise when first exposed to Simpson's paradox.This revelation is their first hint that the story behind the data is crucial for thorough and valid interpretations of the results.This is a prime opportunity for active learning.By using DAGs as a discussion source [42], students review and debate both the diagrams and the need to be careful about which features to train their ML models on and how to utilize their results.
For many students, learning the mathematics of probability and statistics may feel mechanical, thus missing the forest (the ability to use these as tools to inform decisions, automated or otherwise) for the trees (the rote computation) [43,44].Examples like those introduced in 3.1-3.4break the mold of this script and ask students to make a defendable choice with the data and assumptions at-hand because such acts are causal questions often unanswerable by the data alone.
The causal "solutions" to these problems have intuitive, graphical criteria that students tend to find more appealing than reasoning over the symbolic or numerical parameters of each system alone.What follows is an overview of these approaches that can both enhance student understanding of traditional tools in 1 , and understanding their limits: both when and how to seek solutions to questions at higher tiers of the causal hierarchy.

Interventions
The second tier in the causal hierarchy is the interventional layer, 2 .Queries of this nature ask what happens when we intervene and change an input as opposed to seeing the input of the associational layer.Analyzing Table 2 in the AdBot example, the question of what outcome we can predict based on which ad was shown is answered by seeing that Ad 1 received more clicks.However, the causal question of which ad causes more clicks is a different question, predicated on determining the effect of changing the ad that was seen despite its natural causes.
To isolate these causal effects, the RCT was invented [45], free of the so-called "confounding bias" that can make spurious correlation masquerade as the causal effect.Unfortunately, experiments are not always feasible, affordable, nor ethical: if we consider an example experiment to discern the effects of smoking on lung cancer, and confess that while there are valuable techniques for dealing with imperfect compliance [9,46], a study that forced certain groups to smoke and others to abstain would not be ethically sound.

Resolving Simpson's paradox
As such, practitioners are often left with causal questions but only observational data, like in Example 3.1.Herein, we witness an instance of Simpson's paradox, when a better outcome is predicted for one treatment versus another, but the reverse is true when calculating treatment effects for each subgroup.
Resolving Simpson's paradox demands that we understand the underlying data-generating causal system, which in general may cause confusion through only the associational lens.Examining Figure 5, these two observationally equivalent causal models of the data in Example 3.1 tell two different interventional stories.In (a), Z is a confounder whose influence in the observational data must be controlled to isolate the causal effect of → X Y.In (b), Z is only spuriously correlated with { } X Y , , and so controlling for Z in this setting will actually enable confounding bias (by the rules of d-separation, since → ← U Z U 1 2 forms a collider).Practically, this means that if (a) is our explanation of the observed data, then AdBot should consult the age-specific clickthrough rates and display Ad 0; if (b) is our explanation, then we consult the aggregate data and display Ad 1.In this specific scenario, model (a) is the more defendable since there cannot be latent confounders that affect someone's age as in model (b).
Generalizing the intuitions earlier, the foundational tool from the interventional tier is known as docalculus [47], which allows analysts to take both observational data and a causal model, and answer interventional queries.Definition 4.1.(Intervention) An intervention represents an external force that fixes a variable to a constant value (akin to random assignment if an experiment) and is denoted ( ) = do X x , meaning that X is fixed to the value x.This amounts to replacing the structural equation for the intervened variable with its Causal inference in AI education: A primer  155 fixed constant such that = f x X (eliciting the "mutilated submodel" M x ).This operation is also represented graphically by severing all inbound edges to X in G, resulting in an "interventional subgraph" G x .
To compare quantities at associational ( 1 ) and interventional ( 2 ) tiers, the probability of event Y happening given that variable X was observed to be x is denoted by ( | ) = P Y X x .The probability of event Y happening given that variable X was intervened upon and made to be x is denoted by ( | ( )) = P Y X x do .For instance, in Figure 5(a), the effect of intervention ( ) = do X x would be to sever the edge → Z X. Formally, to compute the ACE (Def.2.3) of an ad on clickthroughs in Example 3.1, and assuming our setting conforms to the model in Figure 5(a), we must compute X's influence on Y in homogeneous conditions of Z, weighted by the likelihood of each condition = Z z.This adjustment is accomplished through the graphical recipe specified by the Backdoor criterion: No node in Z is a descendant of X 2. Z blocks every path between X and Y that contains an arrow into X The backdoor adjustment formula for computing causal effects ( 2 ) from observational data ( 1 ) is thus:  From this adjustment, we confirm that displaying Ad = X 0 has the highest ACE on clickthrough rates.In summary, we arrive at this conclusion through the following steps, which are beneficial to highlight for students applying this recipe in general: 1. Example 3.1 demanded that we compute an ACE of ad choice X on clickthrough rates Y in cases that the viewer's age Z is unknown; this is an 2 query of the format ( | ( )) = P Y do X 1 whose computation can suffer from Simpson's Paradox given that the inclusion or exclusion of Z as a control delivers different answers of the optimal ad choice.2. To resolve this "paradox" and compute the ACE requires assumptions about the causal structure to determine which, if any, spurious pathways demanded control.We encoded these assumptions in the SCM with graphical structure from Figure 5(a).3. Given this structure, we applied the backdoor adjustment criteria to find ( | ( )) = = ∀ = P Y do X x X x 1 controlling for backdoor admissible variable Z and concluded that the highest likelihood action = X 0 was the best for maximizing clickthroughs.

Causal recipes for feature selection
The power of do-calculus means ML algorithms can utilize causal effects without having to perform experiments or be trained on experimental data.⁹This has implications for ML feature selection: bias may be introduced if the causal structure is not consulted.For example, a collider might be conditioned on without conditioning on noncolliders along the path from action X to outcome Y .Consider the M-graph of Figure 5(b): variables U 1 and U 2 cannot be included in the feature vector of an ML model because they are unobserved, and if Z is included in the feature vector, this model will produce correlative ( 1 ), but not causal ( + 2 ), predictions.Notably, if the requested query is indeed correlative, the criteria for feature selection are different than if it were causal, and the addition of features that provide information about the outcome can aid in accuracy without causal considerations.However, queries at tiers above the first must be careful with controlled covariates lest they inadvertently bias the outcome.Reflecting on the pharmacology Example 2.1, we can conceive of queries at different tiers: 1 What is the incidence of heart disease among those who take aspirin? 2 What is the ACE of aspirin on incidence of heart disease?
In the 2 query, and assuming the causal graph in Figure 2, we would intuitively wish to include Z (age) as a feature to block the backdoor path ← → X Z Y, and avoid including M (blood pressure) as a feature lest we intercept part of aspirin's effect on heart disease mediated through blood pressure.These intuitions are formalized in the backdoor criterion.
Concretely, revisiting the three linear regression examples of Section 3.2, Example 3.2 poses a decision to use a feature vector consisting of ⟨ ⟩ X Z , or just ⟨ ⟩ X .Since the data generating process makes Z a function of X, and Y a function of Z, the DAG of Figure 1(a) corresponds to this model.The DAG makes it clear that by including Z in the feature vector, we are conditioning a mediator, thus blocking X's influence on Y and preventing the correct calculation of the causal effect of X on Y .This can be seen from the fact that . Since there are no backdoor paths from X to Y , the causal effect can be predicted by not including Z in the feature vector.Students are then left to debate the linearity assumption.Does every additional level of training intensity, within a reasonable range, yield the same increase in athletic performance?This application of 2 tools to get the causal effect of interest by including only X in the feature vector does not depend on linearity.So, the linearity discussions can aid intuition and lead to the generalization of dropping the linearity assumption.
Example 3.3 showcases the same feature vector decision,⟨ ⟩ X Z , or⟨ ⟩ X .This time the corresponding DAG is Figure 5(a), which was used to explain Simpson's paradox.The backdoor path ← → X Z Y must be blocked to have a model that predicts the causal effect of X, preparation, on Y , competitiveness.Blocking this backdoor between X and Y is accomplished by including Z in the feature vector.
Example 3.4 is a collider scenario depicted in the DAG of Figure 1(c).Here, attention must be paid to including the collider Z in the feature vector.By including Z, predictions will be far more accurate (in fact, excluding Z will make predictions simply the mean of Y ).However, doing so opens a spurious pathway between X and Y , making the causal effect of X on Y naïvely appear to be nonzero, but the DAG makes it clear that the causal effect should be null.Therefore, we must exclude Z from the feature vector if the ML model is to determine the causal effect of X, athletic performance, on Y , negotiating skill.
Students can extend the insights gained from the above (which are useful in eliciting insights distinguishing 1 and 2 in simple settings) in more complex models like the following that demands a synthesis of these modular lessons.¹⁰P Y do X Z , .
 10 Additional modular examples of "good and bad controls" using regression and SCMs can be found in ref. [48].
Causal inference in AI education: A primer  157 In Figure 6, conventional wisdom allows for the inclusion of all covariates , , , , , but the causal quantity requires more selectivity.To control for all noncausal pathways requires that

Transportability and data fusion
Although much of traditional ML education focuses on the ability or suitability of models to fit a particular dataset, there are several adjacent discussions that are commonly omitted, including the qualitative differences between observational ( 1 ) and experimental ( 2 ) data, how these datasets can often be "fused" to support certain inference tasks, and how to take data collected at some tier in one environment/population and transport it to another.This transportability problem [49,50] has long been studied in the empirical sciences under the heading of external validity [51,52] and has received attention from the AI and ML communities under a variety of related tasks like transfer learning [53,54] and model generalization [55][56][57].Many modern techniques have focused on the ability to take a model trained in one environment and then to adapt it to a new setting that may differ in key respects.This capability is particularly palatable to fields that train agents in simulation settings to be later deployed in the real world, often because it is too risky, expensive, or otherwise impractical to perform the bulk of training in reality [58][59][60].In general, when the training domain differs from the deployment domain (even slightly), predictions are biased, sometimes with significant model degradation.This often occurs when data from the deployment environment is limited, otherwise the ML model could have been trained on deployment data.To illustrate the utility of causal tools for this task, we provide a simple example in the domain of recommender systems that motivates distinctions in environments with heterogeneous data.P X Y Z , , ) and (due to your budget) cannot conduct an experiment in this domain to determine the best diets to recommend to its population.Your task: Without having to collect more data, determine the best policy your agent should adopt in * π for maximizing the likelihood of users' health, i.e., find: The training and deployment causal diagrams of Example 4.2 are depicted in Figure 7. Notably, because we conducted an experiment (i.e., performed an intervention) in environment π (Figure 7  by the interventional subgraph G x ), the intervention ( ) do X severs any of the would-be inbound edges to X in the observational setting that we see in the target environment * π (Figure 7(b), representing the uninter- vened graph G).Graphically, the challenge in the target environment becomes clear: we wish to estimate , but this causal effect is not identifiable because it is impossible to control for all backdoor paths between X and Y due to the presence of unobserved confounders indicated by the bidir- ected arcs.Yet, by assumption, the only difference between the two environments is the difference in age distributions such that ( ) ( ) ≠ * P Z P Z , so insights from the experiment conducted in π (in which the direct effect of → X Y has been isolated) may yet transport into * π .To encode these assumptions of where structural differences occur between environments and thus to determine if and how to transport, we can make use of another graphical tool known as a selection diagram.Importantly, selection diagrams encode both the differences in causal mechanisms between environments (via the presence of a selection node) and the similarities, with the assumption that any absence of a selection node represents the same local causal mechanisms between environments at that variable.In Example 4.2, the selection diagram requires only a single addition to G (Figure 7(a)): a selection node S representing the difference in age distributions at Z. Notationally, this also allows us to represent distribu- tions in terms of the S variable, such that = * S s indicates that the population under consideration is the target * π .Similarly, we can re-write distributions that are sensitive to selection like ( ) . Doing so provides us a starting point for adjustment, similar to the backdoor adjustment from Example 3.1, wherein (using the rules of docalculus) if we are able to find a sequence of rules to transform the target causal effect into an expression where the do-operator is independent from the selection variables, transportability is possible [62].In the present DietBot Example, the goal is thus to phrase ( | ( ) P X Y Z , , .Such a derivation is as follows: ) ) ) ) Causal inference in AI education: A primer  159 Equation (1) follows from the law of total probability, (2) from the product rule, (3) from d-separation ), ( 4) from do-calculus (because, examining G x , ( ) ⊥ ⊥ = Z do X x ),¹¹ and (5) is simply a notational equivalence for the distribution of Z belonging to * π .While many theoretical lessons may end at the derivation of the transport formula concluding in equation (5), including the numerical walkthrough using the parameters of Table 3 serves as an effective dramatization for why transportability has important implications for heterogeneous data and policy formation.Consider the scenario wherein the agent designer did not perform a transport adjustment between source and target domains, using only the model that would have been fit during training.In this risky setting, the agent would maximize the source environment's ( | ( ) 1 , meaning that the optimal choice in the training environment is = X 0. However, by properly applying the transport formula, we find that the opposite is true in the deployment environment: 1 0 0.3 0.9 0.7 0.1 0.34 1 1 0.4 0.9 0.6 0.1 0.42. z The DietBot example provides a host of important lessons at 2 of the causal hierarchy, juxtaposing different causal inferences that would be obtained in different environments, demonstrating the utility of graphical models and do-calculus, and the dangers of unobserved confounding.Although these theoretical premises are typically taught in the study of causality in the empirical sciences, its practical utility in AI and ML can be driven home by casting transportability in terms of "training and deployment" environments, and by showing the surprise of opposite inferences that would be drawn with and without adjustment.As learning data scientists, students also obtain insights into the risks and opportunities of heterogeneous data, and how their fusion can overcome an otherwise difficult task of training and deployment environment differences.Plainly, in practice, adjustment formulae would not necessarily be computed by hand like This rule is more formally stated in the specific rules of do-calculus as Rule 3: deletion of actions, whose full coverage may be a diversion from topics in traditional AI courses, though would feature prominently on a course with a focus in causal inference.See ref. [1].
in the above, but the experience of the demonstration is valuable for students; a fuller treatment of automated tools used in transportability can be found in refs.[10,50,62].

Instructor reflections
Within previous offerings of these lessons, the student's surprise experienced in associational exercises and questions of Section 3 continues with the interventional exercises for feature selection, transportability, and data fusion.More than just discussions arising from the revelations 2 brings, high-school students have shown a keen interest in immediately using 2 tools to explain everyday experiences and then learning how to encode those using a formal vocabulary.Instructors of introductory courses in AI have expressed frustrations discussing probabilistic models like Bayesian networks as ad hoc or supporting topics that lack an impactful conclusion.However, examining these graphical models through the causal lens yields a fruitful experience for students to move beyond the probability calculus and the mantra that "correlation does not equal causation."Though this mantra is indeed true in general, there is a lesson to be learned in its dual: causation does bestow some structure to observed correlations, and this structure can be harnessed in support of many tasks that lead beyond the data alone.
By using the intuitions of d-separation as the structure of independence relationships in Bayesian networks, this strict graphical explanation of the data serves as an effective stepping stone into causal Bayesian networks and SCMs; by completing this transition, instructors can more fully develop students' understanding of how probability leads to policy.This insight is clearly illustrated by the use of graphical ), and in causal discovery exercises for which an equivalence class of observationally equivalent models may explain some dataset, only some of which may follow a defendable causal explanation.
Along this path, students may struggle to understand the notion of latent variables and unobserved confounding unless the following are explained in unison: (1) the graphical depiction provides a causal explanation for where latent, outside influences may be present, and (2) how these influences outside of the model yield differences in causal 2 and noncausal 1 inferences that the data can provide.

Counterfactuals
The counterfactual layer of the hierarchy, 3 , both subsumes and expands upon the previous two, newly allowing for an expression of queries akin to asking: "What if an event had happened differently than it did in reality?"Humans compute such queries often and with ease (as can be elicited from a classroom), especially through the experience of regret, which envisions a better outcome to an unchosen action.Regret is of great utility for dynamic agents, as it informs policy changes for future actions made in similar circumstances (e.g., the utterance of "Had I only exited the freeway earlier, I would not have gotten stuck in traffic" may bias future trips along the route to take side streets instead).
Counterfactual expressions are valuable to reasoning agents for a number of reasons, including that: (1) they allow for insights beyond the observed data, as it is not possible to rewind time and observe the outcome of a different event than what happened; (2) they can be used to establish precedent of necessary and sufficient causes, important for agents needing to understand how actions affect their environment (e.g., "Would the patient have recovered had they not taken the drug?"); and (3) they can be used to quantify an agent's regret, which can be used for specific kinds of policy iteration in even confounded decision-making scenarios.
Causal inference in AI education: A primer  161

Structural counterfactuals
Despite the expressive and creative potential of counterfactuals, the common student's initial exposure to them risks being overly formal and notationally heavy, often beginning with the following definition: , where ( ) Y u x encodes the solution for Y in the mutilated structural system M x , where for every ∈ V X i , the equation f i is replaced with the constant x.Alternatively, we can write: Ostensibly, a counterfactual appears similar to the definition of an intervention.However, although the do-operator expresses a population-level intervention across all possible situations ∈ ∀ u U u, a counter- factual computes an intervention for a particular unit/individual/situation = U u.This new syntax allows us to write queries of the format ( , which computes the likelihood that the query Y attains value y in the world where = X x (the hypothetical antecedent), given that = ′ X x was observed in reality.The clash between the observed evidence = ′ X x and hypothetical antecedent = X x motivates the need for the new subscript syntax and demonstrates how the previous tiers of the hierarchy cannot express such a query.
These expressions are often a source of syntactic and semantic confusion for beginners; an anecdotally better strategy is to instead begin with a discrete, largely deterministic, simple motivating example with a plain-English counterfactual query, and then to work backward to the formalisms.
Example 5.1.MediBot An automated medical assistant, MediBot, is used to prescribe treatments for simple ailments, one of which has a policy designed around the following SCM containing Boolean variables to represent the presence of an ailment A, its symptom S, prescription of two treatments X W , , and the recovery status of the patient R. The system abides by the SCM in Figure 8.
In addition, we are aware that the ailment's prevalence in the population is ( ) = = P A 1 0.1.Suppose we observe that MediBot prescribed treatment X (i.e., = X 1) to a particular patient u.Determine the likelihood that the patient would recover from their ailment had it not prescribed this treatment (i.e., hypothesizing = X 0).
To address this counterfactual query, intuitions best begin with the causal graph, whose observational state is depicted in Figure 8.Second, it is instructive to show how the previous layers' notations break down with the query of interest, as we cannot make sense of the contrasting evidence and hypothesis using the do-operator alone (i.e., the expression ( | ( ) 1 is syntactically invalid, having set X to two separate values in the same world).Instead, the query of interest focuses upon the recovery state in the

( ) ( )
x s Step 2: Action.With ( ) ( | ) 1), we can effectively discard/ignore the observational model M and shift to the hypothetical twin M x , forcing = X x per the counterfactual ante- cedent, which in our example, means severing all inbound edges to X in M x and forcing its value to = X 0. Let * M be the modified model following steps 1 and 2.
Step 3: Prediction.Finally, we perform standard belief propagation within the modified * M to solve for our query variable, R x , and find that the patient would indeed still have recovered (i.e., ( ) because MediBot would still have also administered the other effective treatment, This simple example not only demonstrates the mechanics and potential of structural counterfactuals, but also serves as a launchpad for more intricate and challenging applications.Worthwhile follow-on exercises include the addition of noisy exogenous variables to the system in Example 5.1 (e.g., nondeterministic patient recovery), and analogies to linear SCMs in which the three-step process is repeated through application of conditional expectation.Moreover, the example leads into questions of necessity and sufficiency [64] of the medical treatments, which can segue into other, more applied and data-driven, counterfactuals.

Counterfactuals for metacognitive agents
In some more adventurous explorations in AI oriented at crafting self-improving and reflective artificial agents, counterfactuals in 3 may prove to be a useful tool for metacognitive agents [65,66].Related to the transportability problem with DietBot in Example 4.2, agents may find the need to evolve their policies learned earlier in their lifespan or in environments that change over time to optimize their performance.This need complements a growing area of reinforcement learning that incorporates causal concepts, especially with respect to meta-learning [13,67,68].To demonstrate such a scenario, we reconsider MediBot in a setting wherein its current policy's decisions are confounded, damaging its performance and requiring it to perform some measure of metacognition to improve that is analogous to the human experience of regret.
1 , where = Y 1 indicates recovery).As such, patients are given the option to choose between the two treatments for the final prescription given.Seemingly innocuous, this patient choice is actually problematic given the following wrinkles:  12 The simplicity of Example 5.2 should not undermine the prevalence of confounded decision-making scenarios that are found in many adversarial settings with traditional ML [69] and a myriad of human-decider-AI-recommender scenarios [15].
1.The patient's treatment request is actually affected by an unobserved confounder (UC), linking the treatment and recovery through an uncontrolled backdoor path (Figure 10a).This unobserved, exogenous variable U is unrecorded in the data and could potentially be anything, like the influence of direct-to-consumer advertising of drug treatments that are primarily observed by different treatmentsensitive subpopulations (like a drug that is only advertised on sports-radio with a primarily exercisefriendly audience).2. Because of this confounding influence, MediBot's observed recovery rates are actually less than the FDA's reported ones (Table 4).Worse is that the observed ( 1 ) and experimental ( 2 ) recovery rates look equivalent within each respective tier, making it a challenge to determine whether a superior, individualized treatment exists.
The data in Table 4 demonstrates the tell-tale sign of unobserved confounding wherein the observed and experimental treatment effects differ , implicating an uncontrolled latent factor that explains the difference.Surprisingly, despite the unknown identity of the confounder, a better treatment policy than MediBot's current one does indeed exist in this context, and is derived from a counterfactual quantity known as the effect of treatment on the treated (ETT) [70].The ETT traditionally computes the difference between the effect of an alternate treatment = X x than the one actually given to an individual = ′ X x , the counterfactual component of which can be expressed in this context as ( . With only the partially specified model, and the observational and experimental recovery rates, it is possible to compute the ETT for binary treatments (assuming, in this setting, that the patient requested treatments are observed in equal proportion, ( ) ( ) = = = = P X PX 0 1 0 .5 ), as in the following derivation that is true for any treatment = X x and its alternative = ′ X x .This algebraic trick (using only the law of total probability) allows us to derive 3 quantities of interest from a combination of 1 and 2 data (though only for binary treatment) and tells an important tale about the system: MediBot is presently in a state of inevitable regret [71] in which the likelihood of recovery for those given treatment under its policy ) is 40% less than had those same patients been treated differently ).The "inevitable" part of this regret is also instructive for distinguishing 1 (what happens in reality/nature) from 3 (what could have  happened differently) quantities because it seems that no matter what decision MediBot makes in reality, there is always a better one that it could have made instead!Ostensibly, this computation yields only bleak retrospect, but also leads to a surprising remedy for online agents.Two insights contribute to the solution, known as intent-specific decision-making [13]: (1) the formation of the confounded agent's observational/naturally decided action (i.e., its intent) can be separated from the ultimately chosen one, and (2) this intended action choice serves as a back door admissible proxy for the state of the UC (see x c , and the structural equation for X indicates that, observationally, the final choice always follows the intended, ( ) When intent is explicitly modeled in a confounded decision-making scenario (Figure 10(b)), the ETT (previously, a retrospective 3 quantity) can be measured empirically before a decision is made by using docalculus conversions to a 2 quantity through a process known as intent-specific decision-making.Definition 5.4.(Intent-specific decision-making (ISDM)) [14,15,72] In the context of a confounded decision-making scenario with decision X, intent of that decision I , and desired outcome = Y 1, the counter- factual In brief, ISDM label's the agent's observational ( 1 ) decision as intent, which is treated as an observed context satisfying the backdoor criterion, enabling conversion of the counterfactual ETT ( 3 ) to an empirically estimable causal ( 2 ) query.The confounded agent can thus choose the action that maximizes the counterfactual ETT to develop a meta-policy that will always act equally, or more, effectively than its original policy's intended action.This technique is known as the regret decision criteria (RDC) [13] and can be expressed (for action X, intent I , and desired outcome = Y 1) as follows: , meaning that in settings wherein MediBot intends to treat with = X 0, it is better off choosing = X 1.The full intent-specific distribution of expected recovery rates is shown in Table 5.
The RDC is useful because (1) it allows a confounded agent to make strictly better decisions as a function of a confounding-sensitive existing policy (in Example 5.2, by prescribing the treatment opposite its first intended), even in complete naivety of the confounding factors, (2) it provides an empirical means of sampling a counterfactual datapoint (surprising given the mechanics of counterfactuals) [15], and (3) it can be intuitively rooted for students in the familiar experience of beginning to do something once regretted, stopping, and then choosing differently.A useful analogy of this is the practice of breaking a habit: intent X 0 = 0.5 0.9 X 1 = 0.9 0.5 signals a desire that is autonomous, reactionary to the environment or one's state (e.g., desiring a strong drink), which is then suspended by imagining the benefits of an alternative choice.
In autonomous systems, the analogy of "habit breaking" can be a useful one for policy improvement in which a maladaptive policy may be improved once a counterfactual predicts a better choice than the one that the current policy would choose.Example 5.2 thus addresses a number of learning outcomes, including the clear distinctions of quantities at all three tiers of the causal hierarchy, how UCs can account for these differences, and how to design agents that either exploit or are resilient to them.

Instructor reflections
Motivating the utility of counterfactual inference can begin with active learning through a Socratic dialogue, rooting the capacity for human counterfactual reasoning in experiences like regret."Why do we not return to restaurants that gave us food poisoning a single time?How do we place this blame of food poisoning on the restaurant?Would we have gotten food poisoning had we not eaten there?"Transitioning from these intuitions to why artificial agents can benefit from the ability to answer similar questions can make for an enjoyable classroom discussion.In classes or levels with more room for debate, discussions on counterfactuals as the origins of human creativity may also yield fruitful explorations.More broadly, the ability of counterfactuals to "escape from the data" can offer inspiration; students have enjoyed the mention of the Lion Man of Ur (an ice-age sculpture depicting a humanoid figure that is half-lion), which demonstrates one of the earliest instances of the human ability to conceive of ideas without a bearing in reality [73].
More formally, situating counterfactuals in the PCH can provide a bridge to other courses or contexts in which the term is used, such as in Rubin's potential outcomes framework [74] or in philosophical and logical discourse [75].By proceduralizing counterfactual computation in the structural three-step approach, students not only appreciate the reasoning mechanics underlying these other approache but also receive hints on future applications in the domain of AI.

Conclusion
In this work, we have endeavored to not only impress the importance of causal topics to the future of AI and ML but have also provided instructor-ready content to supplement the existing AI curricula.Through this earlier exposure to causal concepts, we invite a new generation of data scientists, ML practitioners, and designers of autonomous agents to employ and extend these tools to address problems beyond the empirical sciences.Although this work provides only a cursory exposure to the many possible avenues of synthesis for causality and AI, students familiar with its contents will more deeply understand their data, models, and the types of questions that each are capable of answering.Likewise, instructors may find topics in causality to distinguish and enhance their AI courses, give students unique perspectives, and inspire novel avenues of research.As curricular causal integration becomes more widespread, we likewise invite educational researchers to investigate its impact on the student experience and scholarship.The demands of artificial agents continue to extend beyond only associations, so practitioners familiar with causal concepts will be equipped to address the needs of tomorrow apart from only the data of today.criterion = nn.MSELoss() def train_model(inputs, y, epochs=1000): model = Model(inputs.shape[1]) optimizer = torch.optim.SGD(model.parameters(),lr=0.1) for _ in range(epochs): optimizer.zero_grad()yhat = model(inputs) loss = criterion(yhat, y) loss.backward()optimizer.step()return model

Definition 2 . 2 .
(Causal diagram) Given any SCM M, its associated causal diagram G is a DAG that encodes:

Figure 2 :
Figure 2: Possible causal graph explaining the relationship between variables in Example 2.1.

 7
Full source code is at: https://github.com/CausalEd/exercises. 8 A traditional linear model can provide a confidence interval that will very likely contain 0. z = torch.randn(n,1) # athletic performance for n individuals x = -2 * z + torch.randn(n, 1) # preparation level for n individuals features = torch.cat([x,z], 1) # feature vector with preparation and performance y = x + 3 * z + torch.randn(n, 1) # competitiveness level for n individuals The DAG of Figure 5(a) corresponds to this scenario.Similar to Example 3.2, an ML model can be trained on features X and Z or just on X.First, a feature vector consisting of both X and Z produces the following weights and bias: model = train_model(features, y) weights, bias = model.parameters()print(weights.tolist())# [[1.0000419616699219, 3.0182747840881348]] print(bias.item())# -0.0009028307977132499

Figure 5 :
Figure 5: Potential models explaining Simpson's paradox.(a) Observed confounder Z between X and Y .(b) M-graph with unobserved confounders U 1 and U 2 between X Z, and Z Y , , respectively.

Definition 4 . 2 .
(Backdoor criterion)[24, p. 61]  Given an ordered pair of variables ( ) X Y , in a DAG G, a set of variables Z satisfies the backdoor criterion relative to ( ) By employing the backdoor criterion, we control for the spurious correlative pathway ← → X Z Y to isolate the desired causal pathway → X Y in estimation of ( | ( )) P Y do X .Numerically applied to the AdBot Example 3.1 (with backdoor admissible covariate { } = Z 0, 1 ), and assuming the model in Figure 5(a):

Example 4 . 1 .
Feature selection playground.Consider the SCM in Figure 6 with treatment X, outcome Y , and covariates { } R T W V , , , .Determine which of the covariates should be included in addition to X in the feature vector Z to provide: (1) the most precise observational estimate of Y , ( | ) P Y X Z , , and (2) an unbiased estimates of the causal effect of X on Y , ( | ( ) ) 1) controlling for T opens the backdoor path from ↔ ↔ X T Y, (2) controlling for W blocks the chain from → → X W Y, and (3) controlling for V opens a spurious pathway at collider → ← X V Y .Thus, { } = Z R serves as a backdoor admissible set to allow for estimation of ( | ( )) P Y do X via adjustment as in Example 3.1.

Figure 6 :
Figure 6: Feature selection playground on a causal diagram with treatment X , outcome Y , and other covariates.

Definition 4 . 3 .
(Selection diagram)[61] diagram G.By introducing selection nodes, boxed variables representing causes of variables that differ between source and target environment,⟨ ⟩ * M M , is said to induce a selection diagram D if D is constructed as follows: 1.Every edge in G is also an edge in D.2.D contains an extra edge → S Vi i (i.e., between a selection node and some other variable) whenever there might exist a discrepancy ≠ *

Figure 7 :
Figure 7: Causal and selection diagrams for data collected in different environments but with same causal graph G. (a) Target/ deployment environment π * , causal graph G, eliciting P X Y Z , , ( ) * .(b) Source/training environment π , sub model G x , eliciting P Z Y do X , ( | ( )).(c) Selection diagram D constructed from shared graph G.
models in which observations and interventions can disagree (as in Example 3.1, ( | ) ( | ( )) ≠ P Y X P Y do X ), how the environments and circumstances of data collection powerfully matter (as in Example 4.2, (

Definition 5 . 1 .
(Counterfactual)  [9, p. 204] In a SCM M, let X and Y be two subsets of endogenousvariables such that { } ∈ X Y V ,.The counterfactual sentence "Y would be y (in situation/instantiation of exogenous variables = U u), had X been x" is interpreted as the equality with ( ) = Y u yx

Figure 8 :
Figure 8: SCM M and its associated graph G pertaining to Example 5.1.

Example 5 . 2 .
Confounded MediBot.¹²MediBot is back assigning treatment for a separate condition in which two treatments { } ∈ X 0, 1 have been shown to be equally effective remedies by the Food and Drug Administration (FDA) randomized clinical trial (i.e., (

Figure 10 :
Figure 10: SCM associated with Example 5.2 with treatment X , recovery Y , UC U, and intent I. (a) Observational model.(b) The same system, but with intent explicitly modeled.

Figure 10
(b) with agent intent I ).Definition 5.3.(Intent) [13] In a confounded decision-making scenario with desired outcome = Y 1, final agent choice X, unobserved confounder(s) U c , and structural equation ( ) ← X f U x c , SCMs modeling the agent's intent I represent its pre-choice 1 response to = U u c c such that I adopts the structural equation of X with ( ) ( )

For
Example 5.2 students could find (either analytically through Table4or experientially through a contextual multi-armed bandit assignment) that (

Table 5 :
Intent-specific recovery rates for the confounded MediBot Example 5

Table 1 :
Probability distribution of pharmocological data of Example 2.1.Conditional probability tables from the model in Figure3are given, but truncated only to necessary probabilities 3 More advanced algorithms increase efficiency and accommodate latent structures [9, pp.50-54].4 Tetrad https://www.ccd.pitt.edu/toolswas used to generate data and aid in the DAG reassembly process.The complete dataset can be downloaded at https://learn.ci/data/causal_education.csv.Causal inference in AI education: A primer  147

Table 2 :
Clickthroughs in the AdBot setting striated by the ad shown to participants in a focus group, and the age partition of the viewer Causal inference in AI education: A primer  151 level, respectively.The outcome, Y , is the level of athletic performance.The following PyTorch code⁷ generates example data:

Table 4 :
MediBot's observed treatment recovery rates vs those reported by the FDA's randomized clinical trials