Confidence in Causal Inference under Structure Uncertainty in Linear Causal Models with Equal Variances

Inferring the effect of interventions within complex systems is a fundamental problem of statistics. A widely studied approach employs structural causal models that postulate noisy functional relations among a set of interacting variables. The underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In a recent line of work, additional assumptions on the causal models have been shown to render this causal graph identifiable from observational data alone. One example is the assumption of linear causal relations with equal error variances that we will take up in this work. When the graph structure is known, classical methods may be used for calculating estimates and confidence intervals for causal effects. However, in many applications, expert knowledge that provides an a priori valid causal structure is not available. Lacking alternatives, a commonly used two-step approach first learns a graph and then treats the graph as known in inference. This, however, yields confidence intervals that are overly optimistic and fail to account for the data-driven model choice. We argue that to draw reliable conclusions, it is necessary to incorporate the remaining uncertainty about the underlying causal structure in confidence statements about causal effects. To address this issue, we present a framework based on test inversion that allows us to give confidence regions for total causal effects that capture both sources of uncertainty: causal structure and numerical size of nonzero effects.


Introduction
Questions about causal relations are at the heart of many research problems.Researchers across various fields, ranging from economics to biology and medicine, are interested in distinguishing causes from effects to predict the outcome of interventions in complex systems.While controlled experiments provide an established method for inferring causal relations, in many applied settings interventional studies are not feasible for ethical or cost considerations.Thus, inferring causal relationships based solely on observational data is an important task that the field of causal discovery addresses.
In the last decades, causal discovery has gained popularity, with much fundamental research being done on the topics of identification and estimation [12,18].Identifiability results clarify when it is theoretically possible to go beyond mere correlations.More specifically, they clarify under which circumstances assumptions lead to unique identification of interventional distributions from a given joint probability distribution.Moreover, numerous structure learning algorithms have been proposed that infer the underlying causal structure based on observed data.Knowledge of the underlying causal structure then allows researchers to reason about the effect of interventions in complex systems and given a causal structure, standard statistical methods may be applied to estimate the involved causal effects and provide an uncertainty assessment for the numerical size of the effects.However, in the typical applied setting, the exact causal structure is (at least partially) unknown.In order to draw reliable conclusions and make calibrated confidence statements, the remaining uncertainty about the underlying causal structure needs to be incorporated in causal inference.
Without prior expert knowledge about the underlying causal structure, a commonly employed "naïve" approach would involve splitting the task.In the first step, a causal learning algorithm estimates the underlying causal structure.In the second step, within the inferred model, confidence intervals for the causal parameters can be calculated using classical statistical inference methods.However, the resulting intervals are not valid if the causal learning algorithm infers the wrong causal structure.Such a two-step method conditions away the uncertainty in the causal structure arising from the data-driven model choice and the classical 'double-dipping' problem occurs since both causal structure and the effect are estimated from one data set.Thus, the approach is overly optimistic in its conclusion about the strength and existence of causal effects, and its confidence regions do not achieve the desired coverage probability, especially under high uncertainty with respect to the underlying structure.
Despite the popularity and growing interest in causal discovery, we are unaware of prior attempts to rigorously account for uncertainty in the causal structure when providing confidence statements about causal effects.Here, we use the term confidence in the technical sense for a set with a given desired frequentist coverage probability.A completely different approach for uncertainty quantification that we do not consider here is to form Bayesian credible sets.In fact, a number of authors have used Bayesian approaches for uncertainty quantification in causal discovery, but primarily with an emphasis on the uncertainty in the graphical structure as opposed to causal effects; see [10], [5], and [2] for three selected examples.
In this article, we want to shed light on the problem by highlighting the challenges of the task of handling structure uncertainty and by proposing a solution for constructing confidence intervals for total causal effects that capture both types of uncertainty, the uncertainty regarding the causal structure resulting from a data-driven model selection and the uncertainty about the numerical size of the effect.Our solution is feasible for problems of small to moderate scale-e.g., our simulation experiments will consider problems with up to 12 variables.However, already at this scale, rigorously accounting for structure uncertainty means to account for more than 10 26 possible structures, reflecting the fast superexponential growth with the number of nodes in the causal graph.
In Section 2 of the paper, we review the considered framework of structural causal models [14,11].More specifically, in order to ensure identifiability from observational data alone, we consider a restricted class of causal models, namely linear structural equation models with Gaussian errors and equal variances [13] and review the definition of a total causal effect, which is the target of interest in this study.We then generalize and improve the ideas introduced for bivariate problems in Strieder et al. [20] and present a general framework for constructing confidence intervals for total causal effects in multivariate data (Section 3 and 4).Furthermore, we present computational shortcuts and analyze the performance of the introduced framework in numerical experiments (Section 5).In the concluding Section 6 we discuss extensions of the framework to partially quantify the uncertainty over causal structures.

Background
This section reviews linear structural equation models and the total causal effect which is the target of interest in this study.

Linear Structural Equation Models
Structural causal models constitute a commonly employed tool to study causal relationships and represent noisy causal relations among a set of interacting variables.Each variable is modelled as a function of a subset of other variables, its causes, and a stochastic error term.The causal perspective emerges from viewing those relations as making assignments rather than representing mathematical equations.The left-hand side, the effect, is assigned the value specified on the right-hand side, the causes.Externally varying the values on the right-hand side results in a change on the left-hand side, but not vice versa.This reflects the asymmetry inherent in cause-effect relations.
We assume access to observational data in the form of a sample of independent copies of a random vector X = (X 1 , . . ., X d ); without loss of generality, we assume the random vector to have zero mean.Without any further assumptions on the structural causal model, e.g. the involved functions, identification of the underlying causal structure is only possible up to a Markov equivalence class.Therefore, to ensure unique identifiability, following a line of research initiated by [13], we focus on linear relations and normal distributed errors with equal variances by assuming that X solves the equation system Here, B := [β j,i ] d j,i=1 are unknown parameters that represent the direct causal effects between the variables, and ε j are independent, normally distributed error terms ∼ N (0, σ 2 ), j = 1, . . ., d, with common, unknown variance σ 2 > 0. Each specific Linear Structural Equation Model (LSEM) restricts a subset of the direct effects β j,i to be zero.Thus, LSEMs are naturally represented by a (minimal) directed graph G(B) with vertex set V = {1, . . ., d}.In the associated directed graph, the edge set equals the support of B, thus, a missing edge i → j indicates that β j,i = 0.As in related work, we assume the underlying directed graph to be acyclic (DAG), which entails that B is permutation similar to a triangular matrix and the system (1) admits the unique solution X = (I d − B) −1 ε, with I d denoting the d × d identity matrix.Incorporating the equal variance assumption, the covariance matrix of X is seen to be In the sequel, we will use the following graphical concepts.Each DAG admits a (not necessarily unique) topological ordering of its vertices, that is, a permutation π of {1, . . ., d} such that the existence of a directed path from node π(i) to node π(j) implies i < j.We write i < G j if node i precedes node j in such a causal ordering of the graph.If the DAG contains an edge from node i to node j, then node i is a parent of node j, and we denote the set of all parents of node j with p(j).It has been shown that under the above modelling assumptions with equal error variances, i.e., homoscedasticity across all interacting variables, the causal ordering is identifiable and corresponds to an ordering of conditional variances [4,9].Hence, causal effects can also be uniquely identified.

Causal Effects
Informally put, we are interested in estimating how much the value of X j changes if we externally intervene in the system and set the value of X i to x i .Such an intervention represents a change in the true data-generating process and, thus, a modification in our probabilistic model.In the structural equation framework, the effect of such an external intervention can be expressed by replacing the i-th equation with X i = x i and is denoted as do(X i = x i ).
In this work, we seek to construct confidence intervals for the total causal effect C(i → j) in LSEMs.This effect is formally defined as the unit change in the expectation of X j with respect to an intervention in X i , that is, Remark 2.1.Note that the total causal effect in an LSEM (B, σ 2 ) can be intuitively expressed based on the respective underlying DAG G(B).The effect C(i → j) equals the sum of all products of edge weights along all directed paths from i to j.Furthermore, the effect is zero if there does not exist a directed path from i to j.This follows from the path properties of the matrix (I d − B) −1 that computes the solution X of the structural equations from the independent errors.
Example 2.2.Consider the following LSEM and the corresponding (minimal) DAG: The total causal effect is given by the sum of all products of edge weights along all directed paths.For example, we obtain Our aim is to construct a confidence interval for the total causal effect C(i → j) when the underlying causal structure is not known and has to be learned.We stress that, under the identifiability assumption of an underlying Gaussian LSEM with equal error variances, this is a well-defined problem.In fact, every admissible distribution can be represented by a single (minimal) DAG.This DAG then uniquely defines the total causal effect.
A "naïve" two-step approach that first learns a causal structure and then calculates confidence intervals in the inferred model does not incorporate the uncertainty regarding the underlying causal structure.To respect this uncertainty in the calculations of a valid confidence interval, it is necessary to take all possible structures, i.e. graphs, into account.As a result, the target quantity has a composite structure and, thus, obtaining finite sample distributions of estimators is difficult.
In many situations where the sampling distribution of estimators is unobtainable, bootstrapping and other resampling methods can offer an appealing framework to construct confidence intervals.At first sight, resampling may appear to provide a simple solution to the problem under discussion.This solution relies on computing for each resampled data set a causal effect estimate.To obtain this estimate one may compose a consistent model selection method that learns the causal graph with a consistent causal effect estimator in the learned model.Note that this composition is well-defined due to the identifiability of LSEMs with equal error variances.However, as a statistic, this composition lacks smoothness and, thus, bootstrapping procedures may fail severely [1,8].We emphasize that Drton and Williams [8] demonstrate that even in low dimensions, the asymptotic behavior of confidence intervals and bootstrap tests is difficult to predict for complex composite settings as encountered here.The subtleties stem from the fact that the set of covariance matrices associated with at least one possible DAG form a union of smooth manifolds.This union contains singularities at the intersections of the manifolds, which is highlighted in the bivariate case by Strieder et al. [20].They demonstrate that singular points arise at the intersections of the two models.Consequently, bootstrapping cannot be used to capture model uncertainty correctly.This failure of the bootstrapping methods can also be observed in our simulations in Section 5.3.
In conclusion, there is a need to develop new frameworks for constructing confidence intervals that circumvent the problems that originate from the non-smooth nature of the target of interest.

Confidence Intervals via Test Inversion
In the remainder of this paper, we develop a framework that circumvents the problems posed by the non-smooth nature of the total causal effect and provides valid confidence intervals for causal inference under structure uncertainty.In this section, we introduce our main ansatz, the test inversion approach, and derive the concrete hypothesis that subsequently needs to be tested.

Inversion of Tests
Our main idea is to leverage the classical duality between statistical hypothesis tests and confidence regions in the following way.If we perform valid hypothesis tests for all attainable causal effects, then all values that we cannot reject form a confidence region for the total causal effect.More specifically, let α ∈ (0, 1) be a fixed significance level.We perform (asymptotically) valid level α tests for the hypothesis of a fixed total causal effect ψ for all attainable values of the causal effect, that is, for all ψ ∈ R, we test with a test whose acceptance region we denoted by A(ψ).Then a (asymptotically) valid (1 − α)-confidence region for the total causal effect C(1 → 2) for the data X is given by A proof of this classical result can be found, e.g., in [3,Theorem 9.2.2].
With the above perspective, our problem is shifted to developing suitable hypothesis tests for all attainable causal effects.Due to the lack of concrete knowledge about the model, all possible causal graphs must be respected.Thus, we have to construct hypothesis tests in a composite setting that remains challenging.In the next section, we explicitly describe the hypotheses that need to be tested to construct confidence regions for the total causal effect.

Hypothesis of a Fixed Causal Effect
Recalling the assumptions of an underlying LSEM with homoscedastic errors across all interacting variables, we consider the following task.We assume that the given data consists of random vectors X (1) , . . ., X (n) drawn independently from a centered multivariate normal distribution N (0, Σ 0 ).The distribution's density p(x|Σ 0 ) factorizes according to an unknown DAG while satisfying the assumption of equal error variances.To account for the uncertainty in the causal structure, we need to test at level α whether any multivariate centered normal distribution that factorizes to any DAG under the equal variance assumption allows for a total causal effect of size ψ, and repeat that test for all values ψ ∈ R. In light of the normality assumption, it is natural to parameterize our statistical model in terms of the covariance matrix Σ.Further, the additional assumption of homoscedasticity across all interacting variables restricts the set of possible covariance matrices further as we detail now.
Given a fixed DAG G, under the assumption of equal error variances, all possible normal distributions with a density that factorizes according to G are defined by the family where Considering and combining all possible DAG models, our overall model corresponds to the following union of sets of covariance matrices: where G(d) is the set of all DAGs with d nodes.
Hence, in the definition of M, we only need to consider the union over all complete DAGs with d nodes, i.e. all possible orderings.
The following Proposition shows that we can identify the model M, which is a subset of all positive definite matrices satisfying the assumption of equal error variances, solely based on polynomial constraints in the entries of the covariance matrix.Here, and in the remainder of this article, we write Σ i,j|p(i) for the conditional covariance matrix, that is, Proposition 3.2.For a covariance matrix Σ ∈ PD(d), the following statements are equivalent: (i) Σ ∈ M, (ii) There exists a complete DAG G and σ 2 > 0 such that for all k = 1, . . ., d Proof." =⇒ " Let Σ ∈ M. Then there exists a DAG G such that N (0, Σ) ∈ N (G).Considering Remark 3.1, we can assume without loss of generality that G is complete.Straightforward calculations with the multivariate normal density then imply that the variances of each node conditioned on its parents are equal to σ 2 ; or see a combinatorial argument as in [7,Equation (7.4)].Thus, for all k = 1, . . ., d, " ⇐= " Since G is complete, the multivariate normal distribution N (0, Σ) factorizes according to the graph G.This implies the existence of a matrix B ∈ R G and a positive definite is equal for all k = 1, . . ., d and the claim follows.
Remark 3.3.Note the identifiability result under homoscedasticity across all interacting variables, e.g., Theorem 1 in [4].Each Σ ∈ M corresponds to a unique (B, σ 2 ) and, thus, a unique minimal DAG G(B).As explained in Remark 3.1, the union in the definition of M (3) is not necessarily disjoint and analogously the (complete) graph satisfying Condition 2 in Proposition 3.2 is not necessarily unique.Nevertheless, since B is unique, all graphs satisfying Condition 2 in Proposition 3.2 share the same edge set with non-zero edge coefficients and correspond to all valid causal orderings of the minimal DAG.Thus, every 'additional' parent of node i in a complete DAG G from Condition 2 compared to the unique minimal DAG G(B) is conditionally independent of node i given p G(B) (i).This follows by d-separation since every (undirected) path between node i and an additional parent either traverses the parents p G(B) (i) in the minimal DAG or contains a collider c with i < G(B) c and thus this collider has no descendent in p G(B) (i).
Our testing problem is the following classical statistical problem.We are concerned with a parametric model M, parameterized by the covariance matrix, which is an intricate union of subsets of the cone of positive definite matrices given by polynomial constraints.Within that statistical model, we need to test for all attainable values of the total causal effect ψ.In the following Proposition, we show that the parameter of interest is a function of the covariance matrix.Thus, the considered statistical testing problem can be fully parameterized in terms of the covariance matrix.Proposition 3.4.Let Σ ∈ M. Then the total causal effect C(i → j) is given by Proof.The case j < G i is trivially true, since in this case, the total causal effect C(i → j) is zero and, considering G is complete, the parent set p(i) contains the node j, which yields Σ j,i|p(i) = 0.
Assume i < G j. Since the set p(i) satisfies the back-door criterion with respect to i and j, we have It follows that the total causal effect C(i → j) is the regression coefficient of X i when regressing X j on (X i , X p(i) ), which is given by Σ j,i|p(i) (Σ i,i|p(i) ) −1 .We emphasize that this expression does not depend on the specific choice of the DAG G.All graphs satisfying Condition 2 in Proposition 3.2 share the same egde set with nonzero edge weights and thus the parent set p(i) always satisfies the back-door criterion in the corresponding unique minimal DAG, see Remark 3.3.
Using Proposition 3.4, we can define the hypothesis space M ψ ⊂ M, given by all covariance matrices Σ ∈ M that allow for a total causal effect of size ψ via Put together we have to solve the statistical testing problem for all attainable ψ ∈ R. Then we invert the tests to construct valid confidence intervals for the total causal effect.Similar to the union M, we can write the hypothesis space as a union of single hypotheses over all (complete) DAGs, that is Remark 3.5.It is clear that for ψ ̸ = 0 we only need to consider the union over all graphs with i < G j.In contrast, for a zero effect we have to consider all graphs, since the effects of multiple directed paths might 'cancel' each other.Further, for graphs with j < G i the constraint given by the size zero effect is not informative, since j ∈ p(i) and consequently Σ j,i|p(i) = 0. Therefore, the corresponding hypothesis space has higher dimension.
Example 3.6.Consider the following LSEM and the corresponding (minimal) DAG: There exist two directed paths from node 1 to node 2 in the corresponding (minimal) DAG and thus, 1 < G 2 in all valid causal orderings.However, the effect of the two path 'cancel' each other, such that C(1 → 2) = 0.25 − 0.5 2 = 0.

Testing for Total Causal Effects
For constructing confidence intervals for the total causal effect, we invert tests of the hypothesis (5).In this section, we present two concrete tests based on likelihood ratio theory: (i) a classical constrained likelihood ratio test [17], and (ii) an approach based on the recently proposed split likelihood ratio test [21].

Constrained Likelihood Ratio Test
The first option we consider is to form a classical likelihood ratio statistic for the testing problem (5).Writing ℓ n for the Gaussian log-likelihood function based on the sample X (1) , . . ., X (n) , the likelihood ratio statistic for ( 5) is In regular settings the statistic may be compared to a chi-square quantile for an asymptotic test.Here, however, asymptotic distribution theory is difficult because both the null hypothesis and the alternative are intricate unions and do not satisfy the usual regularity assumptions.In the simplified bivariate setting, Strieder et al. [20] avoid the difficulties posed by the existence of singularities by testing modified problems in two different approaches.In the first approach, the hypothesis and alternative are relaxed to test an ordering of (conditional) variances.However, it is not feasible to optimize this modified testing problem in problems beyond the bivariate case.Further, the mixture weights of the arising limit distribution, which is a mixture of chi-square distributions, are difficult to determine in higher dimensions.The second approach relaxes the alternative to an unconstrained Gaussian model.In this modification, the statistic ( 7) is changed to a larger statistic for which the optimization of M is replaced by an optimization over the positive definite cone PD(d).This approach, however, may cause problems in the test inversion approach, as the union of tested null hypotheses no longer coincides with the alternative.In particular, small confidence regions may arise due to model misspecification.
In the present paper we will derive our confidence region for the total causal effect from tests that reject for too large values of the original statistic λ(ψ) n from (7).However, we will consider relaxing the alternative M and employ the theory of intersection union tests to obtain a simple yet effective upper bound on the distribution of λ(ψ) n .To obtain this upper bound, first note that the hypothesis we need to test is an intricate union of single hypotheses given by G∈G(d) M ψ (G).While we will later show that every single hypothesis defines a smooth submanifold, their union is not necessarily a smooth submanifold again.Nevertheless, the original test statistic (7) can then be written as λ(ψ) Thus, the original test statistic can be upper bounded by every single test statistic λ(ψ) n (G) and a valid level α test of the union of single hypotheses can be constructed by testing every single hypothesis.We then reject the union null hypothesis if we reject all single hypotheses.
A formal proof of this classical theory of intersection union tests can be found, e.g., in [3,Theorem 8.3.23].
In the second step, we relax the alternative to an unconstrained Gaussian model to obtain the following larger test statistic Relaxing the alternative to an optimization over the entire cone of positive definite matrices PD(d) potentially increases the achieved likelihood and consequently upper bounds each single test statistic λ(ψ) n (G).Combining both approximations yields an upper bound of the original test statistic λ(ψ) n and thus, a simple way to construct asymptotically conservative hypothesis tests.In the following, we state our main result by turning them into conservative confidence intervals for the total causal effect C(i → j) via test inversion.Theorem 4.1.Let α ∈ (0, 1).Then an asymptotic (1 − α)-confidence set for the causal effect C(i → j) is given by Note that we only need to consider complete DAGs, which drastically reduces the number of single hypothesis tests that need to be evaluated and thus the computational burden.Further, we need to respect graphs with j < G i only for zero-sized effects.Therefore, our proposed confidence regions may contain a non-zero interval and an isolated zero.In contrast to the bivariate case, however, zero effects can also correspond to orderings with i < G j due to cancellation, see also Remark 3.5.
Proof.Let ψ ∈ R and Σ ∈ M ψ .Since M ψ = G∈G(d) M ψ (G) there exists a complete graph G, such that Σ ∈ M ψ (G).If i < G j, it follows from the introduced upper bound and Lemma 4.3 We obtain an analogous result for j < G i with Lemma 4.4.Employing the test inversion approach yields the claim.
In the following, we derive the asymptotic distribution of the upper bound λ (ψ) n (G) of our test statistic under every single hypothesis M ψ (G).As previously indicated, the cases i < G j versus j < G i define different dimensional hypothesis spaces, thus we consider both instances separately.
First, we consider the case i < G j and assume without loss of generality that (1, 2, . . ., d) is the causal ordering.In this case, the single hypothesis can be written as The relaxed model space of all positive definite matrices PD(d) can be identified with an open subspace in R , since the Jacobian of f ψ has full rank d.To see that, note its (nonzero) triangular structure for derivatives for (Σ k,k ) k=2,...,d in all rows except the first row.The first row, however, has a nonzero derivative for Σ j,i , and zero derivatives for (Σ k,k ) k=j,...,d .
Using the fact that this single hypothesis defines a smooth submanifold, we determine the limit distribution of the upper bound λ Proof.Since the hypothesis space ) and the claim follows.For details on this classical result, we refer to [6].
In the case j < G i the causal effect from i to j can only be zero.Therefore, we solely need to test the single hypothesis H (0) 0 (G).We assume again without loss of generality that (1, . . ., d) is the underlying causal order.The polynomial constraint in the hypothesis (6) corresponding to the fixed size effect is not informative, as the node j is included in p(i) and thus Σ j,i|p(i) = 0. Therefore, the hypothesis H Thus, similarly to the previous case, we obtain a chi-square distribution as the limit distribution, however with fewer degrees of freedom.
Proof.Analogously to Lemma 4.3.Note that the Jacobian has full rank d − 1 due to its triangular structure.
With Theorem 4.1 we defined an asymptotically valid confidence interval for the total causal effect that is able to capture both types of uncertainty, the uncertainty about the causal structure and the uncertainty about the numerical size of the effect.Besides the theoretical guarantees for the asymptotic coverage of the proposed confidence set, we demonstrate in the simulation section, that even in finite samples, the confidence set achieves the desired coverage probability.We emphasize again that our new approach uses the original test statistic (7) with an upper bound on its distribution.Thus, we avoid the issues that may arise under model misspecification in the (bivariate) method by Strieder et al. [20] due to contrasting a hypothesis restricted to equal error variances with a general alternative.
In the following we propose an alternative framework for constructing hypothesis tests for (5) based on the theory of universal inference [21].While this framework leads to more conservative intervals, it comes with a finite sample guarantee.

Split Likelihood Ratio Test
Likelihood ratio tests provide a powerful tool for constructing hypothesis tests that are calibrated via asymptotic distribution theory.However, in the present context the needed distribution-theoretic insights are difficult to obtain and we adopted stochastic upper bounds.An interesting alternative is to conduct the needed tests in the framework of universal inference that was introduced by Wasserman, Ramdas and Balakrishnan [21].Their method employs a modification of the likelihood ratio statistic, called split likelihood ratio, based on a data splitting approach.Due to the independence of the split data, one can apply a universal critical value, instead of relying on asymptotic distribution theory.Type-I error control is then guaranteed by an application of Markov's inequality.In the following, we employ their methodology for constructing conservative confidence regions for total causal effects with finite sample guarantee.
To construct a split likelihood ratio test for the test problem (5), we proceed as follows.First, we split the data into two subsets D 0 = {X (1) , . . ., X (k) } and D 1 = {X (k+1) , . . ., X (n) } and define the log-likelihood functions ℓ j (Σ) := X (i) ∈D j log p(X (i) |Σ), j = 0, 1, based on D 0 and D 1 , respectively.Let Σ1 := argmax Σ∈M ℓ 1 (Σ) be the maximum likelihood estimator of Σ restricted to LSEMs with equal error variances based on D 1 .Then the split likelihood ratio test statistic is defined as The insight of Wasserman, Ramdas and Balakrishnan [21] is that for any fixed Σ * ∈ PD(d) it holds that and, thus, Markov's inequality implies that for any α ∈ (0, 1), under Thus, the decision rule reject constitutes a valid level α test.This test holds level in finite samples and without any regularity conditions.Thus, we can define the following confidence region for the total causal effect based on the split likelihood ratio test.
Remark 4.6.Similar to the asymptotic confidence set given by Theorem 4.1, we need to consider all graphs when testing for zero effects.Thus, the confidence region may consist of a non-zero interval and a disconnected zero that reflects the remaining uncertainty about the existence of an effect.
Proof.This result follows directly from the test inversion approach and (9).Note that for non-zero effects we only need to consider graphs with i < G j.

Computational Details and Numerical Experiments
In this section we present computational details as well as shortcuts in our proposed algorithm that significantly improve the computation time.We also present a simulation study in which we analyze and compare the coverage probabilities and performance of the different proposed testing procedures.

Algorithm and Computational Shortcuts
We focus on the algorithm for calculating the likelihood ratio based confidence regions introduced in Theorem 4.1, denoted with LRT.By adjusting the critical values in the algorithm and incorporating the data splitting procedure, the split likelihood ratio based confidence regions can be computed analogously.As the number of possible underlying causal structures, that is, the number of DAGs, grows superexponentially with the number of nodes, we need to carefully use combinatorial shortcuts in order to be able to compute our proposed confidence intervals for the total causal effect.Our algorithm, presented as Algorithm 1, consists of two subroutines.The first routine possible.order(Algorithm 2) immediately rejects all implausible causal orderings and, thus, reduces the set of possible causal orderings in which we subsequently test for all attainable effects with the second routine testeffect (Algorithm 3).Carefully reducing the space of plausible causal orderings drastically decreases the computation time and allows us to compute our proposed confidence intervals, which take uncertainty over all possible DAGs into account, in a reasonable time for a moderate (but far from trivial) number of involved variables.Precise computation times can be found in Subsection 5.3, which presents the results of a simulation study.

Algorithm 1 returns LRT confidence region for C(i → j)
Input: data, level α, stepsize s (orderings, likelihood, effect) ← possible.order(data,α) ▷ all plausible orderings L1 ← likelihood [1] ▷ maximum likelihood alternative startvalues ←unique(effect) while startvalues is not empty do ▷ start with effects from unrestricted MLEs (leftbound, rightbound) ← min(startvalues) ( In order to reduce the space of plausible graphical structures as much as possible, we note again that we do not need to consider all possible DAGs but merely complete ones, represented by all possible causal orderings.This immensely decreases the worst-case number of single hypotheses we need to perform to reject an effect of size ψ to d factorial.Our subroutine possible.order(Algorithm 2) further reduces the number of plausible causal orderings as follows.
First, we quickly approximate the maximum likelihood under the alternative with the maximum likelihood L1 obtained under the ordering obtained from the method of [4] that sorts conditional variances.Subsequently, we calculate the maximum likelihood of all possible orderings without restrictions on the causal effect and immediately reject implausible orderings by employing the (largest) critical value χ 2 d,1−α .Note that we can search through the space of orderings recursively starting from the first node and immediately reject (partial) orderings if the corresponding (partial) maximum likelihood already exceeds the threshold.Further, for the subsequent testing procedure, the specific ordering of the parent set p(i) is not relevant.Thus, for all orderings that only differ in the specific ordering of p(i), we subsequently just test with the parent order that achieves the maximum likelihood.Additionally, we only need to test for a zero effect in the maximum likelihood ordering out of all orderings with j < G i. Finally, we include an attainable effect in our confidence set if it can not be rejected for at least one of the possible causal orderings using the subroutine testeffect, Algorithm 3. Therefore, we do not need to continue testing for a specific value of the effect if we find an ordering such that we cannot reject the value.Thus, first sorting the causal orderings by their maximum likelihood further improves the computation time.
We start testing within the plausible causal orderings with the minimal causal effect that corresponds to the maximum likelihood estimates.Then, we step-wise decrease the value until we reject in all orderings.The last value we did not reject is a lower bound for our confidence region.We repeat the procedure with increasing steps until we reject in all orderings and B is permutation similar to a lower triangular matrix.Straightforward calculations yield that the maximum of ℓ n (B, σ 2 ) over the equal variance σ 2 > 0 is given by sup Maximizing equation ( 10) over B ∈ R G is then equivalent to minimizing tr(( Thus, in the unrestricted case without constraints for a fixed size causal effect, the problem is equivalent to solving d separate linear least squares problems with minimizer and corresponding minimum Σk,k|p(k) .
Further, in this case, we can calculate the maximum likelihood of any partial ordering that starts from the source node, since it collapses into subproblems.Thus, we can reject all orderings that start with a given partial ordering if the corresponding partial maximum likelihood already exceeds the threshold.
In addition, we need to calculate the maximum likelihood estimates under each single hypothesis H (ψ) 0 (G).For j < G i we only test the hypothesis H (0) 0 (G).However, in that case, the constraint of a size zero causal effect C(i → j) = 0 is not restrictive and, thus, the minimizer is similarly given by (12).Therefore, we assume in the following i < G j. Analogously to the previous calculations, maximizing the Gaussian likelihood for a given causal ordering and a fixed total causal effect can be solved by minimizing the linear least squares problem (11).However, the constraint of a fixed total causal effect C(i → j) = ψ links a subset of the regression parameters and we can not solve all d linear least squares problems separately.
Recalling the path properties of the total causal effect, see Remark 2.1, the assumption This follows by observing that and further inductively (B k ) j,i = p∈p(j)\p(i) For all other k / ∈ p(j) ∪ {j} \ p(i) ∪ {i}, the corresponding part of the optimization problem ( 11) can be solved separately with the (unrestricted) minimizer given by ( 12).However, it remains to minimize the following non-trivial joint least squares problem Note that we can reduce the constraint to the submatrix B s , which is formed by selecting all rows and columns corresponding to the set p(j) ∪ {j} \ p(i), since (B m ) j,i = (B m s ) j,i .Then, the constraint yields β j,i as a function of the remaining β k,p(k) for k ∈ p(j) ∪ {j} \ p(i) ∪ {i} and we solve the remaining least squares problem numerically with quasi-Newton methods.

Simulations
In this section, we present the results of a simulation study that compares the coverage frequencies and the widths of the proposed confidence regions as well as their computing times.Our simulation experiments are designed as follows.We generate synthetic data sets based on LSEMs with Gaussian errors and equal variances associated to randomly selected directed acyclic graphs in a three-step procedure.First, we randomly select a permutation of d nodes, representing the causal ordering of (X 1 , . . ., X d ).Second, we generate edge weights for all edges in the complete graph that corresponds to the chosen causal ordering according to a normal distribution N (β, 0.1).In that way, β reflects the average strength of the direct effects in the graph, and thus, smaller values indicate higher uncertainty about the underlying causal structure.In the next step, we prune the complete graph, that is, we include an edge with a probability of 0.9 for dense and 0.5 for sparse graphs.Finally, we generate n samples according to the LSEM, represented by the chosen graph, with standard normal distributed errors.For a range of edge weights β, sample sizes n and dimensions d, we generated multiple independent data sets and determined the confidence sets for the total causal effect of X 1 on X 2 with confidence level α = 0.05 using the different proposed methods (LRT given by Theorem 4.1 and SLRT given by Theorem 4.5).We repeated this procedure twice, for the case of a true non-zero effect and the case of no effect.We chose the tuning parameter of the SLRT interval, the splitting ratio, optimal in the sense of Strieder and Drton [19].LRT 100 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 500 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1000 0.98 0.98 1.00 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 SLRT 100 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 500 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1000 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Bootstrap 100 0.70 0.75 0.97 0.75 0.80 0.97 1.00 1.00 1.00 1.00 1.00 1.00 500 0.72 0.77 0.97 0.78 0.85 0.98 1.00 1.00 1.00 1.00 0.99 1.00 1000 0.76 0.80 0.95 0.79 0.88 0.98 0.99 0.99 1.00 1.00 1.00 1.00 We report the observed empirical coverage probabilities for 6-dimensional graphs in Table 1.All proposed methods achieve the desired coverage frequency of 0.95 and seem to be conservative.This is expected for the split likelihood ratio method since it is based solely on Markov's inequality.However, it is also not surprising that the LRT method is conservative, considering the use of intersection union testing and the fact that we set critical values based on the relaxed alternative.For comparison, we also included classical bootstrapping results in our simulation study.The bootstrapping method employs GDS, an established causal discovery algorithm for LSEMs with equal variance, proposed by Peters and Bühlmann [13].For each resampled data set, the GDS method uses a greedy search algorithm to maximize the likelihood and estimate the total causal effect.The results validate the theoretical findings in [8,20].Bootstrapping methods do not correctly account for uncertainty in the causal structure and do not achieve the required coverage frequency.
In addition to the coverage frequencies, we investigated how conclusive the confidence regions are, that is, how informative are intervals for practitioners in deciding about the existence and size of the causal effects, whilst still providing a valid uncertainty assessment.To this we show in Figure 1 the average width of the non-zero part of the confidence sets.We report the results for dimension d = 6, where the data was generated according to random DAGs with expected edge weight β = 0.5, against the sample size.In contrast to the bootstrapping method, our proposed LRT method constitutes a valid confidence region.Nevertheless, even in the considered setting with edge weight β = 0.5, where all variants achieved the desired coverage, our method significantly outperforms bootstrapping in the dense setting, while performing similarly in the sparse setting.In contrast, the split likelihood ratio tests yield wider confidence intervals.We also note that the confidence intervals are generally wider in the dense setting, that is, the remaining uncertainty about the numerical size of the effects is higher for dense DAGs.Our intuition here is that in dense DAGs more directed paths contribute to the total causal effect, thus, more edge weights are involved with remaining uncertainty.Figure 1 shows that our proposed methods successfully help to locate the numerical size of existing causal effects while correctly quantifying remaining uncertainty.
In addition to locating the numerical size of existing effects, an informative confidence interval should help in deciding whether there exists remaining uncertainty about the existence of an effect.Therefore, to investigate whether the proposed methods pick up on the causal direction of the effect between two variables, we plotted the proportions of times zero is included in the confidence sets when there is a true nonzero effect.Figure 2 shows the results for dimensions d = 6 with expected edge weight β = 0.5 against the sample size and for the sample size n = 500 against the expected edge weight.The LRT method eliminates the remaining uncertainty about the direction of the causal effect the quickest with increasing sample size, significantly outperforming the bootstrapping method and the split likelihood ratio tests.As intuitively expected, for higher average edge weights there is less uncertainty about the underlying causal structure and thus, the methods more often exclude the possibility of no causal effect.
Furthermore, one observes that in contrast to the bootstrapping method, the confidence regions of our proposed methods contain the zero less often in the dense setting compared to the sparse setting.This stems from the fact that there exists less uncertainty about the causal ordering for dense DAGs, and thus, the remaining uncertainty about the direction of the involved effects is lower than in the sparse setting.While there might be less uncertainty about the numerical size of the effect and thus, narrower confidence regions in the sparse setting, the uncertainty about the causal structure is higher.We obtain similar results for larger causal structures with 12 involved variables.Figure 3 shows the percentages of times our proposed LRT-confidence regions contain zero as well as the mean width of the non-zero part.The data was generated with an expected edge weight β = 0.5 and increasing sample sizes.In the case of no true effect, the confidence intervals always correctly contain zero.The average width of the non-zero part is close to zero, even for low sample sizes.If there is a non-zero total causal effect, the remaining uncertainty about the direction of the effect is already eliminated in samples of size around 2000.We highlight again that the remaining uncertainty about the causal structure (reflected by the percentage of times zero is contained in the confidence interval) is higher in the sparse setting.In contrast, the remaining uncertainty about the numerical size of the effect, reflected by the average width of the non-zero part of the confidence intervals, is lower.
In summary, the proposed framework for constructing confidence intervals for total causal effects successfully detects the direction and the numerical size of causal effects and correctly quantifies the remaining uncertainty in causal structure and effect size.
Figure 4 compares the average computation times over 100 randomly drawn DAGs.As the number of possible causal structures grows superexponentially with the dimension, the computation times for our proposed confidence intervals naturally increase.Since the intervals take uncertainty over all possible structures into account, the computation times similarly increase superexponentially.However, Figure 4 shows that it is feasible to compute this complete uncertainty over all causal structures for moderate dimensions with lower computation times than the bootstrapping method.We plotted the computation times for an expected edge weight β = 0.5 and sample size n = 1000 against the dimension (up to d = 12).Contrasting the bootstrapping procedure based on the greedy search algorithm GDS, the computation times for our proposed confidence intervals are lower in the dense compared to the sparse setting.This is not unexpected, as dense DAGs generally coincide with less uncertainty in the causal structure and thus lead to fewer plausible causal orderings passed on to the testing procedure.Similarly, if there is no true effect, computation times are lower since we need to test within fewer plausible causal orderings.In general, the computation times are mainly determined by reducing the set of plausible orderings with Algorithm 2 and the uncertainty in the causal structure within the data.
Further, Figure 4 also includes the average computation times for increasing sample sizes with expected edge weight β = 0.5 and dimension d = 6.In contrast to the bootstrapping procedure, increasing the sample size reduces the computation times for our proposed confidence intervals.More available data reduces the uncertainty about the underlying structure, which reduces the computation time.
Our framework is based on the assumption of an underlying Gaussian LSEM with equal error variances.We emphasize that if the true data-generating mechanism follows a Gaussian LSEM with differing error variances, the causal structure and, consequently, the true causal effects are generally not identifiable.Therefore, our task of constructing a confidence interval for this causal effect is not well-defined.However, in this setting, our method targets the causal effect in an approximation of the data-generating mechanism under the equal error variances assumption.With practical applications in mind, we investigated this behavior and, thus, the sensitivity of our method towards deviations from equal error variances.We generated data according to the same procedure as described above (for d = 6, β = 0.1 and n = 500), but with error variances sampled uniformly from [1 − 0.5v, 1 + 0.5v].In that way, v quantifies the degree of deviation from homoscedasticity across the interacting variables, ranging from 0 to 1.8.The empirical coverage probabilities in Figure 5 show that our framework is rather robust to small deviations from equal error variances.For small deviations, the targeted causal effect in the approximated equal error variance model is close to the true causal effect, and thus, our confidence intervals nevertheless cover the (unidentifiable) true effect.
In conclusion, the likelihood ratio method performs best in our simulation studies and yields the most informative confidence intervals.Nonetheless, while the split likelihood ratio tests are more conservative, we emphasize that it is the only method with theoretical finite sample guarantees.

Real Data Example
To illustrate the importance of correctly accounting for uncertainty in the causal structure, we consider the frequently studied Sachs protein data [15].This collection of 14 data sets comprises expression levels for 11 proteins and phospholipids in human T-cells obtained under different experimental conditions.In applications involving variables from similar domains and measurements obtained by similar techniques, the assumption of equal error variances offers a simple way to conduct exploratory analyses of cause-effect relations, also in light of the robustness of our method towards small deviations from equal variance that was seen in our simulations.In our data analysis, we compared our LRT method against naive bootstrapping with the GDS method [13] as well as bootstrapping with the LiNGAM method [16].The LiNGAM method explicitly assumes non-Gaussian errors to ensure identifiability.We considered the first of the 14 data sets with a sample size of 853 and focused on a subset of 10 proteins (excluding PKA) that have measurements on a similar scale.The protein PKA is a highly variable source node in the ground truth causal DAG reported in [15].To account for its influence, we first performed linear regressions of each protein on PKA and continued our analysis using the residuals.
Figure 6 shows the calculated 95%-confidence intervals for 6 exemplary causal effects.The regression coefficient of X i when regressing X j on (X i , X p(i) ), where the conventionally accepted ground truth structure defines the parent set, is seen as the true target effect C(i → j) and marked in red.In total, our proposed confidence intervals cover 84 of the 90 pairwise causal effects among the 10 involved variables.In contrast, naive bootstrapping with GDS and with LiNGAM cover only 66 and 72, respectively.Our confidence intervals indicate that, even within our strict modeling assumptions, there is substantial structure uncertainty.Out of the 90 calculated intervals, 67 contain zero as well as non-zero effects, thus reflecting uncertainty about the direction and existence of effects.In comparison, the naive methods with GDS and with LiNGAM contain both directions in only 35 and 55 cases, respectively.Thus, while the bootstrapping methods yield narrower intervals on average, they do not correctly account for structure uncertainty and tend to miss more minor target effects.

Discussion
In this paper, we raise the problem of rigorously quantifying uncertainty in causal inference when the causal structure is unknown.We demonstrate the intricacies of this task and propose a precise solution for constructing confidence regions for total causal effects, which capture both types of uncertainty: numerical size of effects and causal structure.By employing a test inversion approach with intersection union tests and exploiting combinatorial shortcuts, our proposed framework mathematically rigorously quantifies the remaining uncertainty in the concrete setting of LSEMs with homoscedastic Gaussian errors across all interacting variables.Our proposal of leveraging test inversions of joint tests for causal structure and size of effect to obtain confidence regions for causal effect estimates under structure uncertainty has a general nature and with suitable statistical tests could be extended to other modelling assumptions; see [20] for results from a heuristic for the example of bivariate LSEMs with non-Gaussian errors.Moreover, our framework could be adjusted to target different (causal) quantities by reformulating the corresponding constraint.
Assuming LSEMs with equal error variances ensures identifiability.Thus, a confidence region corresponds to a set of covariance matrices, which all uniquely entail a corresponding total causal effect.Without equal error variances (or rather identifiability in general), this unique correspondence does not hold.While our computational framework of focusing on causal orders and complete graphs might be adapted to consider a set of plausible covariance matrices, every covariance matrix then entails a set of possible causal effects that always contains zero.Thus, the corresponding confidence regions are not conclusive about the direction of effects.In contrast, we demonstrated that under the assumption of equal error variances our proposed confidence regions not only correctly quantify the remaining uncertainty in causal structure and numerical effect size, but also provide conclusive information which can help practitioners to decide about existence and numerical size of effects.
The computational shortcuts we adopt allow us to quantify structure uncertainty over all possible DAGs among a moderate number of nodes.For higher dimensional systems it is not feasible to consider all possible DAGs given that there already exist more DAGs on 22 nodes (10 87 ) than estimated atoms in the universe.However, our method can still be used by practitioners as a valuable tool for analysing higher dimensional causal systems by partially quantifying the uncertainty in the causal structure.
In the status quo scenario, practitioners would either confer with experts that provide the underlying causal structure or employ structure learning algorithms to estimate the DAG from available data.Starting from this inferred model, classical statistical methods can then be used to calculate confidence intervals for the involved causal parameters.These confidence intervals quantify uncertainty in the numerical size of the causal effect conditional on the inferred model.However, they do not incorporate uncertainty about the underlying causal structure and the (data-driven) model choice.While in higher dimensional systems it is not computationally tractable to quantify structure uncertainty by considering all DAGs, our framework can still be used as a compromise in between.Rather than solely relying on the full structure provided by experts, our method can be used to consider a set of plausible causal structures.This way practitioners can incorporate varying degrees of belief of experts in different parts of the structure by fixing parts with high confidence but still considering the remaining uncertainty over the rest.Along similar lines, instead of focusing on a single output of causal learning algorithms, practitioners can employ our method to a set of (most likely) causal structures and incorporate some degree of structure uncertainty in causal effect estimates.
In summary, we consider our study of total causal effect in linear causal models with equal error variances as the beginning of a new line of research on confidence sets under structure uncertainty.We anticipate fruitful generalizations in many directions including not only other model settings such as the linear non-Gaussian case, or additive noise models, but also different causal parameters other than total effects.From this perspective, while this paper develops our ideas in a very concrete setting, it sets the scene for various follow-up work in providing rigorously justified confidence regions in causal effect estimation when the causal structure itself is learned from data.

Lemma 4 . 3 .
Let G ∈ G(d) be a DAG with i < G j and ψ ∈ R.Under the hypothesis H (ψ) 0 (G) the likelihood ratio test statistic λ ψ n (G) satisfies is solely defined by d − 1 polynomial constraints corresponding to the assumption of equal error variances, that is, in this case we have M 0 (G) = {Σ ∈ PD(d) : f 0 (Σ|G) = 0}, where

Fig 3 :
Fig 3: Mean width and percentages of times zero contained in 95%-confidence intervals for the total causal effect of X1 on X2 in randomly generated 12-dim.DAGs (100 replications)

Fig 5 :
Fig 5: Empirical coverage of 95%-confidence intervals for the total causal effect of X1 on X2 in randomly generated 6-dim.DAGs under departure from equal error variances (1000 replications).

Fig 6 :
Fig 6: 95%-confidence intervals for the total causal effect between proteins from real world expression data.

Table 1
Empirical coverage of 95%-confidence intervals for the total causal effect of X1 on X2 in randomly generated 6-dim.DAGs (1000 replications).