Causal versions of Maximum Entropy and Principle of Insufficient Reason

The Principle of Insufficient Reason (PIR) assigns equal probabilities to each alternative of a random experiment whenever there is no reason to prefer one over the other. The Maximum Entropy Principle (MaxEnt) generalizes PIR to the case where statistical information like expectations are given. It is known that both principles result in paradoxical probability updates for joint distributions of cause and effect. This is because constraints on the conditional P(effect|cause) result in changes of P(cause) that assign higher probability to those values of the cause that offer more options for the effect, suggesting"intentional behaviour". Earlier work therefore suggested sequentially maximizing (conditional) entropy according to the causal order, but without further justification apart from plausibility on toy examples. We justify causal modifications of PIR and MaxEnt by separating constraints into restrictions for the cause and restrictions for the mechanism that generates the effect from the cause. We further sketch why Causal PIR also entails"Information Geometric Causal Inference". We briefly discuss problems of generalizing the causal version of MaxEnt to arbitrary causal DAGs.


Introduction
Understanding asymmetries between cause and effect has attracted researchers from the field of causal discovery particularly since two decades. One challenging problem motivated by the goal of understanding these asymmetries is to distinguish cause and effect from their bivariate distribution. This task cannot be solved by causal discovery methods that rely on conditional independences only (Spirtes et al., 1993;Pearl, 2000), but new approaches employ statistical properties other than conditional independences. They rely, for instance, on the additive noise assumption (Kano and Shimizu, 2003;Hoyer et al., 2009;Mooij et al., 2016) or a generalization of the latter (Zhang and Hyvärinen, 2009), or on asymmetries with respect to some notion of description complexity (Janzing and Schölkopf, 2010;Marx and Vreeken, 2017;Kocaoglu et al., 2017), or differences regarding regression error (Blöbaum et al., 2017). For an overview see also Peters et al. (2017) and Guyon et al. (2019), but also Janzing (2019) for a critical discussion of some ideas. Although distinction of cause and effect from purely observational data is still challenging, these approaches have stimulated discussions in various directions regarding inferential asymmetries of cause and effect. On the one hand, the relation to the arrow of time in physics has been described by Allahverdyan and Janzing (2008) and . On the other hand, it has been argued that the asymmetries entail implications for machine learning for scenarios where the causal direction is known Bengio et al., 2019).
Here we describe an asymmetry between cause and effect with respect to how we assign priors to a set of possible outcomes of an experiment. Among the most prominent principles to assign priors is the 'Principle of Insufficient Reason' (PIR) and the Principle of Maximum Entropy (MaxEnt) (Jaynes, 2003). PIR assigns uniform probabilities to a set of possible outcomes whenever the knowledge about the outcomes is invariant under permutations. MaxEnt, which generalizes PIR, chooses a prior that maximizes entropy subject to the known constraints. For case where the causal direction is known, Sun et al. (2006) have argued that MaxEnt can result in implausible distributions and more natural joint distributions result from a sequential maximization: first maximize entropy of the cause subject to all constraints relevant for the latter, and then the conditional entropy of the effect, given the cause, subject to all remaining constraints. However, the arguments of Sun et al. (2006) were merely based on intuition without further justification.
On a related note, Ziebart et al. (2013) propose a 'maximum causal entropy principle' for a scenario with two interacting processes X t , Y t where X t is known and Y t is inferred from its own past and from X t and its past via a sequential maximization of conditional entropy. Ziebart et al. (2013) justify the sequential update by arguing that constraints that involve future observations should be ignored at that respective point in time. In the appendix we argue that this justification is not sufficient for our purpose.
The goal of this paper is to derive the sequential maximum entropy update rule proposed by Sun et al. (2006) from principles that we consider slightly more basic. To this end, Section 2 discusses a simple scenario suggesting that also PIR requires the same modification as MaxEnt. Section 3 tries to justify Causal PIR from a deeper principle of independent mechanisms, but also raises questions that remain open in this regard. Section 4 derives the causal version of MaxEnt by Sun et al. (2006) from applying Causal PIR to empirical distributions. Section 5 describes some problems of generalizing Causal MaxEnt to arbitrary causal DAGs. Section 6 shows that Information Geometric Causal Inference (Daniusis et al., 2010) can be derived from Causal PIR similar to Causal MaxEnt.
Proposing new practical inference rules is beyond the scope of this paper. Instead, it aims at better understanding relations between asymmetries of cause versus effect described earlier.

Standard PIR
The 'Principle of Insufficient Reason (PIR)', also called 'Laplace's Principle of Insufficient Reason' or 'Principle of Indifference' (Jaynes, 2003), states that in the absence of any relevant evidence, agents should distribute their credence (or 'degrees of belief') equally among all the possible outcomes under consideration. More explicitly, PIR advices to consider all possible alternatives in a random experiment equally likely. For the simple example where we know that one of n urns contains a ball, PIR considers each of the urns as an equally likely location and assigns P (j) = 1/n to each case j = 1, . . . , n. For a discussion of justifications of PIR we refer to Uffink (1995), where also the relation to MaxEnt is discussed in detail.
For our purpose, it is also instructive to rephrase PIR by stating that it advices the uniform prior whenever there is no evidence that breaks the symmetry between the alternative outcomes. In a way, PIR then gets a circular structure because any argument against the uniform prior implicitly raises doubts about the symmetry of the problem (obviously, the uniform distribution is the only one that is symmetric under permutation of the alternatives). One insight of our discussion below will be that a reasonable use of PIR is not symmetric with respect to interchanging cause and effect. We are agnostic about whether one should consider this merely as an advice on how to properly apply PIR in a cause-effect scenario or as a causal modification of PIR.

Motivating Causal PIR for a simple mechanical device
Consider the mechanical device depicted in Figure 1. It consists of a system with channels having three different entries (top of the figure) and three exits (bottom). The first entry splits into two different channels, while the second and the third entry lead to the same exit. Let us label the three entries with the variable X attaining the values 1, 2, 3, while Y labels the exits 1, 2, 3. Assume we know that a ball enters one of the three entrances at the top. In absence of any further information, we would consider all three options as equally likely, that is P (X = 1) = P (X = 2) = P (X = 3) = 1/3, in agreement with PIR. When rolling through the channel, the ball Y = 3 2 1 X = 1 2 3 Figure 1: A ball enters our mechanical device from the top. Without additional information, we would consider all three options (X = 1, X = 2, X = 3) equally likely, that is, assign the probability 1/3 to them. This results in probability 2/3 for Y = 1 and probability 1/6 each for X = 2 and X = 3.
will take one of the three exits. Whenever it entered at the entrance 2 or 3, it can only take the exit Y = 1 due to the topology of the channels. In case it entered at entrance 1, it has the two options later, namely exits Y = 2 or Y = 3. We now apply PIR for the conditional distribution of Y given X and assume that both alternatives are equally likely. The scenario thus yields the joint probabilities shown in the table in Figure 2, left. This distribution is clearly asymmetric with respect to X and Y although the topology of the channels is symmetric. Assuming that the ball enters from the bottom, that is, Y labels the entrances and X the exits, thus induces the joint distribution in the table in Figure 2, right, which is obtained by swapping the roles of X and Y . Lead by our intuition, we have applied PIR twice: first for X, and then for Y , given X. However, the simplicity of the scenario blurs a non-trivial step in this way of reasoning, namely that the experiment is not symmetric with respect to time inversion, or, which is equivalent here, with respect to swapping cause and effect. Here, our asymmetry of reasoning is implicitly based on a belief about the difference between cause vs. effect and past vs. future.
Note that our mechanical toy example does not describe the typical scenario of cause-effect inference since it is uncommon to know the mechanism that relates cause and effect, that is, only the direction is unknown. Typically, we are given observations from X, Y instead of knowledge on the mechanisms. Yet the example is helpful to motivate Causal PIR, which is later used to motivate Causal MaxEnt, which, in turn, is relevant for more realistic inference scenarios.

Fallback to standard PIR when causal direction is unknown
To elaborate on this, note that the topology of the channel allows 4 different x, y-pairs. Without knowing whether the ball enters from the top or the bottom, PIR lets us assign equal probabilities to each of them since the device is symmetric once the knowledge of the direction of the motion is lost. Obviously, the symmetry of the problem now results in the distribution shown in the table in Figure 2, middle. Note that this distribution may not only be natural when we are agnostic about the causal direction, but also if neither of the causal directions is true and the relation between X and Y is due to a common cause. Although the following scenario may seem less natural than the first two ones with X or Y as cause, we mention it to cover also the common cause scenario. Assume that the ball drops from the sky into one of the channels and lies there at some point at rest. If it lies in the regions X = 2, 3 or Y = 2, 3, its position already defines a unique (x, y)-pair since these values can only occur together with a unique value of Y or X, respectively. In the case where it lies in the regions X = 1 or Y = 1, we push it towards the branching point to generate the corresponding random value of Y or X, respectively. This way, we have again generated a scenario in which we have no reason to prefer any of the 4 possible (x, y)-pairs over the other. One can argue that the causal structure of this scenario is the DAG shown in Figure 2, middle, where some 'big' unobserved variable Z affects both X and Y , where Z contains position and momentum of the ball and the noise which determines the branching process.

Paradoxes with standard PIR
As the table in Figure 2, middle, shows, assigning equal probabilities to all 4 possible cases result in higher probabilities to those values of the cause that admit more options for the effect -which suggests 'intentional behaviour'. Note, however, that the latter interpretation is prone of confusing ontic and epistemic perspectives: whenever the restriction to these 4 alternatives comes from our knowledge about the underlying mechanism connecting the cause X with the effect Y , it is indeed irrational to consider x-values more likely for which there exists a larger number of possible y-values later. However, if we know, for some other reason, that (x, y) is one of the above 4 cases (e.g. because someone told us without telling us (x, y)), there is nothing wrong with updating our subjective prior for X in the way resulting from the uniform distribution over the 4 possible pairs. After all, this Bayesian update entails no statement on the underlying causal mechanism. This distinction will be further discussed in Section 3, where we also mention open problems regarding ontic versus epistemic interpretation of constraints. Similar paradoxes with standard PIR have been described by Hunter (1986Hunter ( , 1989 in a critical discussion of MaxEnt. He described a scenario which he called 'Pearl's Puzzle' 1 , which we briefly sketch. Assume three individuals A, B, C are invited to a party but don't know who will be joining. Further assume we consider, a priori, all 8 possible combinations equally likely. In addition, we know that A, B decide independently of each other and of C whether they join, but C will call the host to ask whether both A and B have accepted the invitation and stay at home in this case to avoid seeing both of them together. After accounting for this extra information (excluding the case where A, B, C occur), we are left with 7 remaining combinations, which we would assign equal probabilities to. According to such an update, the joint distribution of A, B has changed after accounting for the information that C's decision depends on A and B. Phrasing it in causal terms, the puzzle reads as follows: A, B are the causes and C's behaviour their effect. Learning about how C's decision depends on A and B actually changes the belief about the mechanism according to which the effect depends on its causes. It is disturbing that an update on this mechanism affects the distribution of the causes (one can also show, which Pearl describes as the main puzzle, that A and B even become dependent by this update).
We will later elaborate on this in the context of the so-called Principle of Independent Mechanisms (Peters et al., 2017), since Hunter's and Pearl's discussions are already lead by such an independence assumption.
To conclude with 'Pearl's puzzle' we briefly sketch how it gets resolved by a sequential use of PIR: since A and B are the causes, we assign a uniform prior over all 4 possible truth values. Afterwards, we assign a uniform prior over all remaining options for C: whenever A and B are coming, C stays at home with probability 1, for all other cases he would decide to come with probability 1/2. By construction, whether or not C is coming, is irrelevant for A and B.

General definition of Causal PIR
The way we defined the joint distribution for the mechanical device can be described by the following principle, which also solved the above 'puzzle': Definition 1 (Causal PIR). Let X and Y be cause and effect with values in finite sets X and Y, respectively. If the only knowledge about an observation (x, y) is that it lies in some subset S ⊂ X × Y, Causal PIR assigns uniform distribution to all possible x, for which there exists an y such that (x, y) ∈ S. Then causal PIR assigns the uniform prior over all remaining options for y, given x (that is, all y for which (x, y) ∈ S).
A priori, we have introduced Causal PIR only as a principle for constructing a prior when the causal direction is known. Conversely, one can certainly use its asymmetry to infer the causal direction by preferring the one with larger likelihood: Definition 2 (Causal PIR based cause-effect inference). Given an observation (x, y) generated by either the causal structure X → Y or Y → X. Infer that the true causal direction is the one for which (x, y) has larger likelihood according to the Causal PIR prior.
Observing, for instance, the path in Figure 3, we thus infer that the ball entered from the top rather than from the bottom: we obtain likelihood 1/3 for the former versus 1/6 for the latter. In a more informal way, we state Causal PIR as follows:

Independent mechanism update
This section and Section 4 repeatedly refer to the Principle of Independent Mechanisms (IM), which we briefly introduce for the special case of a cause-effect pair. A priori, IM is an informal principle stating that, for an unconfounded cause-effect relation, there should be two independent mechanisms in place, one that generates the cause and one that generates the effect from the cause (see Peters et al. (2017), Section 2.1, for an overview and discussion of its different aspects). IM has been used as foundational justification for cause-effect inference. One formalization of IM in the literature is the Algorithmic Independence of Conditionals (AIC) by Janzing and Schölkopf (2010) and Lemeire and Janzing (2012), stating that the shortest description of P X,Y is given by separate descriptions of P X and P Y |X . 2 This version will be relevant in Section 4, while the following subsection will interpret IM in the sense of decomposing the constraint S according to one that refers to the cause and one referring to the relation between cause and effect.

Constraints on the cause and constraints on functions
We will now describe a principle that justifies Causal PIR in Definition 1. To this end, let F := Y X denote the set of functions f : X → Y and define the formal random variable F attaining values f ∈ F . Note that F can be represented as the k-fold cartesian product of Y with k := |X |, whose components are indexed by x ∈ X . In other words, a function f is represented by the k-tuple Accordingly, a distribution on F is a joint distribution on this cartesian product. Its marginal distribution on component x describes the conditional probability P Y |X=x (see, for instance, Peters et al. (2017), Section 3.4). Let P F be the uniform distribution 3 on F . Fortunately, the uniform distribution has product structure over the components of the cartesian product, which renders P F particularly easy to deal with. Further, for every input x, every output y is equally likely. In other words, the uniform prior over all functions induces conditional distributions P Y |X=x that are uniform for all x. After assuming also a uniform prior P X , we have thus obtained a uniform prior P X,Y = P X P Y |X over all |X ||Y| combinations, which are 9 = 3 · 3 in the above example.
After having defined our prior on X and F , let us obtain the additional information that the entire device only generates (x, y)-pairs in some set S ⊂ X × Y. For the example above, these are the 4 combinations shown in the tables in Figure 2. We now assume that the constraint S is the result from two independent mechanisms, one for X and one for F : Postulate 2 (separation of constraints). Given the constraint (x, y) ∈ S for a cause-effect pair (X, Y ), we assume, by default, that this constraint in enforced by two 3 Note that Hunter's solution of 'Pearl's puzzle' (Hunter, 1989) also uses an update of a distribution over functions ('probability measures over counterfactuals'), but is based on the assumption that the constraint is known to refer to the function only, while our scenario describes a constraint for (x, y) for which it is not a priori known what it tells us about the function. separate mechanisms. First, there is a mechanism that enforces all x to be in the set Second, there is a mechanism that enforces functions to be in the set To better understand the postulate it helps to say what kind of mechanisms it excludes: Imagine an agent who chooses functions f that violate (x, f (x)) ∈ S for some inputs x ∈ S X , but always makes sure that these functions are only combined with inputs x for which the constraints are satisfied. In other words, the agent ensures (x, f (x)) ∈ S by combining x and f in a smart way. In this case, we would say that the mechanism choosing x and the mechanism choosing f are dependent. In Subsection 3.2 we will discuss in what sense this would violate the Principle of Independent Mechanisms.
The above restrictions for X and F together generate the restriction S. Note that the above separation of S into (S X , S F ) entails minimal commitment on both components x and f (while still preserving independence) in the following sense. First, it is obvious that no proper supersetS X ⊃ S X guarantees that (x, y) ∈ S, regardless of the constraints for the functions. Second, no larger setS F ⊃ S F guarantees (x, y) ∈ S unless we require x-values and functions f to respect joint constraints.
For our mechanical device above, the constraint for X reads that there are 3 possible entries X = 1, 2, 3. While this constraint is trivial since our set X contains only these 3 values, one could also think of a set X that is a priori larger until our information on S restricts the options for X to the subset S X consisting of these 3 values. The constraints on F that we conclude from the joint constraint S consists in excluding all functions that map X = 1 to y-values other than Y = 2, 3 and X = 2, 3 to values other than 1. Nevertheless, Postulate 2 is less innocent than our toy example suggests. We will therefore further discuss its justification in Subsection 3.2.
We now obtain the following technically simple result, which we phrase as a theorem since it considers Causal PIR as an implication of the more basic Postulate 2: Theorem 1 (Causal PIR from independent mechanism update). Let P X and P F be uniform distributions on X and F , respectively. Then the conditional distribution of P S X is the uniform distribution over all S X . Further, for every x ∈ S X , the conditional P S Y |X=x resulting from the conditional distribution P S F is the uniform distribution over all y for which (x, y) ∈ S.
Proof. The first statement is obvious. To show that the conditional is also uniform over all remaining options, we represent each function f as the k-tuple of y-values where x 1 , . . . , x k denote the elements of S X . The uniform prior over all function thus amounts to the uniform prior over all k-tuples (y 1 , . . . , y k ) ∈ Y k . Since the uniform distribution over a cartesian product factorizes over its components, we can perform the update independently for each x j and obtain a uniform distribution over all y for which (x j , y) ∈ S.

Justifying separation of constraints
The remaining subsection is devoted to the justification of Postulate 2. It needs to be informal because it is a discussion on beliefs about the world rather than insights from statistics or any other branches of mathematics. Further, it can be seen as 'abstract physics' in which the 'hardware' of the underlying processes is unspecified. We also briefly mention relations to the thermodynamic arrow of time and thus reach a domain that goes beyond the scope of this paper. Accordingly, further justification of Postulate 2 could also raise questions of theoretical physics. However, the main focus on this Subsection is the question which implicit further assumptions are made when Postulate 2 is said to be entailed by the Principle of Independence Mechanisms.
Constraints from knowledge versus constraints from mechanisms The first distinction we need to make regarding the constraint (x, y) ∈ S is whether we assume that there is a mechanism that generates pairs in S or whether we know that a particular experiment resulted in an (x, y)-pair in S by chance (recall our remarks regarding ontic versus epistemic perspectives in Subsection 2.4). In the second case, Postulate 2 does not make sense: if (x, y) ∈ S is not the result of mechanisms that enforces (x, y) to lie in S, it is pointless to postulate separate mechanisms. In this case, the further argument resulting in Theorem 1 breaks down: There is, a priori, no reason why our knowledge about X and F could not render them dependent, 4 although the Principle of Independence Mechanisms states that the true mechanisms contain no information about each other. -Note that Janzing and Schölkopf (2010) formalize independence via algorithmic information, which is an ontic perspective since it relies on the description length of known mechanisms.
One can certainly also justify an epistemic 'Principle of Independence Mechanisms' stating that our prior about cause-effect pairs should factorize between the mechanism generating the cause and the mechanism generating the effect from the cause, but the factorization breaks down after joint observations from cause and effect are available. Although this raises doubts about Postulate 2, we now discuss what kind of inductive bias provides further support.
Bias for ontic interpretation Let us now describe a scenario in which knowing (x, y) ∈ S does provide evidence for the presence of a mechanism that enforces or at least supports outcomes in S. To this end, assume that the sets X and Y are huge (e.g. binary words of length n with n ≥ 100). Further, assume that S is a strong constraint in the sense that it allows only a small fraction of possible outcomes, that is, |S| ≤ (|X | × |Y|)/k for some huge number k. Assuming, a priori, a uniform distribution P X,Y on |X |×|Y|, that is, we have P {(X, Y ) ∈ S} ≤ 1 k . Given some fixed S with this property, we would certainly argue that an observation in S is unlikely without a mechanism that increases the probability of outcomes in S. At first glance, it seems that such a conclusion is only possible if S has been defined prior to the experiment. However, it still holds when we can identify a set S after observing (x, y) provided that S has low description length (here we formalize description length via Kolmogorov complexity (Li and Vitányi, 1997), that is, let K(S) denote the length of the shortest self-delimiting program that decides whether any pair (x, y) is in S). With these as-sumptions, it is unlikely to obtain an outcome (x, y) in any such tiny set S with low complexity. To see this, let U be the union of all sets S with K(S) ≤ ℓ and |S| ≤ k. Since the number of programs of length at most ℓ is at most 2 ℓ (Li and Vitányi, 1997), the probability for obtaining a result in any of these low complexity sets can be bounded from above as follows: In case the right hand side of (1) is still significantly smaller than 1 we assume that observing (x, y) ∈ S indicates the presence of a mechanism that increases the probability of S (compared to the uniform distribution we started with). We phrase this insight as an informal postulate: Postulate 3 (bias towards mechanisms vs state of knowledge). Given a system with finite 'state space' W, then the information that the actual state w lies in some set S with low complexity (in the sense that |S| · 2 K(S) ≪ |W|) is considered as strong evidence for the presence of a mechanism that increases the likelihood of states in S.
The constraints we will discuss later in the context of MaxEnt will typically be of this type: constraints that describe empirical means of simple functions like polynomials of low order have low complexity (provided that the constants involved have short descriptions), and restrict the combinations of outcomes by huge factors. For the same reasons, typical constraints in thermodynamics also result from mechanisms: observing that all particles of a gas are located within a certain volume V can only be explained by a mechanism (e.g. a wall) that confines them to V , rather than being just a coincidence. 5 Is the constraint S tight? Together with the bias for an ontic interpretation of constraints, we are now getting slightly closer to deriving Postulate 2 from IM (beyond the few comments made right after stating it). We now assume, for simplicity, that the constraint (x, y) ∈ S is due to a mechanism that forces all pairs to lie in S (although Postulate 3 is weaker in the sense that it only assumes a 5 In general, constraints on macroscopic variables have negligible description length compared to the typical complexity of the microscopic state of a many-particle system, as also argued by Zurek (1989). mechanism that increases the likelihood of S). The question we are facing is wether S is tight in the sense that all pairs in S will occur after sufficiently many repetitions of the same experiment. We then need to assume that S originates from separate constraints for X and F because otherwise we would need a mechanism that controls X and F jointly by varying them in a way that enforces (x, f (x)) ∈ S, in contradiction to the independence of mechanisms, as sketched after Postulate 2.
For the case where S is not tight and the mechanism generates only (x, y) pairs in S ′ ⊂ S (but we don't know S ′ ) we still choose the update according to Postulate 2 because this is the only possible choice for constraints on X and F that doesn't commit beyond the information we have.
We summarize that assuming that a constraint S arises from independent constraints for X and F is our inductive bias, which can be justified under appropriate conditions.

Wallis' argument for MaxEnt
Inferring underdetermined probability distributions by maximizing entropy subject to the available information is a well-established principle in machine learning and statistics, see e.g. Frogner and Poggio (2019); Levy and Delic (1994); Myung et al. (1996). The usual formal setting reads: Accounting for linear constraints Let us, for simplicity, assume that X is a variable that attains values in some finite set X . Assume the only information available on P X is given by the expectations where f j are measurable functions. According to MaxEnt we would then choose the unique distribution maximizing the Shannon entropy 6 subject to the constraints (2), which yields with appropriate Lagrange multipliers λ j , µ. While distributions that result from MaxEnt often appear intuitively 'natural', or 'smooth' 7 , there is an ongoing debate about how to justify (3) as a rational guess Jaynes (1957); Palmieri and Domenico (2013);Uffink (1996). Shore and Johnson (1978) stated Postulates that 'consistent' rules for updating a distribution after new information comes in should satisfy, Uffink (1996) criticized the approach as suffering from hidden implicit assumptions that go beyond what Uffink (1996) would call 'consistency' requirements. We will therefore prefer the so-called Wallis' derivation (see Jaynes (2003), Section 11.4), which we briefly sketch: Consider an experiment with n draws from the finite probability space X = {x 1 , . . . , x k }, and n 1 , . . . , n k with j n j = n denotes the number of occurrences of x j . By elementary combinatorics, the number of combinations for these frequencies reads #(n 1 , . . . , n k ) = n! n 1 !n 2 ! · · · n k ! .
Using Stirling's approximation one can easily show that 1 n log #(n 1 , . . . , n k ) = − j n j n log n j n (5) Hence, the number of realizations can be estimated via the entropy of the relative frequencies. Thus, the MaxEnt distribution is the distribution for which the corresponding 6 For continuous variables, one typically replaces Shannon entropy with differential Shannon entropy Cover and Thomas (1991) H(X) := − p(x) log p(x)dx. Since the latter is not invariant with respect to re-parametrization, one should then rather consider minimization of relative entropy to a given prior distribution. 7 Since any distribution maximizes the entropy subject to appropriate constraints (just choose f (x) := log q(x) with appropriate constant c), this is certainly a result of the type of constraints that typically occur in applications, e.g., if only first and second moments of a distribution are known relative frequencies maximize the number of realizations in the limit of n → ∞.
Further, one can show that for large enough n, the overwhelming majority of n-tuples satisfying the constraints show empirical distributions that are close to the Max-Ent distribution. Hence, a prior on X n that assigns equal probability to each n-tuple, results, after accounting for the constraints, in a posterior that is essentially supported by empirical distributions close to the unique MaxEnt distribution. In this sense, MaxEnt can also be seen as an implication of PIR (when applied to empirical distributions), although MaxEnt is more general from the formal point of view.

Causal MaxEnt from Causal PIR
We start with motivating Causal MaxEnt in the same way as it is done by Sun et al. (2006). Assume we are given a continuous variable X as cause and a binary variable Y as effect. Let the only information about the joint distribution P X,Y be given by the first and second mo- One can easily verify that the MaxEnt distribution is a bivariate mixture of Gaussians, where the cases Y = 0, 1 correspond to the two mixture components. Sun et al. (2006) argue that this distribution would be a plausible distribution if Y was the cause and X the effect, while it is not plausible that the cause becomes bimodal just because it has an influence on a binary variable. If one, instead, first maximizes H(X) subject to the constraints E[X], E[X 2 ] and then H(Y |X) subject to the remaining constraints , the marginal distribution P X becomes a single Gaussian and P Y =1|X a sigmoid function where the probability for Y = 1 smoothly increases or decreases with X, which Sun et al. (2006) consider plausible for the causal direction X → Y . Formally, they have postulated the following principle: Definition 3 (Causal MaxEnt). Given some linear constraints for P X,Y for the cause effect pair (X, Y ). Infer the bivariate distribution by first maximizing H(X) subject to all constraints for P X (entailed by the joint constraints). Then, maximize H(Y |X) subject to the joint constraints.  show that usual MaxEnt violates the algorithmic independence of P X and P Y |X .
The proof is based on the observation that MaxEnt can result in a joint distribution whose marginal P X cannot be defined by a separate constraint with simple description. Instead, its simplest description may be 'the marginal distribution resulting from MaxEnt for the joint constraint'. For the example above with binary X and real-valued Y with second order constraints, P X is a mixture of two Gaussians, and thus already contains the full information about the joint distribution P X,Y . Despite describing this problem of MaxEnt,  do not show that Causal MaxEnt is the right replacement of MaxEnt.
We now show that this sequential probability update is a result of Causal PIR when applied to empirical distributions. 8 Assume we are given ℓ constraints of the form Let us now interpret (6) as constraints for the empirical distribution after n draws. For each pair (x, y) of n-tuples x := (x 1 , . . . , x n ) and y := (y 1 , . . . , y n ) we denote by E (x,y) the expectation induced by the corresponding empirical distribution of (X, Y ). Finally, we define with some arbitrarily small ǫ > 0, which defines a relaxation of (6) to ensure feasibility for sufficiently large n. Following our separation of constraints in Postulate 2, we now define S X as the set of n-tuples x for which there exists an n-tuple y such that (x, y) ∈ S. Again, Causal PIR tells us to put a uniform prior on S X . Following Subsection 4.1, the overwhelming majority of n-tuples x in S X are close to the distribution P X that maximizes H(X) subject to (6) being feasible for Y . For any x ∈ S X let S x denote the set of n-tuples y such that (x, y) ∈ S. According to causal PIR, we put a uniform prior on S x . We will again use (5) to derive the conditional empirical distribution that is induced by the majority of the y ∈ S x .
To this end, for any x ∈ S X let n x 1 , . . . , n x k denote the number of occurrences of the k different elements of X . Further, for any y ∈ S x , let n i j be the number of occurrences of the element (i, j) in X × Y. For any collection (n i j ) and any fixed x, the number of different y is given by since we need to apply (4) for each element of X and sample size n x 1 . Using the same arguments as for the derivation of (5) we obtain Recalling that the conditional entropy of Y given X for any probability mass function p(x, y) reads (Cover and Thomas, 1991) we observe that (9) is the conditional entropy of the empirical distribution. Accordingly, we conclude that, for any fixed x, the overwhelming majority of n-tuples y in S x are those whose empirical distributions are close to the distribution maximizing conditional entropy subject to (6). The above arguments show that first putting a uniform prior on S X and then, for fixed x, a uniform prior on S x yields a joint distribution on X n ×Y n that is strongly concentrated on the set of (x, y)-pairs whose empirical distribution is given by Causal MaxEnt. In contrast, classical MaxEnt would provide the most likely empirical distribution only if we put uniform prior on S, that is, if we use standard PIR.

Generalization of Causal MaxEnt to arbitrary DAGs
Given a causally sufficient set of N variables X 1 , . . . , X N , causally linked by the directed acyclic graph (DAG) G, the causal Markov condition (Spirtes et al., 1993;Pearl, 2000) implies that the joint distribution factorizes according to where P Xj |P Aj denotes the conditional distribution of X j , given its parents in G. If we are given multivariate constraints of the form the arguments from Section 4 suggest to obtain the conditionals P Xj |P Aj by sequentially maximizing conditional entropy H(X j | P A j ) according to an ordering that is consistent with G, a procedure already proposed by Sun et al. (2006). Since we construct the joint distribution as the product of the conditionals P Xj |P Aj , it is Markov relative to G by construction. This seems to overcome a problem with classical MaxEnt: maximizing the joint entropy subject to (11) does not necessarily result in an Markovian distribution, while maximizing entropy subject to (11) and (10) is not a convex optimization problem and thus need not have a unique solution (as shown below for a toy example). Before describing problems with Causal MaxEnt for general DAGs, let us first consider an example where it makes sense. Let X j be binary variables connected by the causal structure Assume now we are given a constraint saying 'X j = 0 implies X j+1 = 0' for j = 1, . . . , N − 1. Intuitively, this corresponds to a mechanism that appends 0 or 1 to any binary word ending with 1, but it appends only 0 to words ending with 0. In other words, it rules out any binary word (x 1 , . . . , x N ) containing the substring 01. Classical MaxEnt would thus result in the uniform distribution over the N + 1 binary words 0 . . . 0, 10 . . . 0, 110 · · · 0, . . . , 11 . . . 1. Causal MaxEnt yields X 1 = 1 with probability 1/2, and all other X j attain 1 with probability 1/2 if their predecessor is 1. Thus, the binary words occur with probability 1/2 1 , 1/2 2 , . . . , 1/2 N , 1/2 N , a distribution with much lower entropy. In this sense, Causal Max-Ent is more conclusive since it results in smaller uncer-tainty about the resulting joint distribution after levering the causal information. However, sequential entropy maximization raises the following two problems (ignored by Sun et al. (2006)) in case the DAG is not complete: 9 First, the ordering of nodes is not necessarily unique. Second, sequentially maximizing entropy may render the constraints (11) infeasible, as shown by the following toy example with a DAG G with two variables and no edge. Consider the binary variables X 1 , X 2 with values ±1. The Markov condition implies the factorization Assume we are given the constraint To implement Causal MaxEnt, let us choose the ordering X 1 , X 2 . We observe that (14) entails no restriction for the marginal distributions of X 1 , and thus maximizing H(X 1 ) yields P (X 1 = 1) = 1/2. Then Causal MaxEnt advices us to maximize the entropy of X 2 , given its parents in G (which is the empty set), subject to (14). However, there is no P X2 such that P X1 P X2 satisfies (14), after we have already maximized the entropy of X 1 . To satisfy the constraint, we need X 2 depending on X 1 , which violates the Markov condition. The only joint distributions satisfying constraint (14) and Markov condition (13) are point measures on (1, 1) or (−1, −1), respectively. These are the two solutions of the non-convex problem of maximizing entropy subject to (14) and (13). By deciding for one of the solutions we would commit beyond the known constraint (14). If (14) results from independent mechanisms for X 1 and X 2 , it could be that there are either two independent mechanisms generating only the value 1 for both variables, or independent mechanisms generating only −1 for both ones, we just do not know which scenario is the true one. In other words, (14) represents our knowledge on the mechanisms, while the mechanisms themselves respect tighter constraints, namely (X 1 , X 2 ) = (1, 1) or (X 1 , X 2 ) = (−1, −1), depending on the scenario. Hence we have an example for the case where constraints are not 'tight' in the sense of our discussion in Subsection 3.2.
More generally speaking, the example shows that decomposing constraints like (11) into independent constraints for each of the mechanisms P Xj |P Aj may not be possible. The bivariate example suggests that the requirement to obtain a distribution that factorizes according to the DAG structure prohibits using the constraints entirely, given that we must not commit to any information that is not entailed by the constraints (as we would do by choosing either (X 1 , X 2 ) = (1, 1) or (X 1 , X 2 ) = (−1, −1)).

Deriving Information Geometric
Causal Inference from Causal PIR Information Geometric Causal Inference (IGCI) (Daniusis et al., 2010;) is a method for causal discovery that infers whether two X causes Y for Y causes X from the bivariate distribution P X,Y for the case of an invertible deterministic causal relation, i.e., Y = f (X) and X = f −1 (Y ). Although IGCI is more general, we sketch the idea for variables with range [0, 1] and strictly monotonously increasing f , as shown in Figure 4, left. The intuitive idea is that, for the causal relation X → Y , 'generic choices' of P X (independently chosen of f ) 10 result in distributions P Y that tend to have higher density in regions where the derivative (f −1 ) ′ (y) is large. To exploit this asymmetry for inferring the direction, one infers X → Y iff points accumulate in regions of small f ′ rather than small f −1 ′ . Formally, IGCI amounts to inferring the direction X → Y iff 11 n j=1 log f ′ (x j ) < 0.
IGCI can be obtained as the deterministic and continuous limit of Causal PIR in the following sense. Note that our derivation is close in spirit to the justification of IGCI provided by , which relies on counting arguments in the space of discrete functions. However, we want to directly derive it from Causal PIR. 10 formalized by the condition 1 0 log f ′ (x)p(x)dx ≈ 1 0 log f ′ (x)dx. 11 note the symmetry n j=1 log f ′ (x j ) = − n j=1 log f −1 ′ (y j ). Assume we draw the function f with a fat pen, as shown in Figure 4, right. Define, after discretizing X and Y to get a grid with ℓ × ℓ points, define R ⊂ X × Y as the points (x, y) lying on the fat stripe. For each x, let N X (x) denote the number of possible y-values for which (x, y) ∈ R. Define N Y (y) similarly. For each observed point (x j , y j ) we have Then, n j=1 N X (x j ) is the number of possible n-tuples y for the observed x. Likewise, n j=1 N Y (y j ), is the number of possible n-tuples x for the observed y.
On checks easily that inferring causal direction via Causal PIR based cause-effect inference in Definition 2 thus amounts to comparing n j=1 log N Y (y j ) to n j=1 log N X (x j ), which, after using (15) amounts to checking the sign of n j=1 log f ′ (x j ). We have thus shown that another non-trivial causal inference method (part from Causal MaxEnt) also follows from applying Causal PIR to the n-fold cartesian product of the underlying probability space.

Conclusions
Using a simple mechanical device as toy example, we have argued that our common sense replaces PIR with Causal PIR for bivariate distributions of cause and effect whenever we account for knowledge on the mechanism connecting cause and effect. We have further justified Causal PIR and Causal MaxEnt by assuming that constraints on joint distributions arise from two separate constraints: constraints on the cause and constraints on the cause-effect relation (in the sense of possible functions). Earlier work has solved paradoxes with usual MaxEnt by updating priors over functions, too. We have argued, however, that knowledge on the bivariate distribution is not a priori divided into information on the cause and information on the functional relation between cause and effect. We have therefore proposed a way to divide it into these two components in a way that entails minimal commitment for both of them.