# Learning by Ignoring the Most Wrong

## Abstract

Imprecise probabilities (IP) are an increasingly popular way of reasoning about rational credence. However they are subject to an apparent failure to display convincing inductive learning. This paper demonstrates that a small modification to the update rule for IP allows us to overcome this problem, albeit at the cost of satisfying only a weaker concept of coherence.

## 1 Imprecise Probabilities

Consider representing your rational state of belief – your credal state – by a set of probability functions, ℙ, with a common algebra of events Α. Call ℙ a credal set.[1] We might interpret this as a set of possible sharpenings of a somewhat indeterminate credal state, for example.[2] Define ( X ) = { p r ( X ) , p r } , the set-valued function defined over the same algebra as the members of ℙ; and define ( X ) = sup ( X ) , the “upper probability” that takes the largest value assigned to an event X by members of ℙ. We can also define a “lower probability”: ( X ) = inf ( X ) . This is a simple imprecise probability (IP) theory of rational belief.

We’re going to follow the standard practice of exploring the rationality of a credal state by considering what consequences it has for betting behaviour. Let’s make the assumption that if your credal state is ℙ, then serves as your “maximum buying price” for gambles, meaning that fills the role of q below. Call the bet “win 1 − x if A, lose x otherwise” a “unit bet on A (with a price of x)”.

One-Sided Betting

A unit bet on A with price x is acceptable if q (A) > x.

If you accept a unit bet on A, then the bookie has accepted a unit bet on ¬A with price 1 − x. Call the function q that plays the role of determining the acceptability of bets your “betting quotient”. The standard practice we are following here is to identify your betting quotient with your credence, but they are conceptually distinct.

If your credences inform your betting behaviour in this way, then it seems a minimal requirement of rationality is that your betting behaviour should be such that you are not subject to a sure loss. It is a bad thing to be such that, whatever happens, you lose money.

Avoid Sure Loss

Your prices for gambles in a one-sided betting scenario should be such that no combination of acceptable bets is guaranteed to yield a loss in every state.

There is a further property that we might also find desirable, namely that any collection of bets that always wins you money is acceptable to you. This is what Hájek (2008) calls a “Czech book”. So we might want to add in this extra clause.

Coherence

Your prices for gambles in a one-sided betting scenario should be such that no combination of acceptable bets is guaranteed to yield a loss in every state and every combination of bets that yield a gain in every state is acceptable.

Coherence is strictly stronger than avoiding sure loss. We then have the following result: you are coherent if and only if your buying betting quotients are the lower probabilities for some credal set (Troffaes and de Cooman 2014, Theorem 4.38).[3]

Note that the standard precise probability theory is recovered if we require more from your betting quotient:

Two-Sided Betting

1. A unit bet on A with price x is acceptable if q ( A ) > x .

2. If the above bet is not acceptable then a unit bet on ¬A with price 1 − y is acceptable for any x.

Satisfying coherence when your betting quotient function also determines what bets you sell requires that ( X ) = ( X ) for all X; in other words you have to have precise probabilistic prices. This is what is standardly called the “Dutch book argument”. Getting from here to an argument that your rational credences ought to have a certain form requires some kind of premise that connects your rational prices for bets to your credences. Since my main focus is elsewhere, I’m just going to stipulate that we have such a principle, and I’ll talk about prices and credences interchangeably.

## 2 Belief Inertia

We can use the same betting methodology that we used to provide structure to rational credences, to constrain the rational response to evidence. Let’s start with the precise probabilities case. Coherence should not just constrain attitudes at a time, but also attitudes over time. Consider your attitudes before updating on some proposition E, and after having done the update. Call these pr and prʹ respectively. It is not enough for pr and prʹ to both satisfy the axioms of probability individually: they should also be such that you can’t incur a sure loss in moving from one to the other on learning E. In short, the two credences should be jointly coherent and not merely separately coherent. In practice, what this means is that p r ( X ) = p r ( X | E ) = p r ( X E ) p r ( E ) .[6] Now the question is: does updating in this way mean that we acquire better credences as we accrue evidence?[7] As we’ll see below, the answer is: “sometimes”.

In this paper, we’re going to focus on a particular class of credence functions that do seem to exhibit reasonable inductive learning behaviour. Consider a precise credence function that is determined by two parameters, γ ∈ (0, 1) and λ > 0 where the credence that the next coin will land heads is pr (H) = γ, and learning that there were h heads and t tails in the last n = t tosses of the coin (call this evidence S n  = h) gives you an updated credence of p r ( H | S n = h ) = γ λ + S n λ + n .[8] Any such credence is jointly coherent for all learning propositions S n  = h. We shall call such probability functions “learning priors”.[9] Note that Carnap’s continuum of inductive methods (for one predicate) are the special case when γ = 1 2 .

Each pr is identified by two parameters γ and λ that capture a prior attitude to the bias of the coin or the chance of heads, and also how that prior responds to evidence. We can see this by rearranging the above expression to obtain:

(1) p r ( H | S n = h ) = ( n n + λ ) S n n + ( λ n + λ ) γ

γ determines the prior attitude to the next toss’ landing heads, and λ governs how much your attitude changes in response to evidence about previous tosses. So if we keep λ fixed and vary γ, we have a range of priors that start at different points, but learn at roughly the same rate (Figure 1).[10] On the other hand, if we fix γ and vary λ, we have a range of priors that start with the same prior credence for H, but move different amounts on the basis of the same evidence (Figure 2). Note also that as these priors acquire evidence, they become less prone to change on the basis of further evidence.

Figure 1:

Learning priors for fixed λ.

Figure 2:

Learning priors for fixed γ.

Given enough evidence, any such learning prior pr will converge on the true value of the chance with increasing evidence.[11] By the (strong) law of large numbers, as we increase the number of coin tosses (i.e. as n increases), the ratio S n n will tend to the true chance, and as n gets larger, γ λ + S n λ + n gets closer to S n n , since the right-hand term of (1) will get smaller as n increases. So each learning prior will converge on the true chance of heads as evidence increases. This seems like a good result. At least for this class of priors, we have sensible inductive inference behaviour.

To be clear, the property that priors of this kind have is the following:

Convergence to the True Chance

For a sequence of i.i.d. binary random variables H i { 0 , 1 } , generated by probability p r * ( H ) = p , let S n = i = 1 n H i . For almost all (pr *) sequences of outcomes, p r ( H | S n ) p as n→∞.

That is, you are almost certain to receive a sequence of outcomes such that you converge to having correct beliefs about the next coin toss.[12] The chance that you receive evidence that doesn’t prompt you to converge on the truth is zero. More generally, if the desideratum is to approach some expert probability (pr *), we want to converge (almost surely) to p r * ( H | S n ) . Since coin tosses are independent of past tosses, p r * ( H ) = p r * ( H | S n ) for the chance function.

Let’s turn to IP now. Joint coherence of an unconditional and a conditional lower probability is a little trickier than the precise case[13] but for our purposes it is enough to say that generalised conditioning is the most informative coherent way to update: define a conditional credal set: ( | E ) = { p r ( | E ) , p r , p r ( E ) > 0 } , the set obtained by conditioning each member of ℙ on the same evidence. Note the caveat that we throw out any pr that assign zero probability to the evidence received. We shall return to that feature shortly.

Let’s return to the coin-tossing problem to see how credal sets get stuck. Let B be the set of all learning priors. That is, B is a set containing priors, all of which can learn on the basis of evidence.

Suppose we’ve gathered evidence S n  = h. The conditional probabilities of H given S n are all the numbers of the form: γ λ + h λ + n . Hold h and n fixed, and for any λ > 0 and γ ∈ (0, 1), that fraction is a possible probability value for heads. As λ gets bigger, this fraction tends to γ, so since γ is unconstrained, the range of values for the probability of heads, given S n  = h, covers all of (0, 1) regardless of what evidence you acquire. The problem is that, even though each prior is learning, there are always more and more recalcitrant priors in B that cover the whole range of possible chances. The bigger λ is the less effect the evidence has on moving that prior towards the true chance.

There are a number of ways we could diagnose the problem that this example highlights. One problem is that ( H | S n ) fails to converge to p r * ( H ) . Indeed in the example just given, ( H | S n ) = ( 0 , 1 ) for every possible (finite) sequence of evidence about coin flips. Alternatively, we could point to the fact that ( H | S n ) ( H | S n ) does not shrink as n increases. Call the former failure to converge and the latter failure to narrow. Convergence and narrowing are both desirable features of IP updating.

There are a number of ways IP updating can go that appear pathological. A credal set might be such that ( H | S n ) converges to something other than p r * ( H | S n ) . Or ( H | S n ) might just fail to converge at all. Or might converge but does not or vice versa. A credal set might fail to narrow by remaining stuck at ( H | S n ) = ( 0 , 1 ) for all n (as in the above example) or ( H | S n ) might shrink somewhat to some non-trivial interval. Call the former complete failure to narrow and the latter partial failure to narrow. In this latter case, we might further want to distinguish cases where the interval contains pr * and cases where it does not, with the former case seeming less bad somehow. These are not all logically distinct, obviously: complete failure to narrow precludes any sort of convergence, for example.

“Belief inertia” is used to refer to failure to narrow, but it’s often unclear or unspecified whether partial failure to narrow counts as inertia. It is also worth emphasising that solving failure to narrow is pretty easy,[14] the tricky thing is doing so in a way that permits convergence. The problem for IP is not belief inertia per se, but rather a failure to learn inductively. Solving belief inertia won’t necessarily solve this problem: there are bad ways to make credal sets narrow. “Belief inertia” appears to sometimes be used as a synonym for “failure to learn inductively”, but I hope the above comments are enough to make clear that this is not correct.

So, even in quite friendly circumstances – every prior in the set can learn – certain IP prior credal sets are unable to learn from evidence. Levi (1980) noticed this problem (chapter 13), and Walley (1991) was also well aware of the issue. Philosophical interest in the topic has picked up recently, with a number of papers discussing it (Castro and Hart 2019; Joyce 2010; Lassiter 2020; Lyon 2017; Moss n.d.; Rinard 2013; Vallinder 2018). Belief inertia is normally brought up by these authors as a reason to be sceptical that IP can provide a reasonable theory of rational belief, or at least as a serious problem that a theory of IP ought to overcome. Since sensible inductive inference seems like an absolutely central requirement for such a theory, we are owed an explanation of how to deal with belief inertia. I propose we address this problem by modifying the update rule for IP slightly.

## 3 Learning by Ignoring the Most Wrong

Recall that conditionalisation for a precise credence is defined, as is standard, by the following equation:

p r ( X | E ) = p r ( X E ) p r ( E ) .

Standard precise conditioning typically comes with the caveat that it only works for p r ( E ) > 0 . That is, if the evidence you receive has zero prior probability, conditioning doesn’t work.[15] There is a mathematical reason for this: you can’t divide by zero. There is also, however, a conceptual reason to rule out credences that assign zero probability to the actual evidence. It seems that actually observing something that you initially took to be impossible[16] might give you reason to reassess whether you had sensible prior beliefs. Surely it would have been better to give some credence to the event that actually ocurred? This motivates the idea that prior probability functions ought to be regular, that is, they ought to assign zero probability to logical falsehoods only. Now, regularity is commonly known to be an impossible property to satisfy when your space of events is infinitely large, but closer inspection reveals that only a restricted (and satisfiable) form of regularity is motivated by the above. Any evidence proposition you might actually observe is a member of some finite partition over which you ought to have non-zero probabilities. I don’t know about you, but I have never observed an infinitely thin dart being thrown at the unit interval; I have never bought a ticket to a lottery with countably many tickets; I’ve never observed all of an infinite sequence of coin tosses. Even in cases where the full space of possibilities is infinite (for example, arbitrarily long sequences of coin flips) I only ever actually observe some element of a finite partition of the evidence (some initial segment of an infinite sequence for example). We have to be somewhat careful here. It is trivial that any proposition A is a member of some finite partition ({A, ¬A} for instance), but what I am proposing here is that every evidence proposition we care about is a member of a finite partition such that it is sensible to require that your prior probability in every partition element is non-zero. For example, any proposition of the form “observe h heads in n tosses of the coin” is a proposition that has this property: the finite partition it belongs to is the (+ 1-membered) partition of all possible numbers of heads in n tosses. And if we focus our attention on your (conditional) credences about other possible evidence propositions (other members of finite evidence partitions) then it is perfectly reasonable and possible to require all of them to be non-zero. Note that all the learning priors from the previous section are regular in this restricted sense.

A note on terminology. If we have observed evidence E, it is standard to call the function that assigns to each probability function pr the value pr (E), the “likelihood” function. This function is very commonly used as a method of determining how “good” a probability function is, given some evidence. Sometimes this is the basis of the whole approach to statistics (e.g. Pawitan 2001), and even when it is not, the likelihood is an important part of the statistical toolkit. So the remarks of the previous paragraph could be rephrased like so: there’s a conceptual reason to consider probability functions that have zero likelihood to be defective.

Recall that we defined “Generalised Conditioning” (GC) as:

( X | E ) = { p r ( X | E ) , p r , p r ( E ) > 0 }

You will note that last condition builds in the caveat we just discussed: p r ( E ) > 0 . But note that such an update rule is one of many taking a simple form:

( X | E ) = { p r ( X | E ) , p r }

We recover generalised conditioning by setting = { p r , p r ( E ) > 0 } . Note that such a ℚ depends on both E and ℙ. We might consider exploring alternatives to GC by considering alternative choices of ℚ. Some alternatives are discussed by, for example, Bradley and Steele (2014b), Gilboa and Schmeidler (1993), and Grove and Halpern (1998). Here are four candidates for ℚ.

1. = { p r , p r ( E ) > 0 }

2. = { p r , p r ( E ) > τ } for some fixed threshold 0 τ 1 .

3. = { p r , p r ( E ) = ( E ) , p r ( E ) > 0 }

4. = { p r , p r ( E ) α ( E ) , p r ( E ) > 0 } , for α ( 0 , 1 )

The first of these is Generalised Conditioning. While the second seems to get something right – priors should pass some test of “reasonableness” as determined by the evidence received – it has the disadvantage of sometimes being empty (when ( E ) τ ). The third is the “Maximum likelihood” rule discussed by Gilboa and Schmeidler (1993). The name makes sense, since the rule proposes that you should update your credal set to the set of probability functions that attain the maximum likelihood value among the admissible priors. It yields an empty set in fewer circumstances,[17] but it is perhaps too hasty in its jumping to conclusions. The last option is what Cattaneo (2014) calls “α-cut conditioning”, and we shall denote it ( X | α E ) . This update rule is the one I want to focus on. Note that if you make α very small, this will be very similar to generalised conditioning. And if you make α very close to one, then α-cut conditioning will be very close to maximum likelihood.[18] The α parameter could be thought of as encapsulating a kind of epistemic “conservativity” (although not the same kind of conservativity that Konek n.d. discusses). The bigger your α, the quicker you jump to conclusions.

Several authors have suggested equipping a set of probabilities with some kind of measure of reliability Bradley (2017); Gärdenfors and Sahlin (1982); Hill (2013), and Lyon (2017). This proposal is interesting, as far as it goes, but these authors don’t really provide much explanation of where this index of reliability is supposed to come from. As Cattaneo (2008) points out, the likelihood pr(E) gives you a measure of how reliable each probability in ℙ is given the actual evidence. Thus, α-cut conditioning can be seen as part of a likelihood-based approach to enriching sets of probabilities with a well-defined and objective measure of reliability. Hill (n.d.) proposes an updating procedure which is similar to α-cut conditioning, but somewhat more complicated since the basic object is a nested set of sets of probabilities, rather than a single set of probabilities.

One can motivate α-cut conditioning in a very similar way to the more conceptual motivation for ruling out zero probability conditioning in the precise case. Those pr ∈ ℙ that gave relatively small probability to E seem somewhat defective, and it is reasonable to excise them from your credal state. How small the prior probability of the evidence has to be is determined by the α parameter, and (E) which is a kind of normalising factor. This normalising factor is necessary since, as your evidence gets more informative, even the most likely possibilities become less likely in absolute terms. For example, imagine that you know the chance of the coin landing heads is 0.3. The probability you assign to the evidence of one head in one toss is 0.3. This is the less likely of the possibilities (one tail in one toss has probability 0.7). But consider the evidence of five heads in 16 tosses. This is the most likely outcome of 16 tosses of a coin with chance 0.3 of landing heads, and yet, the probability of such an outcome is around 0.21: so less likely (in absolute terms) than one head in one toss. Note that this normalising process means that the α-cut conditioning is not a pointwise constraint in the terms of Moss (n.d.).

Sceptics of the value of regularity (even in this restricted form) can still grant that there is something plausible about considering probabilities in the credal set with extremely low relative likelihood to be defective.

We can also see the plausibility of this method of updating by reflecting on the “credal committee” metaphor (Joyce 2010). Imagine if some committee member was such that they considered what actually happened to be significantly more unlikely than the other members of the committee did. This seems to be a kind of epistemic failing, and one that might lead to that committee member being ignored or thrown off the committee. The α parameter governs how widely you consult among your credal committee. A high α means you only listen to those who were close to being the most reliable, and thus you tend to jump to conclusions, whereas a lower α means you pay attention to more of your committee, even if they were not as reliable; this builds in a kind of conservativity.

One might want to pursue the option of more sophisticated methods for restricting the set of conditionalised probabilities, for example by using some form of accuracy-based scoring rule approach (Pettigrew 2016); or by letting α depend on E. Given our understanding of α as a measure of epistemic conservativity, it only makes sense to consider varying α when we think it is sensible to allow varying rates of conservativity within the same agent. α might vary by subject matter, for instance. I won’t pursue this thought further here because it seems plausible to me that an agent ought to have similar levels of conservativity to all propositions about coin tosses that come up in the examples we discuss.

Let’s return to the coin tossing example from the earlier section. We’re going to show that α-cut conditioning does appear to learn in simulations. Before we do that, a brief word on iterating α-cut conditionalisation. What happens if you update by E and then update by F? Does that give the same result as learning F and then E? It depends on how you update. Let ( ) = ( | α E ) . When you then learn F, what do you do? ( | α E F ) or ( | α F ) ? Call the former of these “total evidence” update, and the latter “iterative” update. In what follows, we’ll focus on total evidence update; for total evidence update, the order you learn propositions does not matter. See Section 5.2 for more on iterative update.

Analytic results for our coin tossing example are hard to come by, but if we simulate α-cut conditioning, we see that it converges (Figure 3).[19] The grey lines represent what generalised conditioning looks like for the probabilities in the set, and the red lines are those that make it through the α-cut process.[20] So we can see that α-cut conditioning does appear to converge on the true value of heads. If we graph the spread of probability values for heads over time (Figure 4), we see that it appears to shrink at a rate of about 1 n for total evidence update (see also Wilson 2001). The fact that the spread for GC appears to shrink at all is an artefact of only simulating finitely many priors: there are more stubborn priors, not simulated, that prevent GC from shrinking at all. Figure 5 demonstrates that this is not a one off: on average, the spread ( H | S n ) ( H | S n ) shrinks as evidence grows, and the interval [ ( H | S n ) , ( H | S n ) ] almost always covers the true chance pr * (H).[21]

Figure 3:

Alpha cut learning.

Figure 4:

( H ) ( H ) as evidence accrues.

Figure 5:

Average improvement for alpha cut.

So our question becomes “under what circumstances does α-cut allow you to learn H on the basis of evidence?” I don’t have any firm results here, but here are some gestures toward future work. If your credal set has some kind of natural “metric” on it such that it makes sense to talk about how close one function is to another (for example if they’re all parameterized by some compact subset of ℝ n ) then call a likelihood function “peaked around pr” if only those probabilities close to pr have high relative likelihood. In circumstances amenable to learning, you are likely to receive evidence such that the likelihood is peaked around a probability function near to the truth pr *.[22] If p r * ( H ) = r , then S n n is likely to be near r because of the law of large numbers, for example. If it is further the case that | p r ( H | E n ) p r * ( H ) | is small when pr is close to pr *, then ℙ will exhibit sensible learning behaviour. Obviously, much work will need to be done to tighten up what kind of “closeness” is appropriate for which learning problems, and how strong the bounds can be made. I take it that the above gestures towards a formal result at least suggest that this approach is worth exploring further.

## 4 Coherence

What coherence property does IP with α-cut conditioning satisfy? In general terms, the answer is that α-cut conditioning doesn’t even satisfy Avoid Sure Loss, as the following example demonstrates. Lottery A is a fair lottery with n tickets. You know nothing about lottery B, except that it also has n tickets so you have a near-ignorance prior for outcomes from lottery B. I’m going to flip a fair coin to decide whether to draw a lottery A ticket (heads) or a lottery B ticket (tails). I’m also going to offer you bets denominated in some units such that your utility is linear with respect to those units. Before I flip the coin, I’ll offer you a bet at even odds that the coin lands heads; i.e. you pay  1 2  unit to win one unit if the coin lands heads. Before revealing the outcome of the coin toss, I’m then going to tell you the outcome of the draw from the urn. I then offer you a bet against heads at worse than even odds. (Pay 1 1 2 n α units to win one unit if tails). Unless α is small enough (i.e. less than 1/n), these bets lose you money, whichever outcome I announce.[23] Why does this pathological behaviour manifest? What’s going on is that once you learn that ticket i was drawn from some lottery, the highest likelihood probabilities are those that take lottery B to be extremely biased towards i. This plus the fact that the chance set up makes the announcement of the lottery outcome dilate the coin toss probabilities (Pedersen and Wheeler 2014; Seidenfeld and Wasserman 1993), leads to some weird behaviour.

In our mystery coin example, however, α-cut conditioning does Avoid Sure Loss. According to Zaffalon and Miranda (2013, Theorem 1), you avoid sure loss if and only if your unconditional upper prevision for every f is greater than the lower prevision you would have for f if you were to learn B for at least some B. α-cut conditioning satisfies that condition in the mystery coin example, but not in general. So at least for sets of learning priors and coin-toss evidence, α-cut conditioning Avoids Sure Loss. If you know you’ll learn one element of some partition ℬ, and you know you’ll be offered the chance to accept or reject some finite menu of gambles, then it’s always possible to set α so as to not be subject to sure loss. The intuition here is that given that as α gets smaller, your posterior credal set more closely resembles the outcome of GC, at least for finite partitions.[24] This solution is theoretically unappealing though, since it requires setting α to some particular value so as to avoid sure loss, whereas α was supposed to be a parameter determined by your taste for epistemic risk.

However, α-cut conditioning cannot satisfy coherence, unless it coincides with GC. Generalised Conditioning is the most informative possible jointly coherent conditional prevision (Miranda 2009, Lemma 2). What this means is that any conditional lower probability that is jointly coherent with prior , must be such that its lower conditional probabilities are lower than those that GC yields. Since the whole point of α-cut conditioning was to provide more informative conditional beliefs than those obtained by GC, we cannot require that α-cut satisfy this stronger condition.

So α-cut conditionalisers will sometimes be such that they don’t find a set of bets that guarantees a net gain to be acceptable. To some extent, this is an issue that imprecise probabilists will have to address anyway, since a similar kind of apparent problem arises for decisions made over time, even without any updating (Elga 2010): in both cases the issue is that combining judgements of acceptability across time yield unintuitive results. This sort of sequential choice incoherence – or “dynamic inconsistency” – is discussed in a number of places. In philosophy, several people have responded directly to Elga’s challenge (Moss 2015a; Rinard 2015, for example); in economics and psychology the idea is often tied up with discussions of “ambiguity aversion” (Al-Najjar and Weinstein 2009; Trautmann and Van der Kuijlen 2016). A full discussion of this topic is beyond the scope of the current paper, but note that the advocate of α-cut conditionalisation has many resources to bring to bear on the issue of dynamic inconsistency. One approach – that of Bradley and Steele (2014a) and Rinard (2015), for example – is to deny the “package principle” that governs how the acceptability of a series of bets is determined by the acceptability of its members.

Whether α-cut conditioning avoids sure loss depends on the kind of propositions one might learn. It is an open question on whether it’s always possible to set α low enough to avoid sure loss in a given learning scenario, so exactly what weakened kind of coherence we can salvage is currently unknown. It would also be interesting to systematically explore how incoherent α-cut conditioning is in the sense of Schervish, Seidenfeld, and Kadane (1997) and Staffel (2020). A natural hypothesis is that the smaller α is, the less incoherent α-cut conditioning is.

## 5 Conclusions

My conclusions are modest. For the case of learning the bias of a coin from evidence of repeated tosses, α-cut conditioning appears to converge on the truth.[25] This does mean, however, giving up on coherence and making do with just avoiding sure loss. This should provide some small amount of comfort to the imprecise probabilist: belief inertia is not an insurmountable challenge. Beyond this simple case of coin tossing, results are more mixed. I suspect there will be a more general convergence result, but without more clarity on when this approach to update avoids sure loss in general, such a learning result is of limited interest. I believe that this approach to update should be explored in more detail in the future.

Corresponding author: Seamus Bradley, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, London, E-mail:

Many thanks to the audience in Munich (online) for their comments. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No 792292.

Award Identifier / Grant number: 792292

## Appendix A: Merging of Probabilities

Schervish and Seidenfeld (1990) offer some results that might appear to make belief inertia a phenomenon that can only occur under relatively circumscribed conditions. We first state a result that follows from Schervish and Seidenfeld (1990), Corollary 1.

Merging of Opinion

Your prices for gambles in a one-sided betting scenario should be such that no combination of acceptable bets is guaranteed to yield a loss in every state.

If ℙ is contained in a closed, convex set of (countably additive) probabilities ℚ such that:

• for all p r , p r we have p r ( X ) = 0 if p r ( X ) = 0 for all X in Α (i.e. all the elements of ℚ are mutually absolutely continuous)

• ℚ is the convex hull of some finite subset of its elements (i.e. ℚ has finitely many extreme points)

then for almost all sequences of possible evidence, the probabilities in ℙ converge in the sense that for any pr, pr′ ∈ ℙ, any XΑ:

| p r ( X | E n ) p r ( X | E n ) | 0

That is, if the probabilities agree on the measure zero events, then, with increasing evidence, they will get closer together on the probability for every event. So merging guarantees narrowing, but not necessarily convergence. This merging of opinion only constitutes learning if, among the pr ∈ ℙ, there is some probability such that being close to it is epistemically good. We earlier called this the “expert probability”. This result secures that probabilities meeting the above conditions will merge (i.e. get closer together) and, if the expert probability is in the set, then we can call this merging “learning”. “Belief inertia” is, then, a failure to merge even in the presence of such expert probabilities. So, to the extent that it’s reasonable to expect that (a) your prior credal set meet the above conditions, and that (b) your prior credal set contain suitable expert probabilities, the belief inertia is no problem.

As we saw, belief inertia does happen for the learning priors discussed in the main text. Which of (a) or (b) is false in this example? Let’s start with (a): the conditions for the application of the Schervish and Seidenfeld (1990) result don’t hold. Consider any two probability functions pr, q , that make the coin tosses independent and identically distributed. That is, these probability functions are candidates to be the chance function. If you like, functions of this kind are the possible infinite limits of the process of learning the outcomes of coin tosses.[26] They are infinitely stubborn in the sense that no amount of evidence will change their probability for the next toss landing heads. No two distinct such probability functions are mutually absolutely continuous since if p r ( H i ) = p q = q ( H i ) , then pr gives probability one to the set of sequences of coin tosses with limiting frequency p, and q gives that set probability zero (and vice versa). So, even though the set of beta priors are all in the convex hull of the set of these chance distributions (this is a consequence of de Finetti’s representation theorem, since coin tosses are exchangeable) these extrema are not mutually absolutely continuous, and thus the above result does not apply.

Beyond the realm of learning i.i.d. evidence, the above result highlights circumstances under which narrowing is guaranteed. If you are sure that the expert probability is among those with which you are merging, then successful inductive inference is assured. I have no answer to the question of how often these conditions are satisfied in real examples.

## Appendix B: Commutativity

We are going to need a few definitions, so I collect them here:

• α [ E , ] = { p r , p r ( E ) α ( E ) }

• E α = ( | α E ) = { p r ( | E ) , p r , p r ( E ) α ( E ) }

Consider the case of learning E, then learning F. In Figure 6, iterative conditioning amounts to moving diagonally down and left to E α and then down and left again to E α ( | α F ) . Total evidence conditioning amounts to moving diagonally down and left to E α , but then learning F amounts to α-cut conditioning the prior ℙ on total evidence EF: moving to the bottom right hand corner ( | α E F ) . These two procedures are distinct: you don’t necessarily end up in the same place.[27]

Figure 6:

α-cut conditioning as GC plus restriction.

Figure 7 illustrates the non-commutativity of iterative α-cut update. As is clear from the image, the two credal sets (red and blue)[28] do differ, but on the whole they have a reasonable amount of overlap and they both tend towards the same interval around the true chance. So, at least in this case, the failure of commutativity is benign.

Figure 7:

The same prior under two permutations of the same sequence of evidence.

Here’s an even simpler example of a failure of commutativity. Consider = { P , Q } where P ( A ) = 0.9 , P ( B ) = 0.2 , P ( C ) = 1 and Q ( A ) = 0.2 , Q ( B ) = 0.9 , Q ( C ) = 0 , and let α = 0.5. ( | α A ) = { P } , so A α ( C | α B ) = 1 , whereas ( | α B ) = { Q } so B α ( C | α A ) = 0 . That is, depending on whether A is learned before B or vice versa, you either come to fully believe C or fully disbelieve C. So in toy examples we see some pathological commutativity.[29]

I don’t consider this failure of commutativity to be an insurmountable problem since, if the priors in the credal set are suitably well-behaved (i.e. satisfy the conditions for some form of convergence theorem) then, whatever order the evidence comes in, the outcome will be a set of probability functions each getting closer to the truth. However, coupled with the fact that iterative update doesn’t necessarily converge on the truth, I think we have reason to prefer total evidence update. Given that total evidence update appears to be what time-slice epistemologists like Moss (2015b) and Hedden (2015), advocate as the right way to conceive of update, one might think that this serves as an argument in favour of their view. I don’t pursue this line of thought since I think there are good reasons to think there really are diachronic norms for decision making, but that discussion is beyond the scope of this paper.

Let’s now consider whether iterative α-cut conditioning converges. If you end up in a situation where α ( E ) ( E ) then nothing is thrown out. Consider updating (iteratively) after every toss of a coin. After a few tosses, you’ve thrown out the extreme priors, but there’s still a range of priors available. So your priors shrink for a bit, but then stop shrinking. You can see this happening in Figure 4: the line for iterative update starts dropping as fast as the total evidence spread does, but after a certain time, it levels off and starts to drop at about the rate GC does. This doesn’t happen in the total evidence updating case because the “amount of evidence” you update on keeps growing. As Figure 8 illustrates iterative update eventually stops converging much, whereas total evidence update continues to converge. (Indeed, that iterative update appears to converge at all in the right-hand portion of the graph is due to the artefact that only a finite number of priors are simulated).

Figure 8:

Iterative and total evidence update compared.

## References

Augustin, T., F. P. A. Coolen, G. de Cooman, and M. C. M. Troffaes, eds. 2014. Introduction to Imprecise Probabilities. New York: John Wiley and Sons.10.1002/9781118763117Search in Google Scholar

Al-Najjar, N. I., and J. Weinstein. 2009. “The Ambiguity Aversion Literature: A Critical Assessment.” Economics and Philosophy 25: 249–84. https://doi.org/10.1017/s026626710999023x.Search in Google Scholar

Bradley, R. 2017. Decision Theory with a Human Face. Oxford: Oxford University Press.10.1017/9780511760105Search in Google Scholar

Bradley, S. 2015. “How to Choose Among Choice Functions.” In ISIPTA 2015 Proceedings, 57–66.Search in Google Scholar

Bradley, S. 2019. “Imprecise Probabilities.” In The Stanford Encyclopedia of Philosophy, edited by E. N. Zalta. The Metaphysics Research Lab.10.1007/978-3-319-70766-2_21Search in Google Scholar

Bradley, S., and K. Steele. 2014a. “Should Subjective Probabilities Be Sharp?” Episteme 11: 277–89. https://doi.org/10.1017/epi.2014.8.Search in Google Scholar

Bradley, S., and K. Steele. 2014b. “Uncertainty, Learning and the ‘Problem’ of Dilation.” Erkenntnis 79: 1287–303. https://doi.org/10.1007/s10670-013-9529-1.Search in Google Scholar

Buchak, L. 2013. Risk and Rationality. Oxford: Oxford University Press.10.1093/acprof:oso/9780199672165.001.0001Search in Google Scholar

Castro, C., and C. Hart. 2019. “The Imprecise Impermissivist’s Dilemma.” Synthese 196 (4): 1623–40.https://doi.org/10.1007/s11229-017-1530-9.Search in Google Scholar

Cattaneo, M. 2008. “Fuzzy Probabilities Based on the Likelihood Function.” In Soft Methods for Handling Variability and Imprecision, edited by D. Dubois, M. A. Lubiano, H. Prade, M. Á. Gil, P. Grzegorzewski, and O. Hryniewicz, 43–50. Berlin, Heidelberg: Springer.10.1007/978-3-540-85027-4_6Search in Google Scholar

Cattaneo, M. 2014. “A Continuous Updating Rule for Imprecise Probabilities”. In Information Processing and Management of Uncertainty in Knowledge Based Systems, edited by A. Laurent, O. Strauss, B. Bouchon-Meunier, and R. R. Yager, et al.., 426–35. Springer.10.1007/978-3-319-08852-5_44Search in Google Scholar

Elga, A. 2010. “Subjective Probabilities Should Be Sharp.” Philosophers’ Imprint 10: 1–11.Search in Google Scholar

Gärdenfors, P., and N.-E. Sahlin. 1982. “Unreliable Probabilities, Risk Taking and Decision Making.” Synthese 53: 361–86.10.1007/BF00486156Search in Google Scholar

Gilboa, I., and D. Schmeidler. 1993. “Updating Ambiguous Beliefs.” Journal of Economic Theory 59: 33–49. https://doi.org/10.1006/jeth.1993.1003.Search in Google Scholar

Grove, A., and J. Y. Halpern. 1998. “Updating Sets of Probabilities.” In Proceedings of the fourteenth conference on uncertainty in AI, 173–82.Search in Google Scholar

Hájek, A. 2003. “What Conditional Probabilities Could Not Be.” Synthese 137: 273–323.10.1023/B:SYNT.0000004904.91112.16Search in Google Scholar

Hájek, A. 2008. “Arguments for—or Against—Probabilism?” The British Journal for the Philosophy of Science 59: 793–819.10.1007/978-1-4020-9198-8_9Search in Google Scholar

Hájek, A. n.d. Staying Regular? Search in Google Scholar

Hedden, B. 2015. Reasons without Persons. Oxford: Oxford University Press.10.1093/acprof:oso/9780198732594.001.0001Search in Google Scholar

Hill, B. n.d. Updating Confidence in Beliefs. Also available at https://ssrn.com/abstract=3385116.Search in Google Scholar

Hill, B. 2013. “Confidence and Decision.” Games and Economic Behavior 82: 675–92.10.1016/j.geb.2013.09.009Search in Google Scholar

Joyce, J. M. 2005. “How Probabilities Reflect Evidence.” Philosophical Perspectives 19: 153–78. https://doi.org/10.1111/j.1520-8583.2005.00058.x.Search in Google Scholar

Joyce, J. M. 2010. “A Defense of Imprecise Credence in Inference and Decision.” Philosophical Perspectives 24: 281–323. https://doi.org/10.1111/j.1520-8583.2010.00194.x.Search in Google Scholar

Konek, J. n.d. “Epistemic Conservativity and Imprecise Credence.” Philosophy and Phenomenlogical Research.Search in Google Scholar

Lassiter, D. 2020. “Representing Credal Imprecision: from Sets of Measures to Hierarchical Bayesian Models.” Philosophical Studies 177: 1463–85. https://doi.org/10.1007/s11098-019-01262-8.Search in Google Scholar

Levi, I. 1974. “On Indeterminate Probabilities.” Journal of Philosophy 71: 391–418. https://doi.org/10.2307/2025161.Search in Google Scholar

Levi, I. 1980. The Enterprise of Knowledge. Cambridge: The MIT Press.Search in Google Scholar

Lyon, A. 2014. “From Kolmogorov to Popper to Rényi: There’s No Escaping Humphrey’s Paradox (When Generalized).” In Chance and Temporal Asymmetry, edited by T. Handfield, and A. Wilson. Oxford: Oxford University Press, 112–25.10.1093/acprof:oso/9780199673421.003.0006Search in Google Scholar

Lyon, A. 2017. “Vague Credence.” Synthese 194 (10): 3931–54. https://doi.org/10.1007/s11229-015-0782-5.Search in Google Scholar

Miranda, E. 2009. “Updating Coherent Previsions on Finite Spaces.” Fuzzy Sets and Systems 160: 1286–307. https://doi.org/10.1016/j.fss.2008.10.005.Search in Google Scholar

Moss, S. 2015a. “Credal Dilemmas.” Noûs 49: 665–83. https://doi.org/10.1111/nous.12073.Search in Google Scholar

Moss, S. 2015b. “Time-Slice Epistemology and Action Under Indeterminacy.” In Oxford Studies in Epistemology, 172–94.10.1093/acprof:oso/9780198722762.003.0006Search in Google Scholar

Moss, S. n.d. “Global Constraints on Imprecise Credences: Solving Reflection Violations, Belief Inertia and Other Principles.” Philosophy and Phenomenological Research.10.1111/phpr.12703Search in Google Scholar

Pawitan, Y. 2001. All Likelihood: Statistical Modelling and Inference Using Likelihood. Clarendon Press.Search in Google Scholar

Pedersen, A. P., and G. Wheeler. 2014. “Demystifying Dilation.” Erkenntnis 79: 1305–42. https://doi.org/10.1007/s10670-013-9531-7.Search in Google Scholar

Pettigrew, R. 2016. Accuracy and the Laws of Credence. Oxford: Oxford University Press.10.1093/acprof:oso/9780198732716.001.0001Search in Google Scholar

Rinard, S. 2013. “Against Radical Credal Imprecision.” Thought 2: 157–65. https://doi.org/10.1002/tht3.84.Search in Google Scholar

Rinard, S. 2015. “A Decision Theory for Imprecise Probabilities.” Philosophers’ Imprint 15: 1–16.Search in Google Scholar

Schervish, M. J., and T. Seidenfeld. 1990. “An Approach to Consensus and Certainty with Increasing Evidence.” Journal of Statistical Planning and Inference 25: 401–14. https://doi.org/10.1016/0378-3758(90)90084-8.Search in Google Scholar

Schervish, M. J., T. Seidenfeld, and J. B. Kadane. 1997. Two Measures of Incoherence: How Not to Gamble if You Must, Technical Report. Pittsburgh: Department of Statistics, Carnegie Mellon University.Search in Google Scholar

Seidenfeld, T., and L. Wasserman. 1993. “Dilation for Sets of Probabilities.” Annals of Statistics 21: 1139–54. https://doi.org/10.1214/aos/1176349254.Search in Google Scholar

Staffel, J. 2020. Unsettled Thoughts. Oxford: Oxford University Press.10.1093/oso/9780198833710.001.0001Search in Google Scholar

Trautmann, S., and G. Van der Kuijlen. 2016. “Ambiguity Attitudes.” In Blackwell Handbook of Judgement and Decision-Making, edited by G. Keren, and G. Wu, 89–116. Blackwell.10.1002/9781118468333.ch3Search in Google Scholar

Troffaes, M., and G. de Cooman. 2014. Lower Previsions. Wiley.10.1002/9781118762622Search in Google Scholar

Vallinder, A. 2018. “Imprecise Bayesianism and Global Belief Inertia.” The British Journal for the Philosophy of Science 69 (4): 1205–30. https://doi.org/10.1093/bjps/axx033.Search in Google Scholar

Walley, P. 1991. “Statistical Reasoning with Imprecise Probabilities.” In Monographs on Statistics and Applied Probability, Vol. 42. Chapman and Hall.10.1007/978-1-4899-3472-7Search in Google Scholar

Walley, P. 1996. “Inferences from Multinomial Data: Learning about a Bag of Marbles.” Journal of the Royal Statistical Society B 58 (1): 3–57. https://doi.org/10.1111/j.2517-6161.1996.tb02065.x.Search in Google Scholar

Wilson, N. 2001. “Modified Upper and Lower Probabilities Based on Imprecise Likelihoods.” In Proceedings of the 2nd International Symposium on Imprecise Probabilities and their Applications.Search in Google Scholar

Zaffalon, M., and E. Miranda. 2013. “Probability and Time.” Artificial Intelligence 198: 1–51. https://doi.org/10.1016/j.artint.2013.02.005.Search in Google Scholar

Published Online: 2021-12-24