Law invariant risk measures and information divergences

A one-to-one correspondence is drawn between law invariant risk measures and divergences, which we define as functionals of pairs of probability measures on arbitrary standard Borel spaces satisfying a few natural properties. Divergences include many classical information divergence measures, such as relative entropy and $f$-divergences. Several properties of divergence and their duality with law invariant risk measures are developed, most notably relating their chain rules or additivity properties with certain notions of time consistency for dynamic law invariant risk measures known as acceptance and rejection consistency. These properties are linked also to a peculiar property of the acceptance sets on the level of distributions, analogous to results of Weber on weak acceptance and rejection consistency. Finally, the examples of shortfall risk measures and optimized certainty equivalents are discussed in some detail, and it is shown that the relative entropy is essentially the only divergence satisfying the chain rule.


Introduction
This paper deepens the analysis of law invariant risk measures and their connection to divergence-type functionals of probability measures. Throughout the paper, a nonatomic standard Borel space (Ω, F , P ) is fixed, and a risk measure is defined to be a convex functional ρ : L ∞ := L ∞ (Ω, F , P ) → R satisfying: (1) Monotonicity: If X, Y ∈ L ∞ and X ≤ Y a.s. then ρ(X) ≤ ρ(Y ).
The functional X → ρ(−X) is more traditionally called a normalized convex risk measure; some authors use the term acceptability measure [31] for what we have chosen to call a risk measure. Convex risk measures first appeared in [14,17,21], extending the class of coherent risk measures introduced in the seminal paper of Artzner et al. [4] (see also [10]). A risk measure ρ is law invariant if ρ(X) = ρ(Y ) whenever X and Y have the same law. Three standard examples will guide us throughout the paper: The first is the well known entropic risk measure ρ(X) = η −1 log E[e ηX ], η > 0. Second, given a nondecreasing convex function ℓ : R → [0, ∞) with ℓ(0) = 1, the corresponding shortfall risk measure (introduced by Föllmer and Schied in [14]) is Lastly, given a nondecreasing convex function φ : R → R with φ * (1) = sup x∈R (x − φ(x)) = 0, the corresponding optimized certainty equivalent (introduced by Ben-Tal and Teboulle in [5,6]) is We construct divergences as follows: Fix a law invariant risk measure ρ. Given a Polish space E, let P(E) denote the set of Borel probability measures on E. For any Polish space (or any standard Borel space) E and any µ ∈ P(E), we may define a new law invariant risk measure ρ µ : L ∞ (E, µ) → R by ρ µ (f ) := ρ(f (X)), where X is any E-valued random variable on Ω with law P • X −1 = µ. Indeed, such an X exists because Ω is nonatomic, and this definition is independent of the choice X thanks to law invariance. This family of risk measures satisfies a consistency property, namely ρ µ (f ) = ρ ν (g), whenever µ • f −1 = ν • g −1 . (1.1) Let α(·|µ) denote the minimal penalty function associated to ρ µ , i.e., the restriction to P(E) of the convex conjugate of ρ µ : We call α the divergence induced by ρ. In summary, the functional α(·|·) is defined for pairs of probability measures on any Polish space (or standard Borel space), much like the classical relative entropy and other information divergences, such as the f -divergence. Indeed, when ρ is the entropic risk measure, α is nothing but the usual relative entropy (also known as the Kullback-Leibler divergence) When ρ is a shortfall risk measure corresponding to a function ℓ, the induced divergence is where ℓ * (t) = sup s∈R (st − ℓ(s)) is the convex conjugate of ℓ. Finally, when ρ is the optimized certainty equivalent corresponding to a function φ, the induced divergence is the φ * -divergence α(ν|µ) = φ * dν dµ dµ, for ν ≪ µ, ∞ otherwise.
We call such a functional a divergence, and we show that to any divergence there corresponds a unique law invariant risk measure defined on the original space (Ω, F , P ); we prove this by showing the definitions ρ(f (X)) := ρ µ (f ) := sup to be consistent in the sense of (1.1), where E is a Polish space, f ∈ B(E), and µ = P • X −1 for some X : Ω → E. The property (4) corresponds exactly to the consistency property (1.1) and is known as the data processing inequality in information theory, at least when α is the usual relative entropy. The primary focus of the paper is on the characterization of properties of divergences related to the well known chain rule for relative entropy, which reads and holds for all (disintegrated) probability measures µ(dx)K µ x (dy) and ν(dx)K ν x (dy) on the product of two Polish spaces. More generally, we say a divergence α is superadditive if α (ν(dx)K ν x (dy) | µ(dx)K µ x (dy)) ≥ α(ν|µ) + ν(dx)α(K ν x |K µ x ). (1.2) and we say α is subadditive if the reverse inequality holds. We characterize this in terms of various properties of the corresponding risk measure ρ.
The original motivation for this study comes from an ongoing investigation into the tensorization properties of concentration inequalities of the form ρ(λX) ≤ γ(λ), for all λ ≥ 0, (1.3) where γ : [0, ∞) → [0, ∞]. In a follow-up paper [28], we study concentration inequalities (1.3) in connection with liquidity risk. When ρ is the entropic risk measure, the inequality (1.3) is simply a bound on the moment generating function of X. Tensorization in this context roughly means bounding ρ(λh(X, Y )) λ≥0 in terms of bounds on ρ(λf (X)) λ≥0 and ρ(λg(Y )) λ≥0 , for two given (typically independent) random variables X and Y and various (classes of) functions f, g, h. Tensorization properties are typically proven using the chain rule (see [20] for details, particularly Proposition 1. thereof), so we seek alternatives for the chain rule in order to understand how to extend these ideas to general concentration inequalities of the form (1.3). It turns out that the dual form of superadditivity (1.2) is a so-called time-consistency property of the corresponding risk measure ρ, which we describe by building on a construction of Weber [35]: Define a functionalρ on P(R) byρ(P • X −1 ) = ρ(X), which is of course well defined thanks to law invariance. For any σ-field G ⊂ F in Ω and any X ∈ L ∞ , consider the G-measurable random variable where P (X ∈ · | G) denotes a regular conditional law of X given G. We say ρ is acceptance consistent if ρ(X) ≤ ρ(ρ(X|G)) for every X ∈ L ∞ and any σ-field G ⊂ F . If the reverse inequality holds, we say ρ is rejection consistent. If ρ is both acceptance and rejection consistent, we say it is time consistent. We show that acceptance consistency of ρ is essentially equivalent to the superadditivity of the induced divergence α, and we provide an additional characterization in terms of a property of the measure acceptance set These various characterizations are put to use to find those functions ℓ and φ for which the corresponding shortfall risk measures and optimized certainty equivalents are acceptance consistent. It follows from the results of Kupper and Schachermayer [26] that the entropic risk measure is essentially the only time consistent risk measure, and as a corollary we find that the relative entropy is the only divergence (up to a scalar multiple) satisfying the chain rule. 1 Ultimately, we find that not many law invariant risk measures are acceptance consistent (or rejection consistent) other than the entropic one, or modest perturbations thereof. In other words, not many divergences beyond relative entropy are superadditive. Although our results are somewhat negative in this sense, the construction and characterization of divergences induced by risk measures is interesting in its own right, and they appear to be useful tools in the study of law invariant risk measures. Moreover, we find some value in understanding the limitations of our divergences in the applications discussed above.
We also briefly revisit the related results of Weber [35]. Say that ρ is weakly acceptance consistent if ρ(X) ≤ 0 whenever ρ(X|G) ≤ 0 a.s., for X ∈ L ∞ and σ-fields G ⊂ F . Weber showed that this is essentially equivalent to the convexity of the measure acceptance set A. We show that weak acceptance consistency is also equivalent to an inequality weaker than superadditivity: Time consistency properties of dynamic risk measures have by now been studied thoroughly [30,11,18,13,34,8]. The nice survey of Acciaio and Penner [1] will be a useful reference, although we will mostly work with the type of dynamic law invariant risk measures constructed by Weber in [35]. With this rich literature in mind, the most novel of our results on time consistency is the characterization of acceptance consistency in terms of the shift-convexity of the measure acceptance set, which nicely complements Weber's result on weak acceptance consistency. In theory, our characterization in terms of superadditivity (1.2) could be deduced from results in [1], but this is non-trivial: The key difference is that previous papers on the subject (including [1]) use essential suprema to define the minimal penalty function of a conditional risk measure. We work purely with pointwise definitions, and while this distinction is largely technical, there is a non-trivial gap between the two stemming from a delicate measurable selection argument. See Section 3.4 for details.
The above results must be qualified: the equivalence of superadditivity and acceptance consistency is only proven under the additional assumption that the divergence α is simplified, in the sense that 1 We make no attempt to reconcile our characterization of relative entropy with the many already present in the literature (see the survey of Csiszár [9]), but we can at least say with confidence that the techniques by which we obtained it are new, notably avoiding functional equations.
for each µ, ν ∈ P([0, 1]). This additional assumption is admittedly somewhat of a nuisance, and it is unclear if the main result on superadditivity holds without it. While we did not identify a nice dual characterization for this condition, we have identified a common stronger condition: Namely, α is simplified as soon as ρ is Lebesgue continuous, in the sense that ρ(X n ) → ρ(X) whenever X n are uniformly bounded and X n → X a.s. This is a strong assumption, but it indeed holds for our main examples of shortfall risk measures and optimized certainty equivalents. Moreover, we show that Lebesgue continuity of ρ is actually equivalent to joint lower semicontinuity of the induced divergence α (with respect to weak convergence).
Finally, Section 4 we study miscellaneous properties of divergences. First, joint convexity of α is shown to be equivalent to the concavity of ρ on the level of probability measures (i.e., concavity ofρ defined above), a property studied in some detail by Acciaio and Svindland [2] which holds for every optimized certainty equivalent. Second, as an interesting decision-theoretic consequence of the defining property (4) of divergences, note that if T : E → F is measurable then α(ν • T −1 |µ • T −1 ) ≤ α(ν|µ); we show that equality holds if T is a sufficient statistic for {µ, ν}. Lastly, we show that simplified divergences can be approximated in a sense by their projections on finite sets.
The paper is organized as follows. Section 2 reviews the basic definitions and duality results of law invariant risk measures before introducing divergences and studying their most essential properties. The main characterization of divergences in terms of law invariant risk measures is given by Proposition 2.6 and Theorem 2.7. Section 2.2 then introduces the concept of a simplified divergence, and as an important class of examples we then clarify the connection between continuity properties of a risk measure and joint lower semicontinuity of the induced divergence. This is a useful preparatory step for Section 3, which turns to time consistency and superadditivity. The main Theorem 3.5 characterizes time consistency properties of a law invariant risk measure in terms of both the additivity properties of the induced divergence and what we call the shift-convexity of its measure acceptance set. Section 4 studies additional results pertaining to convexity and some more information-theoretic uses for divergences, and it should be noted that this section is completely independent of Section 3. Finally, Section 5 applies the theory to the examples of shortfall risk measures and optimized certainty equivalents. The short appendix is devoted to the proof of a technical lemma.

Risk measures and divergences
First, let us fix some notation. Throughout the paper, (Ω, F , P ) is a fixed probability space, which we assume is a nonatomic standard Borel space. Abbreviate L p = L p (Ω, F , P ) as usual for the set of (equivalence classes of) p-integrable real-valued measurable functions on Ω. Let P(Ω) denote the set of probability measures on (Ω, F ), and let P P (Ω) denote the subset consisting of those measures which are absolutely continuous with respect to P . As stated in the introduction, a risk measure to us is a convex nondecreasing (with respect to a.s. order) functional ρ : L ∞ → R satisfying ρ(0) = 0 and ρ(X +c) = ρ(X)+c for all X ∈ L ∞ , c ∈ R. Note again that this is somewhat different from the standard definition, in which ρ is instead nonincreasing [16]. Law-invariant risk measures possess some nice additional structure, highlighted in particular by the results of [22] and [12], though we will not need the latter. Theorem 2.1 (Theorem 2.1 of [22]). Every law-invariant risk measure ρ satisfies the Fatou property, which means that whenever X n ∈ L ∞ are uniformly bounded and converge a.s. to X ∈ L ∞ , then ρ(X) ≤ lim inf n→∞ ρ(X n ).
Let us recall a classical duality result, but note that the details of the presentation are somewhat unusual: We say a function α : P P (Ω) → [0, ∞] is a penalty function for ρ if it holds that (2.1) (Note that the supremum involves only countably additive measures, and we will make no mention of finite additivity.) Here E Q denotes expectation with respect to the probability Q. Expectation under the reference measure P is simply denoted E, and integrals on spaces other than Ω are written in a more explicit measuretheoretic notation.
is a penalty function for ρ. In fact, it is the minimal penalty function, in the sense that any other penalty function α ′ for ρ satisfies α ≤ α ′ .
Note that the dual representation (2.2) implies that the minimal penalty function is convex and lower semicontinuous with respect to the total variation topology, as well as the weaker topology σ(P P (Ω), L ∞ ). 2 There is an alternative dual representation more specific to law invariant risk measures, due to Kusuoka [27] and extended in [19,22], but we will make no use of this. Remark 2.3. Note that we may afford to be lazy about the fact that ρ is to be evaluated at equivalence classes, i.e. elements of L ∞ , as opposed to specific measurable functions. For a risk measure ρ, we may define ρ(X) := ρ([X]) in the obvious way for a measurable function X : Ω → R by finding the equivalence class [X] ∈ L ∞ to which X belongs. With this in mind, we may then define α(Q) := ∞ for Q ∈ P(Ω) which are not absolutely continuous with respect to P , and then the dual formula (2.1) may be re-written for bounded measurable functions X : Ω → R.
As with many properties of convex risk measures, law invariance may be alternatively characterized by a property of the minimal penalty function, and this will be a building block for a more general discussion in the next section. This characterization appears to be new, although a very similar result appeared in [32,Proposition 2], and see also [22,Lemma A.4].
Proposition 2.4. Suppose a risk measure ρ on L ∞ has the Fatou property (see Theorem 2.1). Then ρ is law-invariant if and only if it has a penalty function α satisfying α(Q • T −1 ) ≤ α(Q) for every Q ∈ P P (Ω) and for every measurable T : Ω → Ω satisfying P • T −1 = P .
Proof. First, assume ρ is law invariant, and let α be its minimal penalty function provided by Theorem 2.2. Let T : Ω → Ω be a measurable map satisfying P • T −1 = P . Then X • T and X have the same law and thus ρ(X) = ρ(X • T ) for every X ∈ L ∞ . Hence To prove the converse, fix X, Y ∈ L ∞ with the same law. By [23, Corollary 6.11] (since the probability space is nonatomic) we may find a measurable map T : Ω → Ω such that P • T −1 = P and P (X = Y • T ) = 1. Then = ρ(Y ). 2 As usual, when F is a set of real-valued functions on a set E, the notation σ(E, F ) refers to the coarsest topology on E rendering the elements of F continuous.
Reversing the roles of X and Y completes the proof.
Remark 2.5. From the proof of Proposition 2.4, it should be clear that the assumption that ρ has the Fatou property is not needed. We state only this simpler form in order to avoid introducing additional terminology, and to avoid dwelling on details involving finitely additive measures.
2.1. Divergences and their characterization. Let us now exploit law invariance to construct a corresponding family of risk measures and what we refer to as divergences. Fix for the rest of this section a law-invariant risk measure ρ. Given a Polish space E, let P(E) denote the set of Borel probability measures on E. We write ν ≪ µ to mean ν is absolutely continuous with respect to µ. Given any µ ∈ P(E), write P µ (E) := {ν ∈ P(E) : ν ≪ µ}. Define also C(E), C b (E), and B(E) to be the sets of continuous, bounded continuous, and bounded measurable functions on E, respectively. The space P(E) is endowed with the σ-field generated by the maps µ → µ(A), where A ⊂ E is Borel; this equals the Borel σ-field generated by the topology of weak convergence, i.e., σ(P(E), C b (E)).
Given a Polish space E and µ ∈ P(E), we may find (because Ω is nonatomic) a measurable function X : Ω → E such that P • X −1 = µ. We may then define a (law invariant) risk measure ρ µ on L ∞ (E, µ) by Note that by law-invariance this definition does not depend on the choice of X, as long as P • X −1 = µ. We call (ρ µ ) µ,E the family of risk measures induced by ρ. This family of risk measures satisfies a consistency property, namely In particular, for any measurable map T from one Polish space E to another F , we have . The same construction is valid when E is any standard Borel space, but for simplicity we stick with Polish spaces. The minimal penalty function of ρ µ is denoted α(·|µ) : Extend α(·|µ) to all of P(E) by setting α(ν|µ) = ∞ whenever ν is not absolutely continuous with respect to µ. Then, for f ∈ B(E), and it is easy to check that (2.4) remains valid for ν ∈ P(E)\P µ (E). (As in Remark 2.3, let us not be overly careful about distinguishing between measurable functions and equivalence classes thereof.) We refer to α(·|·) as the divergence induced by ρ. Note that α(·|·) is defined for pairs of probability measures on any Polish space. Additionally, α(·|µ) is always convex and lower semicontinuous with respect to total variation, and also with respect to the topology σ(P(E), B(E)). An alternative expression for the divergence induced by ρ is through the measure acceptance set Indeed, we may then write Divergences satisfy a consistency property related to (2.3), the statement of which requires some notation involving kernels. Given Polish spaces E and F , a kernel from E to F is a measurable function E ∋ x → K x ∈ P(F ). Given µ ∈ P(E), write µK := E µ(dx)K x (·) for the mean measure in P(F ), i.e., Proposition 2.6. Let ρ be a law invariant risk measure with divergence α. If E and F are Polish spaces and K is a kernel from E to F , then and equality holds if T is bijective with measurable inverse.
Proof. Note that the second claim follows from the first by setting K(x, dy) = δ T (x) (dy). Jensen's inequality shows easily that µ It is well known that (normalized) law invariant risk measures are increasing with respect to convex order, e.g. by [16, Corollary 4.65], and thus ρ µK (f ) ≥ ρ µ (Kf ). Then In fact, the inequality (2.5) is enough to reconstruct from α the original family of risk measures. This is made precise in the following: Theorem 2.7. Suppose we are given family of functions P(E) ∋ ν → α(ν|µ) ∈ [0, ∞], for each Polish space E and each µ ∈ P(E), and suppose the following conditions hold: is not absolutely continuous with respect to µ.
For each Polish space E and each µ ∈ P(E), define Then each ρ µ is a law invariant risk measure. Moreover, for any Polish spaces F and G, any µ ∈ P(F ) and ν ∈ P(G), and any f ∈ B(F ) and Proof. It is immediate from the definition that ρ µ is a risk measure. Indeed, since α(µ|µ) = 0 and α(ν|µ) ≥ 0 for all ν, we have ρ µ (0) = 0. Theorem 4.33 of [16] shows that ρ µ satisfies the Fatou property, since the supremum in its definition includes only countably additive measures. For a fixed µ, we deduce from property (3) and Proposition 2.4 that ρ µ is law-invariant.
It remains to prove the last claim. Suppose for the moment that we can find a kernel K from F to G such that µK = ν and µ(Kg = f ) = 1. Then Indeed, the second inequality follows from the assumption (3). Reversing the roles of f and g completes the proof. To prove the existence of such a kernel, we appeal to a famous theorem of Strassen [33,Theorem 3]: If S(x) is nonempty for each x ∈ F , and if h φ is measurable, then Strassen's theorem says that there exists a kernel K from F to G satisfying both µK = ν and µ(Kg Suppose for the moment that S(x) is nonempty for each x ∈ F and that h φ is measurable, so that we can apply this theorem. Define a new function h φ on R by with the usual convention sup ∅ = −∞. Let us check that It remains to check the technical points left out above. First note that S(x) is nonempty for µ-almost every Modify S(x) to equal F on a null set in such a way that it is never empty S(x) = ∅. Next note that h φ is universally measurable because the graph of S(x) is analytic [7,Proposition 7.47], so we may apply Strassen's theorem by simply replacing the Borel σ-field on F with its universal completion.
With the previous result in mind, it is natural to make the following definition: Definition 2.8. A divergence is a family of convex lower-semicontinuous (with respect to total variation) functions P(E) ∋ ν → α(ν|µ) ∈ [0, ∞], for each Polish space E and each µ ∈ P(E), satisfying properties (1-3) of Theorem 2.7. Given a divergence α, the corresponding (or induced) family of risk measures is the family (ρ µ ) µ,E defined by (2.6). The corresponding (or induced) risk measure is the risk measureρ defined on L ∞ = L ∞ (Ω, F , P ) byρ (X) := ρ P •X −1 (id), where id denotes the identity map on R. Thanks to Theorem 2.7,ρ is well defined. It is straightforward to check that its induced divergence is exactly α, and also thatρ µ = ρ µ for each Polish space E and µ ∈ P(E).

2.2.
Simplified divergences. An important property of relative entropy is that its dual formula can be reduced to a supremum over continuous functions: For a Polish space E and for µ, ν ∈ P(E), For our characterization of superadditivity of divergences in Section 3, it will be important for us to have a similar result for general divergences. Such a simplification is not always possible, so we make a definition: Definition 2.9. The divergence α induced by a law invariant risk measure ρ is said to be simplified if for every µ, ν ∈ P([0, 1]) we have Equivalently, for each µ ∈ P([0, 1]), the map α(·|µ) is weakly lower semicontinuous, where "weakly" refers to the usual weak convergence topology σ(P(E), C b (E)). 3 An important reason for this definition is the following measurability result, which we are unfortunately unable to prove without the additional assumption.
Lemma 2.10. Every simplified divergence α is jointly measurable, in the sense that for any fixed Polish space E the function α(·|·) is jointly measurable on P(E) × P(E) (with respect to the Borel σ-field generated by the topology of weak convergence).
2.3. Semicontinuity of divergences. The rest of the section studies lower semicontinuity properties of α, in part for their intrinsic interest, and in part for a tractable condition that will allow us to verify that all of the examples of divergences we discuss in Section 5 are indeed simplified. We did not find a good characterization of simplified divergences on the dual side, i.e., in terms of ρ, but the following partial results shed some light on the condition nonetheless. We know that for any divergence α, the map α(·|µ) is lower semicontinuous with respect to the topology σ(P(E), B(E)) for any fixed µ, E. We will see later in Lemma 2.15 that in fact α(·|·) is jointly lower semicontinuous with respect to the same topology. On the other hand, relative entropy is known to be jointly lower semicontinuous with respect to weak convergence, and we first characterize those divergences which share this property. For this it helps to make two definitions, the second of which is well known: Definition 2.11. We say a divergence α is jointly weakly lower semicontinuous if, for each Polish space E, the map α(·|·) is lower semicontinuous on P(E) × P(E) with respect to the topology of weak convergence, i.e., equipping P(E) with the topology σ(P(E), C b (E)).
Definition 2.12. We say a risk measure ρ is Lebesgue continuous if whenever X n ∈ L ∞ is a uniformly bound sequence with X n → X a.s. we have ρ(X n ) → ρ(X). This is equivalent to the seemingly weaker condition that whenever X n , X ∈ L ∞ with X n ↓ X a.s. we have ρ(X n ) ↓ ρ(X) (c.f. Remark  The main result of this section is the following, and the proof is preceded by a preparatory lemma: Theorem 2.13. Let ρ be a law invariant risk measure with induced divergence α. The following are equivalent: (1) α is jointly weakly lower semicontinuous.
If these conditions hold, then: Lemma 2.14. Let ρ be a law invariant risk measure. Fix a Polish space E and a function f ∈ B(E).
Proof. First we prove the second claim. Let µ n → µ in P(E). By Skorohod representation, we may find E-valued random variables X, X n defined on Ω with P •X −1 = µ, P •X −1 n = µ n , and X n → X almost surely. Then f (X n ) → f (X) almost surely since f is continuous, and the sequence f (X n ) is uniformly bounded. Thus, the Fatou property (Theorem 2.1) implies Proof of Theorem 2.13. (1 ⇒ 2) Suppose first that E is compact. Let µ n → µ in P(E). We know from Lemma 2.14 that ρ µ (f ) ≤ lim inf n→∞ ρ µn (f ), so we show upper semicontinuity. Let ǫ > 0, and find for each n some ν n ∈ P(E) satisfying Since E is compact, every subsequence admits a further subsequence {n k } such that ν n k → ν for some ν ∈ P(E), and lower semicontinuity of α implies lim sup This shows lim sup n→∞ ρ µn (f ) ≤ ρ µ (f ). Finally, if E is not necessarily compact, find M > 0 such that µ n (|f | ≤ M ) = 1. Since [−M, M ] is compact, the previous result shows that (3 ⇒ 4) Let X, X n ∈ L ∞ be uniformly bounded with X n → X a.s. Find M > 0 such that |X n | ≤ M a.s. for all n. Let E denote the (complete separable) metric space of convergent sequences with values in [−M, M ] endowed with the supremum metric, and define µ on E by Let f n (x) = x n denote the coordinate maps, and let f (x) = lim n→∞ x n , for x = (x 1 , x 2 , . . .) ∈ E. Then f and f n are uniformly bounded and continuous, with f n → f pointwise by construction. Since (4 ⇒ 5) Clearly we have To show the reverse inequality, fix ǫ > 0 and find f ∈ B(E) such that Find a bounded sequence f n of continuous functions with f n → f a.s. Then, using (3) and the bounded convergence theorem, we get Let λ denote Lebesgue measure on [0, 1], and let q n and q denote the quantile functions corresponding to µ n • f −1 and µ • f −1 , respectively, so that µ n • f −1 = λ • q −1 n and µ • f −1 = λ • q −1 . Then q n are uniformly essentially bounded with q n → q λ-a.s., and law invariance yields ρ µn (f ) = ρ λ (q n ) → ρ λ (q) = ρ µ (f ).
As announced before, there are some additional continuity properties of potential interest, although we shall not use these in the sequel. Note that Lemma 2.10 does not follow from the following Proposition 2.15, because the Borel σ-field of σ(P(E), B(E)) is typically strictly larger than the Borel σ-field of the topology of weak convergence. Proposition 2.15. Suppose ρ is a law invariant risk measure with induced divergence α. If P(E) is endowed with the topology σ(P(E), B(E)), then the map µ → ρ µ (f ) is continuous for every f ∈ B(E), and α(·|·) is lower semicontinuous with respect to the product topology on P(E) × P(E).

Acceptance consistency and superadditivity
As was first observed by Weber [35], a law invariant risk measure naturally gives rise to a dynamic risk measure on any (nice enough) filtered probability space. We will use the same construction: Defineρ again byρ(P • X −1 ) = ρ(X), which makes sense thanks to law-invariance. Using our previous notation, note that ρ(m) = ρ m (id), where id denotes the identity map on R. We may then define, for any σ-field G ⊂ F in Ω and any X ∈ L ∞ , ρ(X|G) :=ρ(P (X ∈ · | G)). Note that a regular conditional law of X given G exists because Ω is standard. Lemma 2.14 ensures that ρ(X|G) is a G-measurable random variable, defined uniquely up to a.s. equality. Similarly, for a random variable Y , write ρ(X|Y ) := ρ(X|σ(Y )). Note that if Y is G-measurable then for any random variable X. If X and Y are independent, then it is straightforward to check that We are nearly ready to define the type of time-consistency we investigate.
for all sub-σ-fields G ⊂ F and all X ∈ L ∞ . If the inequality is reversed, we say ρ is rejection consistent. We say ρ is time consistent if it is both acceptance and rejection consistent.

Remark 3.2.
This definition begins to look more like the one appearing in the literature (see [1]) once it is applied inductively. Let (F t ) t≥0 denote any filtration on Ω, with F t ⊂ F for all t. Indeed, (ρ(·|F t )) t≥0 is a dynamic risk measure in the sense of [1]. If ρ is acceptance consistent and X ∈ L ∞ , then it is straightforward to check that ρ(X|F s ) ≤ ρ(ρ(X|F t )|F s ) a.s. for 0 ≤ s ≤ t.
3.1. Superadditivity and shift-convexity. Let us give names to certain divergence inequalities resembling the chain rule of classical relative entropy. Henceforth we will need to assume that our divergences are simplified, as in Definition 2.9. As far as the following definition of superadditivity is concerned, this assumption is merely to ensure that the divergence α(·|·) is jointly measurable, so that the integrals make sense. Later, a technical point in the proof of the main Theorem 3.5 will depend crucially on the divergence being simplified, but the question of whether or not Theorem 3.5 holds in more generality remains open. Definition 3.3. We say that a divergence α is partially superadditive (resp. partially subadditive) if α ( ν(dx)K ν x (dy)| µ 1 × µ 2 ) ≥ α(ν|µ 1 ) + ν(dx)α(K ν x |µ 2 ), (resp. ≤ ) whenever ν(dx)K ν x (dy) and µ 1 × µ 2 are probability measures on the product of two Polish spaces; note that the latter is required to be a product measure. We say a simplified divergence α is (fully) superadditive (resp. subadditive) if whenever ν(dx)K ν x (dy) and µ(dx)K µ x (dy) are probability measures on the product of two Polish spaces.
Note that partial superadditivity as opposed to full superadditivity only requires the inequality to hold when the reference measure is a product. It turns out that these conditions are equivalent, although we have only an indirect proof of this fact. As was discussed in the introduction, additivity properties of a divergence α are linked with time consistency and sub-level set properties of its induced risk measure, which we now describe. In the following, we will write P  (1) The measure acceptance set A of a law invariant risk measure ρ is defined by A := {P • X −1 : X ∈ L ∞ , ρ(X) ≤ 0}. In words, this is the set of laws of random variables X satisfying ρ(X) ≤ 0.
(2) A set A ⊂ P(R) is shift-convex if for every µ ∈ A, every M > 0, and every measurable map As was discussed by Weber, the convexity of a measure acceptance set A admits a natural interpretation in terms of so-called compound lotteries: If two outcomes X and Y are acceptable, then convexity of A means that the outcome with law tP • X −1 + (1 − t)P • Y −1 is also acceptable, for any t ∈ (0, 1). Shift-convexity is open to interpretation on similar grounds: Suppose X is an acceptable outcome, and that Y is conditionally acceptable given X. Then shift-convexity means that X + Y is itself acceptable. To see this, in the definition of shift-convexity take µ to be the law of X and K x to be the conditional law of Y given X. Section 3.6 we will elaborate on interpretations and reformulations of this unusual property. We can now state the main result of this section.
Theorem 3.5. Suppose α is a simplified divergence induced by a law invariant risk measure ρ with acceptance set A. The following are equivalent: Similarly, the same equivalences hold when "acceptance" is changed to "rejection", "superadditive" is changed to "subadditive", and A is changed to A c . The equivalence of (1) and (2) holds without the assumption that α is simplified.

3.2.
Properties of time consistency. The following lemma shows that acceptance consistency is equivalent to a seemingly weaker statement, which will be easier to connect with shift-convexity: Lemma 3.7. Let ρ be a law invariant risk measure. Then ρ is acceptance consistent if and only if the following holds: For every pair of independent random variables X, Y with values in some Polish spaces E, F , and for every f ∈ B(E × F ), we have Proof. The "only if" direction is immediate. To prove the converse, fix X ∈ L ∞ and a σ-field G ⊂ F . Find Y ∈ L ∞ which generates G, for example Y = ∞ n=1 2 −n 1 Bn where {B n } is a countable family of generators of G (recall that our ambient probability space is standard). By [23, Theorem 5.10], we may find independent random variables Y and U as well as a measurable function f such that ( Y , f ( Y , U )) has the same law as (Y, Z). Then the hypothesis and law invariance imply But the conditional law of f ( Y , U ) given Y is the same as the conditional law of Z given Y , and thus law invariance of ρ implies that ρ(f ( Y , U )| Y ) and ρ(Z|Y ) = ρ(Z|G) have the same law. Using law invariance once more, we conclude that ρ(ρ(f ( Y , U )| Y )) = ρ(ρ(Z|G)).
The next lemma rephrases acceptance consistency, in a more measure-theoretic notation which will be useful later.
Proposition 3.8. For a law invariant risk measure ρ, the following are equivalent: (1) ρ is acceptance consistent.
The same equivalences hold for rejection consistent, but with the inequalities reversed.
This alternative description of acceptance consistency will serve us especially well when addressing additivity. For now, we will use it in establishing the connection between acceptance consistency and shiftconvexity. Proof. Let ρ be a law-invariant risk measure with measure acceptance set A. First, assume ρ is acceptance consistent. Fix µ ∈ A, M > 0, and a measurable map R ∋ x → K x ∈ A ∩ P[−M, M ]. Setμ = µ(dx)K x (dy − x). Letting λ denote Lebesgue measure on [0, 1], we may find (e.g. by [23,Theorem 5.10]) a measurable function f : R × [0, 1] → R such that, iff (x, y) := (x, f (x, y)), then for each x. Thus Note that since µ has compact support and K x ∈ P[−M, M ] for all x, it follows that f is essentially bounded with respect to µ × λ. Since also µ • g −1 = µ ∈ A, i.e., ρ µ (g) ≤ 0, acceptance consistency (Proposition 3.8(2)) implies that ρ µ×λ (f ) ≤ 0. In other words, (µ × λ) • f −1 ∈ A. But this completes the proof of shift-convexity, since Conversely, assume now that ρ is shift-convex. Let E and F be Polish spaces, and fix µ 1 ∈ P(E), µ 2 ∈ P(F ), f ∈ B(E × F ), and g ∈ B(E) with ρ µ1 (g) ≤ 0. Suppose also that In light of Proposition 3.8(4), we must check that ρ µ1×µ2 (f ) ≤ 0, or equivalently that ( Set ν := µ 1 • g −1 , and note that ν ∈ A. For x ∈ R, define also (The choice of δ 0 is arbitrary, and any other element of A would do.) Then K x ∈ A for each x, and shift-convexity implies Finally, before we turn to the proof of Theorem 3.5, we compute a penalty function for the risk measure X → ρ(ρ(X|G)), under no time consistency assumptions. This is related to some results in [1] and [8], to name but a few, but different in the sense that our conditional penalty functions are defined as pointwise suprema as opposed to essential suprema; see Section 3.4 for a discussion of this point. Proposition 3.10. Let ρ be a law invariant risk measure with induced divergence α, which we assume is simplified. Let E and F be Polish spaces, and letμ = µ(dx)K µ x (dy) ∈ P(E × F ). Let f ∈ B(E × F ), and let X denote the identity map on E. Then Proof. We first compute Complete the proof by using a well known measurable selection argument [7,Proposition 7.50] to deduce where the supremum is over all kernels from E to F .

3.3.
Proof of Theorem 3.5. We saw in Proposition 3.9 that acceptance consistency and shift-convexity are equivalent. We will prove that acceptance consistency implies superadditivity and that partial superadditivity implies acceptance consistency. This is enough, since clearly superadditivity implies partial superadditivity. Fix throughout two Polish spaces E and F and a function f ∈ B(E × F ). First assume ρ is partially superadditive. Fixμ = µ 1 × µ 2 ∈ P(E × F ). Use Proposition 3.10 followed by partial superadditivity to get = ρμ(f ).
Conclude from Proposition 3.8(5) that ρ is acceptance consistent. Now suppose ρ is acceptance consistent. Letμ = µ(dx)K µ x (dy) ∈ P(E × F ). Use Proposition 3.10 followed by Proposition 3.8(3) to get On the other hand, according to Lemma 3.11 proven below, and using the definition of α, we get Indeed, in the second line we replaced g(x) by g(x) + ρ K µ x (f (x, ·)), and in the final step we replaced f by f + g. This shows that the function of ν(dx)K ν x (dy) ∈ P(E × F ) given by is precisely the minimal penalty function of the risk measure given by (3.1). Sinceν → α(ν|μ) is the minimal penalty function of ρμ (see Theorem 2.2), it follows from the order-reversing property of convex conjugation that α(·|μ) dominates the minimal penalty function of the risk measure given by (3.1). That is, for every ν(dx)K ν x (dy) ∈ P(E × F ). Lemma 3.11 (Nearly Lemma 4 of [1]). For anyν = ν(dx)K ν x (dy) ∈ P(E × F ), we have 3.4. Essential suprema. Let us briefly discuss how to connect our results with a more common dual characterization of acceptance consistency in terms of penalty functions, as can be found in [1]. Assume throughout that our divergence α is simplified. Letμ = µ(dx)K µ x (dy) ∈ P(E × F ) for Polish spaces E and F . Let (X, Y ) denote the identity map (i.e., coordinate maps) on E × F , and define the filtration (F 0 , F 1 , F 2 ) on E × F by letting F 0 be the trivial σ-field, letting F 1 = σ(X), and letting F 2 be the Borel σ-field. Define a dynamic risk measure (ρ 0 , ρ 1 ) on E × F by That is, ρ 1 maps F 2 -measurable random variables to F 1 -measurable random variables. Alternatively, we could see ρ 1 as mapping from L ∞ (E × F,μ) to L ∞ (E, µ). In this notation, acceptance consistency simply means ρ 0 (f ) ≤ ρ 0 (ρ 1 (f )) for all f ∈ B(E × F ). According to Theorem 27 of [1], acceptance consistency is equivalent to the inequality Here Eν denotes integration with respect toν. In other words, acceptance consistency is equivalent to holding for everyν = ν(dx)K ν x (dy) ∈ Pμ(E × F ). This differs from our definition of superadditivity only in the term Eν[α 1 (ν)]. According to Lemma 4 of [1], But according to Lemma 3.11, this is in turn equal to In other words, Lemma 3.11 bridges our characterization of acceptance consistency with that of [1,Theorem 27], which we now see are equivalent.

Weak time consistency.
A related notion of time consistency was studied by Weber in [35]. Namely, we say a law invariant risk measure ρ is weakly acceptance consistent if ρ(X|G) ≤ 0 a.s. implies ρ(X) ≤ 0, for every X ∈ L ∞ and every σ-field G ⊂ F . Similarly, ρ is weakly rejection consistent if ρ(X|G) > 0 a.s. implies ρ(X) > 0. The following result, due in large part to Weber [35], characterizes weak time consistency in terms of measure acceptance sets as well as divergences. Let us say that a set Theorem 3.12. Suppose α is a simplified divergence induced by a law invariant risk measure ρ with acceptance set A. The following are equivalent: (1) ρ is weakly acceptance consistent.
(3) For Polish spaces E and F , and measures µ(dx)K µ x (dy) and ν(dx)K ν x (dy) in P(E × F ), we have (4) For Polish spaces E and F , and measures µ 1 × µ 2 and ν(dx)K ν x (dy) in P(E × F ), we have Similarly, the same equivalences hold when "acceptance" is changed to "rejection", "sub" is changed to "super", and A is changed to A c . The equivalence of (1) and (2) holds without the assumption that α is simplified.
Proof. The implication (1) ⇔ (2) in the following was first noticed by Weber [35], and the rest is proven along the same lines as Theorem 3.5, but we provide a sketch: Suppose first that (1) holds. Fix Polish spaces E and F and measures µ(dx)K µ x (dy) and ν(dx)K ν x (dy) in P(E × F ). It is easy to see (similar to Proposition 3.8) that weak acceptance consistency is equivalent to the following: for f ∈ B(E × F ), This proves (1) ⇒ (3). Since clearly (3) implies (4), let us finally show that (4) implies (1). Fix Polish spaces E and F and µ 1 × µ 2 ∈ P(E × F ). As in the proof of Theorem 3.5, the inequality of (4), combined with Lemma 3.11 and the order-reversing property of convex conjugation, implies the set inclusion Again, it is easy to see (similar to Proposition 3.8) this implies weak acceptance consistency.
Remark 3.13. In fact, for a measure acceptance set A, local measure convexity is equivalent to (ordinary) convexity. Indeed, the set A ∩ P[−M, M ] is weakly closed for each M > 0, which can be proven easily using the Fatou property 2.1 and the Skorohod representation for weak convergence. It is well known that closed convex sets are measure convex, e.g. [36,Corollary 1.2.4]. The same equivalence may not hold for the complement A c , which is not closed.
3.6. More on shift-convexity. Let us recall our first interpretation of a shift-convex acceptance set: Suppose X and Y are risks, and X is acceptable. Suppose that Y is conditionally acceptable given X. Then shift-convexity means that X + Y is itself acceptable. Proposition 3.14 below shows that this interpretation can be sharpened somewhat: Suppose X and Y are risks, and X is acceptable. Suppose that X is G-measurable for some σ-field G, and suppose the risk Y is conditionally acceptable given G. Then shiftconvexity implies that the risk X + Y is acceptable as well.
is in A, which shows that A is shift-convex. Conversely, suppose A is shift-convex. Then the corresponding law invariant risk measure ρ is acceptance consistent. A fortiori, ρ is weakly acceptance consistent and thus A is locally measure convex by For each x define K x ∈ P(R) to be the mean measure of Q x , i.e. K x (·) = Q x (dm)m(·).
Since Q x (A ∩ P[−M, M ]) = 1 for µ-a.e. x, and since A is locally measure convex, it holds that K x ∈ A for µ-a.e. x. From partial shift-convexity we conclude that the measure is in A, which proves property (S).

Further properties of divergences
While every divergence is convex in its first argument by definition, it is well known that relative entropy and also f -divergences are jointly convex. It turns out that joint convexity of a divergence is equivalent to concavity of the corresponding law invariant risk measure on the level of distributions. To be clear, for a law invariant risk measure ρ, define the functionρ on the set of probability measures on R with compact support by settingρ(P • X −1 ) = ρ(X), for X ∈ L ∞ . The concavity ofρ was studied recently by Acciaio and Svindland [2], who make a compelling case that concavity is much more common in spite of the convexity of ρ on the level of random variables. Indeed, they show that ρ(X) = EX is the only law invariant risk measure for whichρ is convex. The entropic risk measure, for example, clearly hasρ concave. Moreover, if ρ is the optimized certainty equivalent corresponding to a function φ, then the formulã shows thatρ is concave. (1) α is jointly convex, in the sense that α(·|·) is convex on P(E) × P(E) for each Polish space E.
(2) For each Polish space E and each f ∈ B(E), the map µ → ρ µ (f ) is concave.
Proof. (1 ⇒ 2) Let E be a Polish space and f ∈ B(E). Fix t ∈ (0, 1) and µ 1 , µ 2 ∈ P(E). Then (1) implies (2 ⇒ 1) On the other hand, if ν 1 , ν 2 ∈ P(E), then (2) implies (3 ⇒ 2) This is immediate from the identity (2 ⇒ 3) This is almost immediate from the above identity. Assume (2). Let m 1 , m 2 ∈ P(R) have compact support, and let t ∈ (0, 1). Then, letting id denote the identity map on R, Divergences are actually uniquely determined by their values for finite spaces E, as is formalized in the following proposition. Building on the characterization of relative entropy in Corollary 3.6 below, we could derive an even simpler characterization akin to those surveyed by Csiszár [9], but this would lead us too far astray.
Proposition 4.2. Suppose α is a simplified divergence. For any Polish space E and any µ, ν ∈ P(E), we have  Indeed, this is true because every measurable map T : E → F can be written as Hence, we need only to prove (4.1).
Since [0, 1] is compact, for each n we may find a measurable map T n : [0, 1] → [0, 1] with finite range such that |x − T n (x)| ≤ 1/n for all x ∈ [0, 1]. Then T n converges uniformly to the identity. Since α is simplified, for a given ǫ > 0 we may find a continuous function f on [0, 1] such that Since ρ µ is continuous in the supremum norm, and since f • T n → f uniformly, we conclude that ρ µ (f ) = lim n ρ µ (f • T n ). Thus This is enough to complete the proof.
Proof. We use a more probabilistic notation for this proof, since we deal with conditional expectations. By definition of a divergence, α(ν • T −1 |µ • T −1 ) ≤ α(ν|µ), so we must only prove the reverse inequality. As is well known, sufficiency of T implies easily that Because of Corollary 4.65 of [16], we have ρ µ (f ) ≥ ρ µ (E µ [f | T ]). Thus Every T -measurable function on E may be written as g • T for some measurable function g, and thus Remark 4.4. Proposition 4.3 raises a natural question: Does the converse hold? That is, does the equality α(ν • T −1 |µ • T −1 ) = α(ν|µ) imply that T is sufficient for (µ, ν)? This does not hold for all divergences, but it does when α is relative entropy, as was observed first by Kullback and Leibler [25]. Liese and Vadja [29] show that this converse holds for many (but not all) f -divergences. This characterization leads to useful tests for sufficiency, as is explained in both of these papers [25,29].

Examples
Before we discuss some common law invariant risk measures, recall that our sign convention is not the usual one. Namely, ρ is increasing, not decreasing. More precisely, if ρ is a risk measure according to our definition, the map X → ρ(−X) is what is more often called a risk measure, as in [16]. [14], are of the form

Shortfall risk measures. Shortfall risk measures, introduced by Föllmer and Schied
where ℓ is a loss function, defined as follows: Definition 5.1. A loss function is a convex and nondecreasing function ℓ : R → R satisfying ℓ(0) = 1 < ℓ(x) for all x > 0.
Of course, the induced family of risk measures is Note that by continuity of ℓ and monotone convergence, the infimum is always attained. In particular, According to the [16, Theorem 4.115] the induced divergence is Proposition 5.2. Let ℓ be a loss function and α the corresponding divergence defined in (5.2). If ℓ is log-subadditive (resp. log-superadditive) then α is superadditive (resp. subadditive), or equivalently ρ is acceptance consistent (resp. rejection consistent) Proof. Assume ℓ is log-subadditive. With Theorem 3.5 in mind, we will show that the following set is shift-convex: Fix µ ∈ A, M > 0, and a measurable map R ∋ x → K x ∈ A ∩ P[−M, M ]. Then R ℓ dK x ≤ 1 for all x and also R ℓ dµ ≤ 1. Thus and it follows that R µ(dx)K x (· − x) is in A. where φ : R → R is convex and nondecreasing, with φ * (1) = sup x∈R (x − φ(x)) = 0. Of course, the induced family of risk measures is The corresponding divergence is the φ * -divergence, α(ν|µ) = φ * dν dµ dµ, for ν ≪ µ.
As we saw in the discussion preceding Proposition 4.1, an optimized certainty equivalent always satisfies the concavity condition of Proposition 4.1, and this provides an alternative proof of the well known joint convexity of α. It is also known that α is jointly weakly lower semicontinuous, which we confirm using Theorem 2.13 before addressing time consistency and additivity properties.
Proposition 5.4. Every optimized certainty equivalent is weakly lower semicontinuous. In particular, α is simplified.
Proof. We treat the superadditive case, as the subadditive case is proven similarly. Suppose E and F are Polish spaces, and letμ = µ(dx)K µ x (dy) andν = ν(dx)K ν x (dy) be probability measures on E × F . Simply use the definition of α: Remark 5.6. Of course, the relationship φ * (xy) = xφ * (y) + yφ * (x) is satisfied by φ * (x) = x log x, the conjugate of which (assuming φ * = +∞ on the negative half-line) is φ(x) = e x−1 . More generally, suppose ℓ is a strictly increasing log-subadditive loss function. Then ℓ −1 (xy) ≥ ℓ −1 (x) + ℓ −1 (y) for x, y > 0, and so φ * (x) := xℓ −1 (x) satisfies φ * (xy) ≥ xφ * (y) + yφ * (x). As was discussed in Remark 5.3, there are not many such functions. Now note that for each f ∈ C([0, 1]), the function of x inside the supremum on the right-hand side above is measurable. Indeed, we showed in Lemma 2.14 that µ → ρ µ (f ) is measurable (actually lower semicontinuous) when f is continuous and bounded. Since C([0, 1]) equipped with the supremum norm is a Polish space, for a fixed ǫ > 0 we may find (by [7,Proposition 7.50]) a universally measurable map E ∋ x → g x ∈ C([0, 1]) such that, for each x ∈ E, We may find a Borel measurable map which agrees ν-a.e. with x → g x , and we abuse notation by denoting this again by x → g x . By Lusin's theorem, for each δ > 0 there exists a compact set S δ ⊂ E such that ν(S c δ ) ≤ δ and the restriction S δ ∋ x → g x ∈ C([0, 1]) is continuous. Without loss of generality, assume that S δ ⊃ S δ ′ whenever δ < δ ′ . Define g δ x := g x for x ∈ S δ and g δ x := 0 (the zero function) for x / ∈ S δ , and note that g δ is a bounded measurable function from E to C([0, 1]). Define g δ : E × [0, 1] → R by g δ (x, y) = g δ x (y), and note that g δ is jointly measurable (thanks to Theorem 4.55 and Lemma 4.51 of [3]) and bounded.