Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access January 11, 2023

Joint lifetime modeling with matrix distributions

  • Hansjörg Albrecher , Martin Bladt and Alaric J. A. Müller EMAIL logo
From the journal Dependence Modeling

Abstract

Acyclic phase-type (PH) distributions have been a popular tool in survival analysis, thanks to their natural interpretation in terms of aging toward its inevitable absorption. In this article, we consider an extension to the bivariate setting for the modeling of joint lifetimes. In contrast to previous models in the literature that were based on a separate estimation of the marginal behavior and the dependence structure through a copula, we propose a new time-inhomogeneous version of a multivariate PH (mIPH) class that leads to a model for joint lifetimes without that separation. We study properties of mIPH class members and provide an adapted estimation procedure that allows for right-censoring and covariate information. We show that initial distribution vectors in our construction can be tailored to reflect the dependence of the random variables and use multinomial regression to determine the influence of covariates on starting probabilities. Moreover, we highlight the flexibility and parsimony, in terms of needed phases, introduced by the time inhomogeneity. Numerical illustrations are given for the data set of joint lifetimes of Frees et al., where 10 phases turn out to be sufficient for a reasonable fitting performance. As a by-product, the proposed approach enables a natural causal interpretation of the association in the aging mechanism of joint lifetimes that goes beyond a statistical fit.

MSC 2010: 62N02

1 Introduction

When studying insurance products on multiple lives, it is natural to assume that individuals who are exposed to very similar life conditions may have somewhat correlated lifetimes. This is especially true for married couples, since, once married, the spouses typically share to a large extent a similar lifestyle. Indeed, the simplistic assumption of independence of lifetimes of partners has been shown to be inappropriate in various papers. For example, Frees et al. [15] used a bivariate Frank copula model to assess the effect of dependency between husband and wife on insurance annuities, illustrating their approach on a by now classical data set of a large insurer. The same data were used in Carriere [10], where multiple bivariate copula models were studied. The Linear-Mixing Frailty copula was found to be the best-suited model to describe the data. Shemyakin and Youn [22] introduced a general conditional Bayesian copula for the joint last survivor analysis. They allow entry ages of spouses to have a selection effect on his/her mortality, as well as on the other spouse’s mortality. For that same data set, Luciano et al. [19] captured the dependence between survival times of spouses by an Archimedean copula, whose marginals were estimated according to a stochastic intensity approach. In a similar spirit, Dufresne et al. [13] allow for the Archimedean copula parameter to depend on the age difference of partners at issue of the policy, to describe the dependence of the remaining lifetime of a couple. In the the study by Gobbi et al. [16], extended Marshall-Olkin models were employed for that same data set, where the continuous copula approach is extended by allowing for fatal events that affect both marginal lives.

In this article, we propose an alternative to copula-based methods for the modeling of joint remaining lifetimes in a couple based on multivariate phase-type distributions. Phase-type (PH) distributions are interesting candidates since they broaden favorable properties of exponential random variables to scenarios where the latter alone would not be appropriate. In particular, the denseness of PH distributions among all distributions on the positive half-line in the sense of weak convergence, which extends to the multivariate setup, is a major advantage when one wants to approximate a distribution. For more details on PH distributions, we refer readers to Bladt and Nielsen [8]. As opposed to copula-based methods, a PH distribution can give rise to a natural interpretation when used to approximate lifetime distributions. One can view the path of a Markov jump process as the life of an individual, which goes through several different states (for instance, biological markers) before reaching the inevitable absorption (death) state. Along this interpretation, acyclic PH distributions have been the first choice for modeling the aging process of a human life, since they only allow forward transitions or direct exits to the absorption state. This characteristic makes them an appropriate tool for describing lifetimes ended by natural aging or accidents. In the study by Lin and Liu [18], a PH distribution with the Coxian structure was used to explain the physical aging process of marginal lifetimes. This approach was extended in the study by Asmussen et al. [5] to generalized Coxian distributions for the purpose of pricing equity-linked products. The first contribution to lift the PH approach to bivariate lifetime models was Ji et al. [17], where a Markovian multistate model and a semi-Markov model are used to describe the dependence between the lifetimes of husbands and wives. Spreeuw and Owadally [23] also use a Markovian multistate model, with more attention given on how to tie the bereavement effect to forces of mortality. Moutanabbir and Abdelrahman [20] then used a bivariate Sarmanov distribution with PH marginals to model joint lifetimes. Both articles focused on the pricing of multiple-life insurance contracts.

Recently, Albrecher et al. [2] introduced time-inhomogeneous PH (IPH) distributions for the purpose of lifetime modeling, which leads to a considerable reduction of necessary phases for a satisfactory fit of given data, since the introduced inhomogeneity can more efficiently accommodate nonexponential shapes than an augmentation of the phase dimension. In particular, [2] applied regression on the intensity functions of the IPH distributions to associate lifetimes of different cohorts and populations.

In this article, we propose a different route for using available information in the data set, namely, to incorporate multinomial logistic regressions in the estimation procedure of multivariate PH distributions. In particular, the regression is applied to the initial distribution vectors of each IPH component, which adapts an approach presented in Bladt and Yslas [9] to the multivariate case. The resulting dependence structure allows for explicit formulas alongside an intuitive “aging” interpretation and, beyond the theoretical contribution, for a satisfactory fit to the bivariate spouses’ lifetime data.

The remainder of this article is structured as follows. Section 2 introduces the class of multivariate PH distributions that we will use to describe joint lifetimes of couples; we also provide some additional properties. In Section 3, an estimation method for this multivariate PH distribution is introduced, which allows for right censoring and covariate information. Section 4 then applies and illustrates the procedure on the classical spouses’ lifetime data set from [15] and interprets the results. Section 5 concludes this article.

2 Multivariate phase-type distributions

We first recall the mPH class, which was introduced in the study by Bladt [7].

2.1 mPH distributions

Let { J t ( i ) } t 0 , i = 1 , , d , denote separate homogeneous Markov pure-jump processes on the common state space E = { 1 , , p , p + 1 } , with states 1 , , p being transient and p + 1 absorbing. Defining transition probabilities as follows:

p j l ( i ) ( s , t ) = P ( J t ( i ) = l J s ( i ) = j ) , 0 j , l p + 1 , 0 < i d ,

we may write

P i ( s , t ) = exp ( Λ i ( t s ) ) = exp ( T i ( t s ) ) e exp ( T i ( t s ) ) e 0 1 R ( p + 1 ) × ( p + 1 ) ,

for s < t , 0 i d , where Λ i ( t ) are intensity matrices. In the following, we write e k for the k -th canonical basis vector in R p , e = j = 1 p e j , and

T i = { t k s ( i ) } k , s = 1 , , p , t i = T i e = ( t 1 ( i ) , , t p ( i ) ) T , k = 1 , , p .

The crucial property of this class of PH distributions is now its dependence structure. Concretely, the assumption is that all jump processes start in the same state at time t = 0 , but proceed independently thereafter until absorption. That is, dependence is introduced solely through the shared initial state, which leads to a particularly tractable yet flexible model class. More formally,

(2.1) J 0 ( i ) = J 0 ( l ) , { J t ( i ) } t 0 J 0 ( 1 ) { J t ( l ) } t 0 , l i , i , l { 1 , , d } .

We will use J 0 J 0 ( i ) to simplify notation. Let P ( J 0 = j ) = π j , j = 1 , , p and π = ( π 1 , , π p ) denote the distribution vector of the shared initial state. The random variables

(2.2) X i = inf { t > 0 : J t ( i ) = p + 1 } , i = 1 , , d ,

are then all univariate PH distributed. We say that the random vector X = X 1 , , X d R + d has a multivariate phase-type distribution (mPH) if each marginal variable X i , i = 1 , 2 , , d is given by (2.2) and pairwise dependence is defined by (2.1). We use the notation

X mPH ( π , T ) , with T = { T 1 , , T d } .

The joint cumulative distribution function of X is given by

F X ( x ) = P ( X 1 x 1 , X 2 x 2 , , X d x d ) = j = 1 p P ( X 1 x 1 , X 2 x 2 , , X d x d J 0 = j ) P ( J 0 = j ) = j = 1 p π j i = 1 d ( 1 e j T exp ( T i x i ) e ) , x R + d .

Furthermore, the survival function is

S X ( x ) = P ( X 1 > x 1 , X 2 > x 2 , , X d > x d ) = j = 1 p π j i = 1 d e j T exp ( T i x i ) e ,

and the probability density function is given by

f X ( x ) = j = 1 p π j i = 1 d e j T exp ( T i x i ) t i .

For more details, compare with [7].

2.2 mIPH distributions

The particular focus in this article will now be on an inhomogeneous extension of the mPH distribution (briefly mentioned in [7, Section 6.1]). When considering time-inhomogeneous Markov pure jump processes on the common state-space E , it follows from Albrecher and Bladt [1] that the transition matrices are modified to

P ( s , t ) = s t ( I + Λ ( u ) d u ) I + k = 1 s t s u k s u 2 Λ ( u 1 ) Λ ( u k ) d u 1 d u k ,

with sub-intensity matrix

Λ ( t ) = T ( t ) t ( t ) 0 0 R ( p + 1 ) × ( p + 1 ) , t 0 .

The random variables Y i = inf { t > 0 : J t ( i ) = p + 1 } , i = 1 , , d , then follow univariate inhomogeneous phase-type (IPH) distributions, and readers can compare with [1] for more details.

Here, we focus on the particularly tractable case T i ( t ) = λ i ( t ) T . A random vector Y = Y 1 , , Y d is said to have an inhomogeneous multivariate PH (mIPH) distribution if all marginals follow IPH distributions, and the dependence structure is defined by (2.1). We write

Y mIPH ( π , T , ) , where T = { T 1 , , T d } , = { λ 1 , , λ d } .

With

g i 1 ( y ) 0 y λ i ( u ) d u , i = 1 , , d ,

the cumulative distribution function, survival function, and density of Y are given by

F Y ( y ) = j = 1 p π j i = 1 d ( 1 e j T exp ( T i g i 1 ( y i ) ) e ) , y R + d , S Y ( y ) = j = 1 p π j i = 1 d e j T exp ( T i g i 1 ( y i ) ) e , y R + d ,

and

f Y ( y ) = j = 1 p π j i = 1 d e j T exp ( T i g i 1 ( y i ) ) t i λ i ( y i ) , y R + d ,

respectively. Note that one can view each IPH random variable as a transformation of a PH random variable (and correspondingly the absorption time of a time-transformed formerly time-homogeneous Markov jump process), with X PH ( π , T ) and g ( X ) IPH ( π , T , λ ) .

The construction of mIPH ( π , T , ) allows different sub-intensity matrices and inhomogeneity functions for each marginal, as long as they share the same state-space. This leads to a considerable model flexibility. In particular, when compared to the homogeneous case, time-inhomogeneity allows for substantially smaller state-spaces for appropriate fits of data with potentially nonexponential tails (compare with [1]), and the mIPH class inherits this feature.

When we condition a mIPH distribution on one or more marginals, we obtain another mIPH distribution with a new initial distribution vector and smaller dimension: Let Y mIPH ( π , T , ) and condition on the value of Y l , l d . The conditional density is

f Y Y l ( y y l ) = j = 1 p π j e j T exp ( T l g l 1 ( y l ) ) t l λ l ( y l ) π exp ( T l g l 1 ( y l ) ) t l λ l ( y l ) i l e j T exp ( T i g l 1 ( y i ) ) t i λ i ( y i ) .

That is,

(2.3) Y Y l = y l mIPH ( α , T T l , λ l ) ,

with initial distribution vector

α = π j × e j T exp ( T l g l 1 ( y l ) ) t l λ l ( y l ) π exp ( T l g l 1 ( y l ) ) t l λ l ( y l ) j = 1 , , p .

The same reasoning can be applied to obtain

(2.4) Y Y l y l mIPH ( ν , T T l , λ l ) ,

with

ν = π j × e j T exp ( T l g l 1 ( y l ) ) e π exp ( T l g l 1 ( y l ) ) e j = 1 , , p .

Remark 2.1

One might be tempted to argue that Assumption (2.1) necessarily leads to positive dependence of the resulting random variables (in our case lifetimes). However, sharing the initial state is not sufficient to obtain positive dependence, as the different intensity matrices may introduce counter-effects. For instance, after starting in the same state, we could have a very small expected holding time and direct absorption for one marginal, while the second has to pass through the entire state space before absorption happens, leading to a very large survival time. This behavior could be very well reversed when starting in another (but common) state. Consequently, certain combinations of individual intensity matrices may give rise to negative dependence as well.

3 Parameter estimation for right-censored data and covariate information

In the following, we first introduce the components needed to estimate the parameters of mIPH distributions, when right-censored data is present. Second, we present how to estimate initial distribution vectors considering covariate information. Finally, we propose an adapted expectation maximization (EM) algorithm, which we name ERMI algorithm.

3.1 EM algorithm for right-censored data

Taking inspiration from Asmussen et al. [6] and Olsson [21], we now derive conditional expectations needed in the EM algorithm for mPH distributions, where absorption times are allowed to be right censored. Since the eventually targeted mIPH distributions are transformed mPH distributions, after transformation of the data, the E-Step and M-Step of the algorithm are the same as for the time-homogeneous case.

Let X = ( X 1 , , X d ) be the collection of random variables we are interested in. Let x = x 1 ( m ) , , x d ( m ) be the observations of absorption times assumed to be generated from X mPH ( π , T ) , where x i ( m ) R + n for i = 1 , , d . We assume that the censoring mechanism is independent of the size of the random variables. The marginals X i ( m ) = min ( x i ( m ) , R i ( m ) ) follow PH ( π , T i ) distributions, where R i ( m ) is a random censoring point for the m -th observation. The realization of random right-censoring indicators can be found in Δ = δ 1 ( m ) , , δ d ( m ) , where elements δ i ( m ) R + n , i = 1 , , d , are equal to 1 if the absorption time x i ( m ) is fully observed and 0 if x i ( m ) R i ( m ) is right-censored.

The sample X is associated with the latent sample paths { J t ( i , m ) } t 0 , i = 1 , , d , m = 1 , , n , which are not observed. To face this issue, we make the following definitions. Let

B k = i = 1 d m = 1 n 1 { J 0 ( i , m ) = k } , k = 1 , , p , N k s ( i ) = m = 1 n t 0 1 { J t ( i , m ) = k , J t ( i , m ) = s } , k , s = 1 , , p , i = 1 , , d , N k ( i ) = m = 1 n t 0 1 { J t ( i , m ) = k , J t ( i , m ) = p + 1 } , k = 1 , , p , i = 1 , , d , Z k ( i ) = m = 1 n 0 1 { J t ( i , m ) = k } d t , k = 1 , , p , i = 1 , , d .

B k is the number of times marginal jump processes start in State k , N k s ( i ) is the number of transitions from State k to s for jump process i , and N k ( i ) is the number of absorptions from State k for jump process i . Finally, Z k ( i ) is the time spent in state k before absorption of jump process i . These statistics are not observable, but are sufficient to describe the dynamics of the underlying Markov process. Moreover, they are essential to construct an effective EM-like algorithm. Then, the completely observed likelihood can be expressed using the sufficient statistics defined earlier, as

(3.1) c ( π , T ; x ) = k = 1 p π k B k i = 1 d k = 1 p s k t k s ( i ) N k s ( i ) e t k s ( i ) Z k ( i ) k = 1 p t k ( i ) N k ( i ) e t k ( i ) Z k ( i ) ,

which is seen to conveniently fall into the exponential family of distributions and thus has explicit maximum likelihood estimators. The multiplications on the right-hand side of (3.1) can be interpreted as follows: The first product gives the probability of starting in each transient state. The second consists of the likelihood of transitions between transient states, associated with respective sojourn times, such that the absorbing state is not considered in the multiplication. Finally, the last product gives the likelihood of transitions from transient states to the absorbing state, along with time spent in respective states before absorption.

With these assumptions, the derivation of B k , Z k ( i ) , N k s ( i ) , and N k ( i ) , for k , s = 1 , , p and i = 1 , , d , is analogous to the fully uncensored case (see [7]). The only difference from the fully observed case is that marginals may have right-censored absorption times. To see how this affects the expectation step of the EM algorithm, we give a detailed derivation of E ( B k X = x ) .

For the m -th row of Y , let i u n ( m ) denote the collection of indices of marginals that are uncensored and similarly i r c ( m ) for right-censored marginals. Naturally i u n ( m ) + i r c ( m ) = d . Then, the conditional expectation of B k under right-censoring is

E ( B k X = x ) = i = 1 d m = 1 n E ( 1 { J 0 ( i , m ) = k } X = x ) = d × m = 1 n P ( J 0 ( m ) = k X = x ) = d × m = 1 n P ( J 0 ( m ) = k ) P ( X j d x j ( m ) , X l x l ( m ) ; j i u n ( m ) , l i r c ( m ) J 0 ( m ) = k ) P ( X j d x j ( m ) , X l x l ( m ) ; j i u n ( m ) , l i r c ( m ) ) = d × m = 1 n π k ( m ) j i u n ( m ) e k T exp ( T j x j ( m ) ) t j l i r c ( m ) e k T exp ( T l x l ( m ) ) e s = 1 p π s ( m ) j i u n ( m ) e s T exp ( T j x j ( m ) ) t j l i r c ( m ) e s T exp ( T l x l ( m ) ) e ,

where we see a mix of marginal densities and survival functions appearing in both the numerator and denominator.

Note that we introduce the notation J 0 ( i , m ) since it is possible to observe m different mIPH distributions which only differ in terms of initial distribution vectors, as explained in Section 3.2. With this in mind, J 0 ( m ) indicates the starting state of the Markov jump process of the m -th observation, which is coupled with initial distribution vector π ( m ) .

This expectation can also be expressed, using the Δ notation, as follows:

E ( B k X = x ) = d × m = 1 n π k ( m ) i = 1 d ( e k exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e k exp ( T i x i ( m ) ) e ) 1 δ i ( m ) j = 1 p π j ( m ) i = 1 d ( e j exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e j exp ( T i x i ( m ) ) e ) 1 δ i ( m ) ,

and we shall use this style in the following. The other needed conditional expectations are obtained in a similar way, reading

E ( Z k ( i ) X = x ) = m = 1 n j = 1 p π j ( m ) l i ( e j exp ( T l x l ( m ) ) t l ) δ l ( m ) ( e j exp ( T l x l ( m ) ) e ) 1 δ l ( m ) j = 1 p π j ( m ) i = 1 d ( e j exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e j exp ( T i x i ( m ) ) e ) 1 δ i ( m ) × e k 0 x i ( m ) exp ( T i ( x i ( m ) t ) ) t i e j exp ( T i t ) d t e k δ i ( m ) × e k 0 x i ( m ) exp ( T i ( x i ( m ) t ) ) e e j exp ( T i t ) d t e k 1 δ i ( m ) ,

E ( N k s ( i ) X = x ) = t k s ( i ) × m = 1 n j = 1 p π j ( m ) l i ( e j exp ( T l x l ( m ) ) t l ) δ l ( m ) ( e j exp ( T l x l ( m ) ) e ) 1 δ l ( m ) j = 1 p π j ( m ) i = 1 d ( e j exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e j exp ( T i x i ( m ) ) e ) 1 δ i ( m ) × e s 0 x i ( m ) exp ( T i ( x i ( m ) t ) ) t i e j exp ( T i t ) d t e k δ i ( m ) × e s 0 x i ( m ) exp ( T i ( x i ( m ) t ) ) e e j exp ( T i t ) d t e k 1 δ i ( m ) ,

and finally,

E ( N k ( i ) X = x ) = t k ( i ) × m = 1 n j = 1 p π j ( m ) e j exp ( T i x i ( m ) ) e k δ i ( m ) × l i ( e j exp ( T l x l ( m ) ) t l ) δ l ( m ) ( e j exp ( T l x l ( m ) ) e ) 1 δ l ( m ) j = 1 p π j ( m ) i = 1 d ( e j exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e j exp ( T i x i ( m ) ) e ) 1 δ i ( m ) .

3.2 Initial distribution vectors

Adapting an idea developed in [9], we apply the regression component to the initial distribution, i.e., we estimate a “personalized” initial distribution vector as a function of covariates, which increases the flexibility of the model. To that end, we use multinomial logistic regressions, where we consider the initial probabilities as response variables that depend on covariate information found in A ( m ) T R g , with g being the number of explanatory variables, m = 1 , , n , and regression coefficients in γ R p × g . Concretely, the initial distribution probabilities are then given as follows:

π k ( m ) = exp ( A ( m ) γ k ) j = 1 p exp ( A ( m ) γ j ) ,

with γ k R g for k = 1 , , p .

In every iteration of the expectation-maximization (EM) algorithm to be described later, we use the conditional expectation of the number of times that the underlying process starts in a specific state as weights for the regression coefficients in γ . Let us consider the information carried by E ( B k X = x ) separately. For a row of observations m , let B k ( m ) be the number of times that the marginal jump process { J t ( m ) } t 0 starts in state k . The conditional expectation is given by

E ( B k ( m ) X = x ) = d × π k ( m ) i = 1 d ( e k exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e k exp ( T i x i ( m ) ) e ) 1 δ i ( m ) j = 1 p π j ( m ) i = 1 d ( e j exp ( T i x i ( m ) ) t i ) δ i ( m ) ( e j exp ( T i x i ( m ) ) e ) 1 δ i ( m ) ,

for k = 1 , , p . We then solve the optimization problem

γ ˆ = argmax γ m = 1 n k = 1 p E ( B k ( m ) X = x ) log ( π k ( m ) ( A ( m ) ; γ ) )

and set

(3.2) π ˆ k ( m ) = π k ( m ) ( A ( m ) ; γ ˆ ) = exp ( A ( m ) γ ˆ k ) j = 1 p exp ( A ( m ) γ ˆ j )

in every iteration. The initial distribution hence depends on covariate information. Recall that all marginal processes { J t ( i ) } t 0 are assumed to start in the same state (drawn from the initial distribution with probabilities (3.2)), but afterward, transit independently to other states according to their specific sub-intensity matrices (and the latter do not depend on covariate information).

3.3 ERMI algorithm

Consider now a multivariate sample of right-censored absorption times y = y 1 ( m ) , , y d ( m ) , which we assume to originate from Y ( m ) mIPH ( π ( m ) , T , ) . The associated inhomogeneity functions depend on parameters β i , i = 1 , , d , and the right-censoring indicators are collected in Δ . The resulting EM algorithm with covariate information is depicted in Algorithm 1. As in the study by Albrecher et al. [4], we first take care of time-inhomogeneity. By using the relation g i 1 ( y i ( m ) ) = 0 y i ( m ) λ i ( u ; β i ) d u , i = 1 , , d , we obtain a time-homogeneous random sample ( x 1 ( m ) , , x d ( m ) ) , for which we know how to evaluate conditional expectations of sufficient statistics (E step). By using these expectations (as given above), we estimate the marginal sub-intensity matrices (M step), while the initial distribution vectors are predicted by multinomial logistic regressions (R step). Once we have estimated both, we need to find optimal inhomogeneity parameters β i , i = 1 , , d , that maximize the joint likelihood of the time-inhomogeneous sample (I step). Concretely, we solve

β ˆ = argmax β m = 1 n log j = 1 p π ˆ j ( m ) i = 1 d ( e j T exp ( T ˆ i x i ( m ) ) t ˆ i λ i ( y i ( m ) , β i ) ) δ i ( m ) ( e j T exp ( T ˆ i x i ( m ) ) e ) 1 δ i ( m ) ,

where x i ( m ) = g i 1 ( y i ( m ) ) . We repeat the procedure until a stopping rule is satisfied, and finally obtain the estimated distribution mIPH ( π ˆ , T ˆ , ˆ ) . Here, π ˆ is a matrix, where each row is a distribution vector π ˆ ( m ) , which is shared by marginals with the same covariates.

In contrast to copula-based methods, this approach does not separate the estimation of marginals and multivariate parameters. This may be considered preferable as the implied multivariate distribution has a natural and causal interpretation and is intimately connected to the marginal behavior of the risks, whereas choosing a concrete copula family on given (and possibly already fitted) marginal risks is often a somewhat more arbitrary choice for the modeling of multivariate phenomena.

Algorithm 1. Adapted expectation maximization (ERMI) algorithm for mIPH distributions
Input: Observed absorption times y R + n × d , right-censoring indicators Δ R n × d and arbitrary initial parameters for ( π , T , ) .
(1) For each marginal, transform the data in x i ( m ) = g i 1 ( y i ( m ) ; β i ) , i = 1 , 2 , , d and m = 1 , 2 , , n
(2) E-step: Calculate
E ( B k ( m ) Y = y ) k = 1 , , p , m = 1 , , n
E ( Z k ( i ) Y = y ) k = 1 , , p , i = 1 , , d
E ( N k s ( i ) Y = y ) k , s = 1 , , p , i = 1 , , d
E ( N k ( i ) Y = y ) k = 1 , , p , i = 1 , , d
(3) R-step: Perform a multinomial regression with weights given by E ( B k ( m ) Y = y ) and predict π ˆ k ( m ) for k = 1 , , p and m = 1 , , n .
(4) M-step: Let
t ˆ k s ( i ) = E ( N k s ( i ) Y = y ) E ( Z k ( i ) Y = y ) , k , s = 1 , , p , i = 1 , , d
t ˆ k ( i ) = E ( N k ( i ) Y = y ) E ( Z k ( i ) Y = y ) k = 1 , , p , i = 1 , , d
t ˆ k k ( i ) = s k t ˆ k s ( i ) t ˆ k ( i ) k , s = 1 , , p , i = 1 , , d
Let π ˆ = ( π ˆ 1 ( m ) , , π ˆ p ( m ) ) , T i ˆ = { t ˆ k s ( i ) } k , s = 1 , 2 , , p , and t i ˆ = t ˆ 1 ( i ) t ˆ p ( i ) .
(5) I-step: Compute
β ˆ = argmax β m = 1 n log ( f Y ( y ( m ) ; π ˆ , T ˆ , β , Δ ) )
= argmax β m = 1 n log j = 1 p π ˆ j ( m ) i = 1 d ( e j T exp ( T ˆ i x i ( m ) ) t ˆ i λ i ( y i ( m ) , β i ) ) δ i ( m ) ( e j T exp ( T ˆ i x i ( m ) ) e ) 1 δ i ( m )
(6) Assign π = π ˆ , T i = T i ˆ and β i = β ˆ i then repeat from Step 1 until a stopping rule is satisfied.
Output: Fitted representations ( π ˆ , T ˆ , ˆ ) , for m = 1 , , n .

4 Modeling joint excess lifetimes of couples

In this section, we present an application of Algorithm 1 to the well-known data set of joint lives used in [15]. In addition to survival times, this data set provides information on individuals’ ages at issue of an insurance policy. This leads to left-truncated data in this particular case, and one might indeed consider the different entry ages with a left-truncated likelihood in the estimation process, which is, however, quite inefficient. Instead, we propose here to use entry age as a covariate information and use multinomial logistic regression to deal with the different ages at issue. Entry ages were also considered as relevant factors for the dependence modeling in [22]. Initial distribution vectors obtained via regression will then incorporate the fact that an old couple is expected to survive less long than a young one and that the bereavement effect may be different for different age dynamics in a couple. In PH terms, when using an acyclical distribution, the older the couple, the larger the starting probabilities should be for states closer to the absorbing state, so that fewer states will be visited until absorption. Passing through fewer states translates into less time spent in the state-space before exiting, which results in smaller remaining lifetimes.

4.1 Description of the data set

The data set at hand provides information about 14947 insurance products on joint lives, which were observed from December 29, 1988, until December 31, 1993. We consider January 1, 1994, as the right censoring limit. For the purpose of this article, we only consider birthdays, sexes (and potential death dates) of policyholders, given that we are only interested in mortality, i.e., we do not make use of the monetary details of each contract.

After removing same-sex couples and multiple entries (due to several contracts of the same couple), we compute the remaining lifetime that any person lived from the start of the observation period until the right-censoring date 01.01.1994, given that they are at least 40 years old at the issue of the policy. Note that the terms “remaining” and “excess” are used interchangeably in the following. Doing so leads to 8,834 different joint excess survival times, with 155 cases where both individuals died, 1,057 where only one individual died, and 7,622 where neither died. Consequently, less than 2% of joint remaining survival times are fully observed. Concerning the 1057 cases where only one person in the couple died, in 820 of them this was the man, and in the complementary 237 cases the woman died. Hereafter, we refer to the start of the observation period as the “issue of the policy,” although the actual issue date may be later than 29.12.1988.

To prepare the data for the multinomial regression, we construct the covariate matrix A = ( 1 age y 1 age y 2 interact ) , A R n × 4 . The column vector age y 1 is the collection of all ages of men, at issue of a policy, while age y 2 contains all ages of women. We also consider an interaction term in i n t e r a c t , which gathers element-wise multiplication of ages in a couple. Finally, to perform multinomial regressions via neural networks (R package nnet), we divide the data by 100, such that an absorption time of 0.01 given by estimated distributions actually corresponds to 1 year.

4.2 Fitting the mIPH distribution: marginal behavior

We assume that the remaining lifetimes in a couple, after issue of an insurance policy, follow mIPH distributions. We use information previously discussed as covariates to link starting probabilities to ages of individuals in couples, to reflect different aging dynamics in our model.

Let Y i ( m ) = g i ( X i ( m ) ) = log ( β i X i ( m ) + 1 ) / β i be the IPH-distributed marginal remaining lifetimes, with X i ( m ) = ( exp ( β i Y i ( m ) ) 1 ) / β i PH ( π ( m ) , T i ) and β i > 0 , where i = 1 corresponds to men and i = 2 to women. Then, the distribution with covariate information we work with is given by marginal Matrix-Gompertz distributions, and we consider general Coxian sub-intensity matrices T i of dimension p = 10 (compared with [2]). The order p was found by fitting models with various respective choices, until a satisfactory approximation of marginals was reached. Finding the optimal dimension p is still an open problem, and the literature on solving the identifiability issue of PH distributions is quite narrow (see, for instance, [3,14]).

According to the dependence structure defined in (2.1), after both underlying processes { J t ( 1 ) } t 0 and { J t ( 2 ) } t 0 start in the same state they evolve independently until absorption. With a general Coxian sub-intensity matrix, each marginal jump process is only allowed to transit to the next state or directly to the absorption state. This stochastic structure has a nice interpretation in terms of aging. Indeed, we can think of forward transitions as natural aging steps, given that each time the process jumps to a certain state, the time of absorption obtains closer. Moreover, premature exits can be interpreted as deaths due to causes not related to aging. Indeed, given that exit rates are positive in each state, the absorption of a process may be caused by a transition to State p + 1 from a state smaller than p . Finally, granting the underlying processes to start in different states allows heterogeneity of health statuses for individuals of the same age (this philosophy was already underlying the construction in [18], but with the present inhomogeneity, much fewer states are needed to describe the data satisfactorily).

Algorithm 1 is now applied for 1,000 iterations. In principle, this estimation procedure provides n = 8,834 different initial distribution vectors, one for each couple depending on the ages at issue. Figure 1 depicts the age combinations in the data, with a majority of couples having a small age difference at policy issue. For illustration purposes, we depict below the estimated initial distributions for four different age combinations (age man, age woman): (63, 63), (68, 63), (63, 68) and (73, 63), which were chosen arbitrarily, but to represent different aging dynamics at issue of the policy. Let Y c = ( Y 1 c , Y 2 c ) mIPH ( π ˆ c , T ˆ , ˆ ) denote the bivariate distribution of excess lifetimes for these four couples, where c = 1 , 2 , 3 , 4 , T ˆ = { T ˆ 1 , T ˆ 2 } and ˆ = { λ 1 ( , β ˆ 1 ) , λ 2 ( , β ˆ 2 ) } . The common estimated sex-specific sub-intensity matrices are as follows:

(4.1) T ˆ 1 = 0.049 1.7 / 1 0 7 0 0 0 0 0 0 0 0 0 3.662 2.877 0 0 0 0 0 0 0 0 0 1.8 / 1 0 7 1.8 / 1 0 7 0 0 0 0 0 0 0 0 0 1.9 / 1 0 4 1.9 / 1 0 4 0 0 0 0 0 0 0 0 0 0.611 0.611 0 0 0 0 0 0 0 0 0 0.002 0.002 0 0 0 0 0 0 0 0 0 9.778 5.73 0 0 0 0 0 0 0 0 0 0.36 0.225 0 0 0 0 0 0 0 0 0 1.852 1.099 0 0 0 0 0 0 0 0 0 0.023

for men and

(4.2) T ˆ 2 = 0.196 0.196 0 0 0 0 0 0 0 0 0 0.291 0.291 0 0 0 0 0 0 0 0 0 0.763 0.763 0 0 0 0 0 0 0 0 0 2.8 / 1 0 8 2.8 / 1 0 8 0 0 0 0 0 0 0 0 0 0.001 0.001 0 0 0 0 0 0 0 0 0 0.003 0.003 0 0 0 0 0 0 0 0 0 3.182 1.165 0 0 0 0 0 0 0 0 0 0.172 2 / 1 0 7 0 0 0 0 0 0 0 0 0 0.008 2.3 / 1 0 10 0 0 0 0 0 0 0 0 0 3 / 1 0 6

for women. The optimal inhomogeneity parameters are β ˆ 1 = 43.101 and β ˆ 2 = 47.474 , and Table 1 presents the estimated coefficients of the last multinomial regression performed in Algorithm 1. Except for States 8 and 9, all coefficients in γ ˆ are significant. Note that having nonsignificant coefficients for States 8 and 9 does not necessarily indicate that the state space is too large. One has to separate the PH interpretation of the state space from the regression framework. Table 1 is claiming that the covariates used in the regression are not significant in explaining starting probabilities of States 8 and 9. Nevertheless, both states have their role in the description of random variables we are interested in.

Figure 1 
                  Ages at issue of policies, for all couples. 
                        
                           
                           
                              x
                           
                           x
                        
                      and 
                        
                           
                           
                              y
                           
                           y
                        
                      correspond to 
                        
                           
                           
                              
                                 
                                    age
                                 
                                 
                                    
                                       
                                          y
                                       
                                       
                                          1
                                       
                                    
                                 
                              
                           
                           {{\rm{age}}}_{{y}_{1}}
                        
                      and 
                        
                           
                           
                              
                                 
                                    age
                                 
                                 
                                    
                                       
                                          y
                                       
                                       
                                          1
                                       
                                    
                                 
                              
                           
                           {{\rm{age}}}_{{y}_{1}}
                        
                     , respectively.
Figure 1

Ages at issue of policies, for all couples. x and y correspond to age y 1 and age y 1 , respectively.

Table 1

Coefficients of multinomial regression with associated standard errors in brackets

p Intercept age y 1 age y 2 age y 1 age y 2
2 20.96 3 43.73 3 43.02 1 84.04 9
(7.157) (11.238) (12.400) (19.311)
3 24.82 6 24.63 0 39.25 6 38.45 3
(6.677) (10.476) (11.639) (18.073)
4 51.03 6 57.44 2 90.06 2 104.23 3
(9.894) (15.348) (15.820) (24.374)
5 42.46 9 56.80 4 70.27 3 89.55 6
(6.873) (10.638) (11.443) (17.559)
6 14.85 0 * 41.37 7 39.43 8 89.68 7
(8.004) (12.150) (12.824) (19.317)
7 54.15 7 97.61 8 98.44 5 173.55 3
(6.617) (10.263) (11.010) (16.817)
8 14.363 5.608 22.990 8.794
(9.419) (14.387) (14.885) (22.584)
9 11.589 12.732 4.892 18.559
(7.224) (11.046) (11.889) (18.050)
10 21.47 4 31.06 8 54.90 7 84.08 0
(6.504) (10.054) (10.892) (16.670)

For a two sided statistical test symbols , and correspond to significance levels of 1, 5, and 10%, respectively.

Initial distribution vector estimates for these four couples are as follows:

π ˆ 1 = 0.0526 0.0734 0.0448 0.0886 0.4065 0.0330 0.0326 0.0569 0.1077 0.1039 , π ˆ 2 = 0.0356 0.0313 0.0297 0.0398 0.2805 0.0476 0.0396 0.0384 0.2472 0.2102 , π ˆ 3 = 0.0285 0.0242 0.0114 0.1625 0.4399 0.0419 0.0304 0.1282 0.0819 0.0510 , π ˆ 4 = 0.0172 0.0095 0.0140 0.0127 0.1378 0.0489 0.0343 0.0184 0.4041 0.3030 .

At first glance, it might seem odd that survival times of spouses with large age difference start in the same state of the distribution, but personalized starting probabilities and sex-specific transition intensities account for this. For example, consider Couple 4. If both excess survival times start in State 1, we see that the man’s underlying jump process is much more likely to reach the absorbing state directly from State 1, while the woman’s will at least advance to State 7 before absorption becomes possible. Thus, marginal intensities mixed with age-dependent initial distribution vectors compensate the initialization in a shared state. In Figure 2, we depict the resulting marginal densities of the fitted remaining lifetime distributions for each of the four couples. As expected, the densities for women (solid black lines) allocate more mass to larger values than the male counterpart, and for older individuals, there is more probability mass for shorter survival times. Comparing Couples 1 and 4, we see that the density of the 63-year-old man has a major mode at y 1 = 22 , while for the 73-year-old man, the major mode is at y 1 = 17 . Note that despite the fact that the women in Couples 1 and 4 are of comparable age, their densities have substantially different modes. This is due to the fact that the estimation of the marginal distributions is not separated from the estimation of the joint distributions, and the age of their spouse at the time of policy issue evidently plays an important role for the distribution of the remaining lifetime, seen at the time of policy issue.

Figure 2 
                  Marginal densities of remaining life times (in years). (a) Couple 1. (b) Couple 2. (c) Couple 3. (d) Couple 4.
Figure 2

Marginal densities of remaining life times (in years). (a) Couple 1. (b) Couple 2. (c) Couple 3. (d) Couple 4.

The multimodality we observe in all marginal densities may be due to having different cohorts in the data. We decided not to manipulate the data set further, to avoid restricting our analysis to specific couples. Although spouses with small age difference can be thought of as belonging to the same cohort, we would also need to restrain the ages at issue to instances where enough data points are available for a meaningful estimation procedure (in particular uncensored data points). Doing so would lead to analyzing only couples where both spouses are aged around 65 years.

To assess the precision of our estimated marginals Y i c IPH ( π ˆ , T ˆ i , β ˆ i ) , i = 1 , 2 and c = 1 , 2 , 3 , 4 , we use the conditional Kaplan–Meier (K–M in the following) estimator, also known as Beran estimator. For a sample X 1 , X 2 , , X n and covariate matrix A , the conditional K–M estimator we use is

(4.3) P ˆ ( X t A = a ) = F ˆ X ( t a ) = 1 i : X i : n t 1 δ i : n K a A i : n b n j = i n K a A j : n b n ,

where X i : n are order statistics of the sample, δ i : n is the respective right-censoring indicators, K ( ) is a kernel function, and b n is a band sequence. In our instance, the kernel function is a multivariate Gaussian density and b n = 0.001 . For more details on the conditional K–M estimator, we refer the reader to the study by Dabrowska [12]. Figures 3 and 4 compare the survival probabilities obtained by the conditional K–M estimators with the one of the fitted distributions. One sees that in all cases the fit is in fact quite satisfactory.

Figure 3 
                  Conditional K-M estimators vs the fitted distribution. (a) Couple 1. (b) Couple 2.
Figure 3

Conditional K-M estimators vs the fitted distribution. (a) Couple 1. (b) Couple 2.

Figure 4 
                  Conditional K-M estimators vs the fitted distribution. (a) Couple 3. (b) Couple 4.
Figure 4

Conditional K-M estimators vs the fitted distribution. (a) Couple 3. (b) Couple 4.

4.3 Joint behavior

Let us now consider the resulting bivariate densities. In Figure 5, we find the bivariate densities for the four specified couples. The joint density of Couple 1 shows that large differences in their survival times are quite unlikely (the majority of the joint mass being located close the identity line). Also, the most likely survival times are close to 23 years. There is also a considerable probability mass above the identity line, where the woman survives longer than the man. For instance, we have F ¯ Y 1 ( 12 , 30 ) = 32 % , while F ¯ Y 1 ( 30 , 12 ) = 11.79 % .

Figure 5 
                  Contour plots of the bivariate lifetime densities for the four couples. (a) Couple 1, (b) Couple 2, (c) Couple 3, and (d) Couple 4.
Figure 5

Contour plots of the bivariate lifetime densities for the four couples. (a) Couple 1, (b) Couple 2, (c) Couple 3, and (d) Couple 4.

For Couple 2, the situation is different (top right panel of Figure 5): with the man being already 68 years, and the woman being 63 years old, the remaining lifetimes are shorter, with the major mode of the joint distribution being located near ( y 1 , y 2 ) = ( 23 , 24 ) , the second largest close to ( y 1 , y 2 ) = ( 17 , 35 ) , and the next ones in the neighborhood of ( y 1 , y 2 ) = ( 17 , 19 ) and ( y 1 , y 2 ) = ( 9 , 19 ) , respectively. There is now a much higher probability for the husband to die sooner.

The joint density for Couple 3 resembles the one of Couple 1, with survival times beyond 40 years being more unlikely. Despite the man being the same age as the man in Couple 1, there is a larger probability for the couple to have survival times close to ( y 1 , y 2 ) = ( 29 , 45 ) than for Couple 1.

Finally, the joint density of Couple 4 is close to the one of Couple 2. The spike around ( y 1 , y 2 ) = ( 17 , 35 ) is more pronounced than the analog of Couple 2, while the spike close to ( y 1 , y 2 ) = ( 23 , 24 ) is less important than its counterpart in Couple 2.

One may be tempted to conclude from these four distributions that for the same age, having a younger partner leads to longer survival times, which would signal that the bereavement effect is weaker when spouses have