Estimation of Semi-Markov Multi-state Models: A Comparison of the Sojourn Times and Transition Intensities Approaches

Semi-Markov models are widely used for survival analysis and reliability analysis. In general, there are two competing parameterizations and each entails its own interpretation and inference properties. On the one hand, a semi-Markov process can be defined based on the distribution of sojourn times, often via hazard rates, together with transition probabilities of an embedded Markov chain. On the other hand, intensity transition functions may be used, often referred to as the hazard rates of the semi-Markov process. We summarize and contrast these two parameterizations both from a probabilistic and an inference perspective, and we highlight relationships between the two approaches. In general, the intensity transition based approach allows the likelihood to be split into likelihoods of two-state models having fewer parameters, allowing efficient computation and usage of many survival analysis tools. Nevertheless, in certain cases the sojourn time based approach is natural. In contrasting the two approaches and contemporary relevant R packages used for inference, we use two real datasets highlighting the probabilistic and inference properties of each approach. This analysis is accompanied by an R vignette.


Introduction
In biostatistics, many models for survival and reliability analysis are two-state stochastic processes which lead to a particular event such as death, or the outcome of a particular drug treatment. However, applying a multi-state stochastic process allows the modeller to provide a richer and more accurate model by adding more details. These details allow to capture alternative paths to the event of interest, specify all the intermediate events, and also allow to understand partial failure modes in a progressive disease.
In our context, a multi-state stochastic process is a process X(t) for t ≥ 0, where X(t) can take a finite number of values 1, 2, . . . , p. This process can be considered as a family of random variables X(t) indexed by t. The quantities of interest are often the probability of being in a state at a given time and the distribution of first passage times (the time until the process reaches a given state for the first time from a particular starting state).
In some applications of multi-state stochastic processes, the dependence on the history of the process is negligible. Therefore, for the sake of mathematical tractability, assuming the Markov property (where future transitions between states depend only upon the current state) is convenient. For instance, continuous time Markov chains, which we refer to here as Markov processes, are widely used in modelling the movements of patients between units of a hospital or between stages of a disease, see for instance [13,44], or in the prediction of the incidence rate of diseases, see [1].
However, in certain cases, the Markov assumption is unrealistic. For instance, in the study of human sleep stages, the sleep stages usually do not follow an exponential distribution (constant hazard rate), but can rather have other forms such as a Weibull distribution, see [49]. Further, some aspects of systems' behavior can not be captured by Markov processes. For instance, the risk of chronic diseases such as AIDS essentially depends on the time since infection, see [26]. For these cases, applying the class of semi-Markov processes (SMP), as an extension of Markov processes, is fruitful. Here, future probability transitions depend on the sojourn time (the time spent in the current state), and the clock of each sojourn time is reset to zero after each transition into a new state. SMPs have a variety of applications in healthcare. For instance, for predicting a disease progression [21], health care manpower supply prediction [48], and recovery progress of patients with a specific type of disease [37].
For biomedical applications, especially those concerned with characterizing an individual's progression through various stages of a disease, a three-state semi-Markov process known as the illness-death model is very popular (see for instance [26,9]). The illness-death model may also be applied for modelling the trajectory of patients in intensive care units (ICUs) of hospitals (see [33,14]).
Here, our main focus is on the statistical methodology of semi-Markov processes and more precisely on parametric models of SMPs. We compare and contrast two approaches for defining SMPs which we denote via, I -sojourn times, and II -intensity transition functions.
Approach I is based on the specification of the sojourn time distribution, together with a transition probability matrix of a discrete time Markov chain, which we call the embedded chain. Approach II, is based on intensity transition functions which when multiplied by an infinitesimal quantity, specify the instantaneous probability of transition between states. Note that in the literature, these are sometimes also called hazard rate functions of the SMP, however they should not be confused with hazard functions of probability distributions (such as for example the sojourn time distributions).
While from a probabilistic perspective, both Approach I and Approach II are equivalent ways for describing an SMP, from a statistical perspective there are differences. In this paper, we highlight that when it comes to parameter estimation, Approach II has significant advantages over Approach I. Intrinsically this is because Approach II can be expressed by using fewer parameters. Further, we can show that the likelihood function of Approach II can be written as the product of likelihoods of two-state models. This is very helpful for reducing the computational effort for likelihood-based parameter estimation. Nevertheless, depending on the application at hand, using either approach may be useful. We highlight the associated tradeoffs in the paper.
The remainder of this paper is structured as follows. In Section 2, we introduce semi-Markov processes (SMP) via both Approach I and Approach II. We also present the probabilistic relationships between the approaches. We continue with Section 3 focusing on inference and specify the likelihood function for both approaches. In Section 4, we illustrate inference on two available datasets and present how to implement both approaches (for both datasets) using several contemporary R software packages. For both datasets, we compare results of fully parametric models based on both approaches and highlight the implications of each modelling choice. Section 5 presents some concluding remarks. All the numerical results (tables and figures) from the paper are freely reproducible via R code in a comprehensive detailed vignette available in [32].

Semi-Markov Processes
A multi-state model is a continuous time stochastic process with values in a discrete set that is often applied for longitudinal medical studies, where the patients may experience several events and their related information is collected over time. The complexity of a model greatly depends on the number of states and also on the possible transitions. Figure 1 demonstrates a general multi-state model. See [3] as a classical reference for multi-state models. A specific class of examples of multi-state models is the class of Markov processes where the state evolution jumps between levels and the process adheres to the Markov property. However, in many real-world applications, we need a stochastic process that exhibits dependence between jump times. For instance, in biostatistics, where the trajectory of patients in a hospital is considered, using a Markov process as a multi-state model imposes a stringent limitation on the sojourn time distribution in each state. This is because a key consequence of using Markov processes is the fact that state sojourn times follow the exponential distribution and is "nonageing" with a constant hazard rate. In contrast, in Semi-Markov processes this assumption can be relaxed. Hence with Markov processes one is often lacking the desired degrees of freedom for modeling the dependence between jump times.
Extending the modelling from Markov processes to the class of SMPs removes the restriction of memoryless (exponential) state sojourn times and at the same time, preserves the usefulness of treating the data as jump processes in continuous time. In fact, for semi-Markov processes we consider a relaxation of the Markov property for sojourn times and only the embedded chain of states is required to follow the Markov property. For this reason, SMPs are applied for modelling a variety of phenomena in different areas such as economics, reliability, and health care, see [25].
To define an SMP, consider a homogeneous Markov chain {J n } n≥0 on states {1, 2, . . . , } where the probability of n-th (n ≥ 1) jump from state i to state j for i = j is p ij . That is, Based on the directed graph associated with these probabilities (where arc i → j exists only if p ij > 0) states can be classified as either transient or recurrent of which absorbing is a special case.
Denote the increasing sequence of jump times by T 0 = 0 < T 1 < T 2 < T 3 < . . . and assume that N (t) = max {n : T n ≤ t} for t ≥ 0. This is a count of the number of transitions up to time t. The stochastic process X(t) := J N (t) is said to be a semi-Markov process (SMP) if whenever the process enters state i, the next state is j with probability p ij and given that the next state to be entered is j, the time until the transition from i to j is a random variable with cumulative distribution function F ij (t): where τ n = T n − T n−1 , see Chapter 4 of [40]. Hence in general, SMPs are not Markov processes as they do not posses the Markov property. Further, a semi-Markov process allows arbitrarily distributed sojourn times in any state but retains the Markov property for the embedded (discrete time) Markov chain, {J n } n≥0 . In many applications of SMPs in healthcare, a very popular three state semi-Markov process known as the illness-death model is applied, see for instance [37].

Illness-death model
The illness-death model is the most common model in epidemiology (see Figure 2), where it is often applied in studying chronic diseases. In this model, we have three states "Health", "Illness' and "Death', denoted by 1, 2 and 3, respectively. There are three kinds of transitions: 1 → 2, 2 → 3 and 1 → 3 and state 3 is absorbing. Since this model is often used to describe severe illnesses, there is no possibility of recovery, and therefore the model is irreversible 1 . In cases where treatments may yield remission, it is more appropriate to construct a model with an additional state "Remission" rather than to consider that there is a possibility of moving back to the "Healthy" state. Here, we consider the general multi-state model with all possible transitions as illustrated in Figure 1 which includes both illness-death model and its reversible version (see Figure 3). We now arrive at our key point dealing with SMPs. Here we contrast the two approaches which we denote via, I -sojourn times, and II -intensity transition functions. Approach I was already used to define SMPs above. We spell out further details of Approach I and then continue to introduce Approach II.

Approach I: Sojourn Times
As defined above, an SMP, X(t), can be constructed via the sequence {(J n , T n )} n≥0 , of states and jump times respectively. The underlying parameters of this construction involve the transition probabilities of the embedded chain, p ij presented in (1), as well as the distributions of the sojourn times for each transition i → j, presented in (2).
The transition probabilities of the embedded chain are often organized in a stochastic matrix, P = [p ij ] (where we set p ii = 0). Further, assuming the sojourn time distributions are continuous, they are often represented in different forms including the probability density function, the survival function or the hazard rate function. We now present the details.
The probability density function of the sojourn time is The corresponding survival function is (Note that S ij (t) is a decreasing function, that is S ij (0) = 1 and lim t→+∞ S ij (t) = 0). The hazard function which is often thought of as the probability that a jump occurs in a specified interval (t, t + t) given no jump before time t, is Here, note that by definition of conditional probability we have It is also useful to define the probability of staying in a current state i for at least t time units. We call this the survival function of the waiting/holding time in state i, and denote it via

Approach II, Intensity Transition Functions
The first approach required specification of the parameters using two types of objects, transition probabilities of the embedded Markov chain, and the distribution of sojourn times given a transition i → j. The second approach which we present now is more succinct in that it only requires one type of object: intensity transition functions. These functions are defined viaα and are similar in nature to hazard rates. However, they should not be treated as hazard rate functions of a probability distribution. They rather indicate the instantaneous probability of making a transition from state i to state j after spending t time units in state i since the last transition. However summing up over all target states j, we do obtain a hazard rate of a probability distribution which we denote viaα i (t) = j =iα ij (t). This is the hazard rate of the waiting/holding time in state i. We can use this approach to obtain an alternative expression to S i (t) of (7).
Using Approach II, more formally, the meaning of the intensity transition functions is where using the notation defined previously, τ N (t)+1 is the time elapsed since the last transition, measured at time t. The conditional probability in (10) is conditional on the fact that the last transition was at time t into state i and up to time t + u there have not been any further transitions. Given this condition, it indicates the instantaneous probability of: (i) making a transition at time t + u. (ii) Making the transition into state j.
It can be shown that specification of the intensity transition functionsα ij (·) completely specify the probability law of an SMP. See [11] for a formal description where one can also use the notation H(t − ) to define the history of the process just before time t, i.e. up to t − . This can formally be defined as the sigma algebra in a filteration associated with the stochastic process. Related notation often used in the literature is the (conditional on history) transition probability to be in state j at time t + u, after being in state i in time t. This is often denoted

Relations Between the Two Approaches
We now wish to show how for the same SMP, one can use either the parameters of Approach I or the parameters of Approach II and interchange between them. A key relationship is the followingα It is established using the conditional probabilities, defined in (1), (4), (7), and (5). This key relationship also yields an interpretation of the cumulative incidence function for the i → j transition which we denote via CIF ij (·). This is a common measure used in the field of competing risks, see for example [27], that determines the probability of transitioning into j before or at time t. By rearranging and integrating both sides of (11) we obtain the following representation of the cumulative incidence function.
We can now convert between the parameterizations of both approaches as follows.
This follows directly from (11). Approach II → Approach I. Givenα ij (·) obtain p ij and α ij (·): First we have, This can also be obtained from (12) by taking t → ∞. Once p ij values are at hand we again use (11) and isolate f ij (t) to obtain, from which we can obtain α ij (t) using (6) in the standard manner.

Examples
We now present three examples illustrating the relationship between the two approaches.
It can be parameterized in several ways, two of which are reminiscent of Approach I, and Approach II above: • CTMC Approach I: Define a × stochastic transition probability matrix P (non-negative entries and rows sum to 1) with 0 entries in the diagonal. Denote the entries via P ij . This is often called the transition probability matrix of the embedded discrete time Markov chain. Then define a vector of rates of length ,˘, where λ −1 i denotes the mean holding time of state i. Then from the theory of CTMCs, see for example [11], the process evolves as follows. If at time t the process is in state i, an exponential random variable with parameter λ i is generated to determine a holding duration. After that duration passes a transition from state i to state j occurs with probability P ij . The durations and the transition choice are independent. The process then repeats. It is now evident that such a description of a CTMC is a special case of Approach I for SMPs, where Observe that α ij (t) is independent of t and independent of the target state j.
• CTMC Approach II: Define a × generator matrix Q (non-negative entries on the offdiagonal, non-positive entries on the diagonal, and rows sum to 0). Denote the entries via q ij . Then treat the process as an SMP using Approach II with Observe that α ij (t) is independent of t.
It is well known from the theory of CTMCs the two parameterizations interplay. CTMC Approach I → CTMC Approach II: CTMC Approach II → CTMC Approach I: We now verify that these transformations directly agree with the relationship between Approach I of the SMP and Approach II of the SMP. Using (13) and (16) we have, Going in the other direction using (14) and remembering that since Q is a generator matrix, Now using (15) we have, This is an exponential density and hence it has a constant hazard rate α ij (t) = λ i .
Example 2: Exponential Sojourn Times The example above illustrates that an SMP parameterized via Approach I where α ij (t) is constant in time and also independent of j yields constant (Approach II) transition intensity functions,α ij (t). That is, constant (Approach I) hazard rates yield constant transition intensity functions, but under the condition that α ij (t) is the same for all target states j.
We now show that this condition is necessary for having constant transition intensity functions. To see this, use (13) with α ij (t) = λ ij where for some j 1 and j 2 , Since λ ij 1 = λ ij 2 we are not able to cancel out the exponent as was done in (18). Interestingly, the converse does not hold. The previous example showed that time independent transition intensity functions always yield a CTMC and hence constant (Approach I) hazard rates.

Example 3: Weibull Distributions
The Weibull continuous univariate distribution of a non-negative random variable is parameterized by a shape parameter η > 0 and a scale parameter µ > 0. It has density, survival function, and hazard rate functions respectively given via, It is appealing in survival analysis and reliability analysis because if η < 1 the hazard rate is monotonically decreasing; if η = 1 the hazard rate is constant (an exponential distribution) and if η > 1 the hazard rate is monotonically increasing. Say now that we are using Approach I and parameterize all of the sojourn time distributions as Weibull distributions, where for transition i → j we have respective scale and shape parameters µ ij and η ij . Now, if we were to consider the Approach II representation, then from (13) the transition intensity functions arẽ where we use the notation K ij (t) to represent all of the components ofα ij (t) excluding t η ij −1 . Now the form of K ij (t) can determine if the Approach II representation is of a Weibull type or not. A Weibull type representation will follow if K ij (t) is independent of t, otherwise it is not.
We now see that if for a fixed i and every j 1 and j 2 with j 1 = j 2 we have that then, K ij (t) is independent of t and we obtain a Weibull type Approach II representation with, and hence the scale parameter is modified to p −1/η ij ij µ ij and the shape parameter keeps the same form. Otherwise (if there exists j 1 and j 2 such that (21) does not hold), thenα ij (t) cannot be of the Weibull type as in (22).
Going the other way, assume we are using Approach II with Weibull type transition intensity functions,α Then by integrating (14) we can in principle obtain p ij . It turns out that if for a given state i, the shape parameters of all the transition intensity functions is the same (denote it via η i ) then the integration yields an explicit form, otherwise, there is not an explicit solution. In such a case where η ij = η i for all j, we also have from equation (15), Further, if µ ij = µ i for all j then p ij of (23) reduces to −1 i where i < is the number of possible target states from state i. Note that this form is not specific to the Weibull type intensity transition function, but will hold whenever the intensity transition functions out of state i are the same.
However, with the Weibull case we can continue further and (24) can be represented as, which denotes a Weibull distribution with same shape parameter as the transition intensity function and a scale parameter equal to

Likelihood and Inference
Here we focus on inference for the fully-parametric semi-Markov model defined by the two above mentioned approaches. Survival analysis usually applies for cohort or clinical studies, hence the data is usually gathered from the same subjects repeatedly during a time interval [0, T ]. A key issue in survival analysis and event history analysis is the occurrence of incomplete or sparse observations. For instance, in the case of chronic diseases, when the event of study is death, time of occurrence of this event is not observed for the subjects still alive at time T . This type of incomplete observation is called right-censoring. There are other kinds of incomplete data like left-censoring and interval censoring, see [4,15]. [6] present different incomplete data forms which can be handled in multi-state frameworks. Our focus in this paper is right censored data corresponding to when at the end of the observation period, not all individuals under study have reached an absorbing state.
As our focus is on the fully-parametric case we now present the likelihood functions for both Approach I and Approach II. We assume there are n subjects denoted via h = 1, . . . , n, for which data is collected and we assume independence between subjects. For each subject, we distinguish between two cases depending on if the subject is in an absorbing state at time T or not. This is recorded via δ h where δ (h) = 1 implies no right censoring (subject is in an absorbing state at time T ), and δ (h) = 0 implies right censoring. Further, we record the state evolution denoted via the sequence {J} and the sojourn times denoted via the sequence {τ }. For simplicity, we assume that all subjects start at the same fixed and known state, denoted via J 0 . This can be easily generalized.
For subject h during [0, T ], we use N (h) to denote the number of state transitions up to time T . The data of the subject is represented via Note that if δ (h) = 0 and there is right censoring then we are also interested in the time duration U (h) after the last state is visited. This quantity is represented via, As our subjects are independent we can consider the likelihood contribution of each subject h in isolation, and after denoting it via L (h) , the full likelihood is The specific form of L (h) now depends on if we are using Approach I or Approach II and as we show below, in Approach II, it can further be decomposed as in (30).

Likelihood for Approach I
The form of L (h) based on H (h) is, where for brevity we omit the (h) superscripts for the state and sojourn time sequences {J} and {τ }. To further understand (27) consider a recursive construction where each transition without censoring based on J k−1 → J k with a sojourn time of τ k has a likelihood contribution Further, in case of censoring the last censored transition has likelihood contribution S J N (h) (U (h) ) based on the survival function of the holding time (7), corresponding to the last observed state J N (h) .

Likelihood for Approach II
We refer to the key relationship (11) and replace p ij f ij (u) in (27) withα ij (u)S i (u) to obtain, where for brevity we omit all superscripts (h). Further, we manipulate the likelihood contribution of each subject as follows: In the first line we use the representation in (9). For the second line, by increasing the summation from N to N + 1 we are able to remove δ by defining τ N +1 = U , J N +1 ≡ −1, and using the fact thatα ij (u) ≡ 0 in cases where i is an absorbing state and 0 0 ≡ 1. Now, we are able to separate L (h) into the form following similar ideas as those initially presented in [23]. To achieve such a separation, it is useful to define the indicators We may now manipulate (29) to obtain The form of (26), (30), and (31) allows us to treat each transition separately as if each transition intensity transition functionα ij (·) has its own set of parameters.

Parametric forms and Covariate Information
In carrying out inference, we assume a parametric form for α ij (·) in Approach I orα ij (·) in Approach II. We also allow for covariate information where the natural way to introduce covariates in a multi-state model is a Cox-like proportional hazard model which can be defined by using either the hazard function of the sojourn times or by intensity transitions, see [35]. Hence for a vector of covariates Z, we have for Approach I. Further we haveα for Approach II.
Here α ij,0 (·) are the baseline functions for the hazard rate and transition intensity functions respectively. They each follow a parametric form, determined via θ ij andθ ij respectively. Further, β ij andβ ij are the regression parameters for transition i → j associated with the covariates of the subject. Observe that there are major differences in the interpretation of the regression coefficients β ij of Approach I andβ ij of Approach II. The former, determine the hazard ratio dealing with sojourn times and bear no effect on the actual transitions of the SMP. The latter, determine the hazard ratios dealing with risks and also affect the instantaneous rate of transitioning.
When estimating the parameters for such a model, with Approach I, we also need to estimate p ij , whereas with Approach II this is not needed as p ij is implicitly determined via (14). Interestingly, via Approach II, we have that the covariates implicitly affect p ij ,whereas with Approach I, we may wish to set p ij (Z) using multinomial regression, however it isn't common practice.

Contrasting Inference for Approach I and Approach II
The key difference in the likelihood expressions between Approach I (27) and Approach II (30) and (31) is in computational tractability. As noted by [28] the optimization of the likelihood for Approach I could be challenging when the number of covariates and parameters is large due to a complex form of the sojourn time distribution. This is especially the case when the number of variables is large compared to the size of the data. It should further be noted that modeling using Approach I requires the embedded chain parameters in addition to the parameters involved in (32). This means that for a model with states, there may be up to 2 − more parameters to estimate when using Approach I, in comparison to Approach II, where all the parameters appear in (33). With Approach II, the decoupling in (30) allows to optimize the likelihood for each transition i → j separately if each hazard intensity transition function (33) has its own set of parameters θ ij andβ ij . In such a framework, the full likelihood of the SMP can be maximized separately by maximizing the likelihood of the different separate two-state models, see [35]. Also, the inference of the SMP can be easily handled by using survival models like the Cox model as the separate likelihood arising in survival analysis.
Hence Approach II enables fitting SMP models using survival analysis methods and software, see for example [46,17,24]. This also applies to the case where extensions to the proportional hazard assumption are needed. The forms in (32) and (33) are clearly based on a proportional hazard assumption with fixed covariates where there is only a multiplicative effect on the baseline hazard functions independent of time. An extension is to consider time-dependent covariates Z(t) as in [5,6]. Further, in both proportional models the effect of variables is assumed to have a linear (or log-linear) functional form. A more flexible model is to consider non-linear effects using smooth functions of the covariates such as the Generalized Additive Model (GAM) presented by [22]. With such extensions, using Approach II is much more straightforward than using Approach I due to the decoupling and because there is ample statistical software available for survival analysis.
One may also wish to incorporate random effects in the models using frailty models. In this case, both (32) and (33) could also include random effect to take into account the correlation between observations/subjects. This has been handled in [33] for Approach II by exploiting the decoupling of the likelihood. Such random effects may also be incorporated in Approach I, however the computational effort will be greater.

Semi-Markov Application in Practice
In this section, we illustrate the inference of the semi-Markov model on two real datasets. For both datasets, we compare results of full parametric models based on Approach I where we estimate sojourn time distributions and transition probabilities of the embedded chain, and based on Approach II where we estimate transition intensity functions. All the numerical results including tables and figures, are freely reproducible through R codes in a comprehensible detail vignette available in [32].
To carry out estimation using Approach I we use the semiMarkov package, see [29]. This package implements inference using the likelihood (26) with (27) while allowing to use different parametric forms for the sojourn time distribution including exponential, Weibull, and exponentiated Weibull, see [20]. It also supports covariates with standard inference for the covariate coefficients β ij , including the Wald test. It allows the possibility to include different covariates for each transition. While the focus of the inference with this package is Approach I, as output, the package can automatically provide the intensity transition functionsα ij (·), referred to as the "hazard function of the SMP".
For Approach II, the decoupling in (30) with (31) allows us to use any package that implements inference for survival analysis since for each transition i → j we can maximize, separately if each hazard intensity transition function (33) is assumed to have its own set of parametersθ ij andβ ij . For this we can use different R packages including flexsurv ( [24]), survival ( [45]), and eha ( [12]). In the examples here, we mainly use flexsurv which offers an easy way to run different parametric forms forα ij (·) including exponential, Weibull, gamma, and generalized gamma (see [38]) among many other forms. It also supports covariates with standard inference for the covariate coefficientsβ ij .
We note that the purpose of this section is not to be an extensive survey of all of the R packages that can be used for inference in multi-state models. For a comprehensive software survey, see the survival task view from R ( [7]). Our focus here is to illustrate how the aforementioned packages can be used to easily carry out inference, while we also illustrate a few key points.

Progressive Three-State Model for the Stanford Heart Transplant data
As an illustrative example, we revisit the analysis of the Stanford Heart Transplant data, freely available through the survival package. A full description of the data can be found in [18]. The dataset presents the survival of patients on the waiting list for the Stanford heart transplant program. We analyze the data similarly to [36] which use their R package p3state and propose an illness-death model similar to Figure 2. Their states represent "alive without transplant", "alive with transplant", and "dead".
We consider the same dataset, where patients' records from October 1967 (start of the program) until April 1974 includes the information of 103 patients where 69 received heart transplants, and 75 died. Three fixed covariates for each patient are available. These include the age at acceptance (age), the year of acceptance (year), and previous surgery (surgery: coded as 1 = yes; 0 = no).
We first use the data without taking into account any covariate effects to show that an exponential distribution on the baseline hazard rate of the sojourn time, α ij (·) does not produce constant intensity transition functions,α ij (·). This point was shown theoretically in Example 2, of Section 2.4 and we now illustrate it using data and the semiMarkov package. Using the package we obtain both α ij (·) andα ij (·) for (i, j) = (1, 2) and (i, j) = (1, 3). The results are in Figure 4 where we see that while α ij (·) are constant,α ij (·) are clearly not constant across time. These curves in fact follow a form like (19). We now continue to analyze the data fully with Approach I using the semiMarkov package where we compare different parametric forms and covariates in the model. Table 1 presents the inference results of 6 different models investigated depending on the distribution and covariates chosen for each transition. For each model we present theβ ij estimates and their estimated standard error. We also point out that the semiMarkov package yields p-values for the associated Wald tests. We also present the estimates of the transition probabilities of the embedded chain,p ij . It is evident that these are outputs from likelihood estimation as they don't fully agree across models (they are not proportion estimates). For the final two models, Weibull + Select and Weibull/Exp + Select are based on model selection of the significant covariates with more details in the vignette, see [32]. Observe that the final model, Weibull/Exp + Select, incorporates two different distributions. Table 1: Estimation of the different semi-Markov process modelled by specifying proportional hazard models on the sojourn times -Stanford Heart Transplant data-Approach I

Surgery
Age For choosing between the different proposed models, we use the Expected Kullbak-Leibler (EKL) risk as in [33]. This is done via the Akaike Information Criterion (AIC) which is an estimate of the EKL risk in a parametric framework using maximum likelihood method, see [31,16]. For each of the 6 models described above, we obtain the AIC using output from semiMarkov and present it in the first part of Table 2.
The best model according to the AIC is given by the Exponentiated Weibull model. However, due to convergence issues during the optimization (see the vignette [32]), we select the second best result given by the Weibull/Exp + Select model as the most appropriate model for this dataset according to AIC using Approach I.
Moving onto inference using Approach II, we investigate and estimate different models by specifying different proportional intensity transition models as defined in equation (33). We use flexsurv to estimateα ij (·) for several models which we present in the second part of Table 2. The models Exponential, Weibull, Gamma, and Generalized Gamma are all estimated with all three covariates in the model. The additional two models, Weibull+Select, and Generalized Gamma + Select, have a reduced set of covariates with a procedure consistent with that used for Approach I above. More details are in the vignette (see [32]). The AIC based performance of each of these models is presented in the second part of Table 2.
The best model according to the AIC is Generalized Gamma + select. In this model only the age covariate is the used for transition 1 → 2, only the year covariate is used for transition 1 → 3, and both the surgery and age covariates are used for transition 2 → 3. For this dataset, the best model based on Approach I presents a better AIC value (1639.73) in comparison to the best model based on Approach II (1701.03).
As discussed in Section 3.3 the regression coefficients of Approach I and those of Approach II do not have the same interpretation. Indeed for the hazard rate of the sojourn time, the regression coefficients can be interpreted in terms of relative risk on the waiting time (i.e. given an i → j transition). While the hazard rate of the SMP based on transition intensity functions can be interpreted as the subject's risk of passing from state i to state j.
Since a key object of a Semi-Markov process is the transition intensity function, generally called the hazard rate of the SMP, it is also useful to visualize estimates of these functions based on both Approach I and Approach II. Further insight may be gained by plotting these against non-parametric estimators. For this, we used the Breslow estimator [10], which yields an estimate for the baseline cumulative risk function of a cox-regression type model of each transition. Figure 5 plots the cumulative baseline (all covariates are set to 0) estimated transition intensity functions, estimated via Approach I and Approach II, against the Breslow estimator. These cumulative functions are based on the parameter estimates for the best model of Approach I (Weibull/Exp + Select) vs. the best model of Approach II (Generalized Gamma + Select). In this case, the plots in Figure 5 hint at a better fit (closer to the non-parametric estimation) of the SMP using Approach I in accordance with the results of the AIC values discussed above.

Reversible Semi-Markov Model for the Asthma Control Data
We revisit the analysis of the asthma control data which has been used to illustrate the SemiMarkov R package in [29]. The data consists of the follow-up of severe asthmatic patients consulting at varied times according to their perceived needs. At each visit, several covariates were recorded and asthma was evaluated. The available data (asthma) from the SemiMarkov package presents a random selection of the 371 patients with at least two visits. In this illustration, we only use the BMI (Body Mass Index) covariate coded: 1 if BMI ≥ 25 and 0 otherwise. Similarly to the analysis in [29], we use a reversible three-state model as presented in Figure 6. This investigates the evolution of asthma based on, "optimal control" (State 1), "sub-optimal control" (State 2) and "unacceptable control" (State 3). More details about concepts of control scores can be found in [42]. Observe that in this model, none of the states are absorbing and hence the model is said to be reversible.
For illustration purposes, we concentrate on a Weibull model as proposed in [29]. We perform a first full parametric proportional Weibull model for the sojourn times (Approach I) with the BMI covariate for each transition. This yields significant results of BMI covariate for the transitions 1 → 3 and 3 → 1. Then, we decide to run a sparse model including BMI only for these transitions to facilitate the convergence of the model which can be difficult for a model with too many parameters as reported in [29]. For this new model, BMI regression coefficients remain significant for transitions 1 → 3 and 3 → 1 with β 13 = −0.88 and β 31 = −0.448, and respective p-values 0.012, and 0.023. The fact that hazard ratio of the sojourn time associated with these covariates is less than unity (estimated coefficients are negative), indicates that BMI ≥ 25 generally lengthens the duration of the sojourn time in state 1 when making a 1 → 3 transition and generally lengthens the duration of the sojourn time in state 3 when making a 3 → 1 transition. This can also be interpreted as a decrease of the risk of leaving "optimal control" state to "unacceptable control" as well as a decrease of the risk of leaving "unacceptable control" state to "optimal control". However, the magnitude of the estimated coefficients cannot be used to evaluate the change in the hazard ratio on the risk (recall equations (32) and (33) and the differences between β ij andβ ij ). We may also visualize the effect of BMI covariate on the associated risks by plotting the hazard rate of the semi-Markov model (intensity transition functions), deduced from the sojourn times. See Figure 7 where we also plot estimated transition intensity functions using Approach II estimation, which we describe now.
Approach II presented in this paper offer a direct way to model the intensity transition functions and has the advantage of a split likelihood which facilitates an efficient optimization procedure. In the vignette available in [32], we present the results of the full Weibull propor- tional semi-Markov model using this Approach. For illustration, we do this in two ways: (i) Optimizing the full likelihood. (ii) Splitting the likelihood for each transition and optimizing individually. As expected, both ways yield the same likelihood estimates. This highlight the advantage of the intensity transition based approach for overcoming potential computational issues in the optimization. Moreover, in some applications, some transitions of interest could then be further explored with any sophisticated survival model of choice as the estimation for each transition is performed separately. See for example [46,41,33,24]. We augment the plots in Figure 7 with the estimated transition intensity functions using Approach II. Note that similarly to Approach I, the BMI effects are in the same direction. However, the interpretation of the regression coefficients is different. For the intensity transition based estimation (Approach II), the BMI regression coefficient estimates areβ 13 = −0.5 and β 31 = −0.67. Here, in contrast to the Approach I estimates, the exact magnitude of the coefficients may be interpreted as the change in the hazard ratios associated with BMI ≥ 25.

Concluding Remarks
We have surveyed the two main approaches for modeling and inference using semi-Markov processes in a parametric setting. These are Approach I based on sojourn time distributions and the embedded Markov chain, and Approach II based on transition intensity functions. Each approach has its advantages and drawbacks when carrying out inference and analysis and these were described in the paper. We further summarized the formulation of these two approaches and showed relations between them, where we focused on inference for each of the approaches.
In general, the intensity transition based approach (Approach II) allows the likelihood to be split when each transition has its own set of parameters. Such separation into two-state models facilitates more efficient optimization as well as usage of additional modeling tools. The large scale impact of such separation on bigger datasets and/or simulated data as part of a simulation study remains to be carried out in future work. However, even putting the computational aspects aside, there is great value in such separation as it allows to use many survival analysis R packages directly within the context of semi-Markov processes.
Such packages include flexsurv [24] which allows to fit various shapes of transition intensity functions including the Royston-Parmar spline model [41]. Further, in a semi-parametric framework the mstate package [19] is popular for running multi-state models including the SMP. By exploiting the separation of the likelihood, mstate provides the estimation of covariate effects using Cox regression models. In addition, non-linear effects of the covariates could be investigated through smooth functions using the mgcv package which enables to run GAM models with survival data, see [50]. Further, with Approach II, we can easily handle random effects using the frailtypack package [39], as has been exploited by [33]. Further, in presence of high dimensional data, elastic-net penalty [43] can be used via the package glmnet to fit Cox regression models. Such benefits of using the intensity transition based approach have previously been reported and exploited for general Markov multi-state models as in [23,6] and [35]. Our purpose here was to survey these benefits and compare to Approach I as is popularly used with the semiMarkov package providing a flexible tool for inference of semi-Markov models based on sojourn time hazard rates in a parametric framework [29].
Of further interest, we highlight the fact that PH (Phase Type) distributions may fit naturally in an SMP framework. Such approaches have been previously proposed in [30], [34], [47] and [2]. Interestingly, as PH distributions provide a dense semi-parameteric approximation of any distribution of a non-negative random variable, it may be of interest to approximate semi-Markov processes with continuous time Markov chains constructed via PH distributions, where there are potentially more than states in the approximating Markov chain. To the best of our knowledge, such a semi-parametric inference setup has still not been explored. Such a setup may work well with the EM algorithm for parameter estimation of this type of PH distributions, see [8]. We leave this for future work.