In order to express the model likelihood, we have first to express the distribution of the response variables given the latent variables. In particular, for each subject *i* we observe the sequence **b**_{i} = (*b*_{i1}, …, *b*_{ili})′; we also observe **y**_{i,obs} which corresponds to all or a part of the sequence **y**_{i} = (*y*_{i1}, …, *y*_{ili})′. In particular, if all elements of **b**_{i} are equal to 0, then **y**_{i,obs} and **y**_{i} will coincide; if some elements of **b**_{i} are equal to 1 or 2, then **y**_{i,obs} will be a subvector of **y**_{i}.

Based on the assumptions formulated in the previous section, the distribution of interest has the following density function:

$$\begin{array}{c}f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{|}{U}_{i}=u\mathrm{)}=\left[{\displaystyle \prod _{l=1}^{{l}_{i}}}p\mathrm{(}{b}_{il}\mathrm{|}{U}_{i}=u\mathrm{)}\right]\left[{\displaystyle \prod _{l=1:{b}_{il}=0}^{{l}_{i}}}\varphi \mathrm{(}{y}_{il}\mathrm{|}{U}_{i}=u\mathrm{)}\right]\mathrm{,}\text{\hspace{0.17em}}\\ u=\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}k\mathrm{,}\end{array}$$

where *p*(*b*_{il}|*U*_{i} = *u*) is defined in (1), the second product is extended to all observed elements of **y**_{i}, and *ϕ*(*y*_{il}|*U*_{i} = *u*) denotes the density of the normal distribution defined according to assumption (2). As in a standard finite mixture model, the *manifest distribution* has density that may be obtained as

$$f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{)}={\displaystyle \sum _{u=1}^{k}}{\pi}_{iu}f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{|}{U}_{i}=u\mathrm{}\mathrm{)}\mathrm{.}$$

This is the basis for the model log-likelihood, which has expression

$$\ell \mathrm{(}\theta \mathrm{)}={\displaystyle \sum _{i=1}^{n}}\text{log\hspace{0.17em}}f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{}\mathrm{)}\mathrm{,}$$

where **θ** is a vector containing all model parameters, that is, **β**_{u}, **γ**_{1u}, **γ**_{2u}, **δ**_{u}, for *u* = 1, …, *k*, and *σ*^{2}.

In order to maximize ℓ(**θ**) with respect to **θ**, we rely on the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). This algorithm has been used extensively for fitting mixture models (see McLachlan and Krishnan 1997; McLachlan and Peel 2000; Fraley and Raftery 2002) in the maximum likelihood framework.

The EM algorithm is based on alternating the following two steps until convergence in the target function:

**E-step**: it consists of computing the conditional expected value, given the observed data and the current value of the parameters, of the *complete data* log-likelihood, which is defined as follows:

$${\ell}^{\ast}\mathrm{(}\theta \mathrm{)}={\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{u=1}^{k}}{z}_{iu}\text{log}\mathrm{[}{\pi}_{iu}f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{|}{U}_{i}=u\mathrm{)}]\mathrm{.}$$

In the above expression, *z*_{iu} is an indicator variable equal to 1 if subject *i* belongs to cluster *u* (i.e. *U*_{i} = *u*), and to 0 otherwise.

**M-step**: the expected value resulting from the E-step is maximized with respect to *θ* and, in this way, this parameter vector is updated.

In practice, the E-step reduces to compute the (conditional) expected value of each indicator variable *z*_{iu}, denoted by $${\widehat{z}}_{iu},$$ by the following simple rule on the basis of the current value of the parameters:

$${\widehat{z}}_{iu}=\frac{{\pi}_{iu}f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{|}{U}_{i}=u\mathrm{)}}{f\mathrm{(}{b}_{i}\mathrm{,}\text{\hspace{0.17em}}{y}_{i\mathrm{,}obs}\mathrm{)}}\mathrm{.}$$

Regarding the M-step, we can use explicit solutions for the parameter vectors **β**_{u} and for the common variance *σ*^{2}:

$$\begin{array}{c}{\beta}_{u}={\mathrm{(}{\displaystyle \sum _{i=1}^{n}}{\widehat{z}}_{iu}{\displaystyle \sum _{l=1:{b}_{il}=0}^{{l}_{i}}}{x}_{l}{{x}^{\prime}}_{l}\mathrm{)}}^{-1}{\displaystyle \sum _{i=1}^{n}}{\widehat{z}}_{iu}{\displaystyle \sum _{l=1:{b}_{il}=0}^{{l}_{i}}}{y}_{il}{x}_{l}\mathrm{,}\text{\hspace{0.17em}}u=\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}k\mathrm{,}\\ {\sigma}^{2}=\frac{{\displaystyle {\sum}_{i=1}^{n}{\displaystyle {\sum}_{u=1}^{k}{\widehat{z}}_{iu}{\displaystyle {\sum}_{l=1:{b}_{il}=0}^{{l}_{i}}{\mathrm{(}{y}_{il}-{\mu}_{lu}\mathrm{)}}^{2}}}}}{{\displaystyle {\sum}_{i=1}^{n}{o}_{i}}}\mathrm{,}\end{array}$$

where *o*_{i} is the dimension of **y**_{i,obs}, that is, the number of regularly completed laps by runner *i*. On the other hand, updating the remaining parameters **γ**_{1u} and **γ**_{2u} in (1) requires an iterative algorithm of a Netwon-Raphson type. However, this is a simple algorithm since the objective function being maximized is of the same form as the objective function used when fitting a standard multinomial logit model with weights by maximum likelihood. The same Netwon-Raphson algorithm is also applied to update the parameters **δ**_{u} in (3) that affect the distribution of each latent variables *U*_{i} on the basis of the individual covariates. In the case where the *π*_{iu} probabilities are assumed to be equal for all subjects (i.e. *π*_{iu} = *π*_{u}), we have an explicit solution for the maximization of the expected complete-data log-likelihood with respect to the *π*_{u} probabilities:

$${\pi}_{u}=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}}{\widehat{z}}_{iu}\mathrm{,}\text{\hspace{0.17em}}u=\mathrm{1,}\text{\hspace{0.17em}}\dots \mathrm{,}\text{\hspace{0.17em}}k\mathrm{.}$$

It is important, as for any other iterative algorithm, that the EM algorithm described above is suitably initialized; this amount to guessing starting values for the parameters in **θ**. We suggest to use both a simple rule providing sensible values for these parameters and a random rule which allows us to properly explore the parameter space. Just to clarify, we choose the starting values for the mass probabilities *π*_{iu} as 1/*k* for *u* = 1, …, *k* under the first rule, which is equivalent to fix the same size for all clusters. The corresponding random starting rule is instead based on first drawing each parameter *π*_{iu} from a uniform distribution between 0 and 1 and then normalizing these values.

We recall that trying different starting values for the EM algorithm is important to face the problem of multimodality of the likelihood function that may arise in finite mixture models and combining different initialization rules (deterministic and random) is an effective strategy in this regard.

## Comments (0)