Accessible Published by De Gruyter August 3, 2013

Latent Conditional Individual-Level Models for Infectious Disease Modeling

Lorna E. Deeth and Rob Deardon

Abstract

Individual-level models (ILMs) have previously been used to model the spatiotemporal spread of infectious diseases. These models can incorporate individual-level covariate information, to account for population heterogeneity. However, incomplete or unreliable data are a common problem in infectious disease modeling, and models that are explicitly dependent on such information may not be robust to these inherent uncertainties. In this investigation, we assess an adaptation to a spatial ILM that incorporates a latent grouping structure based on some trait heterogeneous in the population. The resulting latent conditional ILM is then only dependent upon a discrete latent grouping variable, rather than precise covariate information. The posterior predictive ability of this proposed model is tested through a simulation study, in which the model is fitted to epidemic data simulated from a true model that utilizes explicit covariate information. In addition, the posterior predictive ability of the proposed ILM is also compared to that of an ILM that assumes population homogeneity. The application of these models to data from the 2001 UK foot-and-mouth disease epidemic is also explored. This study demonstrates that the use of a discrete latent grouping variable can be an effective alternative to utilizing covariate information, particularly when such information may be unreliable.

1 Introduction

Infectious disease modeling has historically been an area of great interest and importance. Even within the last decade or so, outbreaks of infectious diseases on the local scale, as well as pandemics on the global scale, have had significant impacts on human and animal health and led to major economic losses, reinforcing the necessity to understand, predict, and control such diseases. For example, the 2001 foot-and-mouth disease (FMD) in the UK lasted just over seven months, but in that time resulted in the culling of over 6 million farm animals and billions of dollars in economic loss [1]. An understanding of the spatial and spatiotemporal dynamics of infectious diseases is often pivotal to the development and implementation of effective prevention policies and control strategies. Examples of infectious disease modeling, and their use in disease control and prevention, can be found in Keeling et al. [2, 3], McBryde et al. [4], Tildesley et al. [5], Cook et al. [6], Savill et al. [7], Chis Ster et al. [8], Jewell et al. [9], and Deardon et al. [10]. Additionally, a recent review by O’Neill [11] highlights several key areas in infectious disease modeling, and provides a brief introduction to the nature of epidemic data and various model types.

Individual-level models (ILMs), sometimes referred to as individual-based models, are a class of models that describe the spatiotemporal spread of infectious diseases. They are referred to as ILMs, because these models can incorporate covariates at the level of the individual person, plant or animal, or at the level of an individual structure, such as a barn or house. Often, models of this nature are parameterized within a Bayesian framework, using Markov chain Monte Carlo (MCMC) techniques [12]. These models are advantageous in that they can account for population heterogeneity, such as the geographical location of each individual, or additional individual-level covariate information, when such information is available. However, a common problem in infectious disease modeling is the unavailability of epidemiological data that are free from significant information gaps, or that do not suffer from inherent uncertainties.

In situations where covariate information is either incomplete or unreliable, it may be difficult to accurately account for population heterogeneity. For example, in the UK 2001 FMD epidemic, covariate information for each infected farm was recorded. Information on non-infected farms, however, was not recorded during the epidemic and instead comes from census data collected in 2000. As such, there is a recognized uncertainty associated with these covariates, as changes in the covariate information will almost certainly have occurred between the time of data collection and the time of the epidemic [2]. However, if the covariate information is ignored (either because it is unreliable or incomplete), and a homogeneous population is assumed, a given model may not provide an adequate description of the disease spread. Therefore, it would be beneficial to be able to model an epidemic in such a way that still incorporates population heterogeneity, yet the epidemic model itself does not require the direct input of these potentially unreliable covariate data. One way in which this could be accomplished is to use any available covariate information in a more generalized fashion, by introducing a discrete latent grouping structure into the population wherein individuals could be classified into a particular subgroup, based on the heterogeneous characteristic. In the case of the FMD, for example, rather than simply ignoring unreliable covariate information, it can be used in a less detailed manner to classify individuals into population subgroups. In this modeling structure, only a relative knowledge of the covariate information is used, which could alleviate the need to directly depend on the covariate data. Furthermore, the use of a discrete latent grouping variable instead of the covariate data has certain advantages, such as ease of use and interpretation; these and other advantages are described in Section 6.

To investigate the effectiveness of these new latent conditional ILMs (LC-ILMs), we consider a simulation study in which the underlying true model depends on some continuous, but potentially unknown, covariates. The proposed LC-ILM is then fitted, along with a simple ILM that ignores the covariate effect completely, for comparison with the true underlying model. The newly proposed LC-ILMs are then applied to data from the UK 2001 FMD epidemic.

2 Methodology

2.1 Individual-level models

Deardon et al. [10] described an ILM framework in which each individual in the population is considered to be a discrete unit in time and space. The model can be used to estimate the probability of a susceptible individual becoming infected in a given period of time, by allowing information on disease susceptibility and transmissibility factors (e.g. the number and/or types of animals on a farm, environmental factors, etc.), and the effect of the distance between infected and susceptible individuals to be incorporated in the models.

The ILM framework described here is placed within a standard SEIR or SIR compartmental modeling framework. That is, at any given time, each individual within the population can be categorized into one of three (for SIR) or four (for SEIR) possible states: susceptible (S), exposed (E), infectious (I), or removed (R). In the susceptible state, individuals are not yet exposed to the disease, whereas in the exposed state, individuals have been infected with the disease agent, but are not yet capable of spreading it. In the infectious state, individuals are capable of transmitting the disease virus to other members of the population. The removed state refers to individuals who have been removed from the population, either through recovery, immunity, or death. Over time, individuals in the population move between states, starting in the susceptible state and ending at the removed state. The ILM, denoted as P(i, t), determines the rate at which individuals move from susceptible to exposed (or to infectious in the SIR model), while the rate at which individuals move from exposed to infectious and/or infectious to removed can be determined in various ways. The simplest is to assume a fixed latent () and/or infectious () period. The transition of individuals through each compartment can be visualized as:

or for the SIR model:

2.2 General model

For the purposes of this study, initially we focus on the SIR model, however, the model will later be extended to the SEIR model. As mentioned previously, individuals are classified as being one of susceptible, infectious, or removed and are subsequently identified as being in one of the sets S(t), I(t), or R(t), respectively. The time variable, t, is discrete and represents the continuous time interval [t, t 1). Therefore, an individual’s status (susceptible, infectious, or removed) over the continuous time period [t, t 1) is simply considered its status at the discrete time point t.

The probability that a susceptible individual, i, will become infected at time t is denoted by the general ILM:

[1]
[1]

where and represent susceptibility factors for individual i and transmissibility factors for individual j, respectively. The infection kernel, , represents risk factors involving both individuals and is often a function of a distance metric between individuals i and j. The sparks term, , can be included to account for infections that are not well explained by the other terms in the model. This term can either be purely random or a function of time and/or disease-related factors.

The probability statement in eq. [1] is used to determine the likelihood of the observed infection pattern at time t, . Here, observed infection pattern refers to all new infections occurring at time and all susceptible individuals at time t remaining uninfected at time . The likelihood of the observed infection pattern is, therefore, defined as:

[2]
[2]

The likelihood of the data, given the parameter set , is calculated as the product of the likelihood of observed infection patterns over the entire time period () of the epidemic as:

[3]
[3]

The ILMs are fitted within a Bayesian framework, where the posterior density of , given the data, is proportional to the likelihood function described in eq. [3] and a prior distribution, :

[4]
[4]

2.2.1 Spatial ILM

Here, we first consider a simple version of eq. [1], in which the susceptibility and transmissibility factors are set equal to constants, such that and . Further, a geometric infection kernel is used, where , the distance metric, , is the Euclidean distance between a susceptible individual i and an infectious individual j, and is the geometric rate of decay. For simplicity, the sparks term is set to 0. The resulting model is referred to as a spatial ILM (SILM):

[5]
[5]

Under the SILM, the parameter set, , contains only the model parameters and , which are assumed to be the same for every individual in the population.

2.2.2 Latent conditional ILM (LC-ILM)

Here, it is assumed that a grouping structure exists, such that the population is heterogeneous with respect to the susceptibility parameter, ; for simplicity, is assumed to still be common among all individuals. A different susceptibility level is thus defined for each subgroup within the population, so that if there are k subgroups, then there are k susceptibility levels: . The order constraint on the susceptibility parameters is included for identifiability [13]. In order to identify the susceptibility level for each individual, an indicator variable is used. Thus, the SILM in eq. [5] is extended to

[6]
[6]

Therefore, the probability of an individual i becoming infected at time t is dependent upon the group allocation, , and the corresponding susceptibility level, . Because it is conditional on the value of the latent variable , it is necessary to either know or be able to estimate, the group allocation for each individual, in order to find .

Full and partial classification. If the group allocations are known for every individual, the model is considered fully classified, and only the model parameters need to be estimated (i.e. ). This would be the case if, for every individual, there is sufficient prior knowledge about the individual to be able to correctly classify it into one of the k subgroups. If such information is available for some, but not all, individuals, the model is considered partially classified, and group allocations must be estimated for those individuals for which there is no classification information. In the Bayesian MCMC framework, this can be done via data augmentation, in which the missing data are incorporated into the parameter set, , where n is the number of individuals with unknown group allocations.

3 Simulation study

The motivation for this study comes from the 2001 FMD epidemic in the UK. As previously stated, information such as geographical location, infection, and removal times and number and type of species for each infected farm were recorded during the epidemic, however information on non-infected farms comes from census data of 2000. Therefore, there is a level of uncertainty associated with these data, though many models fitted to these data, e.g. Keeling et al. [2], Chis Ster et al. [8], Deardon et al. [10], incorporate a continuous susceptibility parameter that is dependent upon knowing the number and type of animals on each farm. The purpose of this simulation study is to determine whether we can allow for such uncertainty by incorporating a discrete latent grouping structure on which to model the epidemic, rather than utilizing possibly unreliable covariate information. This simulation study is carried out to assess the differences in posterior predictive ability of the SILM versus the LC-ILM, when they are fitted to epidemic data generated across a heterogeneous population via a covariate-dependent ILM. A population with two subgroups is created, over which multiple epidemics are simulated. Here, individuals in the population are taken to be the individual farms, in which each contains various species and numbers of animals.

3.1 Epidemic simulation

A population of 1,000 individuals is created, with 500 individuals assigned to each subgroup. Geographical locations for each individual are randomly drawn from an arbitrarily selected multivariate normal distribution, resulting in a population in which the subgroups are not spatially segregated. In order to simulate a “true” epidemic across the population, it is necessary to generate counts of the number of sheep and cattle at each location. These counts are based on random values drawn from independent, arbitrary Gamma distributions, which are then rounded to the nearest whole number. The number of sheep for the small farms is drawn from a Gamma(0.9, 200) distribution, while that for the large farms is drawn from a Gamma(4.0, 200) 2,500 distribution. The number of cattle for small and large farms are drawn from Gamma(0.9, 80) and Gamma(4.0, 80) 700 distributions, respectively. These distributions are arbitrarily selected to result in farms with noticeable differences in counts of sheep and cattle. Differentiation between the subgroups themselves is based on the relative size of each farm; farms that are specifically generated to have large numbers of sheep and cattle ( 2,500 and 700, respectively) are considered to be large farms, while those with relatively few sheep and cattle ( 2,500 and 700, respectively) are considered to be small farms. When the covariate information is assumed to be known for a particular farm, these classification boundaries are used to assign the appropriate value. The complete data set, therefore, consists of geographical location (x/y Cartesian coordinates), number of cattle, and number of sheep for each individual.

To simulate epidemics across the population, the model framework of Keeling et al. [2], a special case of the general ILM found in eq. [1], is used. The susceptibility term, , is defined as an additive function of the number of sheep () and number of cattle () on farm i, and a corresponding susceptibility factor ( and ) for each. The transmissibility term is set to 1, and a geometric infection kernel is used, resulting in the model:

[7]
[7]

It is clear from this model that farms with larger numbers of sheep and cattle will have a higher probability of infection. Values of , , and are selected, as these values produce epidemics that spread at a “reasonable rate”; i.e. not all individuals are infected immediately, but a sufficient number of individuals are infected so that the epidemic tends to propagate. This allows for the resulting epidemic data to contain a “reasonable” amount of information as opposed to, say, epidemics that spread too quickly, or simply do not propagate at all. These parameter values are arbitrarily selected for the purposes of this study; parameter values for similar models found in the literature are highly dependent on the specific epidemic data being investigated, and therefore are not likely comparable to the current simulated data. A total of ten epidemics are simulated across the population; this value is limited due to the computational burden associated with model fitting within each epidemic. For each simulated epidemic, three randomly selected individuals are assigned to begin their infectious periods at time . A fixed infectious period of three time units is assumed, and each epidemic is simulated over an epidemic length of time units. These ten replications, or realizations, represent the true epidemics, generated across the population using the complete data set, and are referred to as the true epidemic replications.

3.1.1 Overlapping and misspecified number of subgroups simulation study

The simulation study described above is repeated, this time on a population in which there is no obvious difference between subgroups. A population of 1,000 individuals is generated, in which there exists two population subgroups (500 individuals per subgroup). An arbitrary multivariate normal distribution is used to simulate geographical locations for each individual, while the counts for number of sheep are based on rounded values drawn from a Gamma(0.9, 200) distribution for small farms and a Gamma(4.0, 200) distribution for large farms. The counts for the number of cattle on small farms are rounded values randomly drawn from a Gamma(0.9, 80) distribution, while those for large farms are drawn from a Gamma(4.0, 80) distribution. This results in farm sizes for which there is no obvious difference between small and large farms.

Since there is no clear difference between population subgroups, arbitrary cut-off values are required in order to assign each individual in the population a value. Here, two scenarios are considered: first, when two population subgroups are assumed, and cut-off values of 500 sheep and 200 cattle for small farms () and 500 sheep and/or 200 cattle for large farms () are selected. Second, the effect of misspecifying the number of population subgroups to be three is investigated, with small farms () as those with 500 sheep and 200 cattle, medium farms () are those with 500–1,000 sheep and/or 200–400 cattle, and large farms () have 1,000 sheep and 400 cattle.

As before, eq. [7] is used to simulate ten replications (representing ten true epidemics) across the population. Model parameter values of , , and are used, however, the number of initially infected individuals, the length of the infectious period, and the number of time units over which the epidemic is simulated are the same as previously described.

3.2 Model fitting

Each individual farm in the population is assigned a true value, based on the number of each animal species and the subgroup boundaries described previously. Individual farms that are classified as small are assigned , while large farms have . For each epidemic, a total of seven different scenarios are considered. Five consist of fitting the LC-ILM with differing proportions of the population having known values, and one consists of fitting the SILM. In the seventh scenario, the model described in eq. [7], used to simulate the epidemics, is refitted to the data. This model is referred to as the full ILM. The LC-ILM scenarios consist of cases where 100%, 75%, 50%, 25%, and 0% of the population group allocations are known. When 100% of the group allocations are known, it is assumed that the covariate information is available, and the true values are used in fitting the LC-ILM. For the cases when less than 100% of the group allocations are known, an appropriate number of individuals are randomly selected from the population and are considered to have unknown group allocations (i.e. their covariate information is unavailable, and, therefore, must be estimated). The ILMs are then fitted to the data using a Metropolis-Hastings MCMC algorithm with random walk proposals for the model parameters and an independence sampler proposal for the data augmented group allocations. To account for correlation between the model parameters, a block update is used, and a multivariate normal distribution, centered at 0 with covariance matrix , serves as the proposal distribution, where is the estimated covariance matrix of the posterior distribution. To increase chain efficiency, the covariance matrix for the proposal distribution is iteratively estimated until satisfactory mixing is achieved. For estimating the group allocations, a single parameter update is performed for each unknown , using a Bernoulli (0.5) proposal distribution.

Vague, independent prior distributions are assumed for all the model parameters. For the model parameters, the prior distributions are chosen to be positive, half-normal distributions with mode 0, and an arbitrarily large variance of 104. A Bernoulli (0.5) prior distribution is assumed for each unknown value. A chain totaling 25,000 MCMC iterations is obtained for each model, and convergence is visually ascertained.

This model fitting procedure is also applied to the simulated epidemic data described in Section 3.1.1. Both the SILM and the full ILM are fitted to the epidemic data, as well as two versions of the LC-ILM: an LC-ILM that allows for two population subgroups (k = 2, representing low-/high-risk farms) and one that allows for three population subgroups (k = 3, representing low-/medium-/high-risk farms). Under both LC-ILMs, scenarios of 100%, 75%, 50%, 25%, and 0% of the group allocations known are all considered.

3.3 Posterior predictive ability

The posterior predictive ability of each fitted model is examined as a means to assess model fit [14, 15]. In general, this involves calculating a test statistic based on the observed data set and comparing this value to the distribution of that obtained when the test statistic is calculated from (simulated) replicated data [14, 15]. Adapting the notation of Rubin [14], is defined as the observed data (here, the true epidemic replication), D refers to the posterior predictive replication data, and is the test statistic. The posterior predictive distribution of is then obtained by simulating s replications of the data and calculating , for i = 1,...s. Highest posterior density intervals (HPDIs) can then be obtained from the posterior predictive distribution of , to which can be compared. In this study, two test statistics are defined: the epidemic curve and the total epidemic size, denoted and , respectively, and given by:

[8]
[8]
[9]
[9]

where is the number of newly infectious individuals at time .

To obtain the posterior predictive distributions for and , the following procedure is used. For each fitted model, a random sample of 100 model parameters and, if applicable, estimates are drawn from the joint posterior distribution. Each random sample is then used to simulate a posterior predictive epidemic replication across the population, using the corresponding LC-ILM or SILM, and under the same initial sources of infection, infectious period, and as the true epidemic replication. For each of these posterior predictive epidemic replications, and are recorded. At each time point, a 95% HPDI is obtained for the epidemic curve, as well as a 95% HPDI for the total epidemic size. This procedure is repeated for each of the ten true epidemic replications, resulting in a 95% HPDI for and being obtained for each true epidemic replication. The mean true epidemic curve, denoted , is obtained by averaging the number of new infections at each time point, across all ten true epidemic replications. Similarly, the true total epidemic size is averaged across all ten true epidemic replications to determine the mean true total epidemic size, . The values for and are then compared to each of their respective 95% HPDIs obtained from the posterior predictive replications, as well as the mean 95% HPDIs for and , which are determined by averaging the 95% HPDIs obtained from the posterior predictive replications across all ten of the true epidemic replications. The number of time points in each true epidemic replication at which the HPDIs fail to capture the mean true values is summarized. The epidemic curves, total epidemic size, and respective 95% HPDIs are determined for the population overall, as well as within each subgroup.

4 Results

MCMC convergence is successfully reached under the LC-ILM with 100%, 75%, 50%, 25%, and 0% of the values known, as well as for the SILM and the full ILM. The results of the posterior predictive distributions of the test statistics, for each true epidemic replication, are presented in the supplementary material.

4.1 Epidemic curve analysis

A plot of the mean true epidemic curve and mean 95% HPDI for the SILM, the LC-ILM with 100%, 50%, and 0% of the values known, and full ILM is shown in Figure 1. Similar plots are seen for the LC-ILM with 75% and 25% of the group allocations known (results not shown). When averaged across all ten true epidemic replications, the mean 95% HPDI for every proportion of the group allocations known contains the mean true epidemic curve at all time points, as is also seen with the mean 95% HPDI generated under the full model, suggesting that the LC-ILM performs equally well to the full ILM in predicting the simulated epidemic. Under the SILM, the mean 95% HPDI is successful in capturing the mean true epidemic curve at only three time points. Furthermore, it is clear that the maximum value of the mean 95% HPDI under the SILM occurs at a later time point (approximately ) than the maximum value of the mean true epidemic curve (approximately ), indicating that the SILM is a poor predictor of the simulated disease dynamics.

Figure 1 Mean 95% HPDI of the posterior predictive distribution of  under the SILM (orange), LC-ILM at 0% (purple), 50% (blue), and 100% (red) of  known, and the full ILM (green);  is superimposed (black).

Figure 1

Mean 95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black).

Figure 2 displays the results of the mean true epidemic curve and the mean 95% HPDIs within each subgroup. Plots for the SILM, full ILM, and LC-ILM with 100%, 50%, and 0% of the group allocations known are displayed, while similar plots for the LC-ILM with 75% and 25% of the values known are observed (results not shown). For both the low susceptibility and the high susceptibility subgroups, the mean 95% HPDI under the LC-ILM captures the mean true epidemic curve at every time point, for all proportions of values known, as is also the case for the mean 95% HPDI generated under the full ILM. Under the SILM, the mean 95% HPDI fails to capture the mean true epidemic curve at 8 time points in the low susceptibility subgroup and at 15 time points in the high susceptibility subgroup (Figure 2). It is also clear that in both subgroups, under the SILM, the mean 95% HPDI reaches a maximum value at a later time point (approximately in both subgroups) than for the mean true epidemic curve (approximately in the low susceptibility group and in the high susceptibility group). In the high susceptibility subgroup, this maximum value is also clearly lower than that of the mean true epidemic curve ( under the SILM versus for the true epidemic). These results again suggest that for the simulated epidemic, the LC-ILMs are well able to predict the simulated disease dynamics within each subgroup, while the SILM fails to predict both the speed and the extent of infection.

Figure 2 Mean 95% HPDI of the posterior predictive distribution of  within each subgroup, under the SILM (orange), LC-ILM at 0% (purple), 50% (blue), and 100% (red) of  known, and the full ILM (green);  is superimposed (black). Left: ; right: .

Figure 2

Mean 95% HPDI of the posterior predictive distribution of within each subgroup, under the SILM (orange), LC-ILM at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black). Left: ; right: .

4.2 Total epidemic size analysis

The mean true total epidemic size, , is determined to be 633.2 individuals. When averaged across all ten true epidemic replications, the mean 95% HPDI captures under all the fitted models (Figure 3), however, it is clear that the width of the mean HPDI is noticeably wider for the SILM than that for the LC-ILM and full ILM, indicating a decrease in precision of the point estimates. Similarly, the width of the mean HPDIs generated under the various LC-ILMs is slightly wider than that of the full ILM, suggesting a small increase in variability of the point estimates of when compared to the full ILM.

Figure 3 Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM, and the full ILM ().

Figure 3

Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM, and the full ILM ().

The mean true total epidemic size within each subgroup is determined to be and , for the low susceptibility and high susceptibility subgroups, respectively. When averaged across all ten true epidemic replications, the mean 95% HPDI contains the mean true total epidemic size for each subgroup, under the LC-ILM for all proportions of values known and for the full ILM (Figure 4). Conversely, for both the low and the high susceptibility subgroups, the mean 95% HPDIs under the SILM fail to capture their respective mean true total epidemic size (Figure 4). It is also clear that the mean 95% HPDI under the SILM overestimates and underestimates . Under the SILM, the mean 95% HPDIs for both subgroups are wider than the corresponding HPDIs generated under the LC-ILM and full ILM. Furthermore, within each subgroup, the width of the mean 95% HPDIs under the LC-ILMs increases as the proportion of known values decreases, while the narrowest HPDI is that generated under the full model. This suggests a decrease in precision of the point estimates of and between the full ILM, the various LC-ILMs, and the SILM.

Figure 4 Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM, and the full ILM (, green ; , red ).

Figure 4

Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM, and the full ILM (, green ; , red ).

4.3 Overlapping and misspecified number of subgroups simulation study

MCMC convergence is attained for the SILM, full ILM, and both LC-ILMs with 100%, 75%, 50%, 25%, and 0% of the group allocations known. To assess model fit, the posterior predictive ability, as described in Section 3.3, is examined for each fitted model. The results of the epidemic curve analysis indicate that for the overall population, the mean 95% HPDIs generated under the LC-ILMs2 contain the mean true epidemic curve at more time points than the mean 95% HPDI generated under the SILM, regardless of the proportion of group allocations known (Figure 5). When examining the results within each subgroup, the 95% HPDIs generated under both the LC-ILM2 and the SILM for the low susceptibility subgroup contain the mean true epidemic curve at all time points, however the upper bound of the 95% HPDI generated under the SILM is noticeably higher than that generated under the LC-ILM2. In the high susceptibility subgroup, the 95% HPDI generated under the SILM fails to contain the mean true epidemic curve at seven time points, whereas the 95% HPDIs generated under the LC-ILMs2, for all proportions of group allocations known, contain the mean true epidemic curve for all time points (Figure 6). As is seen in the previous simulation study, the LC-ILM2 and full ILM perform approximately the same, for all proportions of group allocations known.

Figure 5 Mean 95% HPDI of the posterior predictive distribution of  under the SILM (orange), LC-ILM2 at 0% (purple), 50% (blue), and 100% (red) of  known, and the full ILM (green);  is superimposed (black).

Figure 5

Mean 95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM2 at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black).

Figure 6 Mean 95% HPDI of the posterior predictive distribution of 
 under the SILM (orange), LC-ILM2 at 0% (purple), 50% (blue), and 100% (red) of 
 known, and the full ILM (green); 
 is superimposed (black). Left: 
; right:

Figure 6

Mean 95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM2 at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black). Left: ; right:

The results of the total epidemic size analysis for the LC-ILM2 are similar to those previously observed: the mean true total epidemic size is captured with the 95% HPDIs generated under all models, for the population overall (Figure 7). However, within each subgroup, the 95% HPDIs generated under the SILM fail to capture the mean true total epidemic size for both the low and the high susceptibility subgroups. As was previously observed, the 95% HPDIs generated under the LC-ILMs2 are slightly wider than the 95% HPDI generated under the full ILM, for both the population overall and within each subgroup (Figure 8).

Figure 7 Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM2, and the full ILM ().

Figure 7

Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM2, and the full ILM ().

Figure 8 Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM2, and the full ILM (, green ; , red ).

Figure 8

Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM2, and the full ILM (, green ; , red ).

When comparing the epidemic curve results between the LC-ILM3, the SILM and full ILM, a similar pattern of results is seen for the overall population (Figure 9). Here, it is clear that the 95% HPDIs generated under the SILM fail to contain the mean true epidemic curve at more time points than the other two models, which perform approximately the same. However, when comparing within each subgroup (Figure 10), there is a noticeably different pattern of results. In the low susceptibility group, the 95% HPDI generated under the SILM contains the mean true epidemic curve at all time points, however the upper bound of the interval is noticeably higher than those generated under the full ILM and LC-ILMs3. In the medium and high susceptibility subgroups, the 95% HPDIs generated under the SILM fail to contain the mean true epidemic curves at more time points than the 95% HPDIs generated by the other models. When comparing the full ILM and the LC-ILM3, it is only in the medium susceptibility group that the two models perform approximately the same. In the low susceptibility subgroup, as the proportion of known group allocations decreases, there is an increased tendency for the LC-ILM3 to underestimate the mean true epidemic curve. In fact, in this subgroup, the LC-ILM3 with 0% of the group allocations known appears to perform worse than the SILM. Conversely, in the high susceptibility subgroup, the LC-ILMs3 tend to overestimate the mean true epidemic curve, as the proportion of known group allocations decreases (Figure 10).

Figure 9 Mean 95% HPDI of the posterior predictive distribution of  under the SILM (orange), LC-ILM3 at 0% (purple), 50% (blue), and 100% (red) of  known, and the full ILM (green);  is superimposed (black).

Figure 9

Mean 95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM3 at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black).

Figure 10 Mean 95% HPDI of the posterior predictive distribution of  under the SILM (orange), LC-ILM3 at 0% (purple), 50% (blue), and 100% (red) of  known, and the full ILM (green);  is superimposed (black). Left: ; center: ; right: .

Figure 10

Mean 95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM3 at 0% (purple), 50% (blue), and 100% (red) of known, and the full ILM (green); is superimposed (black). Left: ; center: ; right: .

Finally, the results of the total epidemic size analysis, for the population overall, indicate that while all models are able to capture the mean true total epidemic size, the variability in the point estimates is slightly higher under the SILM than under the LC-ILM3 or the full ILM (Figure 11). However, when comparing the results within each subgroup, the 95% HPDIs generated under the SILM fail to capture the mean true total epidemic size within all three subgroups. Furthermore, as the proportion of known group allocations decreases, the 95% HPDIs generated under the LC-ILM3 perform increasingly worse, as the variability of the point estimates not only appears to increase but the credible intervals fail to capture the mean true total epidemic size (Figure 12). The increasingly poor predictive performance of the LC-ILM3 as the proportion of known values decreases, of both the total epidemic size and epidemic curve, could be the result of a greater number of low-risk () individuals being misclassified as high-risk () individuals. As the amount of available information decreases, low-risk individuals who, for example, become infected early in the epidemic may be mistakenly classified as high-risk individuals, thus leading to skewed posterior predictions.

Figure 11 Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM3, and the full ILM ().

Figure 11

Mean 95% HPDI of the posterior predictive distribution of , generated under the SILM, LC-ILM3, and the full ILM ().

Figure 12 Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM3, and the full ILM (, green ; , red ; , blue ).

Figure 12

Mean 95% HPDI of the posterior predictive distribution for , generated under the SILM, LC-ILM3, and the full ILM (, green ; , red ; , blue ).

5 Application

The model fitting methods described in Section 3 are applied to data from the UK 2001 FMD outbreak. The primary objective of this application is to explore the difference in posterior predictive ability of the LC-ILM and the SILM, when these models are fitted to real-life epidemic data. A subset of observations are selected from the Cumbria region, in the north-west of England, based on an area of high farm density and a relatively high rate of disease occurrence. The resulting data set contains 1,177 farms with varying populations of sheep and cattle, as well as information on the geographical location (in the form of x/y Cartesian coordinates) and infection and removal times for each farm (as described at the beginning of Section 3). These farms are observed between days 30 and 70 of the UK epidemic, a period in which most of the observed infection in this region occurred. As such, here refers to day 30 and refers to day 70.

5.1 Model fitting

A plot of number of sheep versus number of cattle is used in an attempt to identify subgroups within the population, but resulted in no definitive cut-off between low and high susceptibility farms. Therefore, an arbitrary cut-off of 50 or fewer cattle and 200 or fewer sheep is used to designate farms as low susceptibility, while farms that had more than 50 cattle and/or more than 200 sheep are classified as high susceptibility. Using this criterion, 442 farms are classified as low susceptibility, and 735 farms are classified as high susceptibility.

Due to the nature of FMD, an SEIR model framework is used. This allows for a latent period during which individuals have been exposed to the disease but are not yet capable of spreading it. Individuals in their latent period of infection are said to be exposed, and the set of exposed individuals at time t is given as . Incorporating a latent period into the model framework changes the likelihood of infection to

which subsequently modifies the likelihood function to

Following the method of Keeling et al. [2], a fixed latent period of time units is assumed for all individuals, followed by a fixed infectious period of time units.

Six different models are fitted to the data. Five of the models consist of the LC-ILM with one of 100%, 75%, 50%, 25%, or 0% of the values known, while the sixth model is the SILM. Similar to the simulation study, if it is assumed that the covariate information for a particular farm is available, then the true value for that farm is used. Therefore, when 0% of the values are known, it is assumed that no covariate information is available, and the group allocations for all farms must be estimated. The ILMs are fitted to the data following the procedure outlined in Section 3.

5.1.1 Posterior predictive ability

The posterior predictive ability of each model is assessed using an analogous procedure to that outlined in Section 3, with the exception that , the epidemic curve, is instead calculated as the number of newly exposed individuals at each given point in time, . Furthermore, the true epidemic curve, , is determined simply by counting the number of new exposures at each time point in the FMD epidemic data, and these values are compared to a single 95% HPDI of the posterior predictive distribution of obtained from each fitted model. Similarly, , the total epidemic size, is determined as . A single 95% HPDI for the posterior predictive distribution of is obtained for each fitted model.

5.2 Results

The model successfully converges under the LC-ILM with 100%, 75%, 50%, and 25% of the values known, as well as for the SILM. However, MCMC convergence is not obtained in a reasonable time frame under the LC-ILM with 0% of the group allocations known, in spite of increased attempts to optimize the covariance matrix of the proposal distribution. Consequently, parameter estimates from the joint posterior distribution of this model cannot be obtained, and the posterior predictive ability is not assessed. A summary of the posterior predictive distribution of the epidemic curve is presented in the supplementary material.

5.2.1 Epidemic curve analysis

A plot of the true epidemic curve for the overall population and corresponding 95% HPDIs generated by all fitted LC-ILMs and by the SILM is seen in Figure 13. It is clear that the 95% HPDIs for all five models are similar, with approximately the same number of time points of the true epidemic curve falling within each of the 95% HPDIs. This result is slightly different than what is seen in the simulation study, where the LC-ILMs performed noticeably better than the SILM. However, a difference between the LC-ILM and the SILM is seen when considering the results within each subgroup.

Figure 13 95% HPDI of the posterior predictive distribution of  under the SILM (orange), LC-ILM at 25% (purple), 50% (blue), 75% (green), and 100% (red) of  known;  is superimposed (black).

Figure 13

95% HPDI of the posterior predictive distribution of under the SILM (orange), LC-ILM at 25% (purple), 50% (blue), 75% (green), and 100% (red) of known; is superimposed (black).

The true epidemic curve and 95% HPDIs within each subgroup are displayed in Figure 14, where plots for all fitted LC-ILMs and for the SILM are shown. Within the low susceptibility subgroup, the 95% HPDIs generated under all the fitted models capture the true epidemic curve at almost every time point. However, it is clear that the 95% HPDIs increase in width as the proportion of known values decreases and that the 95% HPDI generated under the SILM is noticeably wider than any of the intervals generated under the different LC-ILMs, suggesting an increased tendency to overestimate the number of new exposures at each time point. In the high susceptibility subgroup, the 95% HPDI generated under the SILM is the narrowest interval, however, it fails to capture the true epidemic curve at more time points than any of the 95% HPDIs generated under the LC-ILMs at all proportions of values known. This result is similar to that seen in the simulation study, where in the high susceptibility group there was a large number of time points at which the 95% HPDI generated under the SILM failed to capture the mean true epidemic curve, and indicates a pattern of underestimating the number of new exposures at each time point.

Figure 14 95% HPDI of the posterior predictive distribution of  within each subgroup, under the SILM (orange), LC-ILM at 25% (purple), 50% (blue), 75% (green), and 100% (red) of  known;  is superimposed (black). Left: ; right: .

Figure 14

95% HPDI of the posterior predictive distribution of within each subgroup, under the SILM (orange), LC-ILM at 25% (purple), 50% (blue), 75% (green), and 100% (red) of known; is superimposed (black). Left: ; right: .

5.2.2 Total epidemic size analysis

The true total epidemic size for the overall population is found to be individuals. In Figure 15, it can be seen that the 95% HPDIs generated under the LC-ILM (at every proportion of values known) and the SILM all capture , a result similar to that seen in the simulation study. Here, there is no noticeable difference in the width of the 95% HPDIs of the posterior predictive distribution of generated under the LC-ILMs versus the SILM. This similarity could be attributed to the fact that the distinction between subgroups is arbitrary, as discussed in the next section.

Figure 15 95% HPDI of the posterior predictive distribution of , generated under the SILM and the LC-ILM (.

Figure 15

95% HPDI of the posterior predictive distribution of , generated under the SILM and the LC-ILM (.

When examining the true total epidemic size within each subgroup, the low susceptibility group has a total epidemic size of and individuals in the high susceptibility subgroup. Within the high susceptibility group, the true total epidemic size is captured within the 95% HPDI generated under each of the five models. However, under the SILM the true total epidemic size is barely within the upper HPDI limit, indicating a tendency to underestimate the size of the total epidemic within this subgroup. Additionally, there is no discernible difference in the widths of the 95% HPDIs generated under the various LC-ILMs, and all of these intervals were relatively wide compared to the corresponding interval in the low susceptibility group. This could be due to the fact that the variability of farm sizes in the high susceptibility group is quite large, with farms ranging in size from those that are just barely above the arbitrary cut-off value of 50 cattle and 200 sheep, to those whose total animal count is in the thousands. In the low susceptibility group, the 95% HPDI generated under the SILM fails to capture , as the value falls below the lower limit of the HPDI and is indicative of a tendency for the SILM to overestimate the size of the epidemic within this group. These results are consistent with those seen in the simulation study. In addition, while the 95% HPDIs generated under the LC-ILMs all capture the true total epidemic size, as the proportion of known values decreases, the width of the 95% HPDI increases, as was also seen in the simulation study. Finally, within the low susceptibility subgroup as the proportion of known values decreases, the 95% HPDIs generated under the LC-ILMs become less centered around the true total epidemic size, indicating that the point estimates of become increasingly worse.

Figure 16 95% HPDI of the posterior predictive distribution of  within each subgroup, generated under the SILM and the LC-ILM (, green ; , red ).

Figure 16

95% HPDI of the posterior predictive distribution of within each subgroup, generated under the SILM and the LC-ILM (, green ; , red ).

6 Discussion

This article has demonstrated that when modeling an epidemic across a heterogeneous population, the use of a LC-ILM may result in better certain posterior predictive abilities than the use of a SILM when covariate information is incomplete or unreliable. The results of the simulation study appear to confirm this. Furthermore, they also suggest that in a simple spatial epidemic, the LC-ILM can perform comparatively well to the full ILM in terms of tested posterior predictive abilities, indicating a relatively small amount of information regarding these posterior predictive distributions is lost when a discrete latent grouping variable is used in place of the true covariate information. These results are also seen in the second simulation study, where the distinction between population subgroups is not obvious, and an arbitrary boundary for group allocation was defined. When an LC-ILM3 is fitted to epidemic datum for a two subgroup population, it appears that it still has better posterior predictive abilities than a SILM, fitted to the same datum. However, as the proportion of known group allocations decreases, the LC-ILM3 performs increasingly worse compared with the full ILM. The application of these models to the FMD epidemic data yields a consistent pattern of results, in which the posterior predictive abilities of the LC-ILM are either better than or approximately the same as those of the SILM.

The lack of sensitivity of the epidemic curve results to changes in the proportion of group allocations known is an unexpected result, however, it may be due to the relatively insensitive nature of the statistic used. In the calculation of the epidemic curve statistic, only a count of the number of newly infectious individuals is used, not which individuals become infected. Therefore, in each epidemic, and within each replication, there may be enough variability in which individuals are becoming infected to mask any differences that may exist between the LC-ILMs at differing proportions of known values. However, a difference between the LC-ILMs for different proportions of known values is seen under the total epidemic size analysis, as this statistic may be more sensitive to any cumulative effects. The difference in results between the simulation study and the FMD case study could be due to a number of factors. In the simulation study, there was a clear distinction in the farm sizes between those farms classified as low susceptibility and those classified as high susceptibility, allowing for an obvious application of a discrete grouping variable. Furthermore, the epidemics in the simulation study were relatively simple spatial epidemics, depending only upon the number and type of animals on each farm, and the Euclidean distance between susceptible and infectious individuals. The combination of these two factors could allow for the LC-ILM to be as effective as the full ILM in predicting disease dynamics. In the FMD case study, however, the range of farm sizes was relatively continuous, resulting in the use of an arbitrary cut-off value between subgroups. Therefore, it may be more difficult to distinguish between the low and high susceptibility individuals, since the boundaries between the subgroups are not as well defined. In addition, the overall fit of both the LC-ILM and the SILM to the FMD epidemic data was not as good as that in the simulation study, which may make it more difficult to obtain an accurate portrayal of the posterior predictive abilities of these models.

There are certain advantages in applying a discrete latent grouping variable in place of the covariate information, beyond the case in which the covariate information is unreliable or incomplete. There may exist situations in which covariates that contribute to disease spread may be unobserved completely. In this case, a discrete latent grouping variable may be able to encompass the “hidden” covariates, to better account for disease spread than a model which only accounts for observed covariates. Consider, for example, an infectious disease that is dependent upon farm size and type, as well as some measure of biosecurity. If the latter covariate is not observed, then a model that only incorporates the former covariate may not be as effective in modeling disease spread as a LC-ILM, wherein “high-risk” farms could include those of a certain size and type, as well as a certain level of biosecurity. Another advantage of the LC-ILM over a model that relies on continuous covariate information is that a discrete latent grouping variable has a relatively simple, intuitive interpretation, and also requires few assumptions. In many, if not most, disease epidemics, the underlying model is unknown, and it is, therefore, unclear whether or not continuous covariate information is appropriate. Finally, while there is a computational burden associated with imputing missing discrete group allocation variables, it is likely more computationally efficient to estimate one missing grouping variable, rather than multiple missing covariate values for each individual.

Alternatively, it may have been possible to use a continuous random effect variable (for example, a continuous “risk” term) for each individual, rather than a discrete grouping variable. A continuous variable may not have the same intuitive interpretation as the discrete grouping variable does, for example, it might be easier to categorize a farm as having low or high biosecurity, rather than interpret some continuous variable representing biosecurity. Therefore, it may be harder to determine a prior distribution for a continuous random effect variable. However, in some circumstances, a continuous variable may better account for population heterogeneity and is a topic on which the authors intend to work.

While the use of the LC-ILM in populations where the covariate information is either unreliable or incomplete has been shown to be effective, it is not without its limitations. One challenge of using the LC-ILM in place of the full ILM is the increase in computational power required, as the proportion of known values decreases. Even with relatively small populations and the limited epidemic time frame used in both the simulation study and the case study, the computational time required was significant. In larger populations, or for epidemics that occur over a longer time period, the use of the LC-ILM could be limited by the availability of sufficient computing resources.

In addition to the computational requirements of this method, the identification and/or definition of population subgroups can be difficult. In the primary study presented here, the population was assumed to contain only two subgroups; however, depending on the characteristic(s) on which the population heterogeneity is based, there may in fact be multiple subgroups. For example, with respect to FMD, population heterogeneity based on contact networks or biosecurity issues may also be present. It may also be difficult to define the population subgroups, as was the case in the second simulation study. While the LC-ILM still proved to be more effective than the SILM in describing the disease dynamics, these results may be sensitive to the definition of group boundaries. Furthermore, if a LC-ILM is fitted which allows for more subgroups than may actually exist, the resulting posterior predictive abilities may be skewed. Questions on how to define the number of subgroups, as well as the boundaries for each subgroup, naturally arise. Investigations into model selection techniques, to identify a preferred LC-ILMk, are areas that require further attention. One may opt to have the number and definition of the subgroups to be user-defined, however, a sensitivity analysis on the resulting posterior predictive distribution of test statistics of interest may then be warranted. It may also be possible to use cluster analysis techniques to identify population subgroups. Alternately, methods such as reversible-jump Metropolis-Hastings could be employed, to allow not only the group classifications but also the number of subgroups to be determined by the data. Efficient ways in which to identify potential population subgroups, as well as the effect of varying subgroup definitions on the posterior predictive abilities of the resulting LC-ILMs, are areas that warrant future consideration.

References

1. Ward N, Donaldson A. Policy framing and learning the lessons from the UK’s foot and mouth disease crisis. Environ Plann C: Gov Policy 2004;22:291–306.Search in Google Scholar

2. Keeling MJ, Woolhouse MEJ, Shaw DJ, Matthews L, Chase-Topping M, Haydon DT, et al Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape. Science 2001;294:813–17.Search in Google Scholar

3. Keeling MJ, Woolhouse MEJ, May RM, Davies G, Grenfell BT. Modelling vaccination strategies against foot-and-mouth disease. Nature 2003;421:136–42.Search in Google Scholar

4. McBryde E, Gibson G, Pettitt A, Zhang Y, Zhao B, McElwain D. Bayesian modelling of an epidemic of severe acute respiratory syndrome. Bull Math Biol 2006;68:889–917.Search in Google Scholar

5. Tildesley MJ, Savill NJ, Shaw DJ, Deardon R, Brooks SP, Woolhouse MEJ, et al. Optimal reactive vaccination strategies for a foot-and-mouth outbreak in the UK. Nature 2006;440:83–6.Search in Google Scholar

6. Cook AR, Otten W, Marion G, Gibson GJ, Gilligan CA. Estimation of multiple transmission rates for epidemics in heterogeneous populations. PNAS 2007;104:20392–7.Search in Google Scholar

7. Savill NJ, Shaw DJ, Deardon R, Tildesley MJ, Keeling MJ, Woolhouse MEJ, et al. Effect of data quality on estimates of farm infectiousness trends in the UK 2001 foot-and-mouth disease epidemic. J R Soc Interface 2007;4:235–41.Search in Google Scholar

8. Chis Ster I, Singh BK, Ferguson NM. Epidemiological inference for partially observed epidemics: the example of the 2001 foot and mouth epidemic in Great Britain. Epidemics 2009;1:21–34.Search in Google Scholar

9. Jewell CP, Kypraios T, Neal P, Roberts GO. Bayesian analysis for emerging infectious diseases. Bayesian Anal 2009;4:191–222.Search in Google Scholar

10. Deardon R, Brooks SP, Grenfell BT, Keeling MJ, Tildesley MJ, Savill NJ, et al. Inference for individual-level models of infectious diseases in large populations. Stat Sin 2010;20:239–61.Search in Google Scholar

11. O’Neill PD. Introduction and snapshot review: relating infectious disease transmission models to data. Stat Med 2010;29:2069–77.Search in Google Scholar

12. Robert CP, Casella G. Monte Carlo statistical methods, 2nd ed. New York: Springer, 2004.Search in Google Scholar

13. Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components. J R Stat Soc Ser B 1997;59:731–92.Search in Google Scholar

14. Rubin DB. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann Stat 1984;12:1151–72.Search in Google Scholar

15. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis, 2nd ed. London: Chapman and Hall/CRC, 2004.Search in Google Scholar

Published Online: 2013-08-03

©2013 by Walter de Gruyter Berlin / Boston