Jump to ContentJump to Main Navigation
Show Summary Details
More options …

The International Journal of Biostatistics

Ed. by Chambaz, Antoine / Hubbard, Alan E. / van der Laan, Mark J.

IMPACT FACTOR 2018: 1.309

CiteScore 2018: 1.11

SCImago Journal Rank (SJR) 2018: 1.325
Source Normalized Impact per Paper (SNIP) 2018: 0.715

Mathematical Citation Quotient (MCQ) 2018: 0.03

See all formats and pricing
More options …

Distance-Based Mapping of Disease Risk

Caroline Jeffery
  • Corresponding author
  • Liverpool School of Tropical Medicine, Department of International Public Health, Monitoring and Evaluation Technical assistance and Research group, Liverpool L3 5QA, United Kingdom
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Al Ozonoff / Laura Forsberg White / Marcello Pagano
Published Online: 2013-05-07 | DOI: https://doi.org/10.1515/ijb-2012-0024

Abstract: In this article, we consider the problem of comparing the distribution of observations in a planar region to a pre-specified null distribution. Our motivation is a surveillance setting where we map locations of incident disease, aiming to monitor these data over time, to locate potential areas of high/low incidence so as to direct public health actions.

We propose a non-parametric approach to distance-based disease risk mapping inspired by tomographic imaging. We consider several one-dimensional projections via the observed distribution of distances to a chosen fixed point; we then compare this distribution to that expected under the null and average these comparisons across projections to compute a relative-risk-like score at each point in the region. The null distribution can be established from historical data. Scores are displayed on the map using a color scale.

In addition, we give a detailed description of the method along with some desirable theoretical properties. To further assess the performance of this method, we compare it to the widely used log ratio of kernel density estimates. As a performance metric, we evaluate the accuracy to locate simulated spatial clusters superimposed on a uniform distribution in the unit disk. Results suggest that both methods can adequately locate this increased risk but each relies on an appropriate choice of parameters. Our proposed method, distance-based mapping (DBM), can also generalize to arbitrary metric spaces and/or high-dimensional data.

1 Introduction

Geography is an inherent component of public health. Whether the activity concerns policy development, prevention, or disease surveillance, the practice of public health usually focuses on a specific region of the world, such as countries, administrative regions within countries, or cities, along with other characteristics of the population of interest.

In this paper, we consider methods that quantify a variable across a region. Examples in public health are measures of disease occurrence (incidence, prevalence, risk), environmental exposure (carbon monoxide concentration), access to care (number of nursing homes per senior inhabitants), and socio-economic characteristics (income, education, ethnic background). Knowing how the quantity varies throughout the region allows us to better hypothesize on reasons for these patterns or any effect they might have, or to use them to develop targeted decisions and interventions. This is of particular importance in disease surveillance [1] or syndromic surveillance [2], where potential relationships between an individual’s location and the acquisition of disease can be used to prevent further spread of disease when cases have already occurred, or to guide (re)allocation of resources to improve access to care. Research in these areas promotes the development of spatial methods for measures of disease occurrence, for which spatial data on cases is a key source of information [3, 4].

Typically, we wish to determine if observed spatial patterns of disease or some other outcome is not conforming to “expected” patterns. This requires some estimate of what expected behavior would be and thus these problems are typically framed in the context of comparing two distributions [5]. This expected distribution accounts for non-homogenous patterns in demography and geography. Examples of this include the comparison of observed locations of cancer cases to a representative sample of the population at risk [68]. Alternatively, the locations of cases of a given disease can be compared to those of another disease [5]. In an ongoing syndromic surveillance system, traditionally, locations of cases reported during one day/week can be compared to those reported during previous days/weeks [4, 911]. A similar comparison of interest might be between the spatial distribution of cases during a fixed time period and the distribution of cases of the same disease before an intervention or an event, like the installation of an industrial site, the reinforcement of a health care system, a war, or an earthquake.

Two approaches have been devised for this problem. The first is concerned with the determination of whether there is a deviation between observed and expected spatial patterns (see, for example, [1215]). An alternative approach is disease mapping wherein one maps some measure of risk or comparative risk. This latter method potentially provides a richer and more complete picture of the problem at hand by empowering the researcher to observe patterns throughout the entire region of interest, rather than just being alerted to areas of concern. In what follows, we focus our attention on this mapping approach.

When spatial data are available in point form, kernel methods have appeared as a natural approach to estimate risk. Assuming that the data consist of two independent random samples arising from two Poisson processes, the most simple approach is to take a ratio of kernel density estimates [5, 1618]. The kernel functions and the two-dimensional bandwidths are traditionally chosen to be the same for both estimates [17]. One can show that up to an additive constant, this risk estimate is the same as odds estimated from binary regression modeling the probability of being a case conditional on location x [18]. A last approach is a generalized additive model (GAM), a natural extension to the previous model when covariates are available [18]. The method models the probability that an individual at location x with covariates z is a case as

where is a smooth function representing the unexplained spatial variation.

Kernel methods are useful for estimating densities since they can provide consistent and asymptotically normal estimates (see review in [19]). Applications are seen throughout the environmental and neuroimaging literature [2023]. However, in higher dimensions, they are known to suffer from the empty space phenomenon or curse of dimensionality. In this case, the bias-variance trade-off is poor without a large sample size, and large smoothing parameters lead to small variance but high bias, thus masking local variation [24]. Recognizing these limitations, some have considered projection to a lower-dimensional subspace before performing density estimation. For example, a principal component transformation can highlight which dimensions contain most of the characteristics of the density to be estimated [25]. Successive many-to-one projections applied with an optimizing criterion can further sharpen the estimation of complex density patterns [26].

In this work, we propose a new non-parametric approach to disease risk mapping [4, 27,28,29,30], based on the concept of dimension reduction, and illustrate its key advantages in the disease mapping context. We show that in the two-dimensional setting, it is comparable in performance to kernel density methods and further illustrate the capacity to extend it to higher dimensions. Thus, this method represents not only an acceptable alternative to kernel density methods but also a more flexible framework for problems that require higher dimensions than kernel density methods allow.

In Section 2, we present the method and its key properties. In Section 3, we provide results from simulations that compare our method with kernel methods in a variety of settings. We propose a metric for comparing these methods. Finally, we discuss the relative merits of this method, examples of where it would be applied to higher dimensional problems, and its utility in public health practice.

2 Methods

After introducing a notational framework, we present our mapping method for point data and its theoretical properties.

2.1 Notation and definitions

Consider a study region R and a bounded subset of the Euclidean space . Since we are mapping a risk of disease on a two-dimensional region, we limit ourselves to the special case for now. However, the framework described below is applicable to higher dimensions. Furthermore, we use the Euclidean distance measure, denoted by We assume the region is the support for both a case’s location and the domain of the mapping function. In practice, the latter is represented by a finite set of grid points in R, which are chosen by superimposing a lattice over the study region. Hence, we define the mapping function on .

Definition 2.1. Let be a finite collection of points in the Euclidean space. We call any real-valued function a (disease) mapping function, and we call the range of M, i.e. the set of function values , the set of (disease) scores.

The type of observed data further refines the interpretation of M. If we let Ω represent the sample space of the data, then an observed sample will be an element of , defined as the class of finite subsets of Ω. Then, we can give a broad definition for an estimator of .

Definition 2.2. Let be a finite collection of points in the Euclidean space and a mapping function. Let be the set of functions defined from to . Then is an element of . We call an estimator of M any function . To simplify notation, we denote by the estimated image function rather than the estimator itself.

Hence, given a set of observations sampled from , an estimator of M produces a real-valued function defined on .

In the context of mapping a risk of disease, we can define the sample space and the mapping function more precisely. Assume that the spatial data on each case consist of a single location in R, then this location is a random point in R. In practice, since the population from which cases arise is finite, the sample space Ω should be a finite collection of points R. However, to allow for a flexible framework, such as defining probability density functions (PDFs) and possibly extending the framework to incorporate the mobility of individuals over time, we let .

Consider now that X is the bivariate random vector representing the location of a case in Ω = R and let be the corresponding bivariate cumulative distribution function (CDF). We frame a disease risk mapping function as a comparison between F and a reference function throughout the region R, where the two functions represent the ‘observed’ and ‘expected’ spatial distribution of cases, respectively.

One way to compare and is to estimate a risk difference or risk ratio of the corresponding PDFs f and f0. For example, the log-risk ratio [16, 17] is defined on as:


This expression simplifies to zero for all y when f = f0, which corresponds to a flat map.

The method we now describe proposes to estimate a risk-like function, yet it is not a direct comparison of f and f0 as in eq. [1], rather one of the transformations of F and F0.

2.2 Distance-based mapping (DBM)

The general motivation and strategy behind tomographic mapping is the reduction of dimensionality [31]. Many one-dimensional methods are difficult to extend to higher dimensions. Multi-dimensional data can always be reduced in dimension via projection to a hyperspace, but this reduction comes at the price of loss of information, determined by the particular hyperspace chosen. However, several projections can be considered simultaneously, and thus we can describe a dimension reduction strategy schematically as using the following steps.

  • Step 1:Select a projection of the data to a chosen one-dimensional subspace.

  • Step 2:Make a one-dimensional comparison.

  • Step 3:Repeat steps 1 and 2 for a representative sample of subspaces.

  • Step 4:Aggregate the one-dimensional comparisons for a composite multivariate comparison.

We follow this strategy to compare F and F0 at each point . As each of the four steps requires prior decisions, such as the choice of a projection in Step 1, we describe them in detail in the following four sections. We assume for now that is known, but consider dropping this assumption later.

To illustrate the different steps, we provide several pictures in Figure 1 and refer to them when necessary.

(a) Project the data in one dimension. (b) Make a comparison between two one-dimensional distributions. (c) Several one-dimensional comparisons. (d) Averaging over all five circle points.
Figure 1

(a) Project the data in one dimension. (b) Make a comparison between two one-dimensional distributions. (c) Several one-dimensional comparisons. (d) Averaging over all five circle points.

Step 1: Select a projection of the data in one dimension To reduce the multi-dimensionality of the data to one dimension, we consider the distribution of distances to a chosen fixed point. More specifically, let be one of several chosen fixed points, indexed by i. Then, as one-dimensional counterparts of F and F0, consider the CDFs of distances from

where is true, and 0 otherwise. This first step is illustrated in the top left panel of Figure 1. The study region R is shaped as a disk. The point is where we want to make the local comparison of and . The point ci is chosen outside of R. The next step focuses on comparing and at .

Step 2: Make a one-dimensional comparison As a comparison we define the function:

where the function and is a subset of the real line containing . The functionψ is a comparison measure, for example, a difference or a ratio, while is a neighborhood of on which to make the comparison. To make this common support unique with regards to , we choose it of the form , where the choice of is described in the following step.

The top right panel of Figure 1 illustrates this second step. The segment joining ci to y is of length . The annulus delimited by the two dashed lines illustrates the interval . Points whose distance from ci fall in are represented by the shaded intersection of the annulus and R. We call that intersection .

Step 3: Repeat steps 1 and 2 for a representative sample of subspaces The projection described in Step 1 simplifies the data to one dimension which results in a loss of information. Our one-dimensional comparison cannot discern changes in the two distributions for points within the same distance of ci. To compensate we reproduce steps 1 and 2 using different locations for ci. We choose to place these points on a circle surrounding the region R. Let C denote the circle circumscribed around R, centered at with radius . Choose xC as the center of gravity of R under F0. We divide the circumference of C into N equal arcs described by the points on the circumference: .

To ensure all projections are given equal role, we can either fix the length of or fix the proportion of the reference population it covers. We choose the second option, that is, given a fixed proportion p0, we determine the unique that guaranties


By keeping the parameter p0 fixed across all projections, the value h(t, i) will vary with i and t. Each comparison from Step 2 can then be rewritten as


The parameter p0 plays the role of a ‘bandwidth’ parameter defining the area covered by each around . We usually use

Finally, we introduce a further notation to simplify expression [3] to where denotes the expected value relative to F. We also refer to the points as “circle points”. For each of them we can then calculate the corresponding . In the bottom left panel of Figure 1, four other circle points are placed around R, and their corresponding are delimited by dotted lines.

Step 4: Composite multivariate comparison In the final step, we define the distance-based mapping (DBM) as a real-valued function for any point as the average of the one-dimensional comparisons. To interpret this quantity further, we consider the following property for .

Definition 2.3. Let be a real-valued function. We say ψ is semi-linear if and only if for any , any , and any such that , the following holds:

If ψ is semi-linear, for example, when it is a ratio or a difference, using for all simplifies to:


From this expression [4], that is when is semi-linear, we can interpret as a function making a local comparison of transformations of F and F0. More precisely, given a function and a point , the transformation is mapping to . When , the transformation reduces to the single scalar p0. A good feature of Γ is that, similarly to the log-risk ratio from eq. [1], if the two functions to be compared are equal (i.e. ), then is a scalar, equal to 0 or 1 when is defined as a difference or a ratio, respectively.

The bottom right panel of Figure 1 depicts graphically the calculation of the composite score Γ. One hundred observations from which F can be estimated have also been plotted. All the intersections overlap around y, the point of interest. Observations falling in that overlapping area will contribute the most to the composite score. This contribution decreases in a discrete fashion as the observations get further away from y. Yet if there are more observations than expected in some of the , the corresponding one-dimensional comparison should be large, leading to a high value for .

2.3 Estimation and properties

Our proposed method of disease mapping is defined for any proper CDFs F and F0. While any parametric assumption on either function might simplify the form of , we describe here a few of its properties when F and F0 remain in their most general form. We assume that F0 remains known and that the function ψ is defined as a difference, that is . A few comments are made when ψis defined as ratio in Section 2.3.4. The results for the general case when ψ is semi-linear are given in the Appendix (Section 5.1).

First, we consider an estimator of Γ. Let be independent random variables identically distributed according to F. For fixed t, we can estimate with and define an estimator for

We now describe some properties of our estimator. The details of the calculations are given in the Appendix (Section 5.1).

2.3.1 Mean and variance

The mean and variance of Γ at a given point are derived as:

where . A few comments can be made from these expressions. First, both mean and variance vary with the location y through the values of and . Secondly, we see that our estimator is unbiased for and that the variance approaches zero as the sample size n approaches infinity. These results guarantee that for a fixed location y, our estimator consistently estimates . Finally, we can obtain a bound for the variance:


. For example, if the variance is bounded by and , respectively. The bound asymptotes quite quickly, and the values are similar when N = 10 or . Furthermore, the actual relationship between could be quite different from that between N and this bound.

2.3.2 Mean and variance under

As mentioned in Step 4 of Section 2.2, when , which corresponds to a flat map. The function Γ shares this feature with the log-risk ratio. Under the null and , so the mean and variance of our estimator can be simplified to:


Now the mean is independent of the location y, and the variance, using the fact that for all y, can be bounded by a value [7] sharper than the bound in eq. [5] regardless of > and .

2.3.3 Mean and variance under

Section 2.3.2 shows that has desirable features under . Yet, it is also vital that it performs well when F and differ locally. Here, we investigate a very particular alternative, where a high excess of risk occurs only at a single location in the region. For that, fix a point and consider the Dirac delta function (or impulse symbol, [32]) defined on such that . Now let f and f0 be the probability density functions associated with F and F0, respectively, and suppose that for some fixed value . Here, q and 1 ‒ q represent the proportions of cases distributed according to f0 and , respectively. This alternative aims to represent the most simple case of a locally increased risk.

First, we look at the behavior of the log density ratio and our density mapping method under this alternative. The function λ dichotomizes the region into two distinct areas:

Since q is between 0 and 1, we see that the log ratio takes a negative fixed value throughout the region except at a where it jumps to ∞.

Our density mapping function responds to this alternative differently. Under , we have where is the number of intersections Ai(y) (defined in Step 2) that include a. In the above expression, the value of simplifies to a comparison between and , ranging between and . Thus, although the true risk is only at the point a, the value of is related to the proximity between y and a, the highest value being achieved at points y included in all N intersections , for example .

The variance of the estimator Γ is given in the Appendix. One can verify that for , both mean and variance equal those determined under . The variance is decomposed into one term similar to the variance in eq. [6] under , and a second term resulting from the high excess of risk at point a. In particular, one can show that for fixed y,


This bound is sharper than the one given in eq. [5] regardless of the values of q, p0, and N.

2.3.4 Comments for when ψ is a ratio

Suppose , then our score function takes the form

Consequently, the mean of our corresponding estimator is equal to 1 under the null H0. Furthermore, all the variance formulas and bounds given earlier are rescaled by . Finally under H1, our mapping function simplifies to ranging between q and . In particular, the maximum is attained at points y included in all N intersections , for example . If p0 is chosen small relative to , the resulting large value for y close to a can reflect the high excess of risk at a.

3 Evaluation

We now investigate the performance of DBM through simulations. We first provide an illustration of the method and of the log ratio of kernel density estimates (log KDR), as presented by Kelsall and Diggle [17].

3.1 Illustration

The function described earlier is defined for any point y in the study region R, but we have restricted it earlier to the discrete set , where r quantifies the resolution of the map. The resulting values can then be displayed on a map using a color scale. Figure 2 illustrates how such a map can be obtained with DBM or log KDR. There the region is the unit disk, with a resolution of grid points. The spatial distribution of cases F and the reference distribution F0 are both uniform in the unit disk, that is there is no discrepancy between them. We chose a sample size of n = 100, and implemented DBM and log KDR with and {bandwidth = 0.3, normal kernel}, respectively. For each approach, the color scales are determined empirically, by resampling from the historical data F0. More precisely, we determine four color cutoffs by repeating the following 100 times for DBM or log KDR: sample n = 100 points from F0, calculate the disease score at each of r grid points and store the minimum and maximum values. The four cutoffs (dark blue, light blue, orange, red) are defined as the 5th percentile and highest of the 100 minima, and the lowest and 95th percentile of the 100 maxima. The color transition for the intermediary values are created using the built-in R function rainbow. Since this example maps a genuine random sample from the historical data, the ‘true’ map is flat, i.e. it displays no spatial variation.

Mapping of simulated data consisting of 100 points uniformly distributed in the unit disk by DBM ((a) N = 10, p0 = 0.1) and log KDR ((b) bandwidth = 0.3, normal kernel).
Figure 2

Mapping of simulated data consisting of 100 points uniformly distributed in the unit disk by DBM ((a) N = 10, p0 = 0.1) and log KDR ((b) bandwidth = 0.3, normal kernel).

In Figure 3, we consider the two mapping approaches keeping the same color scheme, but now the density of cases has been increased in a small squared region around the origin. The excess of cases is identified by both methods, but not with complete accuracy. Some grid points labeled with higher risk fall outside the square while not all grid points within the square are labeled with higher risk.

Mapping of simulated data consisting of 10 points centered at the origin and 90 points uniformly distributed in the unit disk by DBM ((a) N = 10, p0 = 0.1) and log KDR ((b) bandwidth = 0.3, normal kernel). The square at the origin delimits the subregion with increased density.
Figure 3

Mapping of simulated data consisting of 10 points centered at the origin and 90 points uniformly distributed in the unit disk by DBM ((a) N = 10, p0 = 0.1) and log KDR ((b) bandwidth = 0.3, normal kernel). The square at the origin delimits the subregion with increased density.

3.2 Simulation study

We simulate spatial data on the unit disk and measure the ability of DBM to locate added clusters of cases. As a comparison procedure, we map the same data using log KDR. The clusters, which we refer to casually as ‘hot spots’, are created by increasing the risk of disease in a particular area of the region by a fixed value. The following sections describe the details of the simulation scheme depicted in Figure 4 along with the results.

Simulations scheme for DBM and log KDR.
Figure 4

Simulations scheme for DBM and log KDR.

3.2.1 Reference population (F0)

We first generate a large collection of points distributed uniformly in the unit circle. This population corresponds to the reference population, distributed according to F0. In a surveillance setting, these generated data represent an underlying historical record of past disease (i.e. cases previously recorded by an existing surveillance system).

3.2.2 Simulations of outbreaks (F)

Our simulated clusters are of different shapes (Figure 5). Each simulation consists of 100 ‘incident’ cases, i.e. cases recorded by the surveillance system in the most recent time period. The majority of the incident cases are distributed identically to the cases drawn from the reference population. However, 10 of 100 incident cases are repositioned uniformly in a cluster region. Six of the cluster regions are square-shaped, defined by a center and a diameter, where the diameter refers to the edge length of the square. We choose two positions for the cluster center, at the origin and at the boundary of the disk, (0, 0.9), and three values for the cluster diameter: 0.1, 0.2, and 0.5, referred to as ‘small’, ‘medium’ and ‘large’ respectively. A square is an idealized version of disease clusters that have occurred in the past [3335]. The remaining three cluster regions are a rectangle centered at the origin of width 0.2 and height 1.2, a “road” system, and two squares centered at (0, 0) and (0, 0.9) with a diameter value of 0.2.

(a) Six cluster regions square shaped. Only a portion of the largest square at the boundary is included in the unit disk. (b) Rectangular cluster and “road” system (shaded gray region). (c) Two-square cluster.
Figure 5

(a) Six cluster regions square shaped. Only a portion of the largest square at the boundary is included in the unit disk. (b) Rectangular cluster and “road” system (shaded gray region). (c) Two-square cluster.

The geographic extent of the cluster region and the number of incident cases determine the relative risk of disease between samples containing an outbreak and the reference population [36]. More precisely, the relative risk can be expressed as the ratio between the expected values of being in the cluster region under the alternative and the null: where is the area of .

The relative risks for our simulated clusters are displayed in Table 1, along with the covered percentage area of the unit disk. Since we are using the uniform distribution for the reference population F0, for a fixed diameter, the relative risk of single square clusters remains the same regardless of the location of the cluster center. However, for the larger diameter, the cluster region at the boundary covers the intersection of the square and the unit disk.

Table 1

Characteristics of cluster regions.

3.2.3 Implementation of log KDR (Figure 4, box 1b)

In this two-dimensional setting, the bandwidth for the kernel approach is two-dimensional, but since we are simulating our data in the unit disk, a single scalar is used for all four values. We consider five parameter settings for log KDR. First, we use a normal kernel function with two bandwidth choices , originally selected from a larger set of bandwidths (see Appendix 5.2). These settings do not control for edge effect, yet it is a common concern with kernel density methods [37, 38]. There exist a built-in function in R (splancs library) that estimates a kernel density ratio with a quartic kernel incorporating an edge-correction described elsewhere [39, 40]. We use this new kernel function with three bandwidths choices , also selected from a wider set of values. We designate log KDR mapping with normal and quartic kernels by KDRn and KDRq, respectively.

3.2.4 Implementation of DBM (Figure 4, box 1a)

For the one-dimensional comparison measure, we choose the difference, . The circle points are equally spaced along a circle centered at the origin with radius 2. We consider four parameter settings for the number of circle points and . With N = 10, we set the parameter p0 to one of three values , also selected from a larger set (see Appendix 5.2). With , we set p0 to 0.1. Given a parameter setting , we resample from F0 (500,000 iterations) to determine the widths of the intervals which guarantee eq. [2] (Figure 4, box 2a).

3.2.5 Metric for evaluation (Figure 4, oval 4)

The simulated outbreaks described in Section 3.2.2 dichotomize the expected relative risk throughout the region. For example, when the diameter is 0.2, the expected relative risk is 8.8 in the cluster region , while it remains at 0.9 outside the cluster region . According to these metrics, a strong performance from DBM should exhibit values that clearly dichotomize according to a threshold . We then partition R′ into four regions: , and . From these, a measure of accuracy for the mapping can be represented by the following pair of metrics: first the proportion of cluster region correctly identified as having higher incidence of disease , and second the proportion of the map outside the cluster region correctly identified as having incidence similar to the reference data , where is the counting measure and is the number of grid points in the cluster region (see Section 3.1). These proportions are analogous to traditional definitions of sensitivity and specificity; hence from now on, we will refer to them using these two terms (Figure 4, oval 4).

For the 100 points sampled in Figure 3, Figure 6 shows the grid points that receive a high score under both mapping methods. Both cover all the higher risk squared-region, hence the sensitivities are both one. Some grid points outside the gray region are also given a high score under both mapping approaches, especially the ones along the borders of the square. Hence, both specificities are strictly less than one.

Points in gray highlight the grid points with high score ((a) DBM and (b) log KDR) after mapping the data from Figure 3 (n = 100). Sensitivity and specificity are provided below each panel.
Figure 6

Points in gray highlight the grid points with high score ((a) DBM and (b) log KDR) after mapping the data from Figure 3 (n = 100). Sensitivity and specificity are provided below each panel.

The values of sensitivity and specificity are dependent on the choice of threshold . To circumvent this, we allow threshold values to vary over the range of observed scores, so as to build an relative operating characteristic (ROC) curve and calculate the corresponding area under the curve (AUC).

3.2.6 Results

Results of the simulations are displayed in Tables 2 and 3, where we report the mean, 2.5 and the 97.5 percentiles from the 1000 AUC values calculated for each of the nine cluster regions (rows) and nine mapping approaches (columns). As described in Sections 3.2.3 and 3.2.4, the nine mapping approaches are derived from the three methods: DBM , KDRq and KDRn . We say that a mapping approach performs well when the mean AUC is high and the 95% confidence interval (CI) is small. However, the two features might not always be observed simultaneously. We use the term “bandwidth” to represent p0 or h depending on the method.

Table 2

Distribution of a 1,000 AUC values for nine cluster regions for DBM (mean and 95% CI).

Table 3

Distribution of a 1,000 AUC values for nine cluster regions for log KDR (mean and 95% CI).

We first look at the performance of each mapping approach individually under the different alternatives. For each approach, the cluster regions with smallest area (geographical extent) are most successfully identified, i.e. with highest mean AUC and tightest CI. Additionally, the performance with the 2-square cluster is comparable to the performance with the largest 1-square cluster at the origin (three times its area) when using DBM , KDRq , and KDRn , while it is comparable to the performance with the largest 1-square cluster at the boundary (twice its area) when using DBM , KDRq and KDRn . With and KDRq , the 2-square cluster is not identified as well as any of the two larger 1-square. Finally, although the areas of the rectangle and the larger cluster at the boundary are practically equal, each mapping approach performs better with the square than with the rectangle.

We now consider results for each cluster region separately. For the smallest and medium size 1-square cluster, both at the origin and at the boundary, all nine mapping approaches are mostly undistinguishable and produce AUC distributions equal or close to 1, except for DBM and KDRn , which have wider CIs. For the largest 1-square cluster, CIs are wide but the three mapping approaches with highest bandwidths have highest mean. Confidence intervals are also wider for the rectangle and the roads, for which the nine mappings give similar AUC distributions; however, for the roads cluster, the CIs appear wider with DBM mappings than KDR mappings. For the 2-square cluster, CIs are tighter again; both KDRq and KDRn with smallest bandwidth give the best results while DBM , also produces a high mean AUC but with a slightly wider CI.

Finally, we consider the choice of parameters for each method. When using DBM with , the value is preferred overall, except for the largest 1-square cluster where p0 = 0.3 gives better results. Additionally, for the roads cluster, the mean AUC increases with p0 as the CI widens. In general, using DBM with outperforms any of the other three DBM mappings with , except for the larger 1-square cluster at the origin and the 2-square cluster. With KDRq, the middle bandwidth value, , is usually preferred. However, the three mappings tend to give very similar results for the rectangle and the roads. Additionally, the largest value gives the highest mean for the largest 1-square cluster both at the origin and the boundary, while the smallest value is best with the 2-square cluster. With KDRn, the smallest bandwidth value gives the best results, except with the larger 1-square cluster where the highest bandwidth value performs best. For all three methods, there appears to be an optimal bandwidth value, which is not highly dependent on the shape or area of the cluster region, except in the case of a large single square cluster.

4 Discussion

This paper presents a new approach to disease mapping for point level data, inspired by tomography and distances. It can be described in four main steps: first, reduce the multi-dimensionality of the data into one dimension by considering the distances between the data points and a fixed point placed outside the region; second, make a comparison between the observed and expected distributions of distances; third, repeat the first two steps with other fixed points placed evenly around the region; fourth and last, take the mean of the comparisons over all the projections. This method reformulates the problem of comparing two multi-dimensional distributions as making N comparisons of two one-dimensional distributions. While this reduction via a single projection would result in a drastic loss of information, we take advantage of the fact that we can project onto a variety of planes to reconstruct much of what was lost.

Section 2.3 presents some of the theoretical properties of our estimator by considering the pointwise mean and variance under several scenarios. Aside from showing that the estimator of is consistent for fixed y, we also demonstrate that the variance of can be bounded by a term positively related to the number of circles points . However, the relative increase flattens out as increases, which shows increasing the number of circle has limited impact on the variance. Sharper bounds are available when we consider the two particular hypotheses, and , and are independent of N.

The second hypothesis gives further insight on our method. When the two functions differ immensely at a single point only, Γ locally smoothes out the very large change. At a given point, the comparison is governed by , from which two observations can be made when is specifically defined as a difference or a ratio. First the smoothing effect of Γ is controlled slightly by N, but more importantly by , where a small value results in little smoothing. Second, the range of preferred values for can direct on the choice of ψ: these values are limited to the interval when is a difference, while the ratio allows for any real positive value, which in this specific alternative, permits to reflect the very localized huge increased risk. Other possible functions can be considered also, for example , where the range would be limited to . In general, should follow a reasonable ordering property, such as for all .

We also chose to investigate the properties of our method in through simulations, by measuring its performance to locate an increased relative risk placed in different regions of the unit disk. The simulations show that our method locates accurately where the increase occurs. We also see that in this two-dimensional setting, it performs comparatively to the standard method using the log of a ratio of kernel density estimates.

To measure the performance in identifying a hot spot cluster, we use two metrics reflecting the accuracy of location: “sensitivity” and “specificity”. Both measures need to be high to properly assess the location of a high risk of disease and make an efficient use of public health resources. If the former is high while the latter is low, some low risk areas will be flagged for further investigation. On the other hand, high sensitivity and low sensitivity result in missed high risk areas. As an alternative to simply visualizing maps, similar techniques have been proposed to measure this accuracy with cluster detection tests [41, 42]. Here, we have explored varying the underlying threshold and define a wider set of these two metrics, so as to measure performance with the area under the ROC curve. However, our proposed metrics depend upon dichotomizing the risk of disease (hot spot cluster). If the risk at x is inversely related to the distance of x from the focus (clinal cluster), one can determine theoretically how it affects and see whether this relationship is reflected in the values of .

There are several limitations to mention, both relative to the methods used and the work presented here. First the implementation of both methods require parameters to be selected a priori. The kernel smoothing technique requires the choice of a kernel function and a bandwidth. The kernel choice is known to have little impact on the resulting estimate of the density, but the choice of the bandwidth is more crucial even if a single value can be used for both numerator and denominator. Our results confirm that performance varies when we change this parameter. In practice one does not know which value will be most appropriate. A wrong choice can result in a high number of false negatives or false positives, leading, respectively, to missing sources of cases or costly misuse of public health action.

The DBM method requires the choice of where to place the circumscribed circle, the number of circle points and the value of p0. In our simulations, we choose to investigate the effect of increasing the number of circle points and varying the parameter p0. The results show that increasing usually improves localizing accurately the cluster region. Intuitively, the influence of the parameter should depend on the geographical extent of the cluster: a small p0 would catch small spatial clusters; yet if p0 is too small, the annuli overlap on a very tiny region and might miss some of the observations, which draws the scores to zero. The simulations suggest to set p0 to 10% for most scenarios. Finally, for the placement of the circumscribed circle, we recommend centering the circle at the center of gravity of the region under and choosing a diameter equal to the largest possible distance between two points in the region.

In this study, we have limited our simulations to a uniform distribution in the unit disk. This is highly unlikely in real data, where the region would have an uneven shape and the intensity of cases would vary across the region. Both methods can easily accommodate heterogeneous populations. The log KDR approach was applied several times to cancer data [16, 17] and neurological disease [43], while we have previously applied our method to syndromic surveillance data in Cape Cod, Massachusetts [4, 30, 44] and tuberculosis in Lima, Peru [27]. For an uneven region, the kernel method requires choosing a two-dimensional bandwidth, while the distance-based approach does not require more parameter decisions.

Our simulations present the use of our method in , yet we have described it in the most general framework in Section 2.2, with the intent to apply it to more than two dimensions. In three dimensions, the observing points would be placed on a sphere incorporating the 3D region, and the annuli would become slices of toruses. The choice of parameters would remain the same: center and diameter of sphere, number of circle points, and .

Finally it is worth noting that our method does not necessarily rely on Euclidean distances. One can specify other metrics or measure of dissimilarity. Genetic data again come as a natural example, where one might want to measure a “distance” between two genotypic sequences [45]. This led to a novel distance-based analysis of HIV genotypes, where the goal was to identify patterns of mutations associated with ARV resistance [46]. Although the present work was motivated by problems in spatial disease mapping, the methods extend naturally to more complex spatial or other high-dimensional settings.

5 Appendix

5.1 Properties of Γ

In this section, we give detailed calculations for determining the mean, variance, and variance bounds of our estimator of Γ under the general case and two hypotheses, H0 and H1 proposed in Section 2.3. We first consider any semi-linear function ψ, then simplify the expression for the difference and ratio. First the estimator defined in Section 2.3 can be rewritten as where is a Bernoulli random variable with parameter . Then, if is semi-linear, using for all can be rewritten as:

For semi-linear

The -transformed random variable equals with mean and variance:

Thus, the mean of our estimate is

For a given and are not independent. We decompose the variance in the following manner:

Let , where is the annulus generated by y and ci. We can simplify the expected value of the product as:

Then the final variance formula is


Case when

Under the null , we have and . So, the mean and variance of our estimator can be simplified as:

Furthermore, we can bound the variance using the fact that

Case when

Under this alternative, the distribution function of distances to ci is determined with the following steps:

Then, we have

where . This indicator function takes value 1 if a is within the subset defined by and (see Step 2). Let be the number of that include a. Then, the general form of Γ can be determined:

For the variance in eq. [9], we use the following facts:



The general variance formula from eq. [9] can be simplified to

To obtain the bound in eq. [5], we define and and rewrite . We can also use the fact that so that:

Using the fact that and , a coarser bound can be obtained as:

Case when

Under , the variance simplifies to:

Case when

Under , the variance simplifies to:

When and for all i. Then, the second term in the above expression simplifies to . Using the fact that for the first term, the variance of at a can further be bounded by

To explore the behavior of this variance when , consider the total number of possible pairs of chosen from Let be the number of such pairs where none of the two include a. Let N1 be the number of pairs when one of the two exactly contains a. Let be the number of pairs when both contain a. These three parameters vary with a and y, but we keep the simpler notation. We have . When we include , and , the variance can be rewritten as:

5.2 Algorithm to select optimal h and p0

To narrow down the number of settings for each mapping, we select a smaller set of values for h and p0 from a larger selection S using the algorithm described below. We set , and for , KDRn, and KDRq, respectively.

  • 1.

    Select one of the nine cluster regions.

  • 2.

    For each value for h or p0 from set S, and each simulated set of data (1,000 in total), calculate the sensitivities and specificities for a range of thresholds, and the AUC corresponding to the ROC curve. Take the mean of all 1,000 AUC values.

  • 3.

    Determine and .

  • 4.

    Repeat steps 2 and 3 for the other cluster regions.

  • 5.

    The values from step 3 form the subset of S presented in Sections 3.2.3 and 3.2.4, that is for DBM, for , and for .


The research in this paper was funded by a grant from the National Institutes of Health R01 EB0006195 and CDC grant R01 PH000021–01.


  • [1]

    Teutsch S, Churchill R. Principles and practice of public health surveillance, 2nd ed. New York: Oxford University Press. Google Scholar

  • [2]

    Henning K. What is syndromic surveillance? MMWR 2004; 53S:7–11. Google Scholar

  • [3]

    Forsberg L, Bonetti M, Jeffery C, Ozonoff A, Pagano M. Distance based methods for spatial and spatio-temporal surveillance. Wiley, 2005:133–52. Google Scholar

  • [4]

    Forsberg L, Jeffery C, Ozonoff A, Pagano M. A spatio-temporal analysis of syndromic data for biosurveillance. Springer-Verlag, 2006:173–93. Google Scholar

  • [5]

    Kelsall J, Diggle P. Kernel estimation of relative risk. Bernoulli 1995a; 1:3–16. CrossrefGoogle Scholar

  • [6]

    Turnbull B, Iwano E, Burnett W, Howe H, and Clark L. Monitoring for clustering of disease: application to leukemia incidence in upstate New York. Am J Epidemiol (suppl.) 1990; 132:S136–43. PubMedGoogle Scholar

  • [7]

    Kulldorff M, Feuer E, Miller B, Freedma L. Breast cancer clusters in the Northeast United States: a geographic analysis. Am J Epidemiol 1997; 146:161–70. CrossrefPubMedGoogle Scholar

  • [8]

    Wheeler D. A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996–2003. Int J Health Geographics 2007; 6:13. CrossrefGoogle Scholar

  • [9]

    Sonesson C, Bock D. A review and discussion of prospective statistical surveillance in public health. J Royal Stat Soc: Series A (Stat Soc) 2003; 166:5–21. CrossrefGoogle Scholar

  • [10]

    Kleinman K, Lazarus R, Platt R. A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. Am J Epidemiol 2004; 159:217.PubMedCrossrefGoogle Scholar

  • [11]

    Olson K, Grannis S, Mandl K. Privacy protection versus cluster detection in spatial epidemiology. Am J Public Health 2006; 96:2002. PubMedCrossrefGoogle Scholar

  • [12]

    Cuzick J, Edwards R. Spatial clustering for inhomogeneous populations. J Royal Stat Soc Series B (Methodological) 1990; 52:73–104. Google Scholar

  • [13]

    Diggle P, Chetwynd A. Second-order analysis of spatial clustering for inhomogeneous populations. Biometrics 1991; 47:1155–63.CrossrefPubMedGoogle Scholar

  • [14]

    Tango T. A test for spatial disease clustering adjusted for multiple testing. Stat Med 2000; 19:191–204.PubMedCrossrefGoogle Scholar

  • [15]

    Bonetti M, Pagano M. The interpoint distance distribution as a descriptor of point patterns, with an application to spatial disease clustering. Stat Med 2005; 24:753–73. PubMedCrossrefGoogle Scholar

  • [16]

    Bithell J. An application of density estimation to geographical epidemiology. Stat Med 1990; 9:691–701. PubMedCrossrefGoogle Scholar

  • [17]

    Kelsall J, Diggle P. Non-parametric estimation of spatial variation in relative risk. Stat Med 1995b; 14:2335–42. PubMedCrossrefGoogle Scholar

  • [18]

    Kelsall J, Diggle P. Spatial variation in risk of disease: a non-parametic binary regression approach. Appl Stat 1998; 47:559–73. Google Scholar

  • [19]

    Izenman A. Recent developments in nonparametric density estimation. J Am Stat Assoc 1991; 86:205–24. Web of ScienceGoogle Scholar

  • [20]

    Kammann E, Wand M. Geoadditive models. J Royal Stat Soc: Series C (Appl Stat), 2003; 52:1–18. CrossrefGoogle Scholar

  • [21]

    Paciorek C, Schervish M. Nonstationary covariance functions for gaussian process regression. Adv Neural Inf Process Systems 2004; 16:273–80.Google Scholar

  • [22]

    Cressie N. Statistics for spatial data. New York: John Wiley & Sons Inc, 1993.PubMedGoogle Scholar

  • [23]

    Wikle C. A kernel-based spectral model for non-gaussian spatio-temporal processes. Stat Modell 2002; 2:299–314. CrossrefGoogle Scholar

  • [24]

    Wand, Matt P., and M, Chris Jones. Kernel smoothing. London: Chapman & Hall, 1995. Google Scholar

  • [25]

    Scott, D. W. Multivariate density estimation New York: Wiley, 1992. Google Scholar

  • [26]

    Friedman J, Stuetzle W, Schroeder A. Projection pursuit density estimation. J Am Stat Assoc 1984; 79:599–608. CrossrefGoogle Scholar

  • [27]

    Manjourides J, Lin H, Shin S, Jeffery C, Contreras C, Santa Cruz J, Jave O, Yagui M, Asencios L, Pagano M, Cohen T. Identifying multidrug resistant tuberculosis transmission hotspots using routinely collected data. Tuberculosis. 2012. Web of ScienceGoogle Scholar

  • [28]

    Jeffery C. Disease mapping and statistical issues in public health surveillance, PhD Thesis, Harvard University, Cambridge, MA, 2010. Google Scholar

  • [29]

    Manjourides J, Jeffery C, Ozonoff A, Pagano M. The use of distances in surveillance. In: Proceedings of the American Statistical Association, Statistical Computing Section [CD-ROM]. ASA, 2007. Google Scholar

  • [30]

    Ozonoff A, Bonetti M, Forsberg L, Jeffery C, Pagano M. The distribution of interpoint distances, cluster detection, and syndromic surveillance. In: Proceedings of the American Statistical Association, Biometrics Section [CD-ROM]. ASA, 2004. Google Scholar

  • [31]

    Epstein C. Introduction to the mathematics of medical imaging, 2nd ed. Philadelphia: Society for Industrial and Applied Mathematics, 2007. Google Scholar

  • [32]

    Bracewell R. The Fourier transform and its applications, Electrical Engineering. New York: McGraw-Hill, 1999. Google Scholar

  • [33]

    Meselson M, Guillemin J, Hugh-Jones M, Langmuir A, Popova I, Shelokov A, Yampolskaya O. The Sverdlovsk anthrax outbreak of 1979. Science 1994; 266:1202. CrossrefPubMedGoogle Scholar

  • [34]

    Lagakos S, Wessen B, Zelen M. An analysis of contaminated well water and health effects in Woburn, Massachusetts. J Am Stat Assoc 1986; 81(395):583–96.CrossrefGoogle Scholar

  • [35]

    Naumova E, Egorov A, Morris R, Griffiths J. The elderly and waterborne Cryptosporidium infection: gastroenteritis hospitalizations before and during the 1993 Milwaukee outbreak. Emerg Inf Dis 2003; 9:418–25. CrossrefGoogle Scholar

  • [36]

    Jeffery C, Ozonoff A, White L, Nuno M, Pagano M. Power to detect spatial disturbances under different levels of geographic aggregation. J Am Med Inform Assoc, 2009; 16:847–54. CrossrefPubMedWeb of ScienceGoogle Scholar

  • [37]

    Lawson A. Disease mapping and risk assessment for public health. New York: Wiley, 1999. Google Scholar

  • [38]

    Vidal Rodeiro C, Lawson A. An evaluation of the edge effects in disease map modelling. Comput Stat Data Anal 2005; 49:45–62. CrossrefGoogle Scholar

  • [39]

    Diggle P. A kernel method for smoothing point process data. Appl Stat 1985; 34:138–47. Google Scholar

  • [40]

    Berman M, Diggle P. Estimating weighted integrals of the second-order intensity of a spatial point process. J Royal Stat Soc Series B (Methodological) 1989; 51:81–92. Google Scholar

  • [41]

    Takahashi K, Tango T. An extended power of cluster detection tests. Stat Med 2006; 25:841–852. PubMedCrossrefGoogle Scholar

  • [42]

    Ozonoff A, Jeffery C, Manjourides J, White L, Pagano M. Effect of spatial resolution on cluster detection: a simulation study. Int J Health Geographics 2007; 6:52. CrossrefGoogle Scholar

  • [43]

    Sabel C, Gatrell A, Loytonen M, Maasilta P, Jokelainen M. Modelling exposure opportunities: estimating relative risk for motor neurone disease in finland. Social Sci Med 2000; 50:1121–37. CrossrefGoogle Scholar

  • [44]

    Ozonoff A, Jeffery C, Pagano M. Multivariate disease mapping. In Proceedings of the American Statistical Association, Biometrics Section [CD-ROM]. ASA, 2009. Google Scholar

  • [45]

    Kowalski J, Pagano M, Degruttola V. A nonparametric test of gene region heterogeneity associated with phenotype. J Am Stat Assoc 2002; 97:398–409. CrossrefGoogle Scholar

  • [46]

    Graham D. Statistical methods for the analysis of HIV drug-resistance Data, DSc Thesis, Harvard University, Cambridge, MA, 2005. Google Scholar

About the article

Published Online: 2013-05-07

Published in Print: 2013-11-01

Citation Information: The International Journal of Biostatistics, Volume 9, Issue 2, Pages 265–290, ISSN (Online) 1557-4679, ISSN (Print) 2194-573X, DOI: https://doi.org/10.1515/ijb-2012-0024.

Export Citation

© 2013 by De Gruyter.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Jonathan L. Zelner, Megan B. Murray, Mercedes C. Becerra, Jerome Galea, Leonid Lecca, Roger Calderon, Rosa Yataco, Carmen Contreras, Zibiao Zhang, Justin Manjourides, Bryan T. Grenfell, and Ted Cohen
Journal of Infectious Diseases, 2016, Volume 213, Number 2, Page 287

Comments (0)

Please log in or register to comment.
Log in