Accessible Published by De Gruyter June 15, 2018

A Fast and Robust Way to Estimate Overlap of Niches, and Draw Inference

Judith H. Parkinson, Raoul Kutil, Jonas Kuppler ORCID logo, Robert R. Junker, Wolfgang Trutschnig and Arne C. Bathke

Abstract

The problem of quantifying the overlap of Hutchinsonian niches has received much attention lately, in particular in quantitative ecology, from where it also originates. However, the niche concept has the potential to also be useful in many other application areas, as for example in economics. We are presenting a fully nonparametric, robust solution to this problem, along with exact shortcut formulas based on rank-statistics, and with a rather intuitive probabilistic interpretation. Furthermore, by deriving the asymptotic sampling distribution of the estimators, we are proposing the first asymptotically valid inference method, providing confidence intervals for the niche overlap. The theoretical considerations are supplemented by simulation studies and a real data example.

1 Introduction

Recent methodological progress has reignited interest (see, e.g., [1, 2, 3, 4, 5, 6]) in the problem of quantifying the overlap of d-dimensional Hutchinsonian niches [7], in the following simply referred to as H-niches. These H-niches represent multi-dimensional objects modeled as subsets of an d-dimensional space, where each axis represents an individual trait or environmental factor of interest. In biology, these objects have long been considered to conceptualize and define ecological niches representing the requirements of species to establish and survive in a given habitat. Here, each dimension represents a limiting factor such as temperatur, or the availability of resources. Alternatively, H-niches can also be utilized to characterize the “impact niche” of a species where each dimension represents a functional trait. These impact niches or “trait spaces” are being used to work on a broad range of ecological questions. For instance, by comparing the volume and overlap of trait spaces of species within communities, hypotheses how climate or other factors influence species diversity and assembly of communities can be tested. For example, species tend to have similar “trait spaces if abiotic factors play an important role, while competition between species should limit this similarity [8, 9]. Or, by defining species traits as dimensions that influence resource utilization, it can be explored how species are specialized on certain resources and how several species partition resources that are locally available [10]. While these applications usually focus on species or communities, H-niches or “hypervolumes” can also be utilized to define the states of ecosystems and assess their shifts after environmental change [11]. Besides this ecological focused research, H-niches are becoming increasingly used in other areas of biology, such as animal behavior science or evolution [12, 13] and could be easily be extended to other disciplines, as for example, economics, where one may be interested in the market niche of a company or a product. We refer to the excellent review paper by [14] regarding further details on the theoretical foundation of H-niches, and fields of application.

The geometrical interpretation of the H-niche differs throughout the literature. While Hutchinson’s original concept is a hyper-cube, other authors have interpreted H-niches to be ellipses or minimal convex sets. Each of these interpretations has its benefits and shortcomings, as discussed briefly in [14].

In order to evaluate and compare H-niches, their size and their reciprocal overlap are valuable indicators. However, there are (at least) two challenges considering a useful and interpretable quantification of H-niche overlap. First, how to measure overlap of H-niches in one dimension and second, how to extend the concept to multi-dimensional situations? Our goal is to provide such a quantification, focusing mostly on the first question, and without imposing a particular parametric model or distribution family. That is, the approach presented here is fully nonparametric, it can even be applied when some of the measured variables are only ordinal, and it does not depend on the type of scale chosen (e.g., linear, or logarithmic). Further, we aim at devising an inferential tool to not only estimate overlap of hypervolumes, but also to provide a confidence interval for the true (and unknown) overlap. So far, in the vast literature following, adapting, and generalizing Hutchinson’s work, no inference procedures can be found yet. The present manuscript attempts to close this gap by developing the appropriate techniques needed to draw inferential conclusions from samples of multivariate data in ecology, economy and beyond.

Let us start with the one-dimensional situation and the goal of quantifying H-niche overlap between two, say, objects. That is, there is one (real-valued) variable of interest, and for each subject under consideration, an observation of this variable can, at least theoretically, be obtained. Also, each subject can be classified into one of exactly two groups. The theoretical distributions of values for the two groups are described by their cumulative distribution functions (cdf), denoted by F and G, respectively.

Language convention: for simplicity of reading, we will not distinguish in the text between the distributions and their cdfs. The respective meaning will be clear from the context. The “range of a distribution F” will be meant to be the interval between inf{tR:F(t)>0} and sup{tR:F(t)<1}. Further, even though Hutchinson has established the term d-dimensional hypervolume, which has caught on in the ecological and biological literature, we prefer to use here H-niche instead, due to the connotations evoked by the former term among mathematicians and statisticians. Further, we will assume that F and G are the true underlying population distributions from which samples are drawing representatively. Thus, we will not consider or discuss scenarios where the true distribution and the sampled distribution differ substantially due to bias in the sampling process, or other reasons. We will not explicitly define what the true underlying distribution function is – it may depend on the information of interest. For further discussion on the topic of underlying distributions we refer to [15].

From a statistical point of view, the problem can generally be formulated as quantifying the proportion of values taken by one of the distributions, say F, that is covered by the range of values taken by another distribution, G. It is immediately clear that it would not make sense to limit oneself to the full support of the latter distribution, that is, the full range of values. For one thing, many common families of distributions have unbounded support, so one would basically be restricted to distributions whose densities have compact support. This may not seem crucial, for example in ecology, where one might assume that observations “live” on compact support sets. However, generality of the mathematical model may be an advantage when transferring the method to other fields of application. An arguably more severe problem in using the full support (or its empirical version, the data range) in practice is presented by the fact that the range is a very unrobust, albeit sometimes still not very informative functional [[16], p. 29]. Indeed, the range is so sensitive to outliers that relying on it alone would cause unstable estimation – individual observations can unduly influence it. On the other hand, the situation may also occur where two distributions have the exact same support, but still one of them is concentrated in the center of this support, while the other is more uniformly distributed, indicating that it would be advantageous to incorporate information regarding the full distribution, not only the smallest and largest observation, as in the range.

The recently published nonparametric solution approach by [5] consists of dynamic range boxes where for each quantile α[0,1], it is assessed how much of the central (1α) proportion of F, loosely speaking, is covered by the range of the central (1α) proportion of G. Integrating over α leads to a rather robust estimator of overlap between the two distributions. As the coverage for each α is between 0 and 1, the integral also takes values between 0 and 1. Outliers only contribute to the calculation for those α-values that are close to 1. When integrating, their influence becomes reduced. The most extreme cases possible are the following. (a) For each α, the range of the central (1α) proportion of F is fully included in the corresponding range of G, resulting in an integrated overlap value of 1. (b) None of the central F ranges are included in the corresponding G ranges, resulting in an integrated overlap of 0.

While their method has the advantages of being robust, it also suffers from three disadvantages which are remedied by the approach proposed here.

  1. Dynamic range boxes [5] require the calculation of mutual coverages for each α individually, resulting in either computational burden or resorting to an only approximate solution.

  2. At each step, the relative proportion of the covered range is calculated as a fraction of the respective interval lengths. This assumes that the measured variables are metric, and the proportion is not invariant under nonlinear transformations of the variables. For example, sometimes researchers prefer to transform observed values onto a logarithmic scale – a change between linear and logarithmic scales would however change the results obtained using the above method.

  3. The sampling distribution of the integral above is not known. Therefore inference (confidence intervals) can not be provided. As mentioned in different guidelines, see [17, 18, 19], confidence intervals should be provided along with the estimator, whenever possible.

Figure 1: Graphical illustration of the differences in derivation of the two methods, present paper (left), and Junker et. al. (right). This plot shows exemplarily two underlying normal distributions, and evaluation of niche overlap at α=5%$\alpha = 5\%$. Both methods integrate over all values of α$\alpha$ between 0 and 1.

Figure 1:

Graphical illustration of the differences in derivation of the two methods, present paper (left), and Junker et. al. (right). This plot shows exemplarily two underlying normal distributions, and evaluation of niche overlap at α=5%. Both methods integrate over all values of α between 0 and 1.

Figure 1 illustrates the difference between the approach proposed in this paper and the one proposed by [5], which is the method most closely resembling the newly proposed one. Please note that the methods were simplified for the visualization. The red arrows indicate the quantities measured at each α, in order to calculate the niche overlap. The new method then integrates over all α, the precise derivation of this calculation is given later. The method by [5] measures the overlap of the two individual α intervals and averages across the resulting values.

2 New measures of H-niche containment and overlap

We propose the following approach for quantifying H-niche containment and overlap. For simplicity we will start with one-dimensional H-niche containment and overlap before shortly explaining how to expand the method to the d-dimensional case in Section 2.5.

For each quantile α, we consider the central (1α) portion of F, as above. However, we now calculate the mass that G assigns to exactly this region. This defines an H-niche containment function h2:[0,1][0,1],

h2(α)=G[F1(1α/2)]G[F1(α/2)],

with F1 the quantile function of F. With h1 we will denote the analogous function where the roles of F and G are switched. Note that hi(α),i=1,2, are monotonically decreasing in α[0,1].

Based on independent and respectively i.i.d. samples X1,,XnF and Y1,,YmG, canonical estimators for hi(α),i=1,2, are given by

hˆ2(α)=Gˆm[Fˆn1(1α/2)]Gˆm[Fˆn1(α/2)],

and analogously defined hˆ1, where Fˆn and Gˆm are the respective empirical distribution functions for F and G. The empirical distribution functions Fˆn and Gˆm are known to be strongly consistent, uniformly over tR, and unbiased at each t.

Clearly, for each α(0,1), hˆ1(α) and hˆ2(α) are invariant under strictly monotone (isotone or antitone) transformations of the observations. That is, if X1,,Xn,Y1,,Ym are replaced by τ(X1),,τ(Xn),τ(Y1),,τ(Ym) for strictly monotone τ, the quantities hˆ1(α) and hˆ2(α) remain unchanged. Thus, these functions are well equipped to serve as the foundation for a fully nonparametric quantification of H-niche containment and overlap.

Recall that the receiver operating characteristic (ROC) curve for two distribution functions F and G is defined as ROC(t)=1G[F1(1t)]. The process

n(Gˆm[Fˆn1(t)]G[F1(t)]),t(0,1),

is called the empirical ROC process, for which the following result holds.

Lemma 2.1

LetXandYbe independently distributed according toFandG, with continuous density functionsfandg, respectively. Ifnmconverges to aλ>0forn, wherenandm=m(n)denote the sample sizes, and if the slope ofG[F1(t)], that isg[F1(t)]/f[F1(t)], is bounded on any interval(a,b)with0<a<b<1, then the empirical ROC process converges weakly to the sum of two tight Gaussian processes. More explicitly, it holds that

(1)n[GˆmFˆn1(t)GF1(t)]=λB1(n)[GF1(t)]+g[F1(t)]f[F1(t)]B2(n)(t)+o[n1/2(logn)2]

for t[0,1] almost surely, uniformly on [a,b], where {B1(n),B2(n),0t1} are two sequences of independent Brownian Bridges.

Proof

The Lemma is equivalent to Theorem 2.2. [20].

We will denote the limit process of the empirical ROC process by L(t). The similarity between our H-niche containment function and the ROC process will be exploited in Section 2.2 for the derivation of asymptotic properties of the proposed estimators.

2.1 Asymmetric H-niche containment and its estimation

In this section we will focus on developing an estimator for, roughly speaking, how much of the range of F is contained in the range of G. As a robust measure of H-niche containment, let us now consider the statistical functional

I2=01h2(α)dα=01G[F1(1α/2)]dα01G[F1(α/2)]dα=2(F1(1/2)G(t)dF(t)F1(1/2)G(t)dF(t)),

as well as I1, with the roles of F and G switched. These integrals basically average over H-niche overlap at the center, as well as at the margins of the individual H-niches – and everywhere in between. They take into account the full distribution information of F, all the while set into corresponding relation with G.

For a (not necessarily continuous) distribution function F let F1 denote the pseudoinverse (quantile function). The following identities are straightforward to verify.

limt0+F1(t)=inf{tR:F(t)>0}=:t_Flimt1F1(t)=sup{tR:F(t)<1}=:tF

In the sequel we will refer to Ran(F):=[t_F,tF] as the range of F. Letting μG denote the probability measure corresponding to the distribution function G, the quantities I2 and I1 can be expressed as

(2)I2=(0,1)μG((F1(α2),F1(1α2)])dα,I1=(0,1)μF((G1(α2),G1(1α2)])dα.

The subsequent lemma gathers the most important properties of I1 and I2.

Lemma 2.2

Suppose thatFandGare distribution functions. Then the following properties hold:

  1. IfGis continuous andF=G, thenI1=I2=12.

  2. I2=0if and only ifμG((t_F,tF))=0. If Gis continuous thenI2=0is equivalent toμG(Ran(F))=0.

  3. ForI2=1it holdsμG([F1(12),F1(12+)])=1. If Gis continuous thenI2=1is equivalent toμG((F1(12),F1(12+)))=1.

  4. I1andI2are invariant under strictly monotone, continuous transformations, where the same transformationϕis being applied toXFand toYG.

Proof

  1. If F=G and G is continuous, considering that F1F(t)=t holds for all t(0,1), the identity I1=I2=(0,1)(1α)dα=12 follows immediately.

  2. If I2=0 then, using the fact that αμG((F1(α2),F1(1α2)]) is (not necessarily strictly) decreasing, it follows that μG((F1(α2),F1(1α2)])=0 holds for arbitrarily small α>0, implying μG((t_F,tF))=0. The other implication is obvious. In case G is continuous μG((t_F,tF))=0 is equivalent to μG([t_F,tF)]=0, which completes the proof of the second assertion.

  3. If I2=1, again using monotonicity together with left-continuity of F1 on (0,1) and the fact that α(0,1)(F1(α2),F1(1α2)][F1(12),F1(12+)], the desired property μG([F1(12),F1(12+)])=1 follows immediately. In case G is continuous, I2=1 implies μG((F1(12),F1(12+)])=1 Since the reverse implication is trivial the proof of the third assertion is complete.

  4. Let ϕ be a strictly monotone transformation function and denote the new distribution functions of X and Y with Fϕ and Gϕ. There exists an inverse ϕ1 of ϕ. Now Fϕ(t)=Fϕ1(t) and the inverse Fϕ1(t)=ϕF1(t), analogue for G.

Then

μGϕ((Fϕ1(α2),Fϕ1(1α2)])=GϕϕF1(1α2)GϕϕF1(α2)=Gϕ1ϕF1(1α2)Gϕ1ϕF1(α2)=μG((F1(α2),F1(1α2)])

holds for all α[0,1].

The following examples illustrate that we can not get sharper versions of assertions (i)-(iv) in Lemma 2.2.

Example 2.3

In case of F=G=1[0,) we have (t_F,tF)= and μG()=0. It then follows that I1=I2=0.

Example 2.4

Consider G=1[1,) and let F denote the distribution function of a uniform distribution on [0,1]. Then Ran(F)=[0,1], so μG(Ran(F))=1 although μG([0,1))=0.

Example 2.5

Consider G=1[0,) and let F denote the distribution function of a discrete uniform distribution on {0,1}. Then [F1(12),F1(12+)]=[0,1], from which we get μG([F1(12),F1(12+)])=1>0=μG((F1(α2),F1(α2+)]).

Intuitively, the expressions I1 and I2 can be interpreted as follows. If the range of F is well contained within the range of G then the values of the H-niche containment function h2 are small, and the integral I2 is close to zero. In these cases, h1 takes larger values, and I1 is closer to one. That is, I1 can be interpreted as a measure of “how much is F (i.e., its range) contained in G”, while I2 quantifies “how much is G contained in F”. This will be made more precise in Lemma 2.8.

Remark 2.6

In the following, we will always assume XiF and YjG to be independent, absolutely continuous random variables, even if not explicitly mentioned. Further requirements for the upcoming results are stated directly in the individual lemmas and theorems.

Denote the distribution of F below its median as F1 and above its median as F2, such that

(3)F1(t)={2F(t),t<F1(1/2),1,tF1(1/2),F2(t)={0,t<F1(1/2),2F(t)1,tF1(1/2).

Remark 2.7

The random variables X(1)F1 and X(2)F2 can be constructed from X by conditioning on XF1(0.5) and X<F1(0.5), respectively.

Lemma 2.8

LetX(1)F1X(2)F2, andYGbe three independent, absolutely continuous random variables. Then, I2can be written as

(4)I2=P(Y<X(2))P(Y<X(1))=P(X(1)<Y<X(2)),
using the fact that by constructionX(1)<X(2)

Proof

Compare Chapter 1.4 of [20].

Remark 2.9

When reformulating Lemma 2.8 for general, not necessarily continuous random variables, the normalized version of the distribution function, F(x)=12[P(Xx)+P(X<x)] (as given in Definition 1.3 of [21]), should be used, resulting in

I2=P(Y<X(2))P(Y<X(1))+12P(Y=X(2))12P(Y=X(1)).

In analogy to F1 and F2, define distribution functions G1 and G2 based on G. Let Y(1)G1, Y(2)G2 be random variables, independent of X(1),X(2). Using these definitions, we can rearrange I2 to

(5)12[P(Y(1)<X(2))+P(Y(2)<X(2))P(Y(1)<X(1))P(Y(2)<X(1))].

Lemma 2.10

IfFandGare continuous distribution functions thenI1+I2[0,1]

Proof

Using eq. (5) and the fact that i=1,2, for continuous random variables, I1 and I2 add up to

1/2[P(Y(1)<X(2))+P(Y(2)<X(2))P(Y(1)<X(1))P(Y(2)<X(1))+P(X(1)<Y(2))+P(X(2)<Y(2))P(X(1)<Y(1))P(X(2)<Y(1))]=1/2[P(Y(1)<X(2))+P(X(1)<Y(2))P(Y(2)<X(1))P(X(2)<Y(1))]=1/2[2P(Y(1)<X(2))1+2P(X(1)<Y(2))1]=P(X(1)<Y(2))+P(Y(1)<X(2))1.

Due to the way the random variables are defined, one of the two probabilities will always be 1, while the other ranges between [0,1].

Expressions (4) and (5) provide a rather intuitive probabilistic interpretation of I2 (and analogously for I1).

Using the canonical estimators Fˆn and Gˆm for the cdfs, we can construct the following consistent plugin estimators for I2 (and I1).

(6)Iˆ2=2(Fˆn1(1/2)Gˆm(t)dFˆn(t)Fˆn1(1/2)Gˆm(t)dFˆn(t))

The functional I1 is estimated analogously with the roles of Fˆn and Gˆm switched. Although the estimator in eq. (6) looks rather complex at first sight, there exists a surprisingly simple shortcut formula in order to evaluate the integral. This formula is based on rank statistics, as follows.

Rank all m+n observations, in case of ties use midranks. Sort the X-sample, and without loss of generality, assume that X1X2Xn (to simplify notation). Then, all the X-observations below their median are X1,,XK where K is the largest integer less than or equal to (n+1)/2. Denote their ranks by R1X<,,RKX<, and denote the ranks of the remaining X-sample by RK+1X>,,RnX>. Finally, let

RX<=i=1KRiX<,RX>=i=K+1nRiX>.

Lemma 2.11

ForRX>andRX<defined as above, Iˆ2can be rearranged to

(7)2mn(RX>RX<)+12c,

withc=n/mforneven andcn/mfornodd, andnandmlarge.

Proof

The proof of this is given in the Appendix, Proof A.

Remark 2.12

Note, that the expression of c in the Lemma above only holds for samples without ties. The result for samples with ties is given in the Appendix, eq. (8) after Proof A.

The expression in (7) allows for a quick and direct estimation of I2, and, analogously, of I1. This is a major computational advantage, potentially also improving numerical stability, compared to the approach by [5], which required evaluations of an H-niche containment function similar to h1 and h2, but for several different values of α separately.

2.2 Asymptotic sampling distribution, and inference

In this section, we will establish the sampling distribution of the estimator defined in eq. (6), using the relation to the empirical ROC process mentioned in the beginning of Section 2. To this end, let D:=D[0,1] be the Skorokhod space, that is, the space of all càdlàg functions defined on [0,1] (see Chapter VI of [21]), and E:=[0,1]. Define the map

φ:DE,φ([f(t)]t[0,1])=01f1α2dα01fα2dα.

Note that φ is a continuous linear map. The topology here is the Skorokhod topology, as given in Theorem 1.14 of Chapter VI of [21].

Theorem 2.13

Let the mapφbe defined as above. Then it holds thatn(φ(GˆmFˆn1)φ(GF1))converges weakly toφGF1(L), where Lis the limit process of Lemma 2.1. Further, φGF1(L)is asymptotically normally distributed with expectation zero.

Proof

The proof can be found in the appendix, Proof B.

Remark 2.14

In general GˆmFˆn1 is not a càdlàg function. However it is a monotone decreasing step function that can simply be transformed into a càdlàg function with identical values for ϕ.

Remark 2.15

An alternative way to derive the normality of Iˆ2 in eq. (6) could be pursued by rewriting it as a finite sum of random variables, namely

n2Iˆ2=i=K+1nGˆm(Xi)i=1KGˆm(Xi),

with slight modification in case of ties among the Xi. After establishing an asymptotic equivalence result with respect to the similarly defined sum

i=K+1nG(Xi)i=1KG(Xi),

asymptotic normality can be established using the Lyapunov central limit theorem.

We have decided to elaborate more on the approach that uses the similarity to the ROC process, as we think that this relation may open up the path for further methodological developments. Thus, we intended to demonstrate its potential usefulness. But the next steps also show some limits when using the empirical process theory in practice.

Inference requires a valid variance estimation for the sampling distribution. However, considering the weak convergence of the empirical ROC process to the sum of two Brownian Bridges, we easily realize that calculating the resulting variance based on result (1) will be rather complex. Deriving a consistent empirical estimator for the variance based on this result alone would be even more complicated, as it would require density estimation for both distributions. Yet, as our goal is to construct confidence intervals, we will need an alternative approach for finding a consistent estimator of the variance. In our case the bootstrap supplies a feasible and justified alternative.

Theorem 2.16

LetGˆmandFˆnbe the empirical distribution functions of the bootstrapped sample andIˆ2the estimator ofI2based on the bootstrapped sample. Then it holds thatn[φ(GˆmFˆn1)φ(GˆmFˆn1)]is asymptotically normally distributed with identical limit distribution asn[φ(GˆmFˆn1)φ(GF1)].

Proof

The proof is given in the Appendix, Proof C.

Now, we can give a confidence interval based solely on the empirical quantiles of the bootstrap process. Through simulation of Iˆ2 we can obtain an approximate α-quantile of n(Iˆ2Iˆ2), denoted by zˆα.

Theorem 2.17

The asymptotic bootstrap confidence interval at level1αfor the above defined estimatorIˆ2is given by

[Iˆ2+1nzˆα2,Iˆ2+1nzˆ1α2].

Further, as our process converges to a normal distribution, we can construct asymptotic confidence intervals by applying Slutzky’s Lemma. We therefore can derive a symmetric confidence interval directly from the variance of the bootstrapped data.

Theorem 2.18

A second confidence interval at level1αfor the estimatorIˆ2is given by

[Iˆ21nz[Var(Iˆ2)]1/2,Iˆ2+1nz[Var(Iˆ2)]1/2],

wherezdenotes the(1α/2)-quantile of the standard normal distribution.

Obviously, the same properties hold for I1. Observe that the first confidence interval is not necessarily symmetric as the absolute values of the two bootstrap quantiles may differ, in contrast to the quantiles of the normal distribution. Note that in cases where the estimator is close to 0 or 1 this might yield better results, as the boundaries of the confidence interval based on the empirical quantiles never exceed the interval [0,1], while the approach using the variance estimator could exceed it.

For certain special cases, a simplified direct variance estimator can be constructed.

Lemma 2.19

Let F and G be continuous distribution functions. Reconsider the random variablesX(1),X(2),Y(1),Y(2)defined in Section2.1. AssumeG1(12)=F1(12)holds true, that is, the distributionsFandGare assumed to have the same median. In that case, I2, as given in eq. (5), further simplifies to

I2=12[1+P(X(1)<Y(1))P(X(2)<Y(2))],

and analogously forI1, which immediately yieldsI1+I2=1.

Proof

By definition of the random variables every realization of X(1) and Y(1) is contained in the interval (,F1(12)]. Similarly, X(2),Y(2) can only take values in (F1(12),). It then follows directly that P(X(1)<Y(2))=P(Y(1)<X(2))=1 and P(X(2)<Y(1))=P(Y(2)<X(1))=0. Entering these results into eq. (5) leads directly to the statement above.

As the random variables involved in estimating P(X(1)<Y(1)) and P(X(2)<Y(2)) are independent under the assumption of equal medians, we can construct a variance estimator for these cases using a simpler approach. We only require variance estimators for the canonical estimators of these two probabilities, namely for Fˆ1(t)dGˆ1(t) and Fˆ2(t)dGˆ2(t). To this end, we can use the procedure developed by [22]. Without loss of generality, assume again, as in the description before eq. (7), that the X-observations below their median are X1,,XK, while those above their median are XK+1,,Xn. Analogously, sort the Y-sample to Y1,,YL and YL+1,,Ym. Now, the samples of X and Y observations which are below (above) their respective medians are compared with each other. For i=1,,K, denote by RiX(<) the rank of Xi among the combined sample X1,,XK,Y1,,YL,

and by RˉX(<)=1Ki=1KRiX(<) their average. Analogously, define RˉY(<),RˉX(>), and RˉY(>). Finally, the internal ranks, where observations are ranked only within each of these four groups, are denoted by Ri(X<), Ri(Y<)Ri(X>), and Ri(Y>), respectively.

Lemma 2.20

Define

sX12=1L2(K1)i=1K(RiX(<)Ri(X<)RˉX(<)+K+12)2,sX22=1(mL)2(nK1)j=K+1n(RjX(>)Rj(X>)RˉX(>)+nK+12)2,
andsY12andsY22analogously. Further the same requirements as in Lemma 2.19 shall hold. A consistent variance estimator forIˆ2, the estimator of the simplifiedI2, is given by
s22=L+K2(sX12K+sY12L)+n+mLK2(sX22nK+sY22mL).

Proof

This Lemma follows directly from the results of [22], as well as the decomposition of Iˆ2 given in eq. (5).

Note that this is the same variance estimator as for Iˆ1.

Theorem 2.21

The confidence interval at level1α for the simplified estimator Iˆ2 is then simply given by

[Iˆ21nz1α/2s2;Iˆ2+1nz1α/2s2],

where z1α/2 is again the (1α/2)-quantile of the standard normal distribution.

Remark 2.22

Clearly, in practice, one may not know whether the assumption of equal medians is fulfilled, thus, the calculations for the simplified method may have limited application. So, why mention it then? Apart from the intriguing mathematical fact, a variance estimation in case of equal medians could be used in the construction of a hypothesis test. Additionally, we hope that the techniques used in the simplified variance estimation may be useful to other situations, in particular since simulations (see below) showed that one in general obtains narrower confidence intervals with the simplified method.

2.3 Symmetric H-niche overlap and its estimation

The concept of H-niche overlap describes the common share of two H-niches, or of their respective distribution ranges. The asymmetric containment index derived in the previous section provides a basis for defining a straightforward symmetric H-niche overlap index, namely as O(F,G)=4I1I2, where the factor 4 arises from our desire to standardize the total overlap to 1 for two identical distributions. Using notation as in the intuitive interpretation above we can write

O(F,G)=4P(X(1)<Y<X(2))P(Y(1)<X<Y(2)). Thus, the symmetric H-niche overlap index O(F,G) has the following properties, now formulated in terms of H-niches.

Similar to Lemma 2.2 we will give the most important properties of O(F,G) in the subsequent lemma.

Lemma 2.23

Suppose that F and G are distribution functions, not necessarily continuous. Then the following properties hold:

  1. In case of identical H-niches, F=G, andGcontinuous we haveO(F,G)=1.

  2. O(F,G)=0if and only ifμG((t_F,tF))=0orμF((t_G,tG))=0. IfFandGare continuous thenO(F,G)=0is equivalent toμG(Ran(F))=0, and simultaneouslyμF(Ran(G))=0.

  3. The overlap indexO(F,G)is invariant under strictly monotone transformations, where the same transformationϕis being applied toXFand toYG.

Proof

(i) This follows directly from Lemma 2.2.

(ii) If O(F,G)=0 then either I1 or I2 must be zero. By Lemma 2.2 this is the case if either μG((t_F,tF))=0 or μF((t_G,tG))=0. Obviously the other implication holds, too. The proof for the continuous case is equivalent. Note that, due to continuity, if one of the asymetric overlaps is zero, the other will be zero as well.

(iii) This follows again directly from Lemma 2.2.

Example 2.24

Consider F=1[0,) and let G=U[0,2]. Then (t_F,tF)=, from which we get μG((t_F,tF))=0. On the other hand, μF((t_G,tG))=μF((0,2))=1=μF([G1(12),G1(12+)]).

As stated in the previous lemma, identical H-niches yield a symmetric H-niche overlap of 1. On the other hand, if one H-niche is well contained within the range of the other, then one of the factors, and thus the product O(F,G) will be close to zero. By “well contained”, we mean that most of the weight of the one niche is centered in a small area around the median of the other, as, for example, in the scenario of Example 2.24. A canonical and consistent estimator is given by Oˆ(F,G)=4Iˆ1Iˆ2, where the shortcut formula (7) for Iˆ2, and its analog for Iˆ1, can be used.

2.4 Asymptotic distribution

In deriving confidence intervals for the overlap, we can not assume independence of the two random variables Iˆ1 and Iˆ2. Therefore, the properties of the individual convergences do not suffice here. As a consequence, we can only provide a conservative confidence interval for the general case, since estimation of the joint asymptotic distribution is rather complex. A conservative (1α)-interval can be constructed based on individual (1α/2)-intervals for I1 and I2, which yields a simultaneous (1α)-region for (I1,I2), and thus a straightforward (1α)-interval for 4I1I2. We will restrict the confidence interval obtained by this method to the interval [0,1], as the true overlap only ranges in this interval.

Note that for an important special case, namely under the assumption of equal medians, G1(12)=F1(12), a confidence interval for O(F,G) can simply be constructed using a confidence interval for I2 and the fact that under the given assumption, O(F,G)=4I2(1I2)=:ξ(I2). That is, the confidence interval for I2, namely I2=[I2L,I2U], is simply transformed to ξ(I2). If I2 does not contain 1/2, the interval for O(F,G) is directly obtained by transforming the limits I2L,I2U. If 1/2 is contained in the confidence interval, the upper limit of the confidence interval for O(F,G) is set to 1.

2.5 Extension to the d-dimensional case

In most applications of H-niche overlap, we not only have one trait, but many. Therefore, it is indispensable to extend our approach to the multi-dimensional case. In order to maintain certain properties for our d-dimensional H-niche overlap, we propose to define the d-dimensional asymmetric H-niche containment similar to [5], as

I2(d)=i=1dI2,id

where I2,i, i=1,,d, is the one-dimensional asymmetric H-niche containment of component i. The asymmetric H-niche containment I1 is defined analogously.

It then follows that the d-dimensional symmetric H-niche containment is given by

O(d)(F,G)=4I1(d)I2(d)=4i=1dI1,idi=1dI2,id=i=1dO(F,G)id,

where O(F,G)i, i=1,,d, is the one-dimensional symmetric H-niche containment of component i. Here, F and G are d-dimensional distribution functions. For the d-dimensional H-niche containment O(d)(F,G) the following properties hold.

Lemma 2.25

Suppose thatFandGare distribution functions, not necessarily continuous. Then the following properties hold

  1. In case of identical H-niches, F=G, and ifFandGare continuous, we haveO(d)(F,G)=1.

  2. O(d)(F,G)=0if and only if for at least one componenti{1,,d}eitherμGi((t_Fi,tFi))=0orμFi((t_Gi,tGi))=0.

Consistent estimators are given through the estimators in the one-dimensional case. Confidence intervals can be obtained by transforming the boundaries of the one-dimensional intervals, similar to the transformations of the actual estimators.

Remark 2.26

If the correlation matrices between the traits are rather dissimilar between the two samples, one may be able to construct examples, where the d-dimensional version of the H-niche will fail to describe the “true” relation between the two samples. For example, consider two bivariate samples where one has positively correlated traits, and the other has negatively correlated ones, while the marginal distributions are identical. In such a case, we would obtain an overlap close to one, while an intuitively “true” value should be closer to zero. This problem will require further research.

3 Simulation

In order to investigate the small sample properties of the estimator and its confidence intervals, we have performed simulation studies in R (R version 3.2.3, R Core Team, 2017). The first subsection considers asymmetric H-niche containment, while the second subsection focuses on the symmetric overlap.

3.1 Asymmetric H-Niche Containment

Table 1:

Simulated ECP of the simplified rank-based confidence interval (upper row) and bootstrap confidence interval (lower row) for I2 with different combinations of distributions F and G with equal medians and sample sizes 10, 20, …, 90.

sample sizes102030405060708090
N(0,1)0.9240.9540.9450.9360.9400.9460.9550.9480.942
x N(0,2)0.9310.9510.9610.9410.9510.9390.9640.9440.935
N(0,1)0.9480.9730.9700.9590.9660.9560.9630.9610.946
x N(0,5)0.8470.9340.9440.9390.9290.9400.9500.9440.942
N(0,1)0.8800.9260.9330.9420.9430.9430.9470.9360.940
x U(-0.5,0.5)0.9430.9510.9520.9550.9440.9590.9570.9580.956
U(-0.5,0.5)0.9210.9240.9350.9380.9490.9350.9450.9410.946
x N(0,1)0.9320.9380.9550.9590.9480.9500.9400.9440.956
U(-0.5,0.5)0.9340.9550.9320.9440.9380.9450.9450.9420.948
x U(-0.5,0.5)0.9040.9480.9360.9450.9530.9360.9400.9490.935
B(2,3)0.8860.9120.9320.9200.9370.9240.9350.9390.930
x B(3,2)-0.20.9390.9590.9560.9640.9460.9470.9580.9630.949
B(2,3)0.8660.9100.8970.9240.9260.9300.9130.9140.934
x B(4,1)-0.40.9360.9510.9370.9390.9420.9470.9610.9520.955
N(0.615,1)0.7070.7660.8200.8060.8270.8290.8520.8490.841
x B(3,2)0.8960.9010.8980.9090.9040.8910.9130.9000.926
B(3,2)0.9570.9640.9660.9520.9540.9380.9720.9560.951
x N(0.615,1)0.8690.9110.9260.9330.9470.9370.9430.9350.942
U(0,1)0.8660.8970.9050.9200.9310.9370.9400.9410.936
x B(3,2)-0.1130.9240.9310.9440.9490.9440.9450.9500.9480.949
B(3,2)-0.1130.9250.9290.9350.9390.9380.9390.9570.9510.957
x U(0,1)0.9300.9490.9500.9600.9460.9500.9360.9460.947

The simulations are split into two parts, where one shows the appropriateness of the confidence intervals with respect to the considered sample size, and the other demonstrates when the simplified version of the confidence interval can or cannot be used.

To begin with, combinations of distributions F,G were analyzed, where both distributions had the same median, and each was normal, uniform, or Beta. Note that the simplified rank-based confidence interval formula requires equal medians. The sample sizes for each of the distributions were increased from 10 to 100 in steps of 10 and from 100 to 1,000 in steps of 100. For each combination of distributions 1,000 data sets were generated. The number of bootstrap runs per data set was set to 1,000.

Table 2:

Simulated ECP of the simplified rank-based confidence interval (upper row) and bootstrap confidence interval (lower row) for I2 with different combinations of distributions F and G with equal medians and sample sizes 100,200,,1000.

sample size1002003004005006007008009001000
N(0,1)0.950.940.950.960.940.950.950.960.950.94
x N(0,2)0.950.960.940.940.960.940.940.940.950.96
N(0,1)0.960.950.960.950.940.950.940.960.950.93
x N(0,5)0.940.940.950.930.950.950.940.950.940.95
N(0,1)0.950.950.950.950.940.960.950.940.950.96
x U(-0.5,0.5)0.950.950.950.950.950.960.950.950.950.95
U(-0.5,0.5)0.940.950.960.950.950.940.950.930.950.95
x N(0,1)0.950.950.950.960.950.940.950.960.940.94
U(-0.5,0.5)0.950.940.950.940.950.950.940.940.960.95
x U(-0.5,0.5)0.950.960.960.960.960.930.960.960.950.94
B(2,3)0.940.940.950.940.950.950.940.940.950.95
x B(3,2)-0.2270.950.960.950.950.940.950.950.940.940.96
B(2,3)0.940.930.930.920.920.940.940.920.930.94
x B(4,1)-0.4530.960.960.950.950.940.950.950.950.950.94
N(0.615,1)0.880.900.900.930.930.920.940.930.930.95
x B(3,2)0.930.930.930.940.940.950.950.950.950.93
B(3,2)0.960.950.950.950.950.950.950.950.940.94
x N(0.615,1)0.940.950.950.940.940.940.950.940.960.94
U(0,1)0.930.940.940.960.950.940.950.940.950.95
x B(3,2) -0.1130.950.940.950.940.950.950.950.950.950.95
B(3,2) - 0.1130.930.950.930.950.930.950.950.950.960.94
x U(0,1)0.940.940.940.940.950.960.950.950.950.95

Simulated expected coverage probabilities (ECP) are shown in Table 1 and Table 2, and average confidence interval lengths in Table 3.

The true overlaps of I2 (and I1) were calulated by the following R-function, here given exemplarily for a Beta- and a Normal-distribution.

#N(a1,b1)
a1 = 0.6 1 5; b1 = 1
#B(a2,b2)
a2 = 3; b2 = 2
True_I_2 <function(a1, b1, a2, b2){
integ < function(x){pbeta(x, a2, b2)dnorm(x, a1, b1)}
int_gtm = integrate(integ, qnorm(1/2, a1, b1), 1)
int_stm = integrate(integ, 0, qnorm(1/2, a1, b1))
overlap = 2 (int_gtm$value - int_stm$value)
return(overlap)
}

Both methods (bootstrap and simplified) generally maintained the nominal confidence level well for sample sizes of at least 40. For small sample sizes, it depended on the underlying distribution which of the methods yielded results that were closer to the nominal level. For combinations of symmetric distributions, both methods performed similarly.

Table 3:

Average length of the simulated rank-based confidence interval (upper row) and bootstrap confidence interval (lower row) for I2 with different combinations of distributions F and G with equal medians and sample sizes 10,20,,90.

sample size102030405060708090
N(0,1)0.4580.3330.2690.2320.2060.1880.1730.1620.153
x N(0,2)0.4580.3260.2670.2300.2050.1860.1730.1620.153
N(0,1)0.3150.2460.2050.1770.1580.1440.1310.1230.115
x N(0,5)0.2830.2190.1840.1580.1410.1290.1210.1140.106
N(0,1)0.5050.3530.2860.2480.2210.2010.1860.1740.164
x U(-0.5,0.5)0.6130.4070.3190.2710.2380.2140.1970.1830.172
U(-0.5,0.5)0.4970.3540.2870.2470.2210.2020.1860.1750.164
x N(0,1)0.5050.3550.2910.2510.2230.2030.1880.1750.165
U(-0.5,0.5)0.4260.3190.2580.2230.1990.1810.1680.1560.147
x U(-0.5,0.5)0.4200.3080.2510.2180.1970.1780.1640.1540.145
B(2,3)0.5110.3630.2940.2540.2260.2060.1910.1790.168
x B(3,2)-0.20.5660.3920.3120.2670.2380.2160.1990.1850.174
B(2,3)0.4970.3520.2830.2460.2200.1990.1850.1730.163
x B(4,1)-0.40.6170.4200.3290.2810.2490.2260.2080.1930.180
N(0.615,1)0.3930.2770.2180.1840.1620.1470.1340.1260.117
x B(3,2)0.7300.4930.3910.3260.2810.2510.2260.2030.189
B(3,2)0.3200.2550.2100.1830.1610.1460.1350.1250.117
x N(0.615,1)0.2940.2240.1850.1610.1460.1350.1240.1170.110
U(0,1)0.4940.3400.2770.2390.2130.1940.1800.1680.158
x B(3,2)-0.1130.6290.4140.3260.2750.2420.2160.1970.1830.172
B(3,2)-0.1130.4770.3410.2770.2380.2130.1940.1800.1680.158
x U(0,1)0.4870.3430.2790.2420.2150.1950.1810.1690.159
Figure 2: Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=N(0.615,1)$F = N(0.615,1)$ and G=B(3,2)$G = B(3,2)$, with respect to sample size.

Figure 2:

Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=N(0.615,1) and G=B(3,2), with respect to sample size.

When using two skewed distributions, such as the Beta-distribution, we noticed slight undercoverage of the simplilfied method, in particular for small to moderate sample sizes. The bootstrap generally maintained the nominal level well in these simulations. When F was skewed and G symmetric, both methods performed similarly, with small sample advantages of the simplified version, whereas for F symmetric and G skewed, the simplified method exhibited some undercoverage for small to moderate sample sizes.

Figure 3: Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=B(3,2)$F = B(3,2)$ and G=N(0.615,1)$G = N(0.615,1)$, with respect to sample size.

Figure 3:

Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=B(3,2) and G=N(0.615,1), with respect to sample size.

These similarities and differences are also illustrated in Figure 2 and Figure 3 for combinations of a skewed and a symmetric distribution. The plot of the boundaries in Figure 3 shows the similarities of the two confidence interval methods, with a slightly narrower bootstrap confidence interval, but at the expense of lower coverage than the rank-based interval in this situation, for sample sizes below 100.

Figure 4: Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=N(0,1)$F = N(0,1)$ and G=N(m,1$G = N(m,1$), with respect to increasing difference m$m$ between means.

Figure 4:

Simulated performance of estimation and confidence interval methods. True values, estimates, and average confidence interval boundaries (left), average confidence interval length (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=N(0,1) and G=N(m,1), with respect to increasing difference m between means.

As to the second part, we have already mentioned in Section 2.2 that the simplified variance estimator is only valid under the assumption of equal medians. The authors conjectured that it might still be used for small deviations from this assumption. In order to investigate the conditions under which it might become inappropriate, we conducted simulations using increasing location shifts between both distributions. All results are based on 1,000 simulation runs for each configuration, a sample size of 1,000, and, in the comparative analysis of the bootstrap interval performance, 1,000 bootstrap runs.

Figure 4, and Table 4 demonstrate the effect on ECP when the equal median assumption for the simplified confidence interval method is violated. Here, two normal distributions with different variances were used. The distribution function F was always standard normal, while G was a normal distribution with means m=0,0.1,,4.9,5 and variances 1, 2, and 5. The simplified method exhibited increasing undercoverage, in particular for smaller variances (σ2=1), as the difference in the true medians between both distributions increased. For a difference of 0.4 between the means, coverage was still about 90%, however, it already dropped to 70% for a difference of 1 between the means, dropping further as the mean differences increased. This is caused by the variance estimator underestimating the true variance, leading to confidence intervals which are too short. For larger variances, the performance of the simplified version was better. In general, it would be a useful research task to devise a bias correction for the simplified rank-based variance estimator.

Figure 5: Plot of the confidence intervals with the true value and the estimator (left), average length of the confidence interval (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=U(−0.5,0.5)$F = U(-0.5,0.5)$ and G=U(m−0.5,m+0.5)$G = U(m-0.5,m+0.5)$ w.r.t. the increase of difference m$m$ between the means.

Figure 5:

Plot of the confidence intervals with the true value and the estimator (left), average length of the confidence interval (middle, blue - simplified calculation, red - bootstrap), and simulated expected coverage probability (right) for the two distributions F=U(0.5,0.5) and G=U(m0.5,m+0.5) w.r.t. the increase of difference m between the means.

While it seems that in distributions with infinite support and higher variance, the rank-based variance estimator can cope with rather small deviations from the equal median assumption, as seen in Table 4, the same does not seem to hold when working with distributions with finite support, as illustrated in Figure 5. Here, two uniform distribution were considered, each with a range of 1. In an almost linear manner the ECP of the simplified calculation dropped to zero up to m=0.5, as for m0.5 the range of G can not possibly include the median of F anymore, and vice versa. Thus, the probabilities P(X1<Y1) and P(X2<Y2) are both equal to 1 in this scenario, and the resulting variance estimator is simply zero, rendering the simplified method not usable. However, up to m=0.2, the simulated ECP was still higher than 80%. These results, together with the true values for I2, are also contained in Table 5. The simulated ECP dropped to zero for combinations of U(0.5,0.5) and U(m0.5,m+0.5) when we reached m=0.5. In comparison, the bootstrap confidence interval performed well until m reached 1, where the true overlap equals zero and thus also the variance estimator becomes zero. As in general one does not know whether the assumption of equal medians is valid or not, and one may only have the results of a hypothesis test for equal medians, it is important to note the influence that minor deviations from the assumptions have on the results.

Simulating data from a normal (with infinite support) and a uniform distribution (with finite support), we additionally observed the following. The simulated ECP depended on which of the two asymmetric H-niche estimators we were calculating. This has implications for estimating the symmetric overlap with confidence, as only one of the two individual confidence intervals needs to fail, in order to render their product estimator unreliable. On the other hand, the bootstrap confidence interval obtained roughly the same ECP no matter whether Iˆ1 or Iˆ2 was estimated, as can be seen in Table 6. Confidence intervals for the symmetric overlap were specifically assessed in additional simulations discussed below.

Table 4:

True value of I2 and simulated ECP of rank-based and bootstrap confidence intervals for normal distributions, with respect to increasing difference m between means.

N(0,1)xN(m,1)N(0,1)xN(m,2)N(0,1)xN(m,5)
mI2rankbootI2rankbootI2rankboot
0.00.50000.950.960.29520.950.960.12570.950.96
0.20.49370.940.960.29390.940.960.12560.950.96
0.40.47520.900.960.29010.950.960.12530.960.94
0.60.44600.860.960.28390.940.950.12480.960.94
0.80.40820.790.960.27550.930.960.12410.970.95
1.00.36450.710.950.26500.920.930.12320.970.94
1.30.29390.590.960.24600.900.940.12160.970.95
1.70.20300.440.950.21620.870.940.11880.980.95
2.00.14490.330.950.19180.850.950.11620.990.96
2.30.09850.330.940.16700.820.950.11340.990.94
2.70.05470.200.940.13470.790.940.10900.990.96
3.00.03330.140.940.11210.760.940.10550.990.94
3.30.01940.090.930.09150.700.950.10160.990.95
3.70.00880.040.940.06780.660.950.09620.990.95
4.00.00470.020.910.05290.630.960.09200.990.94
4.30.00240.000.900.04050.600.940.08770.990.95
4.70.00090.000.880.02760.570.950.08170.990.95
5.00.00040.000.840.02020.510.940.07720.990.95

In situations where the overlap between both distributions was small, the bootstrap also showed undercoverage due to underestimation of the estimator’s variance. This was the case for true values of I1 or I2 close to zero, rougly less than 2/sample size. However, here the absolute difference between estimates and true values was still small, but the confidence intervals did not capture the true parameter due to their minuscule lengths. This is not surprising. Indeed, when the overlap between two distributions is close to zero, the bootstrap may yield two resamples which are fully separated in the sense that the largest observation in one sample is smaller than then smallest observation in the other. All those resamples would produce the same estimated zero for the overlap, with overall very little variation among the bootstrap samples, eventually yielding a variance estimator very close to zero.

Table 5:

True value of I2 and simulated ECP of rank-based and bootstrap confidence intervals for uniform distributions, with respect to increasing difference m between means.

U(-0.5,0.5)xU(m-0.5,m+0.5)U(-0.5,0.5)xU(m-1,m+1)
mI2rankbootI2rankboot
0.00.5000.9390.9460.2500.9570.955
0.10.4900.9070.9440.2500.9360.952
0.20.4600.7940.9420.2500.9180.937
0.30.4100.5790.9490.2500.9260.954
0.40.3400.2330.9310.2500.8940.943
0.50.2500.0120.9520.2500.8360.945
0.60.1600.0000.9560.2450.7160.946
0.70.0900.0000.9370.2300.5470.953
0.80.0400.0000.9510.2050.3380.949
0.90.0100.0000.9460.1700.1530.953
1.00.0000.0000.0000.1250.0060.955
1.10.0000.0000.0000.0800.0000.950
Table 6:

True value of I1 and I2 and simulated ECP of rank-based and bootstrap confidence intervals for F=U(0.5,0.5) and G=N(m,1), with respect to increasing difference m between means.

mI1rankbootI2rankboot
0.00.80460.940.950.19540.960.95
0.20.77650.840.940.19170.950.95
0.40.69300.650.950.18100.960.96
0.60.56460.580.950.16440.960.95
0.80.44250.580.940.14370.950.95
1.00.33700.570.960.12090.950.95
1.30.21190.590.940.08690.920.95
1.70.10250.560.950.04880.880.94
2.00.05360.540.950.02870.830.94
2.30.02700.530.920.01540.780.94
2.70.00940.470.920.00590.630.93
3.00.00390.360.860.00260.460.88
3.20.00210.240.790.00140.350.83
3.40.00100.130.660.00080.200.70
3.60.00050.030.540.00040.100.55
3.80.00030.010.330.00020.030.37
4.00.00010.000.200.00010.010.21

Concluding, we can say that our estimators and their confidence intervals appear to perform quite well. The simplified rank-based confidence interval should only be used when deviation from its equal median assumption is not strong. And, for highly skewed data, larger sample sizes than for symmetric data are needed to obtain fitting ECPs.

3.2 Symmetric overlap

In simulating the performance of the proposed confidence interval methods for the symmetric H-niche overlap, we used three combinations of distributions, namely N(0,1) and N(0,2), N(0,1) and U(1,1), as well as N(0.615,1) and B(3,2). For these distributions and sample sizes 10,20,,150 we simulated 1,000 data sets each. For each data set, we calculated the estimators Iˆ1 and Iˆ2 along with their corresponding confidence intervals, and the resulting symmetric H-niche overlap estimator and its confidence intervals. The bootstrap confidence intervals are again based on 1,000 bootstrap runs each.

We used individual asymptotic 95% confidence intervals for I1 and I2. As can be seen in Table 7, we obtained high ECPs even for sample sizes as small as 10. However the average length of the confidence intervals was too high as to actually provide any useful information about the true parameter value. Even for sample sizes of 150 each, we still had an average length of at least 1/3 for the three combinations. This would imply a need for larger sample sizes to actually obtain informatively precise confidence intervals, but such large samples may not always be available in practice.

Even though the length of the rank based confidence intervals is lower than for the bootstrap, without any significant loss of ECP, it still does not seem reasonable to use the confidence interval for small sample sizes as the contained information content is too low.

Table 7:

Simulated ECP and average length of the confidence intervals for the symmetric H-niche overlap based on 95%-confidence intervals for I1 and I2

N(0,1) x N(0,2)102030405060708090100150
Rank ECP0.930.980.970.980.990.990.990.990.991.000.99
Length0.860.710.610.550.510.480.460.440.420.410.36
Boot. ECP0.970.991.000.991.001.001.001.001.001.001.00
Length0.950.820.710.650.600.560.530.510.480.470.41
N(0,1) x U(-1,1)102030405060708090100150
Rank ECP0.940.980.990.991.001.001.001.001.001.001.00
Length0.840.650.550.490.450.410.390.370.350.330.29
Boot. ECP0.980.991.001.001.001.001.001.001.001.001.00
Length0.940.780.660.590.530.500.460.440.420.400.33
N(0.615,1) x B(3,2)102030405060708090100150
Rank ECP0.940.970.960.960.950.960.970.960.960.960.97
Length0.830.760.660.580.520.470.440.410.380.360.29
Boot. ECP0.910.940.950.970.970.970.970.980.980.980.98
Length0.840.800.710.640.590.540.510.480.450.430.35

4 Application to ecological data

To visualize our results on a real data set we chose the data set Finch2 from the R-package dynRB ([5]). This package implements the method developed by [5]. The data set contains nine morphological different traits for two species, Geospiza fuliginosa pavula and Geospiza fortis fortis that describe the similarity of both species. These two finch species are endemic to the Galápagos Islands. They are more commonly known as Darwin’s finches, along with some other species. There are 81 observations for G. fuliginosa pavula and 22 for G. fortis fortis. Out of the nine traits, we chose four, one of which only had a small overlap (tarsus length), two had a more pronounced overlap (body length and wing length), and one with complete containment of one species’ observed distribution within the other’s (tail length). Figure 6 depicts the trait distributions for both species.

Figure 6: Four of the nine traits of two species of finches, Geospiza fuliginosa pavula (red) and Geospiza fortis fortis (blue).

Figure 6:

Four of the nine traits of two species of finches, Geospiza fuliginosa pavula (red) and Geospiza fortis fortis (blue).

In this example, we analyzed “impact niches” (see Introduction). One of the main features of the ecological system of the Galápagos Islands is that it is solitary. Thus, analyzing traits of two species living in this ecosystem and having similar abiotic factors is of interest in ecology.

For each of the chosen traits we calculated both H-niche containment estimators Iˆ1 and Iˆ2, as well as their 95% confidence intervals. We used both, the simplified rank-based method, and the bootstrap, for the confidence interval calculations even though the equal median assumption does not appear to hold for any of the traits. However, as we saw in the simulations of the previous section, it may still yield reasonable results. Further we calculated the values corresponding closely to I1 and I2 given by [5] using the R-package dynRB. Table 8 shows the estimates and the confidence intervals. Here, in all cases, the rank-based confidence interval is well contained in the respective bootstrap confidence interval.

Table 8:

Estimators of I1 and I2 for four traits of the finch data, along with their confidence intervals, and closely corresponding estimators given by the R-package dynRB.

Iˆ1Rank CIBoot. CIdynRBIˆ2Rank CIBoot. CIdynRB
Body L.0.17(0.12, 0.22)(0.09, 0.26)0.1310.18(0.13, 0.23)(0.05, 0.32)0.092
Wing L.0.04(0.02, 0.05)(0.00, 0.09)0.0160.04(0.02, 0.05)(0.01, 0.08)0.013
Tail L.0.47(0.37, 0.57)(0.32, 0.60)0.4870.38(0.28, 0.49)(0.25, 0.52)0.352
Tarsus L.0.05(0.03, 0.07)(0.01, 0.10)0.0200.06(0.04, 0.08)(0.02, 0.12)0.018

While for three of the traits the two asymmetric H-niche containment estimators were almost the same, indicating a symmetric overlap of the data, for tail length, there was a difference of 0.09 between both directions. As shown in Figure 6, the observed support interval for G. fortis fortis completely covers the corresponding interval of G. fuliginosa pavula. And as I1 is the probability that a randomly chosen observation from the species G. fuliginosa pavula lies between two randomly chosen observations of G. fortis fortis, it makes sense that it is clearly higher than I2, since the support of G. fuliginosa pavula does not cover the support of G. fortis fortis. Comparing our results with those of dynRB, we see similar relations and magnitudes, and would reach the same qualitative conclusions regarding the two species being compared. However, both methods do not provide the same numerical answers. Considering that both use different definitions of asymmetric and symmetric H-niche overlap, with different mathematical interpretations, this is not a surprise. The new approach features, for example, the advantages of a direct and intuitive probabilistic interpretation, and a confidence statement.

Table 9:

One-dimensional H-niche overlap of four traits and the corresponding confidence intervals for the finch data.

sym. NORank CIBoot. CI
Body L.0.118(0.059, 0.196)(0.023, 0.309)
Wing L.0.006(0.002, 0.012)(0.000, 0.031)
Tail L.0.720(0.411, 1.000)(0.334, 1.000)
Tarsus L.0.014(0.005, 0.028)(0.002, 0.044)
Figure 7: Scatterplot of four of the nine traits for two species of finches, Geospiza fuliginosa parvula (red) and Geospiza fortis fortis (blue).

Figure 7:

Scatterplot of four of the nine traits for two species of finches, Geospiza fuliginosa parvula (red) and Geospiza fortis fortis (blue).

Table 10:

2-dimensional symmetric H-niche overlap for pairwise traits.

Wing L.Tail L.Tarsus L.
Body L.0.0270.2910.041
Wing L.-0.0670.009
Tail L.--0.101

In this data example, the 4-dimensional symmetric H-niche overlap is given by 5.22%, which seems reasonable as the one-dimensional symmetric overlap of two of the traits is rather low, as seen in Table 9. The confidence interval is (0.0233,0.0918) using the simplified method, and (0.0187,0.1575) using the bootstrap method. The 2-dimensional symmetric H-niche overlaps are given in Table 10. The resulting values appear reasonable in light of the scatterplots shown in Figure 7.

5 Conclusion

There has been much renewed interest in quantifying overlap between Hutchinsonian niches [1, 2, 3, 4, 5, 6], driven in large part by ecologists who have been studying conditions for species to establish and survive in a given habitat, with applications, for example, in the study of native versus invasive species [23]. However, 60 years after the seminal work by [7], no global, unique definition of H-niche overlap has existed in a mathematical sense. Different authors have defined this concept differently, motivated by particular research questions (see [24, 25, 26]).

In addition, as of now, none of the established methods allows statistically valid inferential statements (confidence intervals) yet, even though it has become a requirement in many fields to supply confidence statements along with effect size estimates, whenever possible. The present manuscript attempts to close this gap. This is based on a definition of H-niche overlap that can be expressed in simple terms using intuitive probabilities, but also has connections to the ROC process and to rank statistics. These connections are advantageous in deriving asymptotic results and in providing shortcuts for the calculations. The proposed method is rather robust, and due to its invariance properties, it can be used for symmetric, skewed and heavy-tailed distributions alike. For ordinal and binary variables, the authors conjecture that the method will also work, but a different mathematical proof needs to be provided. Nominal variables are outside of the scope of this approach, since it is not possible to assign ranks to nominal responses, while the proposed method utilizes ranks.

Specifically, two confidence interval methods are proposed. One is based on ranks, it can be calculated quickly using an explicit formula, but it requires an assumption on the distributions being compared. The other is based on the bootstrap, it can be used in general, based on the asymptotic theory shown, but it is obviously numerically more demanding. Simulations have demonstrated good performance of the proposed methods under their respective assumptions. Particularly for symmetric distributions, they work well also for small to moderate sample sizes. Thus, the proposed methods are also suitable for real data sets with low to moderate numbers of observations and thus well applicable for analyses of niches in ecological and other data sets.

Comparing the newly proposed method with previously proposed approaches, some similarities exist with the idea of [6] whose approach however was designed for data following multivariate normal distributions, while the presently proposed method does not make any distributional assumptions. Similar to our interpretation of Hutchinsonian niche overlap, the interpretation of their overlap factor is the probability that an individual from species A is found in the niche of B, which [6] defined to be the central 95%-region of the distribution of species B. They do allow for different choices of α. However, depending on the choice of α, it is then possible to obtain any value between 0 and 1 as an overlap factor, supplemented by heuristic confidence intervals based on Monte-Carlo simulations. Our new method, as well as [5], uses every possible α and combines them into an aggregated index, making the method also statistically robust.

The closest resemblance of the newly proposed method is indeed with the approach by [5], and the new method can be seen as a direct improvement of this approach, rendering a more robust estimation process, with simpler formulas, and supplementing the possibility to provide valid inferential conclusions, in addition to estimates for asymmetric and symmetric Hutchinsonian niche overlap.

Funding source: Austrian Science Fund

Award Identifier / Grant number: 10.13039/501100002428

Funding statement: This article was funded by the Austrian Science Fund (10.13039/501100002428), Grant Number: I 2697-N31.

Appendix

A Proof

Let X and Y be continuously distributed random variables. As in Section 2.1, assume for simplicity of notation that all the X-observations below or identical to their median Fˆn1(1/2) have indices 1,,K. Note that K=n/2 for n even and K=(n+1)/2 for n odd. In general, we will use the notations of Section 2.1 and 2.2.

First, we can rewrite the integral

Fˆn1(1/2)GˆmdFˆn

as a sum

1ni=1KGˆm(Xi)=1ni=1K1mj=1m1{Yj<Xi}=1mni=1K[j=1m1{Yj<Xi}+k=1n1{Xk<Xi}]1mni=1Kk=1n1{Xk<Xi}.

Obviously, the sum k=1n1{Xk<Xi} is just the internal rank of Xi within the sample X1,,Xn, denoted by RX,i(X), minus one half. Analogously, j=1m1{Yj<Xi}+k=1n1{Xk<Xi} is the rank of Xi in all observations , namely RX,i.

Consequently, we obtain

Fˆn1(1/2)GˆmdFˆn=1mni=1K(RX,i)1mni=1K(RX,i(X))=1mn(i=1KRX,ii=1KRX,i(X)).

The second sum over the ranks adds up to K(K+1)2 if we do not have any ties in the data set. Therefore, we have

Fˆn1(1/2)GˆmdFˆn=1mn[RX<K(K+1)2].

Similarly, we obtain

Fˆn1(1/2)GˆmdFˆn=1mni=K+1nRX,i1mni=K+1nRX,i(X)=1mn[RX>n(n+1)2+K(K+1)2].

Substituting these into eq. (6) yields

Iˆ2=2(Fˆn1(1/2)Gˆm(t)dFˆn(t)Fˆn1(1/2)Gˆm(t)dFˆn(t))=2mn(RX>RX<)+2mn(n(n+1)2+2K(K+1)2)=:2mn(RX>RX<)+12c,

where c is given by n/m for n even and by (n+1)/m+3(n+1)/(nm) for n odd. For large m and large odd n, c is approximately n/m.

A.1 Proof A adapted for data sets with ties

For data sets with ties, the sum over the ranks of observations below the median does not necessarily add up to K(K+1)/2 when ties occur around the median. We therefore can not simplify the terms i=1KRX,i(X) in Proof A. However, it holds as before that i=1nRX,i(X)=n(n+1)/2. It follows that in case of ties, Iˆ2 can be written as

(8)2mn((RX>RX<)n(n+1)2+2i=1KRX,i(X)).

B Proof

In this proof, the notation, if not specified otherwise, is the same as in [27]. As GˆmFˆn1GF1 and φ a continuous linear map at every point of D, we can apply the continuous mapping theorem, Theorem 1.3.6 of [27]. It follows that

n(φ(GˆmFˆn1)φ(GF1))φGF1(L).

As the limit process L is a tight Borel measurable map and φ is a continuous linear map, it follows with Theorem 3.9.8 of [27] that φGF1(L) is normally distributed.

We still need to show that the expectation of our normal distributed process φGF1(L) is zero. Let B1 and B2 be two independent Brownian bridges. Consider

E[φGF1(L)]=E01L(1α2)L(α2)dα=E01λB1GF11α2λB1GF1α2dα+E01g(F1(1α/2))f(F1(1α/2))B21α2g(F1(α/2))f(F1(α/2))B2α2dα.

To calculate the expectation of the integral, we will apply the Fubini-Tonelli Theorem. First we however show that the requirements are fulfilled. As the proof is similar for both summands, we will only show it explicitly for one of them. We can limit λB1(GF1(1α/2))λB1(GF1(α/2)) by λsup0t1B1(t)λinf0t1B1(t). Now this term has the same distribution as λsup0t1e(t), where e(t) is a Brownian excursion, see [28]. It therefore holds that

01EλB1GF11α2λB1GF1α2dα01λEsup0t1e(t)dα=λEsup0t1e(t).

As given in Corollary 3.2 of [29], the expectation of a Brownian excursion is given by π/2. As further λ(0,1) we can conclude that the term above is finite and that we can apply the Fubini-Tonelli Theorem. For the second summand, recall that the slope of G(F1(t)) is bounded. Switching expectation and integral, we obtain that the expectation of the process is given by

01λEB1GF11α2λEB1GF1α2dα+01g(F1(1α/2))f(F1(1α/2))EB21α2g(F1(α/2))f(F1(α/2))EB2α2dα.

Since B1 and B2 are Brownian Bridges, they have expectation zero for all time points. It follows that EφGF1L is zero.

Alternatively, the result can also be concluded from the fact that each of the components of I2 in eq. (5) can be estimated in an unbiased manner by a corresponding component of Iˆ2 in eq. (6).

C Proof

In order to apply Theorem 3.9.11 of [27], define Y=(Y1,,Ym)Rm, X=(X1,,Xn)Rn, and m=m(n). Further, let Pf=G(F1), YjGj=1,,m, and XiF, i=1,,n. To show that the requirements of Theorem of 3.9.11 of [27] are fulfilled, we define the class F as the set of functions

fn(X,Y,t,i)=GˆmXi1FˆnXitj=1,jin11FˆnXjt1Xj>Xi

with t[0,1], GˆmXi=1mj=1m1YjXiFˆnXi=1nj=1n1XjXi, and i=1,,n. Observe that this is simply the empirical counterpart to G evaluated at the t-quantile of the empirical counterpart to F. As shown in Example 2.5.4 of [27], the set of all indicator functions of cells in R is a Donsker class. From this we can directly conclude that the individual sets of functions 1Xj>Xi, GˆmXiFˆnXi, and 1Fˆn(Xi)t are Donsker classes. Theorem 2.10.8 of [27] states that the product of two Donsker classes is again a Donsker class. Further, it can easily be seen that the set of functions given by 11Fˆn(Xj)t1Xj>Xi is also a Donsker class. Altogether we can conclude that F=fn(X,Y,t,i):XRn,YRm,t[0,1],i=1,,n is a Donsker class and hence

nPnfnPf(t)=n1ni=1nfn(X,Y,t,i)GF1(t)G

a tight random element, as seen in equation (2.1.1) of [27]. As we now have a Donsker class, we can apply Theorem 3.6.1 and obtain that the upper part of equation 3.9.9 of [27] holds in outer probability. Additionally, the lower equation holds as well, since the empirical process and the bootstrap are measurable. The Hadamard-differentiable function is given by φ defined in the beginning of Section 2.2. With this, we have shown that the requirements are indeed fulfilled, and we can apply Theorem 3.9.11 of [27], which completes the proof.

References

[1] Blonder B. Pushing past boundaries for trait hypervolumes: a response to carmona et al. Trends Ecol Evol. 2016;31:665–667.2745335210.1016/j.tree.2016.07.001 Search in Google Scholar

[2] Blonder B, Lamanna C, Violle C, and Enquist BJ. The n-dimensional hypervolume. Global Ecol Biogeogr. 2014;23:595–609.10.1111/geb.12146 Search in Google Scholar

[3] Carmona CP, de Bello F, Mason NW, Lepš J. The density awakens: a reply to blonder. Trends Ecol Evol. 2016;31:667–669.2747295810.1016/j.tree.2016.07.003 Search in Google Scholar

[4] Carmona CP, de Bello F, Mason NW, Lepš J. Traits without borders: integrating functional diversity across scales. Trends Ecol Evol. 2016;31:382–394.10.1016/j.tree.2016.02.00326924737 Search in Google Scholar

[5] Junker RR, Kuppler J, Bathke AC, Schreyer M, Trutschnig W. Dynamic range boxes a robust nonparametric approach to quantify size and overlap of n dimensional hypervolumes. Meth Ecol Evol. 2016;7:1503–1513. Search in Google Scholar

[6] Swanson HK, Lysy M, Power M, Stasko AD, Johnson JD, Reist JD. A new probabilistic method for quantifying n-dimensional ecological niches and niche overlap. Ecology 2015;96:318–324.10.1890/14-0235.126240852 Search in Google Scholar

[7] Hutchinson G. Concluding remarks. Cold Spring Harbor Symposia on Quantitative Biology, 1957. Search in Google Scholar

[8] Lamanna C, Blonder B, Violle C, Kraft NJ, Sandel B, Šímová I, Donoghue JC, Svenning JC, McGill BJ, Boyle B, Buzzard V. Functional trait space and the latitudinal diversity gradient. Proc Nat Acad Sci USA. 2014;111:13745–13750.10.1073/pnas.1317722111 Search in Google Scholar

[9] Li Y, Shipley B, Price JN, Dantas VD, Tamme R, Westoby M, Siefert A, Schamp BS, Spasojevic MJ, Jung V, Laughlin DC. Habitat filtering determines the functional niche occupancy of plant communities worldwide. J Ecol, 2017. doi: 10.1111/1365-2745.12802. Search in Google Scholar

[10] Kuppler J, Hoefers M, Trutschnig W, Bathke A, Eiben J, Daehler C, Junker R. Exotic flower visitors exploit large floral trait spaces resulting in asymmetric resource partitioning with native visitors. Func Ecol. 2017. doi: 10.1111/1365-2435.12932. Search in Google Scholar

[11] Barros C, Thuiller W, Georges D, Boulangeat I, and Münkemüller T. N-dimensional hypervolumes to study stability of complex ecosystems. Ecol Lett. 2016;19:729–742.2728231410.1111/ele.12617 Search in Google Scholar

[12] Lichenstein J, Wright C, McEwen B, Pinter-Wollman N, Pruitt J. The multidimensional behavioural hypervolumes to two interactions species predict their space use and survivial. Anim Behav. 2017;132:129–136.10.1016/j.anbehav.2017.08.010 Search in Google Scholar

[13] Nunes L, Pearson R. A null biogeographical test for assessing ecological niche evolution. J Biogeogr. 2017;44:1331–1343.10.1111/jbi.12910 Search in Google Scholar

[14] Blonder B. Hypervolume concepts in niche- and trait-based ecology. Ecography 2017;40:001–013. Search in Google Scholar

[15] Drake JM. Range bagging: a new method for ecological niche modelling from presence-only data. J R Soc Interface. 2015;12:20150086.2594861210.1098/rsif.2015.0086 Search in Google Scholar

[16] Isotalo J. . Basics of Statistics. CreateSpace Independent Publishing Platform, 2014. Search in Google Scholar

[17] ICH Expert Working Group. Statistical principles for clinical trials. ICH Harmonised Tripartite Guideline, 1998. Search in Google Scholar

[18] Schulz KF, Altman DG, Moher D. Consort 2010 statement: updated guidelines for reporting parallel group randomised trials. BMC Med. 2010;8:18. Search in Google Scholar

[19] Vandenbroucke JP, Von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, et al. Strengthening the reporting of observational studies in epidemiology (strobe): explanation and elaboration. PLoS Med. 2007;4:e297.10.1371/journal.pmed.0040297 Search in Google Scholar

[20] Hsieh FY, Turnball BW. Nonparametric and semiparametric estimation of the receiver operating characteristic curve. Ann Stat. 1996;24:25–40.10.1214/aos/1033066197 Search in Google Scholar

[21] Brunner E, Munzel U. Nichtparametrische Datenanalyse. Berlin Heidelberg: Springer Spektrum, 2013. Search in Google Scholar

[22] Jean J, Shiryaev A. Limit theorems for stochastic processes, 2nd ed. Berlin Heidelberg New York: Springer-Verlag, 2003. Search in Google Scholar

[23] Konietschke F. Simultane konfidenzintervalle für nicht parametrische relative kontrasteffekte. Ph.D. thesis, University of Göttingen, 2009. Search in Google Scholar

[24] Junker RR, Daehler CC, Dötterl S, Keller A, Blüthgen N. Hawaiian ant-flower networks: nectar-thieving ants prefer undefended native over introduces plants with floral defenses. Ecol Monogr. 2011;81 (2): 295–311.10.1890/10-1367.1 Search in Google Scholar

[25] McInerny GJ, Etienne RS. Ditch the niche–is the niche a useful concept in ecology or species distribution modelling?. J Biogeogr. 2012;39:2096–2102.10.1111/jbi.12033 Search in Google Scholar

[26] McInerny GJ, Etienne RS. Pitch the niche–taking responsibility for the concepts we use in ecology and species distribution modelling. J Biogeogr. 2012;39:2112–2118.10.1111/jbi.12031 Search in Google Scholar

[27] McInerny GJ, Etienne RS. Stitch the niche–a practical philosophy and visual schematic for the niche concept. J Biogeogr. 2012;39:2103–2111.10.1111/jbi.12032 Search in Google Scholar

[28] van der Vaart W, Wellner JA. Weak convergence and empirical processes. New York: Springer Verlag, 1996. Search in Google Scholar

[29] Vervaart W. A relation between brownian bridge and brownian excursion. Ann Probab. 1979;7:143–149.10.1214/aop/1176995155 Search in Google Scholar

[30] Durrett R, Iglehart D. Functionals of brownian meander and brownian excursion. Ann Probab. 1977;5:130–135.10.1214/aop/1176995896 Search in Google Scholar

[31] Billingsley P. Convergence of probability measures. New York: John Wiley and Sons Inc, 1999. Search in Google Scholar

Received: 2017-03-31
Revised: 2018-02-16
Accepted: 2018-05-23
Published Online: 2018-06-15

© 2018 Walter de Gruyter GmbH, Berlin/Boston