Measuring association via lack of co-monotonicity: the LOC index and a problem of educational assessment

Abstract Measuring association, or the lack of it, between variables plays an important role in a variety of research areas, including education,which is of our primary interest in this paper. Given, for example, student marks on several study subjects, we may for a number of reasons be interested in measuring the lack of comonotonicity (LOC) between the marks, which rarely follow monotone, let alone linear, patterns. For this purpose, in this paperwe explore a novel approach based on a LOCindex,which is related to, yet substantially different from, Eckhard Liebscher’s recently suggested coefficient of monotonically increasing dependence. To illustrate the new technique,we analyze a data-set of student marks on mathematics, reading and spelling.


Introduction
Measuring association, or lack of it, between variables has fascinated researchers for centuries.A considerable impetus to this research was given by Sir Francis Galton who in 1885 published his empirical and theoretical developments on regression and correlation.
Inspired by that work, a decade later Karl Pearson proposed a coefficient to measure correlation between variables, which is nowadays known as the Pearson correlation coefficient.For detailed historical notes on this coefficient and many of its interpretations, we refer to Rodger and Nicewander (1988), and Stanton (2001).
The mathematical simplicity and thus interpretability of the Pearson correlation coefficient have encouraged researchers to use it in a variety of areas where measuring association between variables is of interest.But just like any other synthetic measure, the Pearson correlation coefficient also has a number of limitations, such as high sensitivity to outliers (cf., e.g., Abdullah, 1990;Shevlyakov and Smirnov, 2011), its reliance on linearity (cf., e.g., Rodger and Nicewander, 1988), and so on.
In many practical situations, however, we encounter problems that are poorly described by linear relationships, and thus measuring association (or lack of it) using the Pearson correlation coefficient may not be prudent.Hence, a number of alternative ways have emerged in the literature, including the rank correlation coefficients of Kendall (1938) and Spearman (1904) (cf., e.g., Kendall and Stuart, 1961;Nelsen 2006), and the correlation coefficients of Gini (cf., e.g., Gupta, 1999;Nelsen, 2006) and Blomqvist (1950).
For measuring co-monotonicity between pairs of variables, none of the aforementioned coefficients can truly serve the purpose, as noted by Liebscher (2014) who suggested an index for determining whether co-movements of random variables follow an increasing pattern.Specifically, given two random variables X and Y , whose cdf's we denote by F and G, respectively, Liebscher's (2014) coefficient of monotonically increasing dependence where c ψ = 2 1 0 (1 − u)ψ(u)du is the normalizing constant, and ψ can be any non-negative and symmetric around 0 function on the interval [−1, 1] such that ψ(0) = 0. Various properties and extensions of this index have been discussed by Liebscher (2014) from which we see that, to a certain degree, the index can be used for tackling the problem of the current paper.Yet, due to a different goal set out by Liebscher (2014), his index does not truly suit the goal of the present paper, as we shall explain in Section 6.
We have organized the rest of the paper as follows.In Section 2 we describe a classical data-set of Thorndike and Thorndike-Christ (2010), which is of our primary interest, and then visualize the data using scatterplots with superimposed classical least-squares regression lines.In Section 3 we fit curves to bivariate data using several powerful methods available in the literature, which is a precursor to our use of an index for measuring lack of co-monotonicity (LOC).The definition and properties of the LOC index are discussed in Section 4, where we also provide a convenient computational formula for the index.In Section 5 we utilize the LOC index to analyze a data-set of Thorndike and Thorndike-Christ (2010).In Section 6 we discuss the difference between the LOC index and that of Liebscher (2014).Section 7 concludes the paper.Some proofs are relegated to Appendix A.

Data
To facilitate full transparency of our reasoning and adopted methodology, we use a publicly available data-set of Thorndike and Thorndike-Christ (2010).The set consists of marks of 52 sixth grade students on three study subjects: Mathematics, Reading, and Spelling.
The students belonged to two classes, taught by two teachers, who administered tests on the three subjects.For each student and for each study subject, the teachers reported the number of correct answers and used them to assess each student's achievement on each of the three subjects.
For our analysis, we first normalize the marks to the unit interval [0, 1] by dividing the number of correct answers by the total number of items (i.e., questions or problems) on the tests: 65 items for Mathematics, 45 for Reading, and 80 for Spelling.Hence, throughout the paper we deal with functions h : [0, 1] → [0, 1] that model association between pairs of study subjects, which we denote by X and Y , connected via the hypothetical equation y = h(x) with h estimated from data (topic of Section 3).Summary statistics and histograms of the normalized marks are reported in Table 1

Curve fitting
Here we discuss curve fitting to scatterplots -and we have six of them (see Figure 2)which is a precursor to calculating the LOC index, which is a topic of Section 4 below.
A number of approaches have been developed for fitting curves to bivariate data.
The parametric approach is one of them, which includes popular models such as linear, generalized linear, nonlinear, parametric growth curve, and many other ones (cf., e.g., Seber and Wild, 1989;and Pan and Fang, 2002).The disadvantage of this approach, especially in the context of the present paper, is that the shape and form of the functions to be fitted are difficult to guess, and thus involves an element of subjectivity that we want  to avoid.Hence, we opt for the non-parametric approach, which is sometimes referred to as scatterplot smoothing (cf., e.g., Ruppert et al., 1995).
In general, there are two broad non-parametric approaches for fitting curves to bivariate data: one is based on conditional mean and another one on conditional quantile.
Both methods have their own advantages and disadvantages, and we shall illustrate both of them.We note at the outset that in the case of the conditional quantile, we shall restrict our attention to the conditional median that serves a natural alternative to the mean when data are skewed.Some further details and references on the two methods will be provided in Section 3.1 below, with their actual use for analyzing the data of Thorndike and Thorndike-Christ (2010) exhibited in Section 5.

Constructing h
The conditional-mean approach is based on the assumption that a good model for h is given by the conditional mean, and thus Given a scatterplot consisting of n pairs (x i , y i ), the local linear estimate -which is our choice among many other ones available in the literature -for estimating h(x) is given by where β 0 is a solution to the minimization problem min throughout this paper we work with the standard normal kernel K. Details and references on the bandwidth b selection will be provided in Section 3.2 below.As to the loss function L, we choose the quadratic loss function L(x) = x 2 when using the conditional-mean approach, and another function to be specified later in the case of the conditional-median approach.
We note in passing that this estimate naturally arises from the fact (recall here the local constant regression method of Nadaraya-Watson model) that h(x) defined by equation ( 2) 3) is included to diminish the asymptotic bias of the estimate, if compared to the bias arising from the Nadaraya-Watson method (cf., e.g., Fan, 1992).For further properties of the local linear estimate, we refer to Simonoff (1996), Wand and Jones (1995), and references therein.
It is also natural to use the conditional-quantile approach (Koenker, 2005), which is based on the assumption that a good model for h(x) is given by the conditional quantile, and thus for some τ ∈ (0, 1).Throughout this paper we set τ = 0.5 and thus work with the conditional median.An estimate h(x) of h(x) stems from the minimization problem of (3) but now using the loss function , where the indicator ) and 0 otherwise.

Bandwidth selection
The construction of bandwidth b is based on how good the resulting estimator h(x) of h(x) is, and for this task it is customary to use the mean integrated squared error (MISE) with some weight function w that ensures convergence of the integral (e.g., Ruppert and Wand, 1994).Specifically, the bandwidth is chosen so that it asymptotically minimizes the MISE.There are of course other good ways to choose the bandwidth but we shall not delve deeply into this subject here and just note some of the facts that we shall utilize in our data-driven computations.
Namely, we follow Ruppert and Wand (1994), Ruppert et al. (1995), and Fan and Gijbels (2000) when using the conditional-mean approach.We start out with the asymptotic optimal bandwidth given by formula (3.21) in Fan and Gijbels (1996, page 68).To facilitate its practical implementation, we use the direct plug-in method proposed by Ruppert et al. (1995, pages 1262-1263).In the latter reference, the resulting bandwidth is denoted by h DP I , which in the present paper is denoted by b to avoid a possible notational confusion with the estimate h of h.
When using the conditional-median approach, we follow Yu and Jones (1997), who show that the optimal bandwidth in this case is equal to the estimate b from the conditionalmean approach multiplied by , where τ = 1/2 due to our median based approach.The φ in the above quantity is the standard normal density, and Φ −1 is the standard normal quantile function.Hence, in summary, the optimal bandwidth under the conditional-median approach is b(π/2) 1/5 .
In view of the above discussion, we can now assume that for any given scatterplot we have constructed a well-fitting function h : [0, 1] → [0, 1]: if it happens to be increasing, then we say that the random variables X and Y have co-monotonic movements, but if not, then we want to assess how much the function deviates from the increasing pattern.This we accomplish using an index that takes value 0 when h is increasing and some positive value otherwise: the more the function deviates from the increasing pattern, the larger the value.The index, which we call the lack of co-monotonicity (LOC) index, is discussed next.

The LOC index
We start with the well-known notion of increasing rearrangement (cf., e.g., Hardy et al, 1952;Denneberg, 1994;Korenovskii, 2007; and references therein) which for our function , where G h (x) = λ{s ∈ [0, 1] : h(s) ≤ x} and λ is the Lebesque measure.Note 4.1 If we interpret the function h as a random variable on the probability space {[0, 1], B, λ}, then statisticians would immediately recognize that G h (x) is the cumulative distribution function of h, and I h (t) is the quantile function of h.We find these interpretations useful to work out good intuition on the subject.
Hence, to construct an index for measuring the lack of co-monotonicity, henceforth called the LOC index, between pairs of variables, we need to choose an appropriate functional that would couple I h and h in such a way that the resulting quantity would be zero if and only if the functions I h and h coincide, that is, the fitted function h is increasing (to be more precise, non-decreasing).Among such candidates are the maximal distance between I h and h, called the sup-norm, as well as the integrated distance between the two functions, called the L 1 -norm.Though mathematically appealing, the two choices are not good candidates for the purpose due to the lack of so-called co-monotonic addition (to be explain in a moment) as has been pointed out by Qoyyimi and Zitikis (2014) in the L 1 case.

Qoyyimi and Zitikis (2014) argue that a suitable candidate for an LOC index is
The integral is always non-negative, equal to 0 for every increasing function, and takes on (strictly) positive values for all other functions.Furthermore, L( h + d) = L( h) for every real constant d, and L(c h) = cL( h) for every non-negative constant c.If g is a function co-monotonic with h, which means that both g and h increase and decrease on the same intervals of their joint domain of definition, then L( g + h) = L( g) + L( h).We view this co-monotonicity property important for every LOC index to satisfy, and this is the reason we have abandoned the use of the aforementioned sup-and L 1 -norms.

Computational formula
Given the properties noted above, the LOC index L is attractive, but it is highly unwieldy from the computational point even when h has a simple mathematical expression, let alone when it arises nonparametrically from a scatterplot.Hence, we need a simple computational method for the index even when h does not have an explicit mathematical expression.
To this end, we first partition the interval [0, 1] into m subintervals using the points i/m, i = 1, 2, . . .m. Then for each i = 1, 2, . . .m we choose any point t i ∈ ((i − 1)/m, i/m] for which the value τ i := h(t i ) is available.Hence, we have the step-wise function D m : whose LOC index has a very simple computational formula (proof in Appendix A) where Hence, to calculate L( h) numerically, we need to calculate L( D m ), which approximates L( h) as precisely as desired provided that m is sufficiently large.

Data analysis and findings
We work with six scatterplots, and to each of them we fit two curves: one using the conditional-mean approach and the other one using the conditional-median approach.In  In the case of the conditional-mean approach, we use the local linear kernel regression method as discussed in Section 3.1.To aid us with computations, we use the R package Kernsmooth (Wand and Ripley, 2014) with the function dpill assigned for selecting the optimal bandwidth and the function locpoly (with degree=1) for curve fitting.We set the grid size to 1,000.
In the case of the conditional-median approach, we use the R package quantreg (Koenker, 2015) with the function lpqr used to obtain h with τ = 1/2 and m = 1, 000.
We see from the six panels of Figure 3 that all the estimates h are more jiggly than those arising from the conditional-mean approach.Definitely, we can improve them with more work and a more sophisticated tuning of the parameters, but this would beat our purpose of showing that we can easily calculate the LOC index irrespective of how much irregular the function is.
Based on our visual assessment, no function in Figure 3 appears to be increasing over its entire domain of definition.Nevertheless, we may argue that some of them are more increasing than others.To substantiate this claim, we employ the LOC index discussed in Section 4. The following terminology is useful.
Following the guidelines of Section 4.2, we produce the step-wise approximation D m of the function h.Then we calculate the index L( D m ) according to formula (7).Findings in the form of 'LOC matrices' are presented in Tables 3 and 4, whose entries are the values of the LOC index: the larger the value, the more the corresponding pairs deviate from the co-monotonic pattern.
The LOC matrix is, naturally, asymmetric.For example, the entry 0.231814 in Table 3 is the value (multiplied by 1, 000) of the LOC index for Mathematics-Reading, whereas 0.007202 is the value (multiplied by 1, 000) of the LOC index for Reading-Mathematics.We have multiplied all the original LOC-index values by 1, 000 to avoid recording too many decimal zeros in the tables.
Naturally, we may wish to know how much a given study subject might influence the other ones.Hence, we now gently touch upon causality, being well aware of the many pitfalls in the area (cf., e.g., Cheng, 1997;Pearl, 2009; and references therein).As an example, consider Mathematics as an 'explanatory' subject.In this case the LOC index for Spelling is larger than for Reading, and so we may conclude that Mathematics supports Reading more than Spelling.The opposite conclusion holds for the Reading: it supports Mathematics more than Spelling.Indeed, we see from Tables 3 and 4 that in the case of 'explanatory' Reading, the LOC index is larger for Spelling than for Mathematics.The scatterplots in Figure 3 also support the finding: higher marks in Reading correspond to somewhat lower marks in Spelling.Furthermore, we glean from the tables that Spelling tends to support Reading more than Mathematics.
To conclude this section, we note that even though the corresponding entries of Tables 3 and 4 are different, the conclusions that we infer from both of them do not contradict each other.This may not always be the case, especially if data are considerably skewed.
In the case of the data that we are exploring, however, the descriptive statistics and histograms in Section 2 suggest fairly symmetric distributions of all the three study subjects.Next we work with a scatterplot (x i , y i ), i = 1, . . ., n, which we view as our 'population.'To avoid computational complications that inevitably arise when dealing with ranks when some of the x i 's or y i 's are equal, throughout the rest of this section we work under the assumption x i = x j and y i = y j whenever i = j.
Note 6.2 Assumption ( 9) is violated by the data-set of Thorndike and Thorndike-Christ (2010).However, this is not an issue because we can always add negligible noise -e.g., independent and identically distributed normal random variables with means 0 and very small standard deviations, say 10 −5 -and make all the marks unequal without practically changing their numerical values.
Let F n and G n be the marginal cdf's defined by F n (x) = n i=1 1{x i ≤ x}/n and G n (x) = n i=1 1{y i ≤ x}/n.Under this 'finite population' scenario, the quantity I X,Y becomes Let x 1:n < • • • < x n:n be the ordered values of x 1 , . . ., x n , and let y (1) , . . ., y (n) be the corresponding induced ordered values.In other words, the original pairs (x i , y i ) have been ordered according to their first coordinates and the resulting pairs are now (x i:n , y (i) ).
With the notation we have where we used the equation Next we construct a function h 0 n : [0, 1] → [0, 1] such that L(h 0 n ) is equal to the right-hand side of equation (11) or, in other words, such that Namely, for every i = 1, . . ., n, let The LOC index of the function h 0 n is where we used the equations n i=1 i = n i=1 r i and n i=1 i 2 = n i=1 r 2 i .This establishes equation (12) and helps us to connect the LOC index L with Liebscher's ζ.
For this, we first observe that the set of equations h 0 n (i/n) = r i /n, i = 1, . . ., n, is equivalent to the set h 0 n (F n (x i:n )) = G n (y (i) ), i = 1, . . ., n, which is in turn equivalent to the set of equations h 0 n (F n (x i )) = G n (y i )), i = 1, . . ., n.This implies that Liebscher's ζ is the LOC index L of the step-wise function h 0 n , which originates from the rank-based scatterplot (F n (x i ), G n (y i )) and not from the original scatterplot (x i:n , y (i) ).This also explains a considerable difference between the meanings of the two indices.To support our conclusions, we have depicted the two scenarios in Figure 4, where we have used Mathematics (with added noise) as the 'explanatory' variable and Reading (with added noise) as the 'response.'Consequently, in order to decide whether the problem at hand would be better served by the LOC index L or Liebscher's ζ, we need to decide whether the solution of the The herein proposed index for measuring the lack of co-monotonicity (LOC) between pairs of variables is capable of measuring the extent to which the variables deviate from co-monotonic patterns.The LOC index is designed to work with all relationships, including non-linear and non-monotonic.The performance of the index has been illustrated using the Thorndike and Thorndike-Christ (2010) data-set consisting of student marks on three study subjects.

A Appendix: proofs
Proof of equation (7).We check that This concludes the proof of equation (7).
Proof of statement (8).For any two integrable functions u, v : [0, 1] → R we have the bound (e.g., Crandal and Tartar, 1980) Since the function h is finite and integrable, the right-hand side of bound (15) converges to zero when m → ∞.This finishes the proof of statement (8).

6 2 E 2 Note 6 . 1
Comparing the LOC index with Liebscher's ζ Here we compare the LOC index L with Liebscher's (2014) coefficient ζ X,Y of monotonically increasing dependence.Naturally, to understand ζ X,Y we only need to understand its expectation-based part, which under the quadratic function ψ(x) = x 2 /2 is equal to I X,Y = 1 F (X) − G(Y ) The quantity I X,Y is closely related to the Spearman rank correlation coefficient, denoted here by S X,Y , which is, by definition, equal to the Pearson correlation coefficient between F (X) and G(Y ).Hence, we easily check the equation S X,Y = 1 − 12 I X,Y .