Latent-variable Approaches Utilizing Both Item Scores and Response Times To Detect Test Fraud


 There is a growing interest in approaches based on latent-variable models for detecting fraudulent behavior on educational tests. Wollack and Schoenig (2018) noted the presence of five types of statistical/psychometric approaches to detect the three broad types of test fraud that occur in educational tests. This paper includes a brief review of the five types of statistical/psychometric approaches mentioned by Wollack and Schoenig (2018). This paper then includes a more detailed review of the recent approaches for detecting test fraud using both item scores and response times—all of these approaches are based on latent-variable models. A real data example demonstrates the use of two of the approaches.

Consequently, there is a growing interest in statistical/psychometric methods for detecting fraudulent behavior on tests, which is evident from the numerous publications on the topic (e.g., Drasgow, Levine, & Zickar, 1996;Ferrara, 2017;Kingston & Clark, 2014;McLeod, Lewis, & Thissen, 2003;Sinharay, 2017Sinharay, , 2018Sinharay & Johnson, 2020;Sinharay, Duong, & Wood, 2017;van der Linden & Guo, 2008;van der Linden & Lewis, 2015;van der Linden & Sotaridona, 2006;Wollack, 1997;Wollack, Cohen, & Eckerly, 2015;Wollack & Fremer, 2013;Wang, Liu, & Hambleton, 2017). While most research on detecting test fraud is based only on item scores, the increasing popularity of computerized assessments has allowed the recording of response times, and, subsequently, detection of test fraud using both item scores and response times (e.g., Sinharay & Johnson, 2020). Researchers such as Boughton, Smith, and Ren (2017), Fox and Marianti (2017), Lee and Wollack (2020), Man, Harring, and Sinharay (2019), Sinharay and Johnson (2020), van der Linden and Guo (2008), and Wang, Xu, Shang, and Kuncel (2018) have suggested a variety of approaches that can be used to detect test fraud using both item scores and response times and all of these approaches are based on latent variable models (e.g., Bartholomew, Knott, & Moustaki, 2011). The goal of this paper is to provide a review of these approaches. The next section includes brief reviews of the five types of statistical/psychometric approaches to detect test fraud and the most popular methods under each type of approach. The following section includes a review of the approaches based on both item scores and response times. The Real Data section includes a real data example. The last section includes some conclusions and recommendations. Wollack and Schoenig (2018) noted the occurrence of the following three broad types of test fraud in educational assessments: (a) answer-copying and collusion, (b) preknowledge, and (c) test tampering. They also stated that there exist five types of statistical/psychometric approaches to detect these three types of fraud. Interestingly, almost all of these approaches are based on item response theory (IRT) models (e.g., Hambleton & Swaminathan, 1985), which are latent variable models (e.g., Bartholomew et al., 2011) and are widely used in the reporting of scores to takers of educational and psychological tests. Table 1 shows the three types of test fraud (the column names) and the five types of statistical/psychometric approaches (row names). A check mark for a cell indicates that the approach in the heading of the corresponding row can be applied to detect the type of fraud corresponding to that column. For example, a check mark for score differencing and preknowledge indicates that score differencing can be used to detect preknowledge.

The Five types of Statistical/ psychometric Approaches to Detect Test Fraud
Brief descriptions of the three types of test fraud mentioned by Wollack and Schoenig (2018) are given below: -Answer-copying and collusion: Examinees often copy answers from another examinee, most often one sitting close to them. Collusion involves two or more examinees working together (when they are not allowed to) to answer the test items. -Preknowledge: Examinees benefit from item preknowledge when a "source" (such as a teacher, a test preparation company, or a website) shares test questions and/or their answers and then several beneficiaries memorize the assessment questions and/or answers. The items that are shared are usually referred to as "compromised" items.
-Test tampering: In its most common form, classroom teachers or administrators change student answers after the tests are complete (e.g., Wollack et al., 2015). Individual students may also work on their own or with the help of a companion to change their answers. While less common, test tampering can take place on computerized tests as well (see, e.g., Sinharay et al., 2017).
Note that test fraud may occur at the level of individual examinees (as documented in Sinharay et al., 2017, p. 202) and at the level of groups (such as classrooms or schools) of examinees (e.g., Maynes, 2013). Brief descriptions of the five types of statistical/ psychometric approaches that are used to detect the three abovementioned types of test fraud are given below: -Score Differencing: Score differencing involves a test of the null hypothesis of equal performance of an examinee over two sets of items. The alternative hypothesis is that the performance is better, presumably due to test fraud, on one of the item sets. Score differencing can be used to detect several types of test fraud including item preknowledge (e.g., Sinharay, 2017) and test tampering (e.g., Sinharay et al., 2017). Among the several approaches for score differencing, those that have been found to be most powerful are a variation of the optimal index (Drasgow et al., 1996) and the L s statistic (Sinharay, 2017) that is a variation of the likelihood ratio test statistic. -Analysis of Erasures and Answer Changes: While erasures are found on answer sheets irrespective of whether test fraud occurred (e.g., Sinharay et al., 2017), fraudulent erasures arise when teachers or school administrators correct students' wrong answers on their answer sheets after the test, often to hide underachievement by their class or school (van der Linden & Lewis, 2015). In analysis of erasures, an investigator attempts to find out if the erasures found on answer sheets are fraudulent. Popular statistics for analysis of erasures include (a) the erasure detection index (EDI; Wollack et al., 2015), (b) the L-index , (c) EDI for groups of examinees (Wollack & Eckerly, 2017), and (d) two extensions of the EDI for groups of examinees (Sinharay, 2018). While the EDI and the L-index are intended to detect individual examinees, the EDI for groups of examinees and its extensions are intended to detect groups (such as classrooms or schools) of examinees. -Copying-detection and Similarity Analysis: Copyingdetection and similarity analyses are intended to detect potential collusion among a pair or group of examinees by investigating the similarity of their answers and flagging the examinees who produce an observed number of answer matches that significantly exceeds the number of matches predicted under the model. These analyses are performed using answercopying indices and similarity indices. While both of these two types of indices quantify the similarity of the answers of pairs of examinees, they differ in the way they are computed.
In computing the answer-copying indices for a pair of examinees, one first specifies a source and a copier in the pair and computes the expected number of answer matches conditional on the estimated ability of the copier and the answer string of the source. In computing the similarity indices for a pair of examinees, one does not specify a source or a copier and computes the expected number of answer matches conditional on the estimated abilities of the pair of examinees. The answer-copying indices that are most popular include the K-index (e.g., Holland, 1996), the closely related probability of matching incorrect responses (PMIR; e.g., Lewis & Thayer, 1998), the ω index (Wollack, 1997), and the generalized binomial model approach (van der Linden & Sotaridona, 2006). The popular similarity indices include the M4 index (Maynes, 2014) for detecting pairs of examinees one or more of whom may have been involved in test fraud, the index of Wollack and Maynes (2017) that employs cluster-analysis methods (e.g., Everitt, Landau, Leese, & Stahl, 2011) in conjunction with the M4 index (Maynes, 2014) to detect groups of test-takers engaged in collusion, and the analysis for matching of response patterns based on a specialized IRT model (Haberman & Lee, 2017).
-Person-fit Analysis: In person-fit analysis, an investigator tries to determine whether the item scores of an examinee follows an item response theory (IRT) model. In the presence of cheating, the item scores usually do not follow an IRT model because, for example, an examinee who cheats may answer more difficult items correctly than what an IRT model predicts. There exist several indices for performing person-fit analysis-these are referred to as person-fit statistics. Meijer and Sijtsma (2001) provided a review of several of these indices. Popular person-fit statistics include the caution indices (Tatsuoka, 1984), the H T statistic (Sijtsma, 1986), and the l * z statistic (Snijders, 2001;Sinharay, 2016).
-Use of Response-time Models: Response-time models or RTMs (e.g., van der Linden, 2016) are latent variable models that are fitted to data on examinee's response times on educational or psychological tests. While such models have a long history of use in psychology (e.g., Luce, 1986), they are only recently becoming popular tools in educa-tional measurement, primarily propeled by an increasing popularity of computerized tests that allow the recording of response times. There is also a upswing in research on the use of RTMs to detect test fraud. Researchers such as van der Linden and Guo (2008), Marianti, Fox, Avetisyan, Veldkamp, and Tijmstra (2014), Fox and Marianti (2017), Sinharay and Johnson (2020), and Wang et al. (2018) suggested various approaches to detect test fraud using RTMs.
Note that Olson and Fremer (2013) also provided a list of five types of statistical approaches, but the classification provided by Wollack and Schoenig (2018) is slightly different from and a refinement of that of Olson and Fremer (2013). While most of the abovementioned approaches are designed to detect individual examinees who may have committed test fraud, several approaches have been suggested for detecting test fraud at group level (e.g., Sinharay, 2018;Wollack & Eckerly, 2017;Wollack & Maynes, 2017). Finally, some fraud-detection methods such as biometric learner verification (www.ets.org/ accelerate/ai-portfolio/biometric-learner-verification/) are not covered by any of the above types of approaches.

The Models
This section is intended to provide brief descriptions of the statistical/psychometric models that form the basis of the approaches for detecting test fraud that are discussed in this paper. All of these models are latent variable models and include latent variables that measure the examinees' ability or speed.

Item Response Theory Models
Item response theory (IRT) models (e.g., Hambleton, 1989;Yen & Fitzpatrick, 2006) are a family of statistical/ psychometric models used to analyze item response data. These models are employed in many educational measurement applications including test construction, adaptive test administration, test scoring, and score reporting. The primary reason for the popularity of IRT models is that the model allows for the estimation of individual item locations (difficulties) and examinee abilities separately, but on the same scale.
The core of an IRT model is the mathematical expression for P ic (θ) = P (x i = c|θ, δ i ), the probability of observing a particular item score given an examinee's ability and the item parameters, where θ denotes the examinee ability1, x i is the score of the examinee on item i, c denotes a specific value of x i , and δ i is the collection of parameters of item i.
For example, for the 3-parameter logistic model (3PLM; Birnbaum, 1968, p, 405), x i denotes the binary score on item i, and denotes the binary score on item i, and where a i , b i , and c i respectively are the slope, difficulty, and guessing parameters of item i. The Rasch model (Rasch, 1960) is a special case of the 3PLM with c i 's being equal to 0 and the a i 's being the same over all the items. The 2-parameter logistic model (2PLM; Birnbaum, 1968, p, 400) is a special case of the 3PLM with c i 's being equal to 0. Because θ is a latent variable, IRT models are a type of latent variable models (e.g., Bartholomew et al., 2011). One popular IRT model in detecting test fraud is the nominal response model (NRM; Bock, 1972), which has been heavily used in erasure analysis, copying-detection, and similarity analyses. Under the NRM, examinees do not receive any scores on the items, but choose various response options on them and it is assumed that P ik (θ), the probability that an examinee of ability θ chooses the response option k on item i, is given by where ζ im and λ im respectively are the intercept and slope parameters for response option m of item i.
The common IRT models involve the assumption of conditional independence, which implies that conditional on θ, x i , i = 1, 2, . . . , I are independent, where I is the number of items on a test. Because of the assumption, the likelihood of an examinee on an I-item test under the 2PLM is given by where x = (x 1 , x 2 , . . . , x I ). The marginal likelihood of an examinee is the weighted average of L(x|θ) with the weight proportional to a population distribution p(θ) and is typically defined as where a i , b i , and c i respectively are the slope, difficulty, and guessing parameters of item i. The Rasch model (Rasch, 1960) is a special case of the 3PLM with c i 's being equal to 0 and the a i 's being the same over all the items. The 2-parameter logistic model (2PLM; Birnbaum, 1968, p, 400) is a special case of the 3PLM with c i 's being equal to 0. Because θ is a latent variable, IRT models are a type of latent variable models (e.g., Bartholomew et al., 2011). One popular IRT model in detecting test fraud is the nominal response model (NRM; Bock, 1972), which has been heavily used in erasure analysis, copying-detection, and similarity analyses. Under the NRM, examinees do not receive any scores on the items, but choose various response options on them and it is assumed that P ik (θ), the probability that an examinee of ability θ chooses the response option k on item i, is given by denotes the binary score on item i, and where a i , b i , and c i respectively are the slope, difficulty, and guessing parameters of item i. The Rasch model (Rasch, 1960) is a special case of the 3PLM with c i 's being equal to 0 and the a i 's being the same over all the items. The 2-parameter logistic model (2PLM; Birnbaum, 1968, p, 400) is a special case of the 3PLM with c i 's being equal to 0. Because θ is a latent variable, IRT models are a type of latent variable models (e.g., Bartholomew et al., 2011). One popular IRT model in detecting test fraud is the nominal response model (NRM; Bock, 1972), which has been heavily used in erasure analysis, copying-detection, and similarity analyses. Under the NRM, examinees do not receive any scores on the items, but choose various response options on them and it is assumed that P ik (θ), the probability that an examinee of ability θ chooses the response option k on item i, is given by where ζ im and λ im respectively are the intercept and slope parameters for response option m of item i.
The common IRT models involve the assumption of conditional independence, which implies that conditional on θ, x i , i = 1, 2, . . . , I are independent, where I is the number of items on a test. Because of the assumption, the likelihood of an examinee on an I-item test under the 2PLM is given by 1 θ is assumed to be a scalar quantity in this paper. There exist IRT models that involve vector-valued examinee ability, but those models will not be considered in this paper.
where ζ im and λ im respectively are the intercept and slope parameters for response option m of item i. The common IRT models involve the assumption of conditional independence, which implies that conditional on θ, x i , i = 1, 2, . . . , I are independent, where I is the number of items on a test. Because of the assumption, the likelihood of an examinee on an I-item test under the 2PLM is given by The marginal likelihood of an examinee is the weighted average of L(x|θ) with the weight proportional to a population distribution p(θ) and is typically defined as The marginal likelihood of a sample of examinees is defined as the product of the marginal likelihood over all examinees. The item parameters are typically estimated by maximizing the marginal likelihood of a sample of examinees. The details of such estimation, which typically involves the EM algorithm (e.g., Dempster, Laird, & Rubin, 1977), can be found in, for example, e.g., Baker and Kim (2004). In detection of test fraud using IRT models, investigators typically have access to the item parameter estimates computed from a previous estimation/ calibration (mostly not affected by test fraud) and perform the required computations using the examinee likelihood L(x|θ) so that θ is treated as the only unknown parameter (e.g., Snijders, 2001). Descriptions of approaches for the frequentist estimation of θ given item parameters, which involve an iterative algorithm such as the EM algorithm or Newton-Raphson algorithm (e.g., Thisted, 1988, p. 164), can be found in, for example, Baker and Kim (2004). One can also compute the posterior mean of compute the posterior mean of θ, θ θL(x|θ)p(θ)dθ, using num , using numerical integration. Several publicly available software packages including mirt (Chalmers, 2012) and ltm (Rizopoulos, 2006) are available to estimate the IRT item parameters and examinee abilities given the item parameters.

The Lognormal Model for Response Times
Let t i denote the response time of a randomly chosen examinee on item i, where i = 1, 2, · · · , I. Let us define Under the lognormal model for response times (LNMRT; van der Linden, 2006), y i , i = 1, 2, . . . , I, are independent given τ and (1) where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 . That is, the probability density function of y i is given by The parameter τ is the examinee's speed parameter; a larger value of the parameter results in smaller expected response times on all the items for the examinee. The parameter β i is the time-intensity parameter for item i; a larger value of the parameter results in larger expected response times for all examinees on the item. The parameter α i is the discrimination parameter for item i; a larger value of the parameter leads to more information on and hence smaller standard error of the examinee speed parameters. To estimate the item parameters of the LNMRT using a marginal maximum likelihood approach or to perform a Bayesian inference on the examinee ability, one assumes a population distribution g(τ) on τ. The distribution g(τ) is assumed to be the normal distribution with mean 0 and variance σ 2 in most applications of the LNMRT (see, for example, van der Linden & Guo, 2008).
The assumption of conditional independence is made, which leads to the following expression of the likelihood of the logarithms of response times of an examinee: The LNMRT is arguably one of the most popular RTMs. The model was considered, either to analyze only the response times, or to analyze the response times and item scores, by, for example, Boughton et al. (2017), Glas and van der Linden (2010), Qian, Staniewska, Reckase, and Woo (2016), van der Linden (2007), van der Linden (2009), van der Linden (2016), and van der Linden and Guo (2008). Bolsinova and Tijmstra (2018, p. 13) commented that the LNMRT is used in most applications of RTMs.
A Gibbs sampler (e.g., Gelman et al., 2014, p. 276) was suggested by van der Linden (2006) to estimate the item parameters of the LNMRT. That approach has been used in most applications of the model and the R package LNIRT (Fox, Klein Entink, & Klotzke, 2017) can be used to implement the Gibbs sampler. Glas and van der Linden (2010) suggested an approach to compute the marginal maximum likelihood estimates (MMLEs) of the item parameters when the LNMRT is used along with the threeparameter logistic model (3PLM) to jointly analyze both response times and item scores. Finger and Chee (2009) showed how one can use factor analysis to obtain the MMLEs of the item parameters of the LNMRT when it is used as a stand-alone model, as in van der Linden (2006). The R package lavaan (Rosseel, 2012), which is used to perform factor analysis and structural equation modeling (SEM), can be used to estimate the item parameters of the LNMRT, as shown by Sinharay and van Rijn (2020).
It was shown by van der Linden (2006) that given α 2 i 's and β i 's, the MLE of the person speed parameter τ for the LNMRT can be obtained as An expression for the posterior mean of τ given the α i 's, β i 's and y i 's can be found in van der Linden (2007).

The Hierarchical Model for Item Scores and Response Times
The hierarchical/joint modeling approach of van der Linden (2007) involves the application of a model for response times in combination with an IRT model for the item scores. In the approach, one typically assumes that -the response time follows the lognormal model (LNMRT; van der Linden, 2006) defined above. -the item scores follow an IRT model; the flexibility of the approach allows the use of any IRT model such as the 2PLM described above.
One also assumes a suitable population distribution, typically a bivariate normal distribution with means equal to 0, for the vector of examinee parameters, (τ, θ) t . Given that both the ability parameter θ and the speed parameter τ are latent variables, the hierarchical model for item scores and response times is a latent variable model that includes two latent variables.
Klein Entink, Fox, and van der Linden (2009) and van der Linden (2007) suggested Bayesian approaches to estimate the item parameters of the hierarchical model using the Markov chain Monte Carlo (MCMC) algorithm (e.g., Gelman et al., 2014). The estimation approach is implemented in the R package LNIRT . Glas and van der Linden (2010)  Once the item parameters have been estimated, the joint distribution of the item scores and the response times of an examinee given the item parameters can be expressed as the product of the distribution of the item scores and that of the response times (e.g., van der Linden, 2007). Therefore, to compute the joint maximum likelihood estimates (MLEs) of τ and θ for a person given the item parameters, one can compute the MLEs separately-the one for τ only using the response times and the LNMRT and the one for θ only using the item scores and the IRT model.
Note that there exist other joint models for item scores and response times some of which are closely related to those used by cognitive psychologists. For example, the diffusion model (e.g., Ratcliff, 1978) and the race model (e.g., Townsend & Ashby, 1978) are popular in cognitive psychology for modeling cognitive processes based on response time and response accuracy; van der Maas, Molenaar, Maris, Kievit, and Borsboom (2011) and Ranger, Kuhn, and Gaviria (2014) respectively developed versions of these two models for use with item scores and response times. However, these models will not be considered in this paper.

The Approaches Based on Latent Variable Models Using Both Item Scores and Response Times to Detect Test Fraud
The approaches for detecting test fraud that are based on latent-variable models and both item scores and response times include: (a) a Bayesian person-fit approach to detect test fraud (Fox & Marianti, 2017), (b) the use of standardized residuals to detect test fraud (van der Linden & Guo, 2008), (c) the use of mixture models to detect test fraud (Wang et al., 2018), (d) the use of data mining methods to detect test fraud , and (e) the use of the χ 2 where (x|θ) = log(L(x|θ)).
statistic to detect test fraud (Sinharay & Johnson, 2020). Given that these approaches are based on RTMs, they all belong to the last type of methods listed in Table 1. In addition, the first approach belongs to the fourth type and the last approach belongs to the first type of method listed in Table 1. Marianti et al. (2014) suggested a person-fit statistic that is based on response times and the LNMRT (van der Linden, 2006) and is given by

A Bayesian Person-fit Approach
Var( (x|θ)) , implies that l t is a sum of squares of standardized residuals-so large values of the statistic indicate an aberrant responsetime pattern often caused by preknowledge of some items. Marianti et al. (2014) also described a Bayesian approach that involves first the use of an MCMC algorithm to fit the abovementioned hierarchical model to the item scores and response times followed by the computation of l t in each iteration of the algorithm. The proportion of values of l t that are larger than the 95th percentile of the χ 2 distribution is an estimate of the posterior probability of an aberrant response-time pattern, often due to test fraud. Fox and Marianti (2017) suggested a Bayesian approach to estimate the posterior probability of an aberrant itemscore pattern using the l z statistic (Drasgow, Levine, & Williams, 1985) that is given by where (x|θ) = log(L(x|θ)). . The statistic l z is the standardized log likelihood of the item scores of an examinee-so an extreme value of the statistic indicates an aberrant score pattern caused by item preknowledge, guessing etc. Finally, Fox and Marianti (2017) suggested a Bayesian approach using both l t and l z to estimate the posterior probability of both an aberrant item-score pattern and aberrant response-time pattern. The posterior probability is expected to be large (for example, larger than 0.95 or 0.99) for aberrant response patterns. The underlying model in the approach of Fox and Marianti (2017) is the hierarchical model for item scores and response times (van der Linden, 2007). The approach of Fox and Marianti (2017) is designed to detect a variety of aberrant responses including preknowledge. However, the power of person-fit approaches to detect test fraud has been found to be small (e.g., Sinharay & Johnson, 2020). Further, the approach of Fox and Marianti (2017) was found to have inflated Type I error rate and small power in detecting test fraud in the simulations of Sinharay and Johnson (2020).

The Use of Standardized Residuals
A Bayesian approach for detecting aberrant response times was suggested by van der Linden and Guo (2008). They showed that the posterior distribution of the predicted value of an examinee's log-response time on item i conditional on y −i = (y 1 , y 2 , . . . , y i−1 , y i+1 , . . . , y I ) and also on The response time for the examinee on item i is concluded as aberrant at α% level if the absolute value of e i is larger than the 100(1-α)-th percentile of the standard normal distribution. This approach is designed to detect a variety of aberrant responses. For example, the approach can be used to flag for possible item preknowledge the examinees with several statistically significant and negative e i 's, as was performed in Boughton et al. (2017, p. 181), Qian et al. (2016), and van der Linden and Guo (2008). Lee and Wollack (2020) and Wang et al. (2018) suggested using mixture hierarchical IRT models, which are fitted using Bayesian estimation approaches, to detect aberrant item scores and response times. These models include a 0/1 aberrance indicator ∆ ij (that is unknown and is estimated from the data) for each examinee-item combination, with ∆ ij = 1 indicating an aberrant behavior on item j by examinee i. If examinee i does not show aberrant behavior on item j, then the item score and response time for that examineeitem combination is obtained by the hierarchical model of van der Linden (2007). If examinee i shows aberrant behavior on item j, then the probability of a correct answer for that examinee-item combination does not depend on the examinee ability and is the same (d j ) for all examinees on the item; and the response time for that examinee-item combination follows a lognormal distribution with a mean (µ C ) and variance (σ 2 C ), each of which is constant over all items and all examinees. The mixture model is fitted to the data using an MCMC algorithm and the estimated posterior probability of ∆ ij being equal to 1 can be used to determine whether the response time and item score on item j by examinee i are aberrant. An examinee with too many aberrant item-examinee combinations may indicate possible test fraud.  suggested two approaches based on datamining methods (e.g., Hastie, Tibshirani, & Friedman, 2009) and latent-variable models to detect test fraud. In both of these approaches, one first fits an IRT model and a RTM to the item scores and response times to compute the values of several person-fit statistics using item scores and using response times. Then, in the first approach, one applies an unsupervised learning methods such as the K-means clustering method (e.g., Hastie et al., 2009, p. 460) using the person-fit statistics, item scores, and response times, and any other features that may be related to test fraud, to divide the examinees into multiple clusters and attempts to interpret the groups. If the features are powerful predictors of test fraud, then the clusters will correspond to those committing various types of test fraud and those not committing test fraud. In the second approach, which is applicable only when the investigator has available a set of examinees who are known to have committed test fraud, one applies a supervised learning method such as the random forests (e.g., Hastie et al., 2009, p. 587) using the abovementioned features to predict the examinees who committed test fraud. If the features are powerful predictors of test fraud, then the second approach will lead to a powerful (nonparametric) prediction model.  applied both approaches to a data set from Cizek and Wollack (2017) that involved real test fraud and found the second approach (based on supervised learning) to be more powerful in detecting test fraud.

The Use of the χ
where (x|θ) = log(L(x|θ)). Sinharay and Johnson (2020) suggested an approach that is based on the hierarchical model of van der Linden (2007) and can be applied when the investigator is interested in the types of test fraud that lead to better and/or faster performance on a known set of items. For example, real data sets have been used to demonstrate that in cases of item compromise and preknowledge, the examinees who benefit from item preknowledge perform better (e.g., Sinharay, 2017;Smith & Davis-Becker, 2011) and faster (e.g., Kasli & Zopluoglu, 2018;Smith & Davis-Becker, 2011) on the compromised items than on the noncompromised items.

Statistic
Let C and C respectively denote the set of compromised items and non-compromised items. Let x, x C , and x C respectively denote the collection of the scores of a randomly chosen examinee on all items, on the items in C and on the items in C, respectively. Thus, for example, x C = {x i , i ∈ C}. Similarly, let y, y C , and y C respectively denote the collection of the logarithms of response times of the examinee on all items, the items in C and the items in C .
For the examinee, let us denote the true ability based on the whole test, C, and C as θ, θ C , and θ C , respectively.
Let us denote the MLEs of these parameters as θ, θ C , and θ C , respectively. Let τ C and τ C respectively denote the examinee's true speed parameters on the compromised and non-compromised items, respectively, and let τ C and τ C denote their MLEs. Let τ denote the MLE of the examinee's true speed parameter based on all the I items on the test. Sinharay and Johnson (2020) suggested that one can detect whether an examinee performed better and/ or faster on the compromised items by testing the null hypothesis H 0 : τ C = τ C and θ C = θ C versus the alternative hypothesis H 1 : τ C > τ C or θ C > θ C . The rejection of the null hypothesis may indicate possible item preknowledge. For this hypothesis-testing problem, Sinharay and Johnson (2020) recommended the use of the constrained likelihood ratio test statistic where and I(θ C ≥ θ C ) is the indicator function taking the value of 1 if θ C ≥ θ C and 0 otherwise. Sinharay and Johnson (2020) also proved that the cumulative density function (CDF) of Λ ST for large C and C and under the null hypothesis of no item preknowledge is given by where, for example, χ 2 2 denotes a χ 2 random variable with two degrees of freedom. The CDF shown in Equation 7 corresponds to a distribution that is referred to as the chi-bar-square (χ 2 where (x|θ) = log(L(x|θ)).
) distribution (e.g., Dykstra, 1991;Silvapulle & Sen, 2001) that is popular in constrained statistical inference. Consequently, for λ > 0, the p-value for the test statistic Λ ST is calculated as The critical values for Λ ST for significance levels of 0.001, 0.01, and 0.05 are 11.76, 7.289, and 4.231, respectively. If the alternative hypothesis is true, which corresponds to an examinee having item preknowledge, one or both of Λ S and Λ T will have a large positive value and, consequently, Λ ST will have a large and positive value.

Real Data Example
Item scores and response times on one form of a nonadaptive information technology certification examination were available for 1,992 examinees. The examinees are sorted in chronological order, so that the first examinee in the data set took the test before any other examinee. The data set was discussed in Eckerly (2020) and Eckerly, Smith, and Lee (2018). The examination comprises 60 multiple choice and dichotomously scored items. All of these 60 items, along with their answer keys, were found on an internet website, but the answer keys provided on the website were correct for 24 items and incorrect for 36 items. Eckerly et al. (2018) computed the values of several statistics, all using item scores, for detecting test fraud for these data and found evidence of various types of test fraud including potential collusion between groups of examinees who were conjectured to have developed their own answer keys.
The top left panel of Figure 1 shows a scatter plot of the raw score (along the X axis) versus the total time (in minutes; Y axis) of the examinees.2 The correlation coefficient between the raw score and total time is -0.20, which indicates a slight tendency of more able examinees answering faster. Data sets showing a similar pattern include one from the computerized uniform CPA examination (e.g., van der Linden, 2009).
The item parameters for the hierarchical model of van der Linden (2007) were estimated from the data set under the assumption that the 2PLM fits the item scores and the LNMRT fits the response times. The values of Λ ST were computed from the data set using the estimated item parameters. The set of 24 items with the correct answer keys were considered as the compromised items (the set C) in computing Λ ST . The residuals suggested by van der Linden and Guo (2008) were also computed. The appendix includes R code for fitting the hierarchical model, computing Λ ST , and computing the residuals suggested by van der Linden and Guo (2008).3 This data set presents somewhat of a unique case of preknowledge, that is, the type of preknowledge for this data set is somewhat different from that in other data sets (e.g., those analyzed by Sinharay & Johnson, 2020). Whereas some of the items in the data sets analyzed by Sinharay and Johnson (2020) were compromised, some were not; in contrast, all the items were compromised for this data set, albeit some with incorrect answer keys. Examinees who trust the answer keys and do not spend any efforts themselves are expected to perform much better on the items with correct answer keys than on those with incorrect answer keys. It is also possible that some examinees will notice that some answer keys are incorrect, spend some time on those items, and yet incorrectly answer those items, causing them to perform worse and slower on the items with incorrect answer keys. While analyzing the same data set, Eckerly (2020) found a cluster of examinees whose responses corresponded closely to the key posted to the website, whereas examinees in the other clusters seemed to recognize that the posted key was incorrect and deviated from the posted key to achieve a higher score.
According to Eckerly (2020), the testing vendors had reason to believe that several response patterns among Examinees 601-1992 were contaminated by preknowledge of the examination items. Figure 1 shows plots of the raw score (top right panel), total time in minutes (bottom left panel), and the Λ ST (bottom right panel) for all the examinees in the data set; the examinee number is shown along the X-axis in each of these panels. Two of these panels include solid lines showing the lowess-smoothed mean4 (e.g., Cleveland, 1981) of the raw score and total time, respectively-these lines were created using the R function loess (e.g., R Core Team, 2019). The figure shows a tendency of the raw score increasing and total time decreasing with an increase in examinee number, which agrees with the assertion of Eckerly (2020) regarding possible preknowledge among Examinees 601-1992. The bottom right panel shows a dashed horizontal line showing the critical value (11.76) at significance level of 0.001 for Λ ST . A conservative level of 0.001 is used here because several experts (e.g., Wollack et al., 2015) recommended the use of conservative hypothesis testing in detection of test fraud. The bottom right panel indicates that Λ ST is significant at level 0.001 for only two examinees among the first 600 examinees (0.3%), but for 19 among the last 1,392 examinees (1.4%, which is considerably larger than the nominal level of 0.1%), and points to potential fraud by several of the latter examinees. There exist approaches for detection of preknowledge only based on item scores (e.g., Sinharay, 2017) and only based on response times (e.g., Sinharay, 2020)-both of these types of approaches were found to flag fewer examinees compared to Λ ST .5 Figure 2 includes a comparison of the response times versus average response times for the whole sample, of two examinees (referred to as Examinees 1 and 2) for whom Λ ST was very large and significant at level of 0.001 and one additional randomly chosen examinee (Examinee 3) for whom Λ ST was not significant. In each panel, the logarithms of the response times (in minutes) of an examinee on the test items are shown along the Y-axis and the logarithms of the average response time (in minutes) of the items over the whole sample are shown along the X-axis. A plus sign and a minus sign respectively correspond to items that the examinee answered correctly and incorrectly. The black plus or minus signs correspond to the items with correct answer keys (on the website) while the gray plus or minus signs correspond to the items with incorrect answer keys. A diagonal line is provided in each panel for convenience. The logarithms are plotted to prevent the outliers leading to too much vacant space in the plots. Examinee 1 correctly answered all items in C (but incorrectly answered several items in C ), as clear from the black symbols all being plus signs, and answered the former types of items faster in comparison to the latter types of items. Examinee 2 correctly answered all but four items in C, but incorrectly answered all but two items in C ; this examinee did not perform faster on the items in C. Thus, the item scores and response times of Examinee 2 look like those of examinees who trusted the keys on the website and did not try to solve the items themselves (which is why almost all gray symbols are minuses). Examinee 3 did not perform any better or faster on the items in C; most of the points (both black and gray) for the examinee fall below the diagonal line, which indicates that the examinee is faster than the other examinees on average.
The two panels of Figure  sorted so that the first 24 items (to the left of the vertical dashed line) are those with correct answer keys (C) and the last 36 with incorrect answer keys ( C ). Positive e i 's are indicated by a bar above the 0-line and negative e i 's are indicated by a bar below the 0-line. A bar is gray if the corresponding answer is incorrect and black if the corresponding answer is correct. Horizontal (dotted) lines are shown at 2 and -2 for convenience of evaluating whether a specific e i is statistically significant. For Examinee 1, all answers are correct (hence all the bars to the left of the vertical dashed line are black) and all but six residuals are negative (one of them being significantly so) among the items in C. The pattern is different for the items in C for the examinee-there are nine incorrect answers and a mix of positive and negative residuals (eight of which are statistically significant). There is no surprise then that the value of Λ ST was quite large (17.4) for the examinee. Only one residual is statistically significant for Examinee 2 (so that the response times are not aberrant for the examinee), who performed much better on the items in C (only four incorrect answers) in comparison to those in C (only two correct answers). Figure 3 shows how graphical plots of standardized residuals can provide useful information in addition to any other test statistics.

Conclusions and Recommendations
This paper begins with a brief review of the five types of statistical/psychometric approaches (identified by Wollack & Schoenig, 2018) to detect test fraud. More detailed reviews of the recent approaches for detecting test fraud using both item scores and response times are then included-these approaches are all based on latent variable models (e.g., Bartholomew et al., 2011). The approaches increase the chance of detection of test fraud by utilizing response times in addition to the item scores, as was demonstrated by, for example, Sinharay and Johnson (2020). As more information such as keystroke logs (e.g., Leijten & Van Waes, 2013) and eye-tracking data (e.g.,  become available for computerized tests, interest is bound to grow in more approaches based on latent variable models to detect test fraud. While the existence of the abovementioned approaches is encouraging, existing approaches have several limitations and, consequently, there remain several questions, mainly on how these approaches (and other approaches for detection of test fraud in general) should be used in practice. For example, there are no clear guidelines on how many approaches and which ones should be used for a specific data set. Wollack and Cizek (2017, p. 397) recommended the collection of evidence of cheating from multiple distinctly different indices. They also recommended controlling for false positives in the context of test fraud where several statistical tests are typically performed on many examinees, which could lead to tens of thousands of statistical tests for a data set that is as large as the real data set discussed earlier; one way to limit the number of false positives is to choose a critical value that adjusts for multiple comparisons by controlling the family-wise error rate (using, for example, a Bonferroni correction) or controlling the false discovery rate (using the procedure of Benjamini & Hochberg, 1995). Experts such as Buss and Novick (1980), Holland (1996), and Hanson, Harris, and Brennan (1987) recommended that statistical methods of detection should generally not form the sole basis for a judgment that an examinee cheated (or that an examinees scores and/or response times are sufficiently questionable to justify non-reporting) in the absence of corroboration from other types of evidence.
There are several topics that could benefit from further research. First, more applications of the abovementioned approaches to real data sets, especially that that involve actual test fraud, will be helpful. Second, there is a lack of research on the use of keystroke logs (e.g., Leijten & Van Waes, 2013) and eye-tracking data (e.g., , possibly in addition to item scores and response times, to detect test fraud. Third, there is a severe lack of approaches utilizing item scores and response times for detecting test fraud at an aggregate level (that is, at the level of classes or schools that the examinees belong to)so more research on this area will be useful. Fourth, while the hierarchical model of van der Linden (2007) was used in most of the approaches discussed in this paper, research on the use of other models for item scores and response times (for example, those by Maris & van der Maas, 2012;van der Maas et al., 2011) to detect test fraud would be useful. Fifth, the item parameters are assumed known in the application of the approaches discussed here and it is possible to explore approaches, possibly Bayesian, to account for the uncertainty of the item parameters in using the approaches. Finally, more research on examining the consequences of misfit of the latent-variable model on the properties of the abovementioned approaches will be helpful.