Sequential change-point detection in a multinomial logistic regression model

Abstract Change-point detection in categorical time series has recently gained attention as statistical models incorporating change-points are common in practice, especially in the area of biomedicine. In this article, we propose a sequential change-point detection procedure based on the partial likelihood score process for the detection of changes in the coefficients of multinomial logistic regression model. The asymptotic results are presented under both the null of no change and the alternative of changes in coefficients. We carry out a Monte Carlo experiment to evaluate the empirical size of the proposed procedure as well as its average run length. We illustrate the method by using data on a DNA sequence. Monte Carlo experiments and real data analysis demonstrate the effectiveness of the proposed procedure.


Introduction
Structural change is of central importance in many fields of research and data analysis. In the past few decades, studies of change-point detection have attracted a great deal of attention in fields such as statistics, engineering, economics, climatology and bioscience. For surveys, we refer to the studies by Csörgö and Horváth [1], Perron [2], Gombay [3], Chen and Tian [4], Na et al. [5], Zou et al. [6], Ross [7], Robbins et al. [8], Li et al. [9], Cao et al. [10] and Chen [11]. Most of these studies deal with the classical linear regression model either with independent identical distribution errors or assuming some type of dependence. However, the research of detection of change-point in generalized linear models has received increased attention in recent years. Antoch et al. [12] presented some results on testing for changes in generalized linear models based on overall maximum-type test statistic. Zhou and Liang [13] proposed an estimating procedure for estimating the change-point along with other regression coefficients under the generalized linear model framework.
Categorical time series data are frequently encountered in biomedicine, social science and genetics. As generalized linear regression models for categorical time series allow for parsimonious modeling and incorporation of random time-dependent covariates, Fokianos and Kedem [14] suggested the generalized linear model for categorical time series modeling. For change-point detection in categorical time series, there are retrospective procedures which deal with the detection of a structural change within an observed data set of fixed size, whereas sequential procedures check the stability hypothesis each time a new observation is available (see Horváth et al. [15]). Fokianos et al. [16] provided a new statistical procedure based on the partial likelihood score process for the retrospective detection of change in binary time series. Gombay et al. [17] discussed the retrospective detection of change in categorical time series using the same method. Hudecová [18] studied the change-point problem within the framework of autoregressive models for binary time series. Wang et al. [19] proposed a novel change-point detection procedure motivated by high-dimensional homogeneity tests to estimate the locations of multiple change-points in multinomial data with a large number of categories.
The above study of detection of change is retrospective change-point detection, in regard to sequential change-point detection (on-line monitoring, sequential test or priori test) whereby data are not observed at once, but arrive in a sequential mannerone by one, Xia et al. [20] introduced two procedures to sequentially detect structural change in generalized linear models with assuming independence. Höhle [21] proposed a CUSUM control chart method based on the generalized likelihood ratio statistic for sequential change-point detection in regression models for categorical time series.
The goal of this article is to propose a sequential change-point detection procedure based on the partial likelihood score process for the coefficients of multinomial logistic regression model for categorical time series. Score test for the sequential detection of changes in time series models has been studied by many authors. Related work can be found in the studies by Gombay and Serban [22] and Gombay et al. [23]. Compared with the CUSUM detection method, the alternative value of the parameter of the proposed method does not have to be estimated, and in our current model the parameter estimation is complex. Also, the probability of false alarms of the proposed score test is lower.
In this article, we propose a sequential test statistic based on the partial likelihood score process for the coefficients of multinomial logistic regression model. The asymptotic distribution of test statistic is derived under the null hypothesis and the consistency is proven under the alternative hypothesis. We perform Monte Carlo simulations to explore the finite sample performance of the proposed test statistics in terms of empirical size as well as average run lengths. The results show that the proposed test statistic can make the alarms reliable signals of genuine change, and the price is an increased average run length in detection. A real data example is also provided to examine the efficiency of the proposed procedure.
The rest of the article is organized as follows. Section 2 describes the multinomial logistic regression model. Section 3 contains the proposed procedure, necessary assumptions and their asymptotics. In Section 4, Monte Carlo simulations and real data analysis are conducted to investigate the performance of the proposed procedure. Section 5 concludes the article. All proofs of the theorems are gathered in Appendix.

Multinomial logistic regression model
for every t, σ field − t 1 is generated by previous observations and covariates , , , , , that is, Following the definition of generalized linear models, the vector of conditional probabilities is linked to the covariate process through the following equation: with β, a p-dimensional vector of parameters. The inverse link function h is defined on R q and takes values in R q as well.
In this article, we investigate the multinomial logistic regression model, which is frequently employed in the analysis of nominal time series (Agresti [24], Section 9.2).
where − z t 1 is the corresponding d-dimensional vector of stochastic time-dependent covariates independent of j. Obviously,

Sequential change-point detection
In this section, we propose a sequential change-point detection procedure for the coefficients of multinomial logistic regression model. Following the general paradigm of Chu et al. [25], sequential change-point detection uses the initial time period of length m to estimate a model, and its goal is to verify that the probability of the detection approaches α under the null of no change and one under the alternative of a change in parameters after the initial time period. We first assume that there is no change in the regression parameter during the first m observations, i.e., We are interested in testing the hypothesis : , for all , 1, 2, , , and for , , , β 0 is the true value of the parameter vector β and unknown, * k is the unknown time when change-point occurs in some of the regression coefficients, and κ is some fixed positive integer larger than 1. The sequential change-point detection procedure is based on a detecting statistic + Γ m k and a boundary ( ) g m k , , we stop and reject H 0 at , this is also called the "closed-end" procedure [26]. The detector and boundary must be chosen such that Condition (2) ensures that the probability of a false alarm is asymptotically bounded by α, while condition (3) means that a change-point is detected with probability approaching 1.
Next, we will state the assumptions on the covariate vector z t and parameter vector β. We denote the i th component of a vector X as ( ) X i . Assumption 1. The process { } z t is ergodic and stationary in the sense that for all t and all ≥ l 0, 1, 2, , , and The estimation of the parameter vector β in model (1) follows the partial likelihood methodology described in Kedem and Fokianos [27].
The partial likelihood function is so that the partial log-likelihood function is given by the partial score process is defined through the partial sum The maximum partial likelihood estimator denoted by β m is a solution of the score equations ( ) = S β 0 m , and its asymptotic properties have been studied by Kedem and Fokianos [27].
The following results show that the score vector behaves, approximately, as a vector of the Wiener process.
Lemma 3.1. Under Assumptions 1-3, there exists a Wiener process ( ) ( ) ≥ W t t 0 with covariate matrix T such that if β is the true vector of coefficients, then the score vector admits the following approximation: is defined by (4), and T is the limit of the sample information matrix Let β m be the maximum partial likelihood estimator of the true value of the parameter β. where = × I I p p is the identity matrix, = p qd, T is given by (5).
Next, we consider the asymptotics under the null hypothesis and alternative hypothesis.    In some applications this is not the case. It is possible that some parameters are of special concern, and the others are nuisance parameters. In this simulation, we focused on β 10 and β 20 , which means = p 2. Table 1 reports the simulated empirical sizes for the proposed procedure based on detecting statistic ( ) m k Γ , , and the significance level is chosen as 5%. It can be seen that the results are close to the significance level 5% in most cases.  [21]). The threshold of the CUSUM detection method could be obtained by simulation of the run length. In the control state, suppose that the average run length is set to 370, and then the threshold is selected as 1.7. Monte Carlo simulations were performed in R and all simulations were based on 2,000 replications. Tables 2-5 show the conditional power, the first and third quartiles ( ) Q Q , 1 3 , the median Q 2 , the mean (ARL), the maximum of the distribution of the run lengths and the probability of false alarms ( ) P τ . The conditional power is defined as the proportion of signals occurring after + * m k but before the truncation point κm. The probability of false alarms is obtained as the proportion of signals occurring before the change point + * m k , denoted by P τ . Let = + * k m k , ≥ k ARL¯is the average run length for runs stopping on or after k. From Tables 2-5, it can be seen that the conditional power of score test increases as the sample size increases and decreases as the change-point occurs late. The general relation between power and changepoint location in hypothesis testing is that a change-point occurs after a longer period of monitoring and tends to have a lower power. For the boundaries ( ) ≡ g s c and ( ) = g s c s, when = m 100, the conditional power of the former is higher than that of the latter, and when = m 200, the conditional power of two boundaries is almost 1.
Next, we investigate the numerical characteristics of run lengths, Q 1 , Q 2 , ARL, Q 3 and maximum of the distribution of run lengths increase with the increase in sample size. When the boundary ( ) ≡ g s c, these numerical characteristics are slightly influenced by the change-point location. For instance, when = m 100, 200, the ARLs are around 265 and 357, respectively. However, when the boundary ( ) = g s c s, these numerical characteristics increase as the change-point occurs late. For instance, when = m 100, = * k N N 0.01 , 0.3 , the ARLs are 258.65 and 330.75, respectively. In addition, compared with the boundary ( ) ≡ g s c, the ARLs of test statistic with ( ) = g s c s are shorter when the change-point is at the beginning of the series being monitored, but they are longer when the change-point occurs late. Finally, the probability of false alarm P τ in two cases is almost zero in most cases. Table 6 illustrates the conditional power, Q1, Q2, ARL, Q3, maximum of the distribution of run lengths and P τ for the CUSUM detection method. It can be seen from Table 6 that the conditional power and ARLs decrease as change-point occurs late. When the change-point is at the beginning of the series being monitored, the condition power is high, the ARLs are short, and the probability of false alarms is low. However, when the change-point occurs after a period of monitoring, the conditional power decreases rapidly, and the probability of false alarms grows rapidly. The monitoring may stop before the change-point + m k ⁎ , thus the ARLs are negative when change-point occurs late. In comparison with the proposed score test, the ARLs are shorter, but the condition power is lower, and the probability of false alarms is higher.
To summarize, the score test has good control over its type I errors (empirical size), the price is longer ARL. The CUSUM detection method has shorter ARL, but the probability of false alarms grows rapidly when the change-point occurs late.

Application to real data
In this section, we applied the proposed procedure to a DNA sequence, which was obtained from the gene BNRF1 of the Epstein-Barr Virus (Shumway and Stoffer [28]), and contains = n 1,197 observations. Let A, G, C and T denote adenine, guanine, cytosine and thymine nucleotides, respectively, then a sequence of letters A, G, C and T can be viewed as a nominal categorical time series. Fokianos and Kedem (2002) [27] argued in detail that the most appropriate model for these data is the multinomial logistic model with covariate vector where β has 21 components. Let the first 400 samples be the historical sample, then monitor the remaining sample one by one (401-1,197). The proposed score test statistic was computed and graphically reported in Figure 1. The results show that there exists a structural change, which occurs at 738 for β 34 .
To evaluate the validity of the above results, we analyzed the data by the retrospective change point detection procedure (Gombay et al. [17]), and the results show that there exists a structural change at 701. It is also demonstrated that there exists a certain delay, but the proposed score test is effective.

Concluding remarks
In this article, we propose a sequential change-point detection procedure based on the partial likelihood score process to detect the coefficients of multinomial logistic regression model. The asymptotic properties of score test statistic are derived under both the null of no change and the alternative of changes in coefficients. To evaluate the finite sample performance of the proposed score test statistic, we conduct a Monte Carlo simulation; simulation results show that the score test method can signal a genuine change reliably, and the price is an increased average run length in detection.
In Monte Carlo simulation, we find that the proposed score test is suitable for the case that the amount of structural change is relatively large and is not sensitive for small deviations. Thus in our simulation, we assume that β 10 changes from 0.3 to 1.5 and β 20 changes from −0.2 to −1.5 under the alternative hypotheses. When the amount of structural change is small, the conditional power is lower, and the ARL is longer, so how to establish a more effective test statistic is the next research subject.

Acknowledgement(s):
The authors are grateful to the editors and referees for useful suggestions and helpful comments for improving the article. This work was supported by the National Nature Science