Disentangling the impact of mean reversion in estimating policy response with dynamic panels

: This article accounts for multivariate dependence of the variable of policy interest in dynamic panel data models by disentangling the two sources of intertemporal dependence: one from the e ﬀ ect of the policy variable and the other from mean reversion. In a situation where intensity of the policy varies over time, we estimate the unconditional mean in the autoregressive process as a function of the agent ’ s characteristics and the policy intensity. Comparison of the ﬁ tted values of the unconditional mean under di ﬀ erent values of the policy intensity enables identi ﬁ cation of the policy e ﬀ ect cleared of mean reversion. The approach is relevant for measuring the e ﬀ ect of reforms, which use an intertemporal incentive where intensity of the reform varies over time. The empirical part of the article assesses the e ﬀ ect of hospital ﬁ nancing reform based on incentive contracts, related to the observed quality of services at Medicare hospitals in 2013 – 2019. We ﬁ nd a direct association between prior quality and quality improvement owing to the reform. Our result reassesses a stylized fact in the literature, which asserts that a pay - for - performance incentive leads to greater improvements at hospitals with lower baseline quality.


Introduction
The phenomenon of regression toward the mean (mean reversion) is observed in case of longitudinal observations of a variable, which is susceptible to random variations. In this case, exceptionally low or high values of the variable in initial measurement tend to be closer to the center of the distribution in subsequent measurements [24]. In short, mean reversion is an inherent part of the stationary process and implies the return of the process to its mean value [25,31].¹ Historically, the appearance of the term "mean reversion" is associated with the seminal works by Galton, who discovered an inverse relationship between the height of parents and children [30] and hence framed the term "regression" as the tendency of the dependent variable to revert to the mean value. Recent examples of the analysis of processes which exhibit mean reversion in various fields of economics include  the current account of countries [81] and their productivity [29], profitability of banks [48], housing prices [31], tax avoidance by companies [3], blood pressure and cholesterol level of patients [5], and birthweight of children in successive pregnancies of the same mother [79].
Mean reversion contaminates judgment about the time profile of the dependent variable in case of groupwise estimations. If the value of the dependent variable for a certain observation is lower than average in period t, it is likely to be higher in period + t 1 than in period t. Similarly, observations with high values in period t tend to be followed by lower values in period + t 1. Accordingly, mean reversion leads to an increase in the expected value of the dependent variable in the group of observations belonging to the lower percentiles of y, and to a decrease in the expected valuein the higher percentiles of y. Therefore, the impact of mean reversion needs to be excluded in econometric analysis, which evaluates the longitudinal impact of policy interventions on groups of economic agents.
The purpose of this article is to model multivariate dependence of the variable of policy interest by disentangling the two sources of intertermporal dependence: one from the effect of the policy of interest per se and the other from mean reversion. Specifically, we show a way of separating the effect of mean reversion from the policy effect when evaluating the impact of an incentive scheme with intertemporal stimuli and intertemporal variation of the parameter of the reform intensity.
Although mean reversion is inherent to any stationary process, it is most often noted in the analysis of dynamic panels. The dynamic panel data model is a generalization of the panel data fixed effect regression when the dynamic structure of the process needs to be introduced. In our article we use the example of Medicare's incentive contract applied to the observed quality of services, which has to be described as an autoregressive process. Hence, in evaluating the effect of this incentive scheme on hospital quality, we follow a handful of articles which deal with mean reversion in dynamic panels [25,31,48,81].
We focus on the pay-for-performance mechanisman innovative method of remuneration, which originally emerged in corporate finance and managerial economics, and has since been much used in the public sector (civil service, education, social work, and healthcare). In order to quantify the unobserved quality of work, the incentive scheme computes the performance level using imprecisely measured proxies for various dimensions of quality. Next, the regulator imposes an incentive contract, which relates remuneration to performance, so that agents with higher performance in the current period receive higher payment for their services in future periods than agents with lower performance. The reform intensity parameter in this context is the share of the agent's income, which is "at risk" under the incentive contract.
Assuming a direct association between demand for services and quality of work, higher payment to agents with high performance incentivizes agents to improve their level of quality in order to raise demand for their services. In such a setting, if the unobserved quality could be measured precisely, each agent would have sustained their fixed level of performance.
However, performance is in fact a noisy signal. First, there is an imprecision in measuring performance, since it is only a proxy for true quality. Second, in case of healthcare, the unobserved true quality of services is itself subject to a random variation, due, for instance, to patient non-compliance with medical treatment [62]. So it is plausible to assume that performance contains a random error. Hence, performance may unexpectedly be valued as having improved in period t due to this random error, and then the payment in period + t 1 (which is a function of current performance) will increase. Accordingly, the incentive to improve quality in the future period becomes stronger for agents with higher performance. So the performance of these agents in period + t 1 will be on average higher than their performance in period t. The reverse argument applies in case of unexpected lowering of performance valuation in period t.
What therefore happens is that performance of the economic agent becomes a process with serial correlation. So the evolution of the variable of policy interest when such incentives are applied can be viewed as an autoregressive process. In a situation where the policy variable changes over time, we estimate the unconditional mean in the autoregressive process as a function of the agent's characteristics and of policy intensity. Comparison of the fitted values of the unconditional mean under different values of the reform intensity enables us to identify the reform effect cleared of mean reversion. For instance, we contrast the unconditional means estimated under the values of the policy variable in two consecutive time periods. Alternatively, we compare the fitted value of the unconditional mean in period t with its counterfactual analogue: the unconditional mean at zero value of policy intensity. The article which is closest to our latter approach in assessing the reduced Medicare's base payment⁴ to each hospital by a factor α t which equaled 0.01 in 2013. The amount of the reduction was increased annually by 0.0025 in 2014-2017 and has remained flat at 0.02 since 2017. Note that α is the parameter of the reform intensity, varying over time, and = α 0 would correspond to a counterfactual setting with the absence of the reform.
The accumulated saving from reduction in base payment is redistributed across hospitals according to an adjustment coefficient, which is computed as a linear function of the composite quality measure: , where i is the index of a hospital, t indicates year, and m it is the hospital's total performance score (TPS), ( ≤ ≤ m 0 1 0 0 it ). A hospital is rewarded in period + t 2 if the adjustment coefficient based on m it is above one and is penalized otherwise. The quality incentive scheme is budget-neutral and the value of the slope κ t is chosen to ensure budget neutrality, so that hospitals with value of TPS above the empirical mean gained under the reform. In the first years of the reform κ t was close to 2, so hospitals with value of the composite quality measure above 50 were winners from the incentive scheme.
The TPS is a weighted sum of scores for measures in several domains: timely implementation of recommended medical interventions (clinical process of care), quality of healthcare as perceived by patients (patient experience of care), survival rates for AMI, heart failure and pneumonia patients and other proxies for outcome of care, healthcare-associated infections and other measures of safety of care, and spending per beneficiary as a measure of efficiency of care.⁵ A hospital's intertemporal incentive in Medicare's scheme is based on the expectation that the quality payments will continue over a long term, so the hospital's executives and physicians realize that demand is proportionate to quality and that their current policies toward quality of care will influence future reimbursement [46,73].

Autoregressive process and quality convergence
The evolution of the measured quality constitutes a process with serial correlation. If the process for the measured quality is stationary, then it may be treated as an autoregressive process denotes the mean value of the measured quality for a hospital with type θ. As the absolute values of the reciprocals of the roots of the characteristic equation of AR(p) processes are less than one, the maximum absolute value of these reciprocals (denoted λ) may be used as the measure of persistence for the process of measured quality [74]. Using definitions in [29], we can disentangle a permanent component in m t , which is related to economic impact of pay-for-performance from a transient component (a pure dynamic effect), which may be referred to as "mean reversion" or "regression toward the mean" [30].
The reason for the phenomenon of mean reversion is the existence of the random error ε t in the measured quality m t . Indeed, in the absence of ε t the process quickly converges to its mean ( ) μ θ and does not exhibit mean reversion because it always sits at the mean. The random error in the measured quality is largely attributed to imprecision in quality measurement: it is hard to reveal true quality using observable proxies. Another reason is random variation in true quality, which may be explained by the fact that patients do not always comply with the prescribed treatment [62]. Combined with the fact that hospitals make an intertemporal decision in respect of the quality-based reimbursement, the random error leads to the autoregressive form of measured quality m t .  4 The base payment is linked to each diagnosis-related group. 5 The domain score is the sum of the scores for its measures. Higher score of the measure reflects higher position of the hospital in the empirical distribution of the quality measure in a given year or greater improvement of the quality measure relative to the baseline period. Specifically, achievement points are computed for each measure evaluating a hospital's performance relative to other hospitals in a given year, and improvement points for each measure are computed to assess change in the hospital's own performance in the given year relative to the baseline period. Then, for each measure, the highest of the two (achievement points or improvement points) is used as the hospital's score for that measure.
The autoregressive specification can be taken as equivalent to convergence of the measured quality toward the value ( ) μ θ and λ is associated with the speed of quality convergence. The persistence parameter λ essentially describes how quickly the effect of any unexpected shock in value of the dependent fades over time. For example, consider a simple AR(1) process with < < λ 0 1 and the conditional mean . Here the expected value of current measured quality Em t is closer to the mean value ( ) μ θ than is the value of the measured quality in the previous period, i.e., − m t 1 . The becomes more complicated for AR(p) processes with > p 1, but λ can still be used as a measure of persistence of the process.
The hospital receives higher profits for improvement of performance under higher values of α than under lower values of α. This, combined with the serial correlation between performance in consecutive periods, implies a direct association between the persistence parameter λ and α. Higher values of λ imply a lower rate of convergence of quality and hence a weaker effect of mean reversion. The payment schedule makes the hospital adjustment coefficient a linear function of TPS, so each hospital has an incentive to raise the value of the observed composite quality measure. Hence, the introduction of payfor-performance is expected to have a positive effect on mean value of the composite quality measure. Indeed, the mean level of hospital performance was improved even in case of a continuous reward function applied to hospitals above the threshold values of quality indicators (Medicare's pilot program, Phase I) [18,34,37,52,68]. Specifically, the value of the composite performance score in Medicare's pay-for-performance hospitals was higher than in the control group of hospitals [52,78]. Moreover, sociological evidence points to the fact that hospitals participating in incentive schemes are likely to improve performance as they implement a larger number of quality improving activities that non-incentivized hospitals do not carry out [41].
The higher the value of α, the higher may be the hospital's loss under the reform in case of insufficient value of TPS. Indeed, the empirical evidence points to larger incentives being more effective than smaller ones in such reforms [8,15,60].
Accordingly, the expected mean effects of the reform may be formulated as follows: Hypothesis H a 1 : The introduction of pay-for-performance and the increase of parameter α in the context of pay-for-performance lead to a positive mean effect on observed quality.
Hypothesis H a 1 implies that hospitals can be treated as agents which take their future payments into account. The intertemporal stimuli result in mean reversion with respect to observed quality. However, the strength of mean reversion is interrelated with parameter α as follows: Hypothesis H b 1 . The increase in the share of hospital funds at risk in pay-for-performance weakens the effect of convergence of the measured quality to the mean value.

Groupwise effects of the reform
We assume that the effect of Medicare's reform will be larger at hospitals with higher quality, based on findings in the health policy literature that emphasis on quality improvement in incentive schemes is greater at high-quality hospitals or among high-quality physicians in comparison with low-quality hospitals and physicians [21,37,69,77,78].
For instance, [77] conducted structural surveys at hospitals in the top two and bottom two deciles of performance measure in Medicare's pilot program and discovered stronger involvement in quality improving activities among top performing hospitals. The statistically significant differences between top-and bottom-performing hospitals were observed in case of the numerical values, assigned to the following components of quality improvement: organizational culture, multidisciplinary teams, "adequate human resources for projects to increase adherence to quality indicators" and "new activities or policies related to quality improvement" (Tables 3 and P on pp. 836-837).
Interviews with the leaders of California physician organizations [21] similarly discovered that physicians with high performance placed higher emphasis on the support that "the organization dedicates to addressing quality issues" than medium-and low-performing physicians (Exhibit 3, p. 521).
Moreover, papers that use policy evaluation techniques applied to assessment of the effect of the pilot pay-for-performance program at Medicare hospitals report that hospitals in the top two deciles of quality measures showed the fastest improvement, while hospitals in the lowest deciles raised their quality to a much lesser extent or may even have failed to improve [69,78].
To sum up, the hypothesis on groupwise effects of pay-for-performance is as follows: Hypothesis H 2. The introduction of pay-for-performance leads to a larger boost of measured quality at high-quality hospitals than at low-quality hospitals.

Net total effect over time at groups of hospitals
Consider the multivariate dependence of the variable of interest on two sources of intertemporal dependence: the policy reform and mean reversion. The effect of mean reversion implies a differential time profile of measured quality: measured quality increases at hospitals in low percentiles of the quality distribution and decreases at hospitals in high percentiles. Combined with the positive effect of pay-for-performance on the mean value of measured quality (Hypothesis H 2), mean reversion is likely to result in heterogeneous net total effect of change in measured quality over time.
Hypothesis H a 3 . High-quality hospitals experience decrease of measured quality owing to regression toward the mean. However, the introduction of pay-for-performance and increase of the share of hospital funds at risk in pay-for-performance lead to improvements in measured quality at these hospitals. The net total effect may vary.
Hypothesis H b 3 . Low-quality hospitals increase their measured quality owing to regression toward the mean. The introduction of pay-for-performance and increase of the share of hospital funds at risk in payfor-performance also cause a rise in measured quality, so the net total effect at these hospitals is positive.
If α is gradually raised in the course of implementation of the incentive scheme, then, according to H b 1 , convergence of measured quality weakens over time. The net total effect at high-quality hospitals is the sum of the positive effect of the quality incentive and negative effect of the quality convergence. With increase in α, the number of hospitals where the positive effect outweighs the negative becomes larger.
Hypothesis H c 3 . The increase of hospital funds at risk under pay-for-performance weakens the effect of convergence of measured quality, so the number of high-quality hospitals with negative net total effect decreases.
 where z it are hospital time-varying characteristics, u i are individual hospital effects (in particular, they incorporate the altruistic effects), the size of quality incentives α t varies in different years and enters the equation multiplied by the share of Medicare discharges s it , which indicates that the quality incentives apply only to treatment of Medicare patients, and d t is a set of dummy variables which capture external time effects (effects unrelated to hospital decisions). The following restrictions are used to identify the constant term ϕ 0 : the sum of the coefficients for components of d t is normalized to zero, and the expected value ( ) = E u 0 i . Hospital timevarying characteristics are disproportionate share index, casemix index, number of hospital beds,⁷ physicianto-bed ratio, and nurse-to-bed ratio. The posterior analysis of the effect of quality incentives deals with hospital grouping according to the time-invariant characteristics, which could not be incorporated in the empirical specification with fixed effects: geographic region where the hospital is located, public ownership, urban location, and teaching status.
We use two hospital control variables which affect quality improvement and allow us to mitigate potential biases, which might occur if the pay-for-performance effect is identified based only on the variation of α in time. The HRRP penalty captures the impact of a simultaneously adopted incentive program with similar incentives. Moreover, the readmission reduction program targets improvement of quality measures which are components of TPS.⁸ The binary variable for successful attestation of meaningful usage of EHR accounts for the effect of another compulsory program, which provides bonuses to attested hospitals. The variable controls for the fixed cost incurred by a hospital to improve its quality through installing and using health information technology systems.
Eq. (1) can be estimated using the generalized method of moments: the [2] and [12] methodology for dynamic panel data. Examples of use of the methodology in health economics include analysis of the quality of care at Medicare's hospitals in [56], study of the length of stay at Japanese hospitals in [10], investigation of labor supply by Norwegian physicians in [4], and of health status of individuals in the US in [57].
The first set of moment conditions in GMM comes from the approach of [2] and [12]. We take the first difference of the right-hand side and left-hand side of Eq. (1): (2) Since ε it cannot be predicted using the information available at period − t 1, ε it is uncorrelated with any variable known at time − t 1, − t 2 etc. Therefore, ε Δ it is uncorrelated with any variable known at time − t 2, − t 3 etc. Hence, the following set of moment conditions can be imposed to estimate the model parameters in Eq. (2), see [2] and [12]: where e it is the regression residual and Z it is any variable known at time t.⁹ Another set of moment conditions comes from [12] for the level Eq. (1): + u ε i i t has to be uncorrelated with As the distribution of the number of hospital beds is extremely skewed, we take the log of hospital beds. This approach is in line with [22] and makes it possible to account for nonlinear effect of hospital beds. It is less restrictive than the alternative approach, employing a list of dummies based on the ranges of hospital beds (e.g., less than 99, 100-199 etc.). Use of a list of dummies condemns the effect to be piece-wise, prohibiting variation within the category of hospitals with a given range of beds. 8 30-day unplanned readmission rates for acute myocardial infarction, heart failure, and pneumonia. 9 For instance, y it may serve as Z it .
So Z it includes lagged values of predetermined and endogenous variables (the first set of moment conditions) and differenced predetermined and endogenous variables (the second set of the moment conditions). All moment conditions are formulated separately for different years, so the number of observations for asymptotics equals the number of hospitals.¹⁰ More specifically, lagged value of TPS and other hospital control variables in z it (beds, physician-to-bed and nurse-to-bed ratios, HRRP penalty, and the binary variable for hospital EHR attestation) are taken as predetermined and do not require the use of instruments in estimations. Casemix and the disproportionate share index are assumed to be endogenous: we rely on the empirical evidence of manipulation by hospitals of patient diagnoses (i.e., with casemix) and reluctance to admit low-income patients under quality-incentive schemes [17,23,28]. We assume that the Medicare share is endogenous, too: the fact may be explained by demand-side response from Medicare patients to publicly reported hospital quality [44,53,72].
It should be noted that the use of dynamic panel data methodology requires justification on economic grounds. This is because the approach uses lags and lagged differences as instruments, and there are potential problems with using lags as instruments even though they pass the Arrelano-Bond tests. Specifically, lags may prove to be weak and invalid [7]: the weakness may occur when lags are distant [59], and invalidity happens due to overfitting of the endogenous variable under large T [66]. However, neither of these problems (weakness and invalidity) are likely to be present in our analysis since we restrict our instruments only by the first appropriate lag.
The validity of instruments is assessed through statistics of the Arellano-Bond test. We employ [80] robust standard errors for estimation.¹¹ But formal tests are insufficient for establishing the causal relationship in models, which use an instrumental variable approach [1,7]. Accordingly, it is necessary to provide an economic justification for the assumption of the exclusion restriction of the instruments, i.e., that the instruments are exogenous and impact the dependent variable through no channels other than the endogenous variable and, possibly, also through exogenous covariates. An example of such justification on theoretical grounds can be found in [6], who uses lags of GDP and lags of the inflation rate as instruments for GDP and inflation. Another way of arguing for the exclusion restriction is given in [38], which estimates per capita output in various countries as a function of social infrastructure. Owing to endogeneity of social infrastructure, variables related to exposure to Western culture are used as instruments, and there is a discussion of the absence of any direct channels through which these variables could impact a country's per capita output.
We follow the latter approach to provide an economic justification for the validity of instruments in the dynamic panel data model for the composite quality measure at Medicare hospitals. Our arguments below, which advocate the applicability of lagged first differences as instruments for the level Eq. (1) and first lagged levels as instruments for the difference Eq. (2), are based on the plausible assumption of a short adjustment period in the values of the dependent variable. Specifically, we assume that hospital managers take prompt action upon learning the TPS in year t, so that adjustment is observed in the next period and is not delayed until a more remote future. This assumption is supported by interviews with hospital managers [21,46,37,55,73,77], which show real-time assessment of performance of hospital personnel and immediate feedback initiatives aimed at correcting possible lack of quality. For instance, at Medicare hospitals which participated in the pilot pay-for-performance program, "progress reports were routinely delivered to hospital leadership and regional boards" ( [37], p. 45S). Hospital-specific and physician-specific compliance reports were collected at least every 1.5 months on average, and the results of these reports were delivered to individual physicians once in 5 months on average at both top-performing and bottom-performing hospitals ( [77], Table 4, p. 837). As regards nationwide implementation of pay-for-performance at Medicare hospitals, the TPS is calculated annually, but values for the quality dimensions of the TPS are made publicly  10 The separate formulation of moment conditions for different years makes it impossible to apply the exclusion restrictions, which are commonly used in the instrumental variables approach. 11 The Sargan statistic may be used in dynamic panels for assessing validity of instruments under the homoskedasticity assumption. But it is not applicable to our specification with robust standard errors. available on a quarterly basis.¹² Frequent announcements of quality scores make it possible to expedite quality adjustment at each hospital and improve the value of the TPS within a year. For instance, the survey of hospital CEOs, physicians, nurses, and board members showed that, since implementation of the valuebased purchasing program, "data were shared with their board and discussed at least quarterly with senior leadership" ( [55], p. 435).
As regards our formal analysis, Eq.
(1) has TPS as a dependent variable and its first and second lags as explanatory variables. − y Δ t 1 is used as an instrument for − y t 1 . We assume that the change in TPS from period − t 2 to − t 1, i.e., − y Δ t 1 , which is observed at a hospital at − t 1, is immediately followed by the hospital's action in period − t 1. So the instrument − y Δ t 1 affect the dependent variable y t through the endogenous variable − y t 1 , i.e., through improved quality in period − t 1 (and potentially also through the predetermined variable − y t 2 , i.e., quality adjustment may start as early as in period − t 2) but not through other channels. Without the short adjustment period, these other channels might have included some postponed effects which only come into effect in period t. Note that the equation has hospital control variables, and we follow the empirical literature on the US Medicare reform by treating some of them as endogenous. One of such variables, the share of Medicare patients, reflects the desire of the regulator to sign contracts with the hospital to treat Medicare patients, and it is a function of the hospital's quality enhancing efforts [46]. Our empirical strategy relies on the fact that − x Δ t 1 is an excludable instrument for x t . It is, indeed, plausible to assume that increase of quality efforts from period − t 2 to period − t 1 results in positive value of − s Δ t 1 (where s t denotes the share of Medicare patients) and impacts the value of the TPS in the period t. A similar argument applies to another endogenous control variablecasemixwhich reflects the share of patients with complicated diagnoses. If we ignore potential dumping of patients by hospitals, hospitals are interested in treating patients with complicated diagnoses, since compensation in the system of diagnosisrelated groups is higher for severe cases. But patient demand responds to public reports on hospital quality [20,42,44], so the share of Medicare cases becomes a function of hospital quality.
Another equation is (2) and it models first differences, i.e., changes in quality. The dependent variable is y Δ t and it is a function of the endogenous variable − y Δ t 1 , a predetermined variable − y Δ t 2 , and the difference in the values of hospital control variables x Δ t . The instrument for − y Δ t 1 is − y t 2 and the instrument for each endogenous hospital control variable is − x t 2 . Following the above logic about prompt response of TPS to its values in the previous period, we presume that − y t 2 will affect the change in the value of the TPS from period − t 2 to period − t 1. So − y t 2 impacts y t through − y Δ t 1 (and potentially even through the predetermined variable − y Δ t 2 ) but not through other channels (i.e., not through processes that occur as late as in period t). Similarly, upon learning the value of − x t 2 , hospitals speedily adjust their quality to change x Δ t and it affects y Δ t . Note that [56] used similar arguments in discussing applicability of the dynamic panel data model to analyze in-hospital mortality and the complication rate, which are used as measures of hospital quality in US Medicare hospitals. They write: "We believe our approach is appropriate because (i) changes to inhospital mortality and complications should be immediately affected by changes in staffing levels, not after a long adjustment period, and (ii) the influence of the past is incorporated through the lagged value of the dependent variable." (p. 296, Footnote 3).
A related study applying dynamic panel data models to hospital performance indicators deals with average length of stay at Japanese acute-care hospitals that plan to introduce a prospective payment system [10]. The variable is treated in Japan as a proxy for hospital efficiency. It is regularly monitored and analyzed by the regulator and by hospital management, with feedback actions by hospital personnel in response to annual updates on levels of the variable [9,10,43,45,75]. Accordingly, the assumption of a short adjustment period for the length of stay is likely to hold at Japanese hospitals and the use of lagged levels and lagged differences as instruments is justified.  12 Exceptions are one measure in the clinical process of care domain (influenza immunization), one measure in the safety domain (PSI-90), and a measure in the efficiency domain -Medicare spending per beneficiary, which are updated annually. See measure dates in the quarterly data archives available at https://data.cms.gov/provider-data/archived-data/hospitals. Note that potential violations of the exclusion restriction may occur in instances where the quality measure requires long periods to adjust. In such instances, causal impact of the Medicare reform on the quality of care cannot be established [1,7].
We note other limitations of our approach. First, the analysis deals with the composite quality measure. While quality-related efforts of a hospital and the TPS composite quality measure are multi-dimensional, we do not touch upon multi-tasking in the empirical estimations. Our approach considers a one-dimensional effort, a one-dimensional true quality, and its measurable proxy.¹³ Second, we do not touch on the rules for computing the scores of each dimension of the composite measure or on aggregation of dimension scores. It is important to note that Medicare uses whichever is highest, improvement points or achievement points, as the score for each dimension. The choice between achievement and improvement points stimulates low-performing hospitals, and the uniform formula assumes that all groups of hospitals have equal margin for improvement. A minor exception is protection of hospitals above the benchmark value of the 95th percentile of a corresponding measure score: these hospitals receive 10 points for their achievement on a [ ] 0, 10 scale, while the maximum number of points for improvement by any hospital is 9.¹⁴ Third, weighting of scores across domains is another feature of the design of the incentive mechanism which is not analyzed in our article. So the dichotomous variables for annual periods in the empirical specification capture time effects unrelated to Medicare's value-based purchasing as well as time effects not associated with the size of incentives but potentially linked to changes in other elements of the reform design (i.e., changes in weights).
Finally, conventional policy evaluation using a control group of hospitals is not possible because quality measures for non-Medicare hospitals are not available.¹⁵ The empirical part of the article therefore focuses solely on pay-for-performance hospitals and identifies the effect of quality incentives based on variation in α t and the share of Medicare patients in the hospital s it . Variation in α t plays the role of the dummy for treatment/pretreatment periods, and variations in s it act similar to the dummy for the treatment/control groups.

Calculation of the mean in the autoregressive process
We interpret the second-order dynamic panel (1) as a second-order autoregressive process. The coefficients for the first and the second lags of y it in this AR(2) process are equal to + ϕ ϕ α s t it 1 4 and + ϕ ϕ α s t it 2 5 ,  13 Note that in case of Medicare's formula, the true multi-dimensional quality of hospitals (and hence quality-related efforts) is transformed into measured quality (i.e., TPS) in a non-linear manner, owing to the step-wise scale used for computing the points for each measure. We might nonetheless assume that quality is transformed into TPS monotonically and can be linearized in the empirical part of the article. Several arguments can be listed to support the conjecture. First, the data for Medicare hospitals show that no hospital has the highest possible step-wise values for all its measures. So even the best hospitals have an incentive to work to increase at least one of their measure scores in order to improve TPS. Second, we can neglect disincentives within the step-wise scale used for aggregating measure scores, which may cause deterioration of quality for hospitals that are already positioned at the highest step. Such hospitals could afford only a slight decrease of their quality (due to slacking efforts) while remaining at this step, and the impact on the value of TPS of a fall in quality in only one quality measure will be negligible. Third, interviews with executives of hospitals using value-based purchasing show that a hospital rarely gives special attention to a given subset of measures or shifts its administrative and other efforts across measures. All dimensions of TPS are monitored and actions to improve each dimension are implemented [46,73]. 14 The approach used by Medicare is in contrast with the methodology used in France, where all providers are stimulated according to improvement while only providers with quality above the mean value are also rewarded for their achievement [14]. 15 The TPS or all its components are only available for hospitals in the Hospital Compare database. The Hospital Compare database does include a small group of non-incentivized hospitals together with value-based purchasing hospitals. These are children's hospitals and critical-access hospitals. But both groups offer a special type of healthcare and are not comparable with acute-care hospitals. Moreover, critical-access hospitals usually have no more than 20 beds, which makes it impossible to find a close match with acute-care hospitals. See [70] for an attempt of matching acute-care and critical-access hospitals.
respectively. Note that both coefficients are linear functions of α t . While the standard form of the AR(2) process contains only the lags of the dependent variable, the right-hand side of our empirical equation includes various hospital characteristics and control variables.
To test the hypotheses which concern the mean value of the measured quality μ, we measure the mean fitted value of y it as follows.
For a fixed value of α we take the unconditional expected values of both sides of (1) and denote ( ) where because of the normalization of coefficients δ 3 in (1). After collecting the terms with μ and rearranging them, we obtain: Since α differs across t, we use sample means across the hospitals for fixed t to obtain estimates of expectations.
The estimate of ( ) μ α is constructed by replacing the expected values and covariances by corresponding sample means and sample covariances: Note that the expression for ( ) μ α does not contain the time effects ′ d δ t 3 , as they represent shifts in quality which are common to all the hospitals and are caused by external circumstances.  The null hypothesis is as follows: and it is tested against the positive alternative. Equivalently, we compute the difference between ( ) μ α and ( ) μ 0 : Note that ( ) μ 0 represents the mean value in the pre-reform years when = α 0 and is obtained analytically by plugging = α 0 into the expression for ( ) μ α . The null hypothesis is as follows: The persistence parameter ( ) λ α describes how quickly the effect of random shock in quality fades over time. For a second-order autoregressive process the rate of convergence of the conditional expected value of y it decays exponentially at a rate equal to the reciprocal value of the smallest root of the characteristic equation for the AR(2) process: where s is the mean value of the share of Medicare cases for a given year. An alternative approach considers the value of the autocorrelation function (ACF (1)) (the correlation coefficient between y it and − y it 1 ) as the persistence parameter λ. Specifically, for the second-order autoregressive process (1)  Testing H a 3 and H b 3 involves computing the predicted values of TPS it at the mean value of each covariate for different quintiles of the lagged TPS it and examining whether in 2013-2019 they change from positive in the lowest quintiles to negative or insignificant in the highest quintiles. Average difference between predicted TPS and lagged TPS shows the expected change in quality in consecutive years (the net total effect) which is the sum of the effect of pay-for-performance and the impact of mean reversion.

Data sources and variables
The analysis uses data for Medicare hospitals in 2011-2019 from several sources. We use Hospital Compare data archives (January 2021 update) for quality measures, hospital ownership, and geographic location. The medical school affiliation of a hospital, the number of hospital beds, nurses, and physicians come from Provider of Service files. Other hospital control variables are taken from the Final Rules, which are Medicare's annual documents on reimbursement rates in the inpatient prospective payment system. Specifically, we use information from the Impact Files, which accompany the Final Rules and estimate the impact of the reimbursement mechanism on hospital characteristics. The variables taken from the Impact Files are the share of Medicare's discharges, ownership, and urban location.
Patient characteristics are also taken from the Impact Files. The casemix variable reflects the relative weight of each DRG in financial terms and is adjusted for transfers of patients between hospitals.¹⁶ Casemix makes it possible to control for the composition of patient cases taking account of the objective link between severity of illness and hospital resources. The disproportionate-share index accounts for the share of low-income patients and makes it possible to proxy a patient's income.
To account for other major channels of quality improvement by Medicare hospitals over the observed time period, we use the data for two programs run by the Centers for Medicare and Medicaid Services. One of them is the HRRP, which applies to Medicare hospitals since fiscal year 2013 and penalizes them for excess readmissions. Specifically, the payment reduction which may equal from 0 to 3% is applied to hospital's Medicare remuneration, higher values of the percentage for the penalty represent more excess readmissions at the hospital. Using the HRRP Supplemental data files, which accompany annual Final Rules on acute inpatient PPS (June 2020 update), we find the HRRP penalty for 2013-2019 and use it as one of the control variables in the empirical analysis.
We also consider the EHR Incentive Program, which was in force since 2011. The program establishes hospital attestation on the use of EHR. The adoption of quality-improving information technology requires substantial fixed cost, so the binary variable for hospital attestation within EHR makes it possible to control for the fixed cost in the empirical analysis. The EHR promotion program consists of three stages (sequentially introduced in 2011, 2014, and 2017). Using data from The Eligible Hospitals Public Use Files on the EHR incentive program (February 2020 update), we set the EHR attestation dummy equal to one if the hospital passed its attestation for the given year at any stage. Owing to non-availability of data on the third stage of the program, we extend the second stage data from year 2016 to years 2017-2019. Use of an attestation dummy lets us control for the fact of incurring the fixed cost of quality-improvement efforts. Owing to the small size of the non-EHR group (only 8-10% of the sample), we do not analyze whether quality goes up faster in the group of the hospitals (for instance, we do not interact the attestation dummy with α).

Sample
The non-anonymous character of the data sources allows us to merge them by year and hospital name. Our analysis focuses on acute-care Medicare's hospitals, as the pay-for-performance incentive contract applies exclusively to this group. We restrict the sample by considering only hospitals with share of Medicare cases greater than 5%.  Table 1).

Flow of quality and evidence of mean reversion
Descriptive analysis of the values of TPS offers suggestive evidence in support of some of the main hypotheses generated by the model. Specifically, we focus on the flow of hospitals between quintiles of TPS in different years. The Sankey diagrams in Figures 1 and 2 use the width of arrows as the intensity of flow rates and demonstrates how hospitals change their position in quintiles of the composite quality measure after the introduction of pay-for-performance (e.g., from 2012 to 2013). As can be inferred from Figure 1, there is considerable movement of hospitals between quintiles. For instance, consider hospitals which in 2012 belonged to the fifth quintile of TPS (quintile with the highest performance). Fewer than half of these hospitals remained in the fifth quintile of TPS in 2013, and the rest saw a decline of their position relative to other hospitals by moving to quintiles one through four. Similar tendencies are observed for hospitals in any other given quintile of TPS in 2012: only a small share of hospitals continue to belong to the same quintile in the subsequent year. This can be viewed as graphic support for the phenomenon of mean reversion since hospitals would rarely change their quintile from year to year in the absence of mean reversion.
It is plausible to assume that mean reversion becomes weaker when there is an increase of α.

Empirical results
The first set of our results is reported in Table 2 and concerns the mean effect of pay-for-performance at Medicare hospitals.   Note that the mean value of ( ) μ α t increases in α t , which supports our supposition that hospital managers take account of future benefits from improving current values of hospital quality. Table 3 shows the second set of results for heterogeneity of hospital response to pay-for-performance. The parameter λ is estimated as the inverse of the smaller root of AR(2) or as ACF (1). The values are significant and less than one under both approaches. This points to mean reversion, so quality decreases toward the mean at highquality hospitals and goes up toward the mean at hospitals with low quality. The values of λ rise with an increase in the size of incentives α, which implies that the persistence of the dynamic process increases, and   The persistence parameter ( ) λ α t is estimated as the inverse of the smaller root of AR(2) or as ACF (1), the latter is denoted as "alternative." Since the values of ( ) λ α t are well below 1, we can conclude that the estimated AR(2) processes are indeed stationary for each α t .  Tables 4, 5, 6 where hospitals are divided into quintiles according to the values of their TPS. Note that the change in hospital quality is a function of the regression coefficient and the mean values of covariates. So its standard error consists of two parts: the error of the estimated regression coefficient and the error of the mean values of covariates. Only the second part of this error depends on sample size and should go up approximately 5 times due to analysis by quintiles. However, the weight of this second part proves to be relatively small in case of our data, so the standard errors in Tables 4-6 are only slightly larger than standard errors in Table 2.
The estimates of the effect of pay-for-performance in terms of ( ) show that the higher the quintile of the quality distribution in the previous year, the larger is the impact of the reform (Tables 4 and 5). Statistically significant differences in the effect of pay-for-performance across consecutive quintiles of lagged TPS are observed in many years, for instance, in 4 years out of 7 for quintiles 1-2 in case of ( ) ( )  Table 5. So pay-for-performance stimulates quality increase in all groups of Medicare's hospitals, and the impact of pay-for-performance is greater at higher-quality hospitals. Notes: quintile 1 denotes the lowest quality and quintile 5the highest. The table reports the effect at each corresponding quintile and the differences in the effects at consecutive quintiles. * , * * , and * * * show significance at levels of 0.1, 0.05, and 0.01, respectively.
Standard errors (calculated using the delta-method for the difference of the reform effects across the corresponding two categories of each time-invariant hospital characteristic) are in parentheses.
There are two sources of errors in the estimates shown in the table: the error of the regression coefficient and the error of the mean values of covariates. The first part of the error does not vary across all result tables, while the second part of the error depends on the group size and is approximately 5 times larger than its counterpart in Table 2. However, the errors of the regression coefficient are considerably bigger than those of mean values of covariates, so the increase in the standard errors in this table and two subsequent tables relative to the standard error in Table 2 is only minor. Notes: quintile 1 denotes the lowest quality and quintile 5the highest. The table reports the effect at each corresponding quintile and the differences in the effects at consecutive quintiles. * , * * , and * * * show significance at levels of 0.1, 0.05, and 0.01, respectively.
Standard errors calculated using the delta-method are in parentheses.  Standard errors calculated using the delta-method are in parentheses. Table 6 gives estimates of the net total effect, i.e., the expected change in hospital quality over time, measured as the difference between the predicted TPS and the lagged TPS. The net total effect is the sum of the impact of mean reversion and the effect of pay-for-performance.
Note that the estimation of the fitted value of TPS includes time effects which account both for time trend and for important changes in the incentive mechanism not captured by variation in α. An example of such change occurred in 2015 and temporarily decreased the value of TPS for each hospital.¹⁷ Accordingly, Table 6 shows that the values of predicted TPS minus lagged TPS go down in 2015 for each quintile.
The values of net total effect reveal an increase of quality in the groups of low-quality hospitals, while quality deteriorates in high-quality groups. Negative total effect is less prevalent or is smaller in absolute terms at high-quality hospitals in 2016-2017. The result can be attributed to the weakening of mean reversion with increase in α. Yet, when α becomes constant in 2018-2019, the prevalence of negative total effect and the absolute value of the negative effect returns to that of 2015.
Finally, we focus on the effect of pay-for-performance for groups of Medicare hospitals according to their ownership, teaching status, urban location, and geographic region. The mean effect increases in α for public and private hospitals, for urban and rural hospitals, for teaching and non-teaching hospitals, and for hospitals in each geographic region (Tables 7 and 8).
The effect of pay-for-performance is greater for private hospitals than for public hospitals, which corresponds to findings in [13] and [78]. The result can be explained by a greater emphasis on financial incentives at these healthcare institutions. These profit constraints, combined with the altruistic character of healthcare services, induce more effective quality competition at non-public hospitals [16]. The difference in the effect for private and public hospitals is statistically significant in most years.
As for teaching status, quality improvement owing to the incentive scheme is often higher at nonteaching hospitals, which may be because they can devote all of their labor resources to patient treatment, while teaching hospitals lose some efficiency due to their educational activities [64]. Also, teaching hospitals may be treating more difficult cases. This complexity could not be fully captured by the casemix variable in our analysis and may cause a downward bias of the estimated effect at teaching hospitals, explaining the lower value of the effect at teaching than at non-teaching hospitals. Yet, the difference in the values at teaching and non-teaching hospitals is statistically insignificant in each year. Statistically significant differences in the effect of pay-for-performance for urban and rural hospitals are observed only in the last 2 years: the effect is larger at rural hospitals.
As regards geographic location, there is practically no variation in the effect across groups of hospitals in the early years of pay-for-performance. The differences are present mainly in the later few years: for instance, the mean effect of pay-for-performance is greater in New England than in Mid-Atlantic in 2016-2019 and than in East North Central and West South Central regions in 2017-2019. Notes: Standard errors (calculated using the delta-method for the difference of the reform effects across New England hospitals and hospitals in each corresponding geographic region) are in parentheses. * , * * , and * * * show significance at levels of 0.1, 0.05, and 0.01, respectively.
In this article, we focused on exclusion of mean reversion in evaluating the response of TPS at Medicare hospitals to an incentive contract. Since TPS under this contract becomes an autoregressive process, our analysis deals with dynamic panels. It should be noted that dynamic panel data models are prevalent in various fields of economics. Examples in macroeconomics include the analysis of a country's growth [11,50] or its current account [81]. Application in corporate finance deals with the study of such firm-level variables as size [33,61], profit [54], leverage [32,36], and such proxies of firm performance as return on asset and Tobin's Q [49,65]. In the banking sphere dynamic panels are applied to ROE and profitability [35,48] while in finance they are used for housing prices [31] and fuel prices [71]. Papers in the economics of labor, health, and welfare employ dynamic panel data models to analyze physician labor supply [4], hospital staffing intensity [82], wealth of households and health status of individuals [57], and quality and efficiency of hospitals (e.g., mortality ratio in [56], and average length of stay in [10]).
The approach used in our study estimates the unconditional mean of the dependent variable in the dynamic panel data model and employs it for policy evaluation. Specifically, the comparison of the fitted values of the unconditional mean at different values of policy intensity offers a measure of the effect of reform. The advantages of the approach are twofold. First, it excludes the impact of mean reversion in groupwise estimations (e.g., in lower and in higher quantiles of hospitals according to their TPS). Second, the approach may also be used in the analysis of the mean effect of the reform if we focus on effects in the long run. Indeed, the unconditional mean in dynamic panel data analysis is sometimes called the long-term mean as it reflects the mean value in the long run. It should be noted that an alternative approach that uses the estimated coefficient for the policy variable as a measure of the mean effect of reform does not suffer from the problem of mean reversion. But in dynamic panel data models it evaluates only the short-term impact of policy.
As regards exclusion of mean reversion in dynamic panels, we note a limitation on the character of mean reversion, imposed by the nature of the dynamic panel model where the unconditional mean is the long-term mean. Mean reversion is not instantaneous: if a deviation from the mean is observed in period t, the return to the mean occurs not in period + t 1 but only in later periods. It may be noted that our approach is similar to difference-in-difference estimations. The long-run effect of reform under our approach is the difference in the fitted value of the long-term mean under the value α t and under counterfactual value of zero (similar to [48]). Alternatively, we can take the difference in the fitted values of the long-term means under the value of α t and − α t 1 . To summarize: in focusing on the longrun impact of the reform in dynamic panel data models, the estimation of either the mean effect or of the groupwise effects requires an unconditional mean. The approach also excludes mean reversion which contaminates policy evaluation in case of groupwise estimations.
As regards policy evaluation based on panel data fixed effects methodology, our approach of computing the unconditional mean as a function of the policy variable α produces the conventional linear prediction of the dependent variable. The mean effect of reform in the static panel is either the coefficient for the reform variable or the difference in the fitted value of y under α t and the fitted value of y under 0 (counterfactual).
Finally, we note the prerequisites for identification of the unconditional mean which are similar to the assumptions in difference-in-difference estimations. Two requirements apply both to the static and dynamic panels. First, time variation in the policy variable is required for identification of the coefficient for the policy variable in the unconditional mean function. Second, if there is only time variation in the policy variable α (and no cross-section variation in α t at a given value of t, i.e., no control group), the reform effect cannot be distinguished from other time effects. So cross-section variation in another variable, which is correlated with the policy variable, is required. In our case, this variable is the Medicare share: the higher the share of Medicare patients in the hospital, the stronger the impact of α (the share of hospital funds at risk under the Medicare program becomes more important for total revenues of the hospital). The use of dynamic panel data models requires a third assumption: the unconditional mean must be defined, and for this reason the process y has to be stationary.

Conclusion
Studies of incentive contracts usually focus on the mean tendency and give scant attention to potentially heterogeneous response to the policy of interest by agents at different percentiles of the distribution of the dependent variable. But insufficient analysis of such heterogeneity may lead to speculation on ceiling effects and belief among agents with better values of the variable of interest that there are no ways of making further financial gains by further improvements.
This article highlights the fact that there is a multivariate dependence of the variable of interest in such incentive contracts. Specifically, a part of intertemporal dependence can be attributed to the policy reform and a part to mean reversion. So the article proposes a method to help model such multivariate dependence by excluding the impact of mean reversion. As mean reversion contaminates judgment regarding the time profile of the dependent variable, and this contamination is different for agents in lower and higher percentiles of the variable of interest, clearing out the reform effect of mean reversion makes the method suitable for assessing heterogeneity of incentive schemes.
In an application to the longitudinal data for Medicare's acute-care hospitals taking part in the nationwide quality incentive mechanism ("value-based purchasing"), we find that the higher the quintile of quality in the prior period, the larger the increase in the composite quality measure owing to the reform. Quality improvement in each quintile increases with the increase in size of the quality incentive.
Our results reveal that increase in the quality measure owing to pay-for-performance is greater at hospitals with higher levels of quality. The finding suggests stronger emphasis on quality activities at high-quality hospitals, and this is indeed discovered in a number of works. For instance, top-performing hospitals in the US pilot program paid more attention to quality enhancement than bottom-performing hospitals [77]. Under the proportional pay-for-performance mechanism in California, high-quality physicians similarly placed more emphasis on an organizational culture of quality and demonstrated stronger dedication to addressing quality issues than low-quality physicians [21]. The desire of high-quality hospitals, which have reached top deciles of hospital performance, to pursue quality improvement by means additional to those proposed by the policy regulator is further evidence in support of our research [37].
Directions for future work in health economics applications may include analysis of heterogeneous hospital response to quality incentives by considering different dimensions of the composite quality measure. A related field of research is the study of potential sacrifice of quality of non-incentivized measures in favor of measures incentivized by pay-for-performance. This has been analyzed at the mean level [27,47] and may be expanded to account for different behavior by high-quality and low-quality hospitals.

A Estimation with the dynamic panel B Data sources
Total performance scores and other Hospital Compare data were downloaded from https://data.medicare.gov/data/hospital-compare (Table A2).
Impact Files data are taken from https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/ AcuteInpatientPPS (Table A5).  The Sargan statistic is not applicable to the specification with robust standard errors.