Peng Ding

# Abstract

A result from a standard linear model course is that the variance of the ordinary least squares (OLS) coefficient of a variable will never decrease when including additional covariates into the regression. The variance inflation factor (VIF) measures the increase of the variance. Another result from a standard linear model or experimental design course is that including additional covariates in a linear model of the outcome on the treatment indicator will never increase the variance of the OLS coefficient of the treatment at least asymptotically. This technique is called the analysis of covariance (ANCOVA), which is often used to improve the efficiency of treatment effect estimation. So we have two paradoxical results: adding covariates never decreases the variance in the first result but never increases the variance in the second result. In fact, these two results are derived under different assumptions. More precisely, the VIF result conditions on the treatment indicators but the ANCOVA result averages over them. Comparing the estimators with and without adjusting for additional covariates in a completely randomized experiment, I show that the former has smaller variance averaging over the treatment indicators, and the latter has smaller variance at the cost of a larger bias conditioning on the treatment indicators. Therefore, there is no real paradox.

MSC 2010: 62-01; 62A01; 62J10

## 1 Variance inflation factor

Consider the following linear regression:

(1) yi=α+τzi+β'xi+εi,(i=1,,n)

where zi is a scalar and xi is a scalar or vector. Without loss of generality, we center x so that x¯=n-1i=1nxi=0 . In a leading example, zi is the treatment variable and xi contains all the pre-treatment covariates. Using a standard result in linear models, we can write the OLS estimator for τ as

τ^a=i=1nzˇiyii=1nzˇi2,

where ži is the residual from the OLS fit of zi on (1, xi). This result is also called the Frisch–Waugh–Lovell theorem in econometrics; see [1] for a recent review. If the regressors (zi, xi)'s are all fixed and the ɛi's are independent and identically distributed (IID) with mean 0 and variance σ2 as in the classic linear model, then the variance of τ^a equals

(2) var(τ^a)=i=1nzˇi2var(yi)(i=1nzˇi2)2=σ2i=1nzˇi2=σ2i=1n(zi-z¯)2×i=1n(zi-z¯)2i=1nzˇi2.

The first term of (2) equals the variance of

(3) τ^=i=1n(zi-z¯)yii=1n(zi-z¯)2,

i.e., the coefficient of zi in the OLS fit of yi on (1, zi) without adjusting for xi. The second term of (2) is the VIF, which is no smaller than 1 because it is the total sum of squares divided by the residual sum of squares in the OLS fit of zi on (1, xi). The VIF can be equivalently written as (1-Rz|x2)-1 , where Rz|x2 is the sample R2 between zi and xi. So the VIF result can also be written as

var(τ^a)=var(τ^)×(1-Rz|x2)-1.

It highlights the bias-variance tradeoff: with more covariates included, the model is closer to the truth and thus leads to a smaller bias in estimating τ, but at the same time, it results in a larger variance of τ^a . See [2], [3], [4] and [5] for textbook discussions.

Thus, from (2), the variance of var(τ^a) will never decrease with more covariates in (1), because the residual sum of squares i=1nzˇi2 will decrease while the total sum of squares i=1n(zi-z¯)2 is constant. An immediate result is

var(τ^a)var(τ^),

and the equality holds when XZ = 0, where Z = (z1, . . . , zn) is the vector formed by regressors zi's and X=(x1',,xn') is the matrix formed by the regressors xi's. The orthogonality of regressors (i.e., XZ = 0) ensures that Rz|x2=0 and τ^=τ^a .

## 2 Analysis of covariance

Now we consider a special case of (1): the xi's are pre-treatment covariates, the zi's are the binary treatment indicators (1 for the treatment and 0 for the control), and the yi's are the outcomes of interest. Then (1) is the standard ANCOVA model [6], and the parameter of interest τ is the treatment effect. Let n1=i=1nzi and n0=i=1n(1-zi) be the numbers of units under the treatment and control, respectively. As in Section 1, we assume that the ɛi's are IID with mean 0 and variance σ2. In a completely randomized experiment, we further assume that the zi's are from random permutation of n1 1's and n0 0's.

Because zi is binary, we can define

δ^x=n1-1i=1nzixi-n0-1i=1n(1-zi)xi,δ^ε=n1-1i=1nziεi-n0-1i=1n(1-zi)εi

as the differences in means of x and ɛ, respectively. Further let β^ be the OLS estimator for β in (1). The estimator τ^ in (3) without adjusting for covariates simplifies to the difference in means of the outcome:

τ^=n1-1i=1nziyi-n0-1i=1n(1-zi)yi,

which further simplifies to

(4) τ^=n1-1i=1nzi(α+τzi+β'xi+εi)-n0-1i=1n(1-zi)(α+τzi+β'xi+εi)=τ+β'δ^x+δ^ε,

under (1). Based on OLS of (1), the estimator τ^a adjusting for the covariates simplifies to

τ^a=n1-1i=1nzi(yi-β^'xi)-n0-1i=1n(1-zi)(yi-β^'xi)=τ^-β^'δ^x=τ+(β-β^)'δ^x+δ^ε.

With large samples, we can ignore the term (β-β^)δ^x above to obtain

τ^aτ+δ^ε,

because (β-β^)δ^x=OP(n-1) is of higher order due to β^-β=OP(n-1/2) and δ^x=OP(n-1/2) , both justified by central limit theorems under certain moment conditions. See [7] for technical details.

Under complete randomization, we can show that

E(δ^ε)=0,E(δ^x)=0

based on a standard result for the differences in means [8],

(5) var(δ^ε)=nn1n0σ2,var(δ^x)=nn1n0Sx2,

where Sx2=(n-1)-1i=1n(xi-x¯)(xi-x¯)' is the finite population covariance of the xi's [9], and moreover, the uncorrelatedness of the two differences in means [7]

cov(δ^ε,δ^x)=0.

Then E(τ^)=τ and E(τ^a)τ , i.e., τ^ is unbiased and τ^a is consistent for τ. Their variances satisfies

var(τ^)-var(τ^a)var(β'δ^x)=nn1n0β'Sx2β0.

Thus, if β ≠ 0 then ANCOVA improves estimation efficiency, at least asymptotically. See [10], [11] and [12] for textbook discussions.

## 3 From conflict to unification

### 3.1 A unified data generating process

From the VIF result, we see that adding more covariates never decreases the variance of an OLS coefficient. In contrast, from the ANCOVA result, we see that adding more covariates never increases the variance of an OLS coefficient at least asymptotically. These two results are both standard in textbooks of linear models or experimental design. However, they seem to give opposite conclusions. Both results are derived under the linear model (1), and therefore, these two conflicting results seem paradoxical.

If we go back to check the derivations above carefully, we will find that Section 1 assumes that the zi's and xi's are both fixed, but Section 2 assumes that the zi's are random and the xi's are fixed. Therefore, the VIF and the ANCOVA results hold under different assumptions on the treatment indicators. This vaguely explains the paradox.

Technically, the settings for VIF and ANCOVA are slightly different. For example, zi can be general and may have an arbitrary correlation structure with xi in the VIF result, but it is binary and arises from complete randomization in the ANCOVA result. The data generating process below comes from the intersection of the settings for the two results, which allows for a more unified discussion of them.

Consider the following data generating process: for i = 1, . . . , n,

1. (a)

fix the xi's and center them at x¯=n-1i=1nxi=0 ;

2. (b)

generate the potential outcomes under control as

yi(0)=α+β'xi+εi,

where ℰ = (ɛ1, . . . , ɛn) are IID with mean 0 and variance σ2;

3. (c)

generate the potential outcomes under treatment as

yi(1)=yi(0)+τ,

i.e., the individual treatment effect yi(1) − yi(0) is constant;

4. (d)

generate Z = (z1, . . . , zn) from a random permutation of n1 1's and n0 0's;

5. (e)

obtain the observed outcome as

(6) yi=ziyi(1)+(1-zi)yi(0)=τzi+yi(0)=α+τzi+β'xi+εi.

In (b) and (c), I use the potential outcomes notation due to [8]. Readers who are uncomfortable with the notation yi(1) and yi(0) can ignore steps (b) and (c) and view (6) as the data generating process with random ɛi's and zi's.

In the above data generating process, τ represents the individual and thus the average treatment effect. It is the parameter of interest.

### 3.2 Comparing the variances

Conditional on Z, (6) is a linear model with fixed (zi, xi)'s and homoskedastic errors ɛi's. The discussion in Section 1 applies in this case. Then from the VIF result, we know that

(7) var(τ^a|Z)var(τ^|Z),

i.e., the estimator adjusting for covariates xi's has larger variance. However, τ^a is unbiased but τ^ is biased. From the classic OLS theory,

E(τ^a|Z)=τ,

and from the formula of (4), the bias of τ^ is

E(τ^|Z)-τ=β'δ^x.

Therefore, the smaller conditional variance of τ^ comes at the cost of having a larger conditional bias. The conditional bias of τ^ vanishes only in the special cases with β = 0 or δ^x=0 , that is, the covariates are unrelated with the outcome, or the covariates are perfectly balanced in means across the treatment and control groups.

Averaging over Z, we have random potential outcomes and random treatment indicators. The discussion in Section 2 applies in this case. We have shown that

E(τ^)=τ,E(τ^a)τ,

and moreover, asymptotically (ignoring higher order terms),

(8) var(τ^)var(τ^a).

Mathematically, the efficiency reversal results in (7) and (8) do not lead to contradiction given the explicitly specified conditioning sets. Statistically, however, they form a paradox that is similar to the classic Simpson's paradox of effect reversal due to different conditioning sets. I give a simple explanation of this paradox using the following decompositions based on the law of total variance:

var(τ^a)=E{var(τ^a|Z)}+var{E(τ^a|Z)}=E{var(τ^a|Z)}+var(τ)=E{var(τ^a|Z)}

and

var(τ^)=E{var(τ^|Z)}+var{E(τ^|Z)}=E{var(τ^|Z)}+var(τ+β'δ^x)=E{var(τ^|Z)}+var(β'δ^x).

Based on the VIF result, E{var(τ^a|Z)}E{var(τ^|Z)} , but their difference is small because the Rz|x2 between zi and xi is close to zero under complete randomization. We can ignore their difference in the asymptotic analysis. More importantly, the unadjusted estimator has an additional variance term due to the conditonal bias which reverses the ordering of var(τ^a) and var(τ^) .

### 3.3 Comparing the estimated variances

Sections 1 and 2 compare the variances of τ^a and τ^ which are theoretical quantities depending on the unknown true data generating process. In practice, standard statistical software packages report the estimated variances based on OLS:

var^(τ^a)=σ^y|z,x2i=1n(zi-z¯)2×11-Rz|x2

and

var^(τ^)=σ^y|z2i=1n(zi-z¯)2,

where σ^y|z,x2 equals the residual sum of squares divided by n − 2 − dim(x) in the OLS fit of yi on (1, zi, xi), and σ^y|z2 equals the residual sum of squares divided by n − 2 in the OLS fit of yi on (1, zi). The ratio of these two variances depends on σ^y|z,x2/σ^y|z2 and Rz|x2 , which can be larger or smaller than 1. Importantly, this is a numeric result regardless of whether or not we condition on Z.

In fact, under the data generating process in Section 3.1, var^(τ^a) is often smaller than var^(τ^) as long as the covariates are predictive to the outcome. This is true due to two basic facts: first, Rz|x2 is close to 0 under complete randomization so the VIF can be ignored asymptotically; second, σ^y|z,x2 is often smaller than σ^y|z2 because the residual sum of squares decreases with an additional predictive covariate. This argument ignores the opposite impact of the degrees of freedom correction, which is reasonable when the sample size is large and the dimension of covariates is small. See [13] for the discussion of R2 with high dimensional covariates.

The above heuristic comparison of var^(τ^a) and var^(τ^) is not in contradiction with the VIF result which concerns the true variances conditional on Z. The estimated variances can be different from the true variances, especially when the linear model of yi on (1, zi) is misspecified.

## 4 Connection with randomization inference

### 4.1 Variances conditional on the error terms

Another conditioning scheme leads to the following discussion beyond Sections 1. and 2. Conditional on the error terms ℰ, we have fixed potential outcomes and completely randomized Z. Statistics under this regime is called randomization inference, or design-based inference. The classic results from randomization inference are E(τ^|)=τ [8], E(τ^a|)τ [14, 15], and var(τ^|)var(τ^a|) asymptotically [14, 15]. This relative efficiency is coherent with var(τ^)var(τ^a) asymptotically. Again we can explain the coherence using the following decompositions based on the law of total variance:

var(τ^a)=E{var(τ^a|)}+var{E(τ^a|)}E{var(τ^a|)}+var(τ)=E{var(τ^a|)}

and

var(τ^)=E{var(τ^|)}+var{E(τ^|)}=E{var(τ^|)}+var(τ)=E{var(τ^|)},

where both τ^a and τ^ are unbiased for τ at least asymptotically. These results are all in favor of ANCOVA which improves the estimation efficiency. I summarize the results in Table 1.

### Table 1

Comparison under the data generating process in (a)–(e)

mean of τ^a mean of τ^ variance comparison
unconditional E(τ^a)=τ E(τ^)=τ var(τ^a)var(τ^) asymptotically
conditional on Z E(τ^a|Z)=τ E(τ^|Z)=τ+β'δ^x var(τ^a|Z)var(τ^|Z)
conditional on ℰ E(τ^a|)τ E(τ^|)=τ var(τ^a|)var(τ^|) asymptotically

### 4.2 More general potential outcomes model

The data generating process in (a)–(e) assumes constant treatment effect and homoskedastic errors. It yields the standard ANCOVA model (1) or (6). The literature of randomization-based causal inference often deals with more general potential outcomes, without requiring these linear model assumptions [7, 8, 9, 14, 15]. In general cases with possibly misspecified linear models, [14] criticized ANCOVA by showing that it might increase or decrease the efficiency compared to τ^ . As a response to this critique, [15] proposed to use a modified ANCOVA estimator that also includes the interaction term zi × xi in OLS, and showed that this estimator is at least as efficient as τ^ asymptotically. See [16] and [17] for related discussions.

### 4.3 A design issue

From the above discussion, δ^x is a key quantity that causes the paradox. If it is zero, then τ^a=τ^ and the paradox disappears. Complete randomization ensures E(δ^x)=0 , but in a particular allocation, δ^x can differ from zero. From the experimental design perspective, δ^x measures the covariate balance across the treatment and control groups. Complete randomization ensures covariate balance on average, but a particular allocation may have covariate imbalance. [18] and [19] proposed to use rerandomization to improve the data generating process (d) by forcing the treatment indicators Z to satisfy

δ^x'cov(δ^x)-1δ^x=δ^x'(nn1n0Sx2)-1δ^xc0,

where c0 > 0 is a predetermined threshold. Under the randomization inference framework, [20] show that this new experimental design improves the efficiency of τ^ , which is, in fact, close to the efficiency of τ^a for small c0 ≈ 0. From our discussion before, rerandomization can also reduce the conditional bias of τ^ given Z because it forces δ^x to be small for any realized value of Z. Therefore, rerandomization can mitigate the paradox through experimental design. See [21] for more unified discussions.

## 5 Final remarks

I have shown that the seemingly paradoxical results of VIF and ANCOVA are due to different statistical assumptions. The key difference is whether or not statistical inference is conditional on the treatment indicators Z. Conditioning on Z, the unadjusted estimator has a smaller variance but larger bias. Averaging over Z, both unadjusted and adjusted estimators are consistent for τ but the variance of the adjusted estimator is no larger than that of the unadjusted estimator. In randomized experiments, we recommend using ANCOVA under a constant treatment effect model or its modified version for general settings [15, 21].

I end this note with two minor technical issues. First, I assume that the xi's are fixed throughout the paper. With random covariates, we can condition on them to obtain the same results. The key in the discussion is whether or not to condition on Z. Second, if the Zi's are IID Bernoulli random variables as in a Bernoulli experiment, we can condition on (n1, n0) to reduce the discussion to complete randomization.

# Acknowledgement

The author thanks Ugur Yildirim for raising this question in his class of “Stat 230A Linear Models” at the University of California Berkeley, Luke Miratrix for the inspiring discussion of estimated variances in Section 3.3, and Anqi Zhao, Xinran Li, Liyun Chen, Zhichao Jiang, Jason Wu, and Yuting Ye for helpful suggestions. A reviewer made many constructive comments which helped to improve the paper significantly. This research was partially supported by the U. S. National Science Foundation (grant # 1945136).

1. Conflict of Interests: Prof. Peng Ding is a member of the Editorial Board of the Journal of Causal Inference although had no involvement in the final decision.

### References

[1] P. Ding. The Frisch–Waugh–Lovell theorem for standard errors. Statistics and Probability Letters, 168:108945, 2021.Search in Google Scholar

[2] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning. New York: Springer, 2013.Search in Google Scholar

[3] J. J. Faraway. Linear Models with R. Boca Raton: Chapman and Hall/CRC, 2016.Search in Google Scholar

[4] J. Fox. Applied Regression Analysis and Generalized Linear Models. Newbury Park, CA: Sage Publications, 2015.Search in Google Scholar

[5] A. Agresti. Foundations of Linear and Generalized Linear Models. New York: John Wiley & Sons, 2015.Search in Google Scholar

[6] R. A. Fisher. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1stst edition, 1925.Search in Google Scholar

[7] X. Li and P. Ding. General forms of finite population central limit theorems with applications to causal inference. Journal of the American Statistical Association, 112:1759–1769, 2017.Search in Google Scholar

[8] J. Neyman. On the application of probability theory to agricultural experiments: Essay on principles, Section 9. Masters Thesis. Portions translated into english by D. Dabrowska and T. Speed (1990). Statistical Science, 5:465–472, 1923.Search in Google Scholar

[9] G. W. Imbens and D. B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. New York: Cambridge University Press, 2015.Search in Google Scholar

[10] O. Kempthorne. The Design and Analysis of Experiments. New York: Wiley, 1952.Search in Google Scholar

[11] K. Hinkelmann and O. Kempthorne. Design and Analysis of Experiments, Volume 1, Introduction to Experimental Design, 2nd Edition. New York: John Wiley & Sons, 2007.Search in Google Scholar

[12] D. R. Cox and N. Reid. The Theory of the Design of Experiments. Boca Raton: Chapman and Hall/CRC, 2000.Search in Google Scholar

[13] D. A. Freedman. A note on screening regression equations. The American Statistician, 37:152–155, 1983.Search in Google Scholar

[14] D. A. Freedman. On regression adjustments to experimental data. Advances in Applied Mathematics, 40:180–193, 2008.Search in Google Scholar

[15] W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique. Annals of Applied Statistics, 7:295–318, 2013.Search in Google Scholar

[16] A. A. Tsiatis, M. Davidian, M. Zhang, and X. Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in Medicine, 27:4658–4677, 2008.Search in Google Scholar

[17] A. Negi and J. M. Wooldridge. Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, page in press, 2020.Search in Google Scholar

[18] D. R. Cox. Randomization and concomitant variables in the design of experiments. In P. R. Krishnaiah G. Kallianpur and J. K. Ghosh, editors, Statistics and Probability: Essays in Honor of C. R. Rao, pages 197–202. North-Holland, Amsterdam, 1982.Search in Google Scholar

[19] K. L. Morgan and D. B. Rubin. Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40:1263–1282, 2012.Search in Google Scholar

[20] X. Li, P. Ding, and D. B. Rubin. Asymptotic theory of rerandomization in treatment-control experiments. Proceedings of the National Academy of Sciences of the United States of America, 115:9157–9162, 2018.Search in Google Scholar

[21] X. Li and P. Ding. Rerandomization and regression adjustment. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82:241–268, 2020.Search in Google Scholar