Two paradoxical results in linear models: the variance inflation factor and the analysis of covariance

A result from a standard linear model course is that the variance of the ordinary least squares (OLS) coefficient of a variable will never decrease if we add additional covariates. The variance inflation factor (VIF) measures the increase of the variance. Another result from a standard linear model or experimental design course is that including additional covariates in a linear model of the outcome on the treatment indicator will never increase the variance of the OLS coefficient of the treatment at least asymptotically. This technique is called the analysis of covariance (ANCOVA), which is often used to improve the efficiency of treatment effect estimation. So we have two paradoxical results: adding covariates never decreases the variance in the first result but never increases the variance in the second result. In fact, these two results are derived under different assumptions. More precisely, the VIF result conditions on the treatment indicators but the ANCOVA result requires random treatment indicators. In a completely randomized experiment, the estimator without adjusting for additional covariates has smaller conditional variance at the cost of a larger conditional bias, compared to the estimator adjusting for additional covariates. Thus, there is no real paradox.


Variance inflation factor
Consider the following linear regression: where the regressors z i is a scalar and x i is a vector containing 1. Using the Frisch-Waugh Theorem, we can write the OLS estimator for τ asτ wherež i is the residual from the OLS fit of z i on x i . If the regressors (z i , x i )'s are all fixed and the ε i 's are IID with mean 0 and variance σ 2 , then we can express the variance ofτ a as var(τ a ) = n i=1ž 2 The first term of (2) is the variance ofτ i.e., the coefficient of z i in the OLS fit of y i on (z i , 1). The second term of (2) is the VIF, no smaller than 1, because it is the total sum of squares divided by the residual sum of squares in the OLS fit of z i on x i . See Faraway (2016), Fox (2015) and Agresti (2015) for textbook discussions.
Thus, from (2), the variance of var(τ a ) will never decrease with more covariates in (1), because the residual sum of squares n i=1ž 2 i will decrease while the total sum of squares n i=1 (z i −z) 2 is fixed. An immediate result is that var(τ a ) ≥ var(τ ).

Analysis of covariance
Now we view (1) in a slightly different way: the x i 's are pretreatment covariates, the z i 's are the binary treatment indicators, and the y i 's are the outcomes of interest. Then (1) is the standard ANCOVA model, and the parameter τ is the treatment effect of interest. Let n 1 = n i=1 z i and Because z i is binary, we can simplify the expressions ofτ tô and the expression ofτ a tô τ a (1) (1 − z i )ε i are the differences-in-means of x and ε, andβ is the OLS estimator for β in (1). We ignore the term (4) because with large samples,β → β in probability.
As in Section 1, we assume that the ε i 's are IID with mean 0 and variance σ 2 . We further assume that the z i 's are IID Bernoulli(π), and if we condition on (n 1 , n 0 ), then (z 1 , . . . , z n ) is a permutation of n 1 1's and n 0 0's. We can show that E(δ ε ) = 0, E(δ x ) = 0, and var(δ ε ) = n n 1 n 0 The first variance and the third covariance in (5) follow from standard variance and covariance calculations by first conditioning on all z i 's, and the second variance in (5)  Li and Ding 2017). Then E(τ ) = τ and E(τ a ) ≈ τ , i.e.,τ is unbiased andτ a is consistent for τ .

From conflict to unification
From the VIF result, we see that adding more covariates will never decrease the variance of an OLS coefficient. In contrast, from the ANCOVA result, we see that adding more covariates will never increase the variance of an OLS coefficient at least asymptotically. These two results are both standard in textbooks of linear models or experimental designs. However, they seem to give opposite conclusions. Both results are derived under the linear model (1), and therefore, these two conflicting results seems paradoxical.
If we go back to the derivations above carefully, we will find that Section 1 assumes that the z i 's and x i 's are both fixed, but Section 2 assumes that the z i 's are random and the x i 's are fixed.
Therefore, the VIF and the ANCOVA results hold under different model assumptions. This vaguely explains the paradox. Below, we give a more unified discussion.
Consider the following data generating process: for i = 1, . . . , n, (a) the x i 's are fixed constants with the first component being 1; (b) generate the potential outcomes under control as y i (0) = β ′ x i + ε i , where E = (ε 1 , . . . , ε n ) are IID with mean 0 and variance σ 2 ; (c) generate the potential outcomes under treatment as y i (1) = y i (0) + τ, i.e., the individual treatment effect y i (1) − y i (0) is constant τ ; (d) generate Z = (z 1 , . . . , z n ) IID from Bernoulli(π); (e) the observed outcome is In (b) and (c), I use the potential outcomes notation (Neyman 1923). Readers who are uncomfortable with y i (1) and y i (0) can ignore (b) and (c) and view (6) as the data generating process with random ε i 's and z i 's. Then τ is the average treatment effect parameter of interest.
Conditional on Z, (6) is a linear model with fixed (z i , x i )'s and homoskedastic errors ε i 's. The discussion in Section 1 applies in this case. Then from the VIF result, we know that var(τ a | Z) ≥ var(τ | Z), i.e., the estimator adjusting for covariates x i 's has larger variance. However,τ a is an unbiased estimator, butτ is not an unbiased estimator. From the classic OLS theory, E(τ a | Z) = τ , and from (3), the bias ofτ is E(τ | Z) − τ =δ x . Therefore, the smaller conditional variance ofτ comes at the cost of having a larger conditional bias.
Conditional on E and (n 1 , n 0 ), we have fixed potential outcomes and completely randomized Z.
Thus, under a constant treatment effect model, ANCOVA improves efficiency asymptotically.
Conditional only on (n 1 , n 0 ), we have random potential outcomes and random treatment indicators. The discussion in Section 2 applies in this case. We have shown that E(τ | n 1 , n 0 ) = τ and E(τ a | n 1 , n 0 ) ≈ τ , and moreover, asymptotically, var(τ | n 1 , n 0 ) is at least as large as var(τ a | n 1 , n 0 ).

Some final remarks
I have shown that the seemingly paradoxical results of VIF and ANCOVA are due to different statistical assumptions. The key difference is whether the treatment indicators Z are random or not. In a model with fixed Z, the unadjusted estimator has smaller variance but has larger bias.
In a model with random Z, both unadjusted and adjusted estimators are consistent for τ but the variance of the adjusted estimator is no larger than the variance of the unadjusted estimator. In a randomized experiments, we still prefer using ANCOVA.
In (a), I fix the x i 's. With random covariates, we can condition on them and obtain the same results. Again, the key is whether Z is random or not. I do not focus on the conditions for asymptotic analyses. See Freedman (2008), Lin (2013) and Li and Ding (2017) for more details.
In this short note, I do not focus on more general potential outcomes models. The data generating process in (a)-(e) assumes constant treatment effect. It yields the standard ANCOVA model (1) or (6). The literature of randomization-based causal inference often does not assume constant treatment effect (Neyman 1923;Freedman 2008;Lin 2013;Li and Ding 2017). In those general cases, ANCOVA may increase or decrease the efficiency (Freedman 2008), but simply adding the interaction term z i × x i with centered x i 's gives an estimator that is asymptotically as efficient as the unadjusted estimator (Lin 2013).