In a fully parametric setup when the distributional specification is available, one may be interested in whether the mean regression takes a particular restricted functional form. While the unrestricted regression may be inferred from the specified distribution and estimated from the data, it is likely to allow a rich variety of shapes.1 In such a case, it is often interesting whether the shape of the mean regression reduces to some functional form implied by economic theory, tradition in the literature, or visual inspection; at the same time, it may be problematic to test directly for parametric restrictions embedded in the hypothesized shape. As an example, Figure 2 from our illustrative application based on a complex mixed continuous/discrete distribution presents two regressions of a wage variable on a variable representing education and on a variable representing age, derived from the estimator of the fully parametric (i.e. joint distributional) model. One of these regressions looks quite like linear to a naked eye, and is widely assumed to be linear in the literature, but is it truly linear? The other may seem to be cubic or quartic, but is it truly such? Do the observable deviations from a low-order polynomial owe merely to the sampling error, or do they evidence against these simple forms of the conditional mean?
In this paper, we develop a test for a parametric functional form of a mean regression when the full parametric model for all variables is estimated.2 A natural test statistic is the average squared deviation of the regression function implied by the estimated parametric model from the hypothesized functional form. We derive the asymptotic distribution of the test statistics, which turns out to be a weighed sum of χ2 distributions with one degree of freedom. Even though the test statistic is non-pivotizable (except possibly in some special cases when the distribution collapses to a single scaled χ2 distribution with one degree of freedom), the test is easy to implement by employing estimates of the weights by using numerical derivatives of the true and hypothesized regression functions and the score function. We demonstrate good size and power properties of the test in finite samples using two simple stylized models – one is based on bivariate normality, and the other on a mixed continuous/discrete marginals linked by a copula. Finally, we illustrate the test using Card’s (1995) data on wage, education and age of a few thousand US young men. Despite the regressions may look seemingly linear to a naked eye, the test decidedly rejects linearity of regressions of log-wage on education, log-wage on age, and log-wage on both education and age, as well as of low-order (quartic) polynomial analogs of these.
There exists a variety of tests for a parametric form of a mean regression against non-parametric alternatives; see, for instance, Härdle and Mammen (1990) and Horowitz and Spokoiny (2001). This is also a valid approach to testing for a regression parametric specification. However, when the whole framework is parametric, more natural is to utilize it and perform testing within the parametric distributional model. In addition, from the technical standpoint, the nonparametric tests usually involve kernel estimation of the mean regression and bootstrapping of the test statistic, so their implementation is more involved than that of the test proposed here.
The paper is structured as follows. In Section 2 the setup is described, the assumptions are laid out, and properties of auxiliary estimates are derived. In Section 3 the test statistic and its asymptotic properties are presented, and implementation of the test is described. Section 4 contains two illustrative examples, accompanied by simulation evidence. In Section 5, we illustrate how the test works using labor market data. Finally, Section 6 concludes. All proofs and tedious derivations are relegated to the Appendix. Notes on notation: denotes the L2 norm of a matrix, by we denote dimensionality of a vector, by – the rank of a matrix.
2 Setup and Estimation
Suppose there is a parametric density3 model , θ ∈ Θ for a pair of random variables (u, v), u being scalar and v being possibly multidimensional, and let θ0 be the true value of parameter θ. The implied mean regression for u given v is the value at θ0 of the conditional expectation function
is the conditional density of u given v, and
is the marginal density of v. The estimated implied regression is then
where is the maximum likelihood estimator of θ0:
We would like to compare the implied regression (1) to the parametric functional form , where
The estimator of the (pseudo)true value of the parameter β0 is based on least squares:4
Denote and . Because the test to be developed will need to use information on asymptotic correlatedness between and , we frame the two estimation problems inside one joint optimization problem5
and the asymptotic variance estimate for can be obtained numerically from this optimization problem. The factor is added for convenience of computing the derivatives; its presence (or presence of any other positive factor) does not affect the estimator or its properties.
Let us have a closer look at the structure of . Because is an extremum estimator, it has a sandwich form , where H is a Hessian matrix, and Ω is a variance matrix of first derivatives. Because of an additive structure of in θ and β, H has a block-diagonal form
and the northwest corner is occupied by −Hf because of information matrix equality (recall that is an ML estimate). Putting the pieces together,
Note that while H is necessarily of full rank, the matrix Ω may well be singular. In one of examples in Section 4, rk(Ω) = 4 while . The matrices Hf, , and can be easily estimated by numerical derivatives and the parameter estimate already obtained.
We make a number of assumptions that guarantee existence of the above moments and ensure joint consistency and asymptotic normality of ML and LS estimates and .
The following about data generation holds:
the data is a random sample from a population with probability density and finite ;
the parameter set Θ is a compact subset of , and θ0 is in the interior of Θ;
for any θ ∈ Θ such that θ ≠ θ0, it holds that ;
is continuous in θ on Θ and twice continuously differentiable in θ in a neighborhood 𝔑θ of
the following moments are finite: and the following functions are integrable:
the matrix Hf is non-singular.
The following about the hypothesized regression function holds:
the parameter set B is a compact subset of and β0 is in the interior of B;
for any β ∈ B such that β ≠ β0, it holds that ;
ψ(v, β) is continuously differentiable in β on B and twice continuously differentiable in β in a neighborhood 𝔑β of B;
the following moments are finite: , ,, ;
the matrix is non-singular.
Suppose assumptions 1–2 hold. Then .
For future use, define
A natural estimator of Δ is
For simplicity, we assume that this evaluation occurs without computational error.6 We make additional technical assumptions that ensure finiteness of Δ and consistency of .
The following moments exist and are finite: , .
Suppose assumptions 1–3 hold. Then .
Note that because of convenient partitioning of ϑ into θ and β and dependence only on θ and of only on β, differentiation inside Δ also separates out, and one can rewrite
While the bottom entry can be computed analytically, for the top entry one can use the machinery of numerical derivatives in a straightforward way.
3 Test and Asymptotics
Suppose that is specified so that it may be equal, almost surely, to derived from a fully parametric model for some combination of θ and β. The null hypothesis to be tested is
The test of the null H0 is based on the comparison at data points of regression values implied by the full parametric model and by the hypothesized regression function. The sample squared deviations statistic is
which is a sample analog to
which is zero under H0 and nonzero otherwise.
The following theorem provides the asymptotic distribution of under the null, which turns out to be a weighted sum of χ2 distributions.
Suppose assumptions 1–2 hold and . Then, under H0,
where are eigenvalues of Λ, and .
To implement the test, one computes , constructs consistent estimates Ĥ and of H and Δ and finds eigenvalues of . Then one simulates the distribution of
and reads off its relevant right quantile to use as critical values for .7
Note that Λ may well be of reduced rank, and it may be of rank even lower than and/or . In one of examples in Section 4, while and ; in the second example, rk(Λ) = 3 while and This happens because typically there is a great deal of collinearity between the derivative of the true regression, the derivative of the hypothesized regression, and the score, at least under the null. This phenomenon does not, however, pose any difficulties in implementation in case rk(Λ) is a priori unknown (which is typically the case) as the other eigenvalues of Λ are zeros.
Consider now the situation when the null hypothesis does not hold. More precisely, the null does not hold if for no parameter value the functional form ψ(v, β) coincides with the true regression almost surely. Note that in this case β0 is interpreted as a pseudotrue value of β as the true value does not exist. The following theorem says that under any alternative, the test statistic diverges.
Suppose assumptions 1–2 hold, and for any β ∈ B. Then
Theorem 2 implies that the test is consistent against any deviations from the true specification, i.e. when the regression function does not equal, on a set of positive measure however small it is, to the hypothesized specification ψ(v, β) for any β ∈ B. The power of the test is expected to be greater the larger is this set on which the two functions (evaluated at the true and pseudotrue parameter values, respectively) deviate from each other, and/or the larger are those deviations.
4 Illustrations and Simulations
In this section we elaborate on two examples of data generating processes to illustrate the construction of the test and verify its finite sample performance.
The aim of our first experiment is to analyze the size of the test in a simplest setup, and, even more importantly, to see whether the use of numerical derivatives delivers good enough precision in controlling the size of the test. Here all variables are continuous, the regression function has a known form, and the matrices related to first and second derivatives are computable in a closed form. Namely, we use a jointly normal model for the two variables
where Due to joint normality, the regression function is linear: We use this fact to verify performance of the test in finite samples in terms of size properties, setting , where . Notice that there is a priori no doubt that the tested regression functional form is true. The total dimensionality of the parameter vector is .
In Appendix B we derive that
We rule out the cases as these values sit on the boundary of the parameter set for ρ. In the formulation of Theorem 1, we also rule out the case ρ = 0 which leads to Λ being a zero matrix with . The test will not work properly when ρ = 0.
Provided that ρ ≠ 0 and , the rank of Λ is unity no matter what the parameter values are, and only non-zero eigenvalue is . Note that even though we have rk(H) = 5, rk(Ω) = 4, rk(Δ) = 2, yet rk(Λ) = 1. Because rk(Λ) = 1, the limiting distribution in fact simplifies to λρ times a distribution. Thus, the critical values can be computed simply as times an appropriate quantile of the tabulated distribution, where is λρ with the ML estimate plugged in place of ρ.
This result will be used as an ‘analytic’ benchmark when one uses analytical derivatives. To that end, we set the limiting distribution as described in the previous paragraph. The other, ‘numerical’ value for Λ is obtained as , where Ĥ, and are estimates of H, Ω and Δ using numerical derivatives.8
The pairs are drawn from the bivariate normal distribution with means , unit variances and correlation ρ0 = 0.5. The following simulation results are based on 2000 simulations; the rejection rates are expressed in percentages.
The size control is excellent even for small samples when analytical derivatives are used. When one computes numerical derivatives instead, there are expectedly some size distortions, which go away quickly as the sample size grows. For samples of a few thousand, the size control is of no concern, at least for low-dimensional setups.
In our second experiment, we will analyze the size and power of the test in a more realistic setup. Here the data are mixed continuous and discrete. The continuous u has a logistic distribution, the discrete v is drawn from a three-point distribution, and the dependence is induced by the Farlie–Gumbel–Morgenstern (FGM) copula. These choices are due to availability of the joint PDF/PMF and CDF/CMF in a closed form, simplicity of the form of mean regression, simplicity of tuning the parameters so that the regression function is linear or non-linear, and, finally, conceptual similarity to our illustrative empirical application.
The continuous marginal has the density
and cumulative distribution function
We set the true value of μ to be zero in order to obtain symmetry. The three-point distribution of the discrete marginal is with marginal PMF g(v) represented by the corresponding collection of probabilities with CMF . The dependence is induced by the FGM copula
where and ρ > 0 implies positive, although moderate at most, dependence.
Let . It is shown in Appendix B that the joint density/mass is
If, in addition to , we set , then, due to a symmetry around the origin, the regression function will be linear: , where λ depends on . If we set , the symmetry ceases to take place, and the regression function is no longer linear. We again set , where , and study size properties when and power properties when . The total dimensionality of the parameter vector is .
The variables are drawn from the standard logistic distribution (i.e. with and ). We set implying the correlation coefficient of . Then, for a given i and given pair , we compute and and use these to generate the variables from the three-point distribution with corresponding probabilities . We set the pair to three values, one of which implies a linear regression, while the other two imply non-linear ones (see Figure 1).
The following table contains simulation results for samples of small size n = 100, moderate size n = 500, and big size n = 2000. The results are based on 2000 simulations; the rejection rates are expressed in percentages.9
Except for small samples, the size and power figures are favorable. The actual rejection rates shown in the first line are quite close to nominal test sizes. The power figures are impressive, especially for large samples, and even though the true regression line does not deviate much from a linear form, the test detects it pretty often from a sample of a moderate size. With small samples, the null rejection rates fall short of nominal rates a bit, and the test has hard time detecting small deviations from the null. While a hundred observations are clearly not sufficient for the test to work properly, increasing the sample size severalfold straightens out the rejection rates and makes the properties of the test very attractive.
5 Illustrative Application
In this section we illustrate the test using the labor market data from Card (1995). These data contain, in particular, wage, education and age of a sample of US men of size n = 3010 taken in 1976. The main variable is logarithm of wages (lwage76), and regressors are education (ed76) and age (age76). We run bivariate and trivariate full parametric models for the pairs (lwage76,ed76), (lwage76,age76) and the triple (lwage76,ed76,age76), compute implied regressions of log wages on one or two regressors, and test them for linearity using the test developed in this paper.10
Because the regressand is a continuous variables while both regressors are discrete, we construct the joint distribution by using the copula machinery. The marginal density for the continuously distributed log wages is chosen to be the skew-normal distribution (Azzalini 1985):
where μ is a location parameter, σ is a scale parameter,
and11 γ is a shape parameter that indexes the degree of skewness; the distribution reduces to the regular normal when γ = 0. In total, the skew-normal density and its CDF are characterized by three parameters in . Azzalini, Dal Cappello, and Kotz (2003) argue that this distribution (among others) well approximates the real log income data. Below are the results of fitting the marginal skew-normal density to the variable lwage76.
The Kolmogorov–Smirnov statistic (the maximal difference between the empirical distribution function and estimated CDF) equals 0.0168, and, normalized by , equals 0.921, which is quite smaller than the critical value even at the 20% significance level (e.g. Massey 1951).
The marginal distributions of the variables ed76 and age76 are categorical, with a number of categories being k1 = 18 for the former and k2 = 11 for the latter,12 and with categorical probabilities , subject to . Let us denote the CMF of this distribution by . The estimates are shown in the following tables.
Because the two/three components are both discrete and continuous, we extend the method of Anatolyev and Gospodinov (2010) of constructing a joint distribution of mixed marginals to the case of multiple values in the discrete marginal’s support13 using copula machinery. We employ the Gaussian copula because it is simple and convenient, easily interpretable, and allows natural extension to higher dimensions with a reasonable increase in the degree of parameterization. When there is only one discrete regressor, the Gaussian copula has only one correlation parameter ϱ. It is derived in Appendix C that the joint density is
is ‘distorted’ categorical probability, and collects all 21 or 14 parameters.
Maximization of the joint (log) likelihood yields estimates of parameters of the marginals very close to figures reported above but with lower standard errors, and the estimates of the copula as in the following table:
One can see that the estimates of bivariate degrees of dependence are highly statistically significant and moderately large in value.
Figure 2 shows the estimated mean regressions. In the case of ed76, it may appear that the true functional form is linear, which is what the corresponding literature tends to focus on. In the case of age76, linearity does not seem to hold, but a low-order polynomial like a cubic form may be appropriate. To verify whether these conjectures hold, we first perform the test for a linear mean regression:
The test results are in the following table.
The hypothesis of a linear regression form is decidedly rejected for both regressors at any conventional significance level; in fact, the exceedance is huge. We conclude that the form of the actual mean regression differs from what is usually assumed in regressions of wages on its determinants.
Labor econometricians often add in their linear regressions a square of a variable related to duration (e.g. work experience14); Murphy and Welch (1990) show that even fourth powers may be needed. Therefore, we have also run the test with low-order polynomial hypothesized regression forms: and . These functional forms are also rejected at any conventional significance level.
When there are two discrete regressors, the Gaussian copula has a 3 × 3 correlation matrix
with three distinct parameters ϱ0, ϱ1, ϱ2. It is derived in Appendix C that the joint density is
for are ‘distorted’ bivariate categorical probabilities, where
and collects all 33 parameters.
Maximization of the joint (log) likelihood yields estimates of parameters of the marginals very close to figures reported above but with lower standard errors, and the estimates of the copula as in the following table:
One can see that the estimates of bivariate degrees of dependence ϱ1 and ϱ2 are very close to those from bivariate models with similar standard errors. The degree of dependence between the two regressors ϱ0 is estimated to be quite modest but significantly different from zero.
Figure 3 shows the surface of the estimated mean regression which is arguably close to a plane. We perform the test for a linear mean regression:
The test results are:
The hypothesis of a linear regression form is decidedly rejected for both regressors at any conventional significance level. We also repeat this exercise for the form quadratic in both regressors , as well as, motivated by the study of Murphy and Welch (1990), for the form linear in education and quartic in age, , as well as the same form with age v2 replaced by potential experience that equals .15
These functional forms are also decidedly rejected at any conventional significance level. Evidently, the observable “bumps” in the curves/surface in Figure 2 and Figure 3 are not due to a sampling error only, but rather are built-in attributes of the shapes of regressions. The overall results imply that the true mean regressions are not likely to reduce to low-order polynomials in the conditioning variables but rather take more complex functional forms, which is contradictory to popular empirical practices.16
We have developed a test for a restricted functional form of a mean regression function when a parametric distribution for all variables is specified and estimated. The test is based on mean-square comparison of the estimated regression implied by the joint density and estimated hypothesized functional form. The test statistic is asymptotically mixed χ2 distributed, with the coefficients computable from the true and hypothesized regression functions and the score function. The size and power properties are favorable for sample sizes usually employed. A possible direction of future research may be extension of the test to causal regressions estimated by instrumental variables.
I am grateful to the Co-Editor and two anonymous referees for useful suggestions that significantly improved the presentation. I also thank Nikolay Kudrin for excellent research assistance.
Proof of Lemma 1
Proof of Lemma 2
which follows from Assumption 3. Next,
Proof of Theorem 1
Take a second-order stochastic expansion of n around the true parameter value ϑ0:
and is the asymptotic distribution of . Under H0, the leading term is zero. Next, under H0,
by the law of large numbers (see the proof of Lemma 2) and because
Summarizing, we have that under H0,
Now, using Lemma 3.2 from Vuong (1989), we get that
where are eigenvalues of , and . □
Proof of Theorem 2
It follows from the proof of Theorem 1 that
Because almost surely, we have that tends to +∞ as n → ∞ as it is positive by construction. □
Details on Simulation Experiments
Consider the setup of the first experiment. Because and , we compute that
Note that there are only two non-collinear elements. Hence,
which, expectedly, has a rank of 2.
The logdensity is
and its derivatives are
The derivatives of the hypothesized regression function are
So, the (minus) inverted Hessian is
Next we compute
Hence, the matrix of expected cross-products of the elements of the score vector is
Then the asymptotic variance matrix is
For the second experiment, we extend the method of Anatolyev and Gospodinov (2010) of constructing a joint distribution of mixed discrete and continuous marginals to the cases of the cardinality of the discrete marginal’s support higher than two. The joint CDF/CMF is
so the PDF/PMF is a derivative with respect to the continuous argument and a difference with respect to the discrete one:
where the second term is
For the FGM copula,
implying the distorted success probabilities
The joint density/mass is
and the result follows.
Details on Empirical Illustration
We omit the parameters during the derivations. In the case of only one discrete component, the joint PDF/PMF is
where the last term is
The Gaussian copula is , where Φ2 is CDF of the standard bivariate normal, and Φ−1 is inverse to the standard normal CDF. Note the important property:
This leads to
Note that because Φ2 is bivariate standard normal with correlation coefficient ϱ, we have, by normality of the conditional distributions under joint normality, that
In the case of two discrete components, the joint PDF/PMF is
where the last term is
Consider the three-dimensional Gaussian copula
Note the property
which leads to
As a computational matter, we use the fact that
Azzalini, A. 1985. “A Class of Distributions which Includes the Normal Ones.” Scandinavian Journal of Statistics 12: 171–178. Google Scholar
Azzalini, A., T. Dal Cappello, and S. Kotz. 2003. “Log-Skew-Normal and Log-Skew-t Distributions as Models for Family Income Data.” Journal of Income Distribution 11 (3-4): 12–20. Google Scholar
Card, D. 1995. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” In Aspects of Labor Market Behaviour: Essays in Honour of John Vanderkamp edited by L. N. Christofides, E. K. Grant, and R. Swidinsky. Toronto: University of Toronto Press. Google Scholar
Judd, K. 1998. Numerical Methods in Economics. Cambridge, Massachusetts: MIT Press. Google Scholar
Härdle, W., and E. Mammen. 1990. “Comparing Nonparametric versus Parametric Regression Fits.” Annals of Statistics 21: 1926–1947. Google Scholar
Horowitz, J. L., and V. G. Spokoiny. 2001. “An Adaptive, Rate-Optimal Test of a Parametric Mean-Regression Model against a Nonparametric Alternative.” Econometrica 69 (3): 599–631. CrossrefGoogle Scholar
Newey, W. K., and D. McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” In Handbook of Econometrics, edited by R.F. Engle and D. McFadden, Vol 4, pp. 2111–245. Amsterdam: North-Holland. Google Scholar
Another appropriate context is semiparametric where a conditional distribution is specified and estimated in the first place. However, this case is less practical, because specifying a conditional distribution typically entails specifying the conditional mean as a part of the modeling strategy.
We call simply by ‘density’ what may in fact be a mass in the case discrete variables are considered, or a density/mass in the case mixed continuous/discrete variables are. The integrals considered from now on are then redefined accordingly.
Even in the most likely case when is linear in β and the solution for is known in a closed form, the ML estimator f is likely not, so one still has to solve a nonlinear optimization problem. The closed form of can be conveniently used as an starting (and final) point for β during optimization.
There are several sources of computational errors: software’s round-off error, error in evaluation of integrals on a finite domain, error from neglecting tails of functions being integrated, and error in evaluation of derivatives. See Judd (1998) for information about orders of some of these approximation errors. For example, two-sided differences in evaluation of first derivatives lead to errors of order where h is a step size and ϵ is an error in computation of the function being integrated (which may exceed the round-off error) (Judd 1998, section 7.7); numerical integration on a bounded interval using the Gaussian–Chebychev quadrature causes errors of order , where m is a number of quadrature nodes (Judd 1998, section 7.2). We assume that the total computational error is sufficiently controllable so that it does not affect the test statistic to the precision used to compute it.
As a practical matter, simulation of the null distribution can be implemented very easily given the collection of eigenvalues. For example, in GAUSS, the vector of simulated values can be computed using the statement sumc(lambda.*rndn(d,S)^2);. Here, the vector lambda contains the eigenvalues, d is the dimension of ϑ, and S is the number of simulations.
The derivatives are computed using two-sided differences with the step of componentwise, where h = 10−5. The integrals involved in evaluation of expectations are computed via Gauss–Chebychev quadrature with m = 100 quadrature nodes on [−8, 8]. Such precision is more than sufficient not to worry about the error ϵ of computation of the function being integrated; see the previous footnote.
We do not make any attempt to interpret these regressions as any sort of causal relationships. A causal approach when ed76 is involved requires an acknowledgement of its endogeneity and needs instrumental variables for consistent estimation; see Card (1995) and the rest of the returns to schooling literature.
In Anatolyev and Gospodinov (2010), the discrete marginal is Bernoulli.
While the regressions we have considered here are not causal (see footnote 10), the rejections obtained indirectly indicate probable misspecification of similar causal relationships used in the returns to schooling literature.