Show Summary Details
More options …

# Journal of Econometric Methods

Ed. by Giacomini, Raffaella / Li, Tong

Online
ISSN
2156-6674
See all formats and pricing
More options …
Volume 8, Issue 1

# Testing for a Functional Form of Mean Regression in a Fully Parametric Environment

Stanislav Anatolyev
• Corresponding author
• CERGE-EI, Politických vězňů 7, 11121 Prague 1, Czech Republic
• New Economic School, Skolkovskoe shosse 45, Moscow, 121353, Russia
• Email
• Other articles by this author:
Published Online: 2018-02-10 | DOI: https://doi.org/10.1515/jem-2016-0013

## Abstract

We develop a test for a restricted functional form of a mean regression when a complex distributional model for all variables is estimated. The test statistic is an average squared deviation from the estimated hypothesized function of the form implied by the estimated parametric model, and is asymptotically distributed as a mixture of χ2 distributions. The test is easy to implement using numerical derivatives, and it performs well in samples of typical size. We illustrate the test using data on labor market characteristics of US young men.

JEL Classification: C12; C21

## 1 Introduction

In a fully parametric setup when the distributional specification is available, one may be interested in whether the mean regression takes a particular restricted functional form. While the unrestricted regression may be inferred from the specified distribution and estimated from the data, it is likely to allow a rich variety of shapes.1 In such a case, it is often interesting whether the shape of the mean regression reduces to some functional form implied by economic theory, tradition in the literature, or visual inspection; at the same time, it may be problematic to test directly for parametric restrictions embedded in the hypothesized shape. As an example, Figure 2 from our illustrative application based on a complex mixed continuous/discrete distribution presents two regressions of a wage variable on a variable representing education and on a variable representing age, derived from the estimator of the fully parametric (i.e. joint distributional) model. One of these regressions looks quite like linear to a naked eye, and is widely assumed to be linear in the literature, but is it truly linear? The other may seem to be cubic or quartic, but is it truly such? Do the observable deviations from a low-order polynomial owe merely to the sampling error, or do they evidence against these simple forms of the conditional mean?

In this paper, we develop a test for a parametric functional form of a mean regression when the full parametric model for all variables is estimated.2 A natural test statistic is the average squared deviation of the regression function implied by the estimated parametric model from the hypothesized functional form. We derive the asymptotic distribution of the test statistics, which turns out to be a weighed sum of χ2 distributions with one degree of freedom. Even though the test statistic is non-pivotizable (except possibly in some special cases when the distribution collapses to a single scaled χ2 distribution with one degree of freedom), the test is easy to implement by employing estimates of the weights by using numerical derivatives of the true and hypothesized regression functions and the score function. We demonstrate good size and power properties of the test in finite samples using two simple stylized models – one is based on bivariate normality, and the other on a mixed continuous/discrete marginals linked by a copula. Finally, we illustrate the test using Card’s (1995) data on wage, education and age of a few thousand US young men. Despite the regressions may look seemingly linear to a naked eye, the test decidedly rejects linearity of regressions of log-wage on education, log-wage on age, and log-wage on both education and age, as well as of low-order (quartic) polynomial analogs of these.

There exists a variety of tests for a parametric form of a mean regression against non-parametric alternatives; see, for instance, Härdle and Mammen (1990) and Horowitz and Spokoiny (2001). This is also a valid approach to testing for a regression parametric specification. However, when the whole framework is parametric, more natural is to utilize it and perform testing within the parametric distributional model. In addition, from the technical standpoint, the nonparametric tests usually involve kernel estimation of the mean regression and bootstrapping of the test statistic, so their implementation is more involved than that of the test proposed here.

The paper is structured as follows. In Section 2 the setup is described, the assumptions are laid out, and properties of auxiliary estimates are derived. In Section 3 the test statistic and its asymptotic properties are presented, and implementation of the test is described. Section 4 contains two illustrative examples, accompanied by simulation evidence. In Section 5, we illustrate how the test works using labor market data. Finally, Section 6 concludes. All proofs and tedious derivations are relegated to the Appendix. Notes on notation: $‖\cdot ‖$ denotes the L2 norm of a matrix, by $\mathrm{dim}\left(\cdot \right)$ we denote dimensionality of a vector, by $\mathrm{r}\mathrm{k}\left(\cdot \right)$ – the rank of a matrix.

## 2 Setup and Estimation

Suppose there is a parametric density3 model $f\left(u,v|\theta \right)$, θ ∈ Θ for a pair of random variables (u, v), u being scalar and v being possibly multidimensional, and let θ0 be the true value of parameter θ. The implied mean regression for u given v is the value at θ0 of the conditional expectation function

$E(u|v,θ)=∫−∞+∞uf(u|v,θ)du,$

where

$f(u|v,θ)=f(u,v|θ)g(v|θ)$

is the conditional density of u given v, and

$g(v|θ)=∫−∞+∞f(u,v|θ)du$

is the marginal density of v. The estimated implied regression is then

$E(u|v,θ^)=1g(v|θ^)∫−∞+∞uf(u,v|θ^)du,$(1)

where $\stackrel{^}{\theta }$ is the maximum likelihood estimator of θ0:

$θ^=arg⁡minθ∈Θ ∑i=1nln⁡f(ui,vi|θ).$

We would like to compare the implied regression (1) to the parametric functional form $\psi \left(v,{\beta }^{0}\right)$, where

$β0=arg⁡minβ∈B E[(u−ψ(v,β))2].$

The estimator $\stackrel{^}{\beta }$ of the (pseudo)true value of the parameter β0 is based on least squares:4

$β^=arg⁡minβ∈B ∑i=1n(ui−ψ(vi,β))2.$

Denote $\vartheta ={\left({\theta }^{\mathrm{\prime }},{\beta }^{\mathrm{\prime }}\right)}^{\mathrm{\prime }}$ and ${\vartheta }^{0}={\left({\theta }^{0\mathrm{\prime }},{\beta }^{0\mathrm{\prime }}\right)}^{\mathrm{\prime }}$. Because the test to be developed will need to use information on asymptotic correlatedness between $\stackrel{^}{\theta }$ and $\stackrel{^}{\beta }$, we frame the two estimation problems inside one joint optimization problem5

$ϑ^≡(θ^′,β^′)′=arg⁡maxθ∈Θ,β∈B1n∑i=1n{Q(ui,vi|ϑ)≡ln⁡f(ui,vi|θ)−12(ui−ψ(vi,β))2},$

and the asymptotic variance estimate ${\stackrel{^}{V}}_{\vartheta }$ for $\stackrel{^}{\vartheta }$ can be obtained numerically from this optimization problem. The factor $\frac{1}{2}$ is added for convenience of computing the derivatives; its presence (or presence of any other positive factor) does not affect the estimator or its properties.

Let us have a closer look at the structure of ${\stackrel{^}{V}}_{\vartheta }$. Because $\stackrel{^}{\vartheta }$ is an extremum estimator, it has a sandwich form ${H}^{-1}\mathrm{\Omega }{H}^{-1}$, where H is a Hessian matrix, and Ω is a variance matrix of first derivatives. Because of an additive structure of $Q\left(u,v|\vartheta \right)$ in θ and β, H has a block-diagonal form

$H=[Hf00−Mψψ]$

where

$Hf=E[∂2ln⁡f(u,v|θ0)∂θ∂θ′]=E[∂2Q(u,v|θ0,β0)∂θ∂θ′]$

and

$Mψψ=E[∂ψ(v,β0)∂β∂ψ(v,β0)∂β′]=−E[∂2Q(u,v|θ0,β0)∂β∂β′].$

Next,

$Ω=[−HfMufψMufψ′Mu2ψψ],$

where

$Mufψ=E[(u−ψ(v,β0))∂ln⁡f(u,v|θ0)∂θ∂ψ(v,β0)∂β′],Mu2ψψ=E[(u−ψ(v,β0))2∂ψ(v,β0)∂β∂ψ(v,β0)∂β′]$

and the northwest corner is occupied by −Hf because of information matrix equality (recall that $\stackrel{^}{\theta }$ is an ML estimate). Putting the pieces together,

$V^ϑ=H−1ΩH−1=[−Hf−1−Hf−1MufψMψψ−1−Mψψ−1Mufψ′Hf−1Mψψ−1Mu2ψψMψψ−1].$

Note that while H is necessarily of full rank, the matrix Ω may well be singular. In one of examples in Section 4, rk(Ω) = 4 while $\mathrm{dim}\left(\vartheta \right)=5$. The matrices Hf, ${M}_{\psi \psi }$, ${M}_{u\psi \psi }$ and ${M}_{{u}^{2}\psi \psi }$ can be easily estimated by numerical derivatives and the parameter estimate $\stackrel{^}{\vartheta }$ already obtained.

We make a number of assumptions that guarantee existence of the above moments and ensure joint consistency and asymptotic normality of ML and LS estimates $\stackrel{^}{\theta }$ and $\stackrel{^}{\beta }$.

#### Assumption 1

The following about data generation holds:

1. the data ${\left\{\left({u}_{i},{v}_{i}\right)\right\}}_{i=1}^{n}$ is a random sample from a population with probability density $f\left(u,v|{\theta }^{0}\right)$ and finite $E\left[{u}^{2}\right]$;

2. the parameter set Θ is a compact subset of ${\mathbb{R}}^{\mathrm{dim}\left(\theta \right)}$, and θ0 is in the interior of Θ;

3. for any θ ∈ Θ such that θθ0, it holds that $f\left(u,v|\theta \right)\ne f\left(u,v|{\theta }^{0}\right)$;

4. $f\left(u,v|\theta \right)$ is continuous in θ on Θ and twice continuously differentiable in θ in a neighborhood 𝔑θ of $\mathrm{\Theta };\phantom{\rule{0ex}{0ex}}$

5. the following moments are finite: $E\left[\underset{\theta \in \mathrm{\Theta }}{sup}|f\left(u,v|\theta \right)|\right],$ $E\left[\underset{\theta \in {\mathfrak{N}}_{\theta }}{sup}‖{\mathrm{\partial }}^{2}\mathrm{ln}f\left(u,v|\theta \right)/\mathrm{\partial }\theta \mathrm{\partial }{\theta }^{\mathrm{\prime }}‖\right],$ and the following functions are integrable: $\underset{\theta \in {\mathfrak{N}}_{\theta }}{sup}‖\mathrm{\partial }f\left(u,v|\theta \right)/\mathrm{\partial }\theta ‖,$ $\underset{\theta \in {\mathfrak{N}}_{\theta }}{sup}‖{\mathrm{\partial }}^{2}f\left(u,v|\theta \right)/\mathrm{\partial }\theta \mathrm{\partial }{\theta }^{\mathrm{\prime }}‖;$

6. the matrix Hf is non-singular.

#### Assumption 2

The following about the hypothesized regression function holds:

1. the parameter set B is a compact subset of ${\mathbb{R}}^{\mathrm{dim}\left(\beta \right)},$ and β0 is in the interior of B;

2. for any βB such that β ≠ β0, it holds that $\psi \left(v,\beta \right)\ne \psi \left(v,{\beta }^{0}\right)$;

3. ψ(v, β) is continuously differentiable in β on B and twice continuously differentiable in β in a neighborhood 𝔑β of B;

4. the following moments are finite: $E\left[\underset{\beta \in \mathrm{B}}{sup}\psi {\left(v,\beta \right)}^{2}\right]$, $E\left[\underset{\beta \in \mathrm{B}}{sup}{‖\mathrm{\partial }\psi \left(v,\beta \right)/\mathrm{\partial }\beta ‖}^{2}\right]$,$E\left[\underset{\beta \in {\mathfrak{N}}_{\beta }}{sup}‖\mathrm{\partial }\psi \left(v,\beta \right)/\mathrm{\partial }\beta \cdot \mathrm{\partial }\psi \left(v,\beta \right)/\mathrm{\partial }{\beta }^{\mathrm{\prime }}‖\right]$, $E\left[\underset{\beta \in {\mathfrak{N}}_{\beta }}{sup}{‖{\mathrm{\partial }}^{2}\psi \left(v,\beta \right)/\mathrm{\partial }\beta \mathrm{\partial }{\beta }^{\mathrm{\prime }}‖}^{2}\right]$;

5. the matrix ${M}_{\psi \psi }$ is non-singular.

#### Lemma 1

Suppose assumptions 1–2 hold. Then $\stackrel{^}{\vartheta }\stackrel{p}{\to }{\vartheta }^{0}$.

For future use, define

$Δ=E[(∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ)(∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ)′].$

A natural estimator of Δ is

$Δ^=1n∑i=1n(∂E(u|vi,θ^)∂ϑ−∂ψ(vi,β^)∂ϑ)(∂E(u|vi,θ^)∂ϑ−∂ψ(vi,β^)∂ϑ)′.$

For simplicity, we assume that this evaluation occurs without computational error.6 We make additional technical assumptions that ensure finiteness of Δ and consistency of $\stackrel{^}{\mathrm{\Delta }}$.

#### Assumption 3

The following moments exist and are finite: $E\left[\underset{\theta \in \mathrm{\Theta }}{sup}{‖\mathrm{\partial }\mathrm{ln}f\left(u|v,\theta \right)/\mathrm{\partial }\theta ‖}^{2}\right]$, $E\left[\underset{\theta \in {\mathfrak{N}}_{\theta }}{sup}{‖E\left[u\cdot \mathrm{\partial }\mathrm{ln}f\left(u|v,\theta \right)/\mathrm{\partial }\theta |v\right]‖}^{2}\right]$.

#### Lemma 2

Suppose assumptions 1–3 hold. Then $\stackrel{^}{\mathrm{\Delta }}\stackrel{p}{\to }\mathrm{\Delta }$.

Note that because of convenient partitioning of ϑ into θ and β and dependence $E\left(u|v,\theta \right)$ only on θ and of $\psi \left(v,\beta \right)$ only on β, differentiation inside Δ also separates out, and one can rewrite

$∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ=[∂E(u|v,θ0)∂θ−∂ψ(v,β0)∂β].$(2)

While the bottom entry can be computed analytically, for the top entry one can use the machinery of numerical derivatives in a straightforward way.

## 3 Test and Asymptotics

Suppose that $\psi \left(v,\beta \right)$ is specified so that it may be equal, almost surely, to $E\left(u|v,\theta \right)$ derived from a fully parametric model $f\left(u,v|\theta \right)$ for some combination of θ and β. The null hypothesis to be tested is

$H0:E(u|v,θ0)=ψ(v,β0)a.s.$

Denote

$Λ=H−1ΩH−1Δ.$

The test of the null H0 is based on the comparison at data points of regression values implied by the full parametric model and by the hypothesized regression function. The sample squared deviations statistic is

$D^=1n∑i=1n(E(u|vi,θ^)−ψ(vi,β^))2,$

which is a sample analog to

$D=E[(E(u|v,θ0)−ψ(v,β0))2],$

which is zero under H0 and nonzero otherwise.

The following theorem provides the asymptotic distribution of $\stackrel{^}{D}$ under the null, which turns out to be a weighted sum of χ2 distributions.

#### Theorem 1

Suppose assumptions 1–2 hold and $\mathrm{r}\mathrm{k}\left(\mathrm{\Lambda }\right)\ne 0$. Then, under H0,

$nD^→dD,$

where

$D=d∑j=1dim⁡(ϑ)λjζj2,$

where $\left\{{\lambda }_{j}{\right\}}_{j=1}^{\mathrm{dim}\left(\vartheta \right)}$ are eigenvalues of Λ, and $\left\{{\zeta }_{j}^{2}{\right\}}_{j=1}^{\mathrm{dim}\left(\vartheta \right)}\sim IID\text{ }{\chi }_{\left(1\right)}^{2}$.

To implement the test, one computes $\stackrel{^}{D}$, constructs consistent estimates Ĥ and $\stackrel{^}{\mathrm{\Delta }}$ of H and Δ and finds eigenvalues $\left\{{\stackrel{^}{\lambda }}_{j}{\right\}}_{j=1}^{\mathrm{dim}\left(\vartheta \right)}$ of $\stackrel{^}{\mathrm{\Lambda }}={\stackrel{^}{H}}^{-1}\stackrel{^}{\mathrm{\Omega }}{\stackrel{^}{H}}^{-1}\stackrel{^}{\mathrm{\Delta }}$. Then one simulates the distribution of

$D^=d∑j=1dim⁡(ϑ)λ^jζj2,$

and reads off its relevant right quantile to use as critical values for $n\stackrel{^}{D}$.7

Note that Λ may well be of reduced rank, and it may be of rank even lower than $\mathrm{dim}\left(\theta \right)$ and/or $\mathrm{dim}\left(\beta \right)$. In one of examples in Section 4, $\mathrm{r}\mathrm{k}\left(\mathrm{\Lambda }\right)=1$ while $\mathrm{dim}\left(\theta \right)=3$ and $\mathrm{dim}\left(\beta \right)=2$; in the second example, rk(Λ) = 3 while $\mathrm{dim}\left(\theta \right)=5$ and $\mathrm{dim}\left(\beta \right)=2.$ This happens because typically there is a great deal of collinearity between the derivative of the true regression, the derivative of the hypothesized regression, and the score, at least under the null. This phenomenon does not, however, pose any difficulties in implementation in case rk(Λ) is a priori unknown (which is typically the case) as the other $\mathrm{dim}\left(\vartheta \right)-\mathrm{r}\mathrm{k}\left(\mathrm{\Lambda }\right)$ eigenvalues of Λ are zeros.

Consider now the situation when the null hypothesis does not hold. More precisely, the null does not hold if for no parameter value the functional form ψ(v, β) coincides with the true regression almost surely. Note that in this case β0 is interpreted as a pseudotrue value of β as the true value does not exist. The following theorem says that under any alternative, the test statistic diverges.

#### Theorem 2

Suppose assumptions 1–2 hold, and $Pr\left\{E\left(u|v,{\theta }^{0}\right)\ne \psi \left(v,\beta \right)\right\}>0$ for any β ∈ B. Then

$nD^→p+∞.$

Theorem 2 implies that the test is consistent against any deviations from the true specification, i.e. when the regression function does not equal, on a set of positive measure however small it is, to the hypothesized specification ψ(v, β) for any β ∈ B. The power of the test is expected to be greater the larger is this set on which the two functions (evaluated at the true and pseudotrue parameter values, respectively) deviate from each other, and/or the larger are those deviations.

## 4 Illustrations and Simulations

In this section we elaborate on two examples of data generating processes to illustrate the construction of the test and verify its finite sample performance.

The aim of our first experiment is to analyze the size of the test in a simplest setup, and, even more importantly, to see whether the use of numerical derivatives delivers good enough precision in controlling the size of the test. Here all variables are continuous, the regression function has a known form, and the matrices related to first and second derivatives are computable in a closed form. Namely, we use a jointly normal model for the two variables

$f(u,v|θ)=12π1−ρ2exp⁡(−(u−μu)2−2ρ(u−μu)(v−μv)+(v−μv)22(1−ρ2)),$

where $\theta ={\left({\mu }_{u},{\mu }_{v},\rho \right)}^{\mathrm{\prime }}.$ Due to joint normality, the regression function is linear: $E\left(u|v,\theta \right)={\mu }_{u}+\rho \left(v-{\mu }_{v}\right).$ We use this fact to verify performance of the test in finite samples in terms of size properties, setting $\psi \left(v,\beta \right)=a+bv$, where $\beta ={\left(a,b\right)}^{\mathrm{\prime }}$. Notice that there is a priori no doubt that the tested regression functional form is true. The total dimensionality of the parameter vector is $\mathrm{dim}\left(\vartheta \right)=5$.

In Appendix B we derive that

$Λ=2ρ21−ρ21+ρ2[00000000000000000μv0−μv00−101].$(3)

We rule out the cases $\rho =±1$ as these values sit on the boundary of the parameter set $\left[-1,1\right]$ for ρ. In the formulation of Theorem 1, we also rule out the case ρ = 0 which leads to Λ being a zero matrix with $\mathrm{r}\mathrm{k}\left(\mathrm{\Lambda }\right)=0$. The test will not work properly when ρ = 0.

Provided that ρ ≠ 0 and $\rho \ne ±1$, the rank of Λ is unity no matter what the parameter values are, and only non-zero eigenvalue is ${\lambda }_{\rho }=2{\rho }^{2}\frac{1-{\rho }^{2}}{1+{\rho }^{2}}$. Note that even though $\mathrm{dim}\left(\vartheta \right)=5,$ we have rk(H) = 5, rk(Ω) = 4, rk(Δ) = 2, yet rk(Λ) = 1. Because rk(Λ) = 1, the limiting distribution in fact simplifies to λρ times a ${\chi }_{\left(1\right)}^{2}$ distribution. Thus, the critical values can be computed simply as ${\stackrel{^}{\lambda }}_{\rho }$ times an appropriate quantile of the tabulated ${\chi }_{\left(1\right)}^{2}$ distribution, where ${\stackrel{^}{\lambda }}_{\rho }$ is λρ with the ML estimate $\stackrel{^}{\rho }$ plugged in place of ρ.

This result will be used as an ‘analytic’ benchmark when one uses analytical derivatives. To that end, we set the limiting distribution as described in the previous paragraph. The other, ‘numerical’ value for Λ is obtained as ${\stackrel{^}{H}}^{-1}\stackrel{^}{\mathrm{\Omega }}{\stackrel{^}{H}}^{-1}\stackrel{^}{\mathrm{\Delta }}$, where Ĥ, $\stackrel{^}{\mathrm{\Omega }}$ and $\stackrel{^}{\mathrm{\Delta }}$ are estimates of H, Ω and Δ using numerical derivatives.8

The pairs ${\left\{\left({u}_{i},{v}_{i}\right)\right\}}_{i=1}^{n}$ are drawn from the bivariate normal distribution with means ${\mu }_{u}^{0}={\mu }_{v}^{0}=1$, unit variances and correlation ρ0 = 0.5. The following simulation results are based on 2000 simulations; the rejection rates are expressed in percentages.

Actual rejection rates in the first illustrative example

The size control is excellent even for small samples when analytical derivatives are used. When one computes numerical derivatives instead, there are expectedly some size distortions, which go away quickly as the sample size grows. For samples of a few thousand, the size control is of no concern, at least for low-dimensional setups.

In our second experiment, we will analyze the size and power of the test in a more realistic setup. Here the data are mixed continuous and discrete. The continuous u has a logistic distribution, the discrete v is drawn from a three-point distribution, and the dependence is induced by the Farlie–Gumbel–Morgenstern (FGM) copula. These choices are due to availability of the joint PDF/PMF and CDF/CMF in a closed form, simplicity of the form of mean regression, simplicity of tuning the parameters so that the regression function is linear or non-linear, and, finally, conceptual similarity to our illustrative empirical application.

The continuous marginal has the density

$fu(u|μ,γ)=exp⁡(−γ−1(u−μ))γ[1+exp⁡(−γ−1(u−μ))]2$

and cumulative distribution function

$Fu(u|μ,γ)=11+exp⁡(−γ−1(u−μ)).$

We set the true value of μ to be zero in order to obtain symmetry. The three-point distribution of the discrete marginal is $v\in \left\{-1,0,+1\right\}$ with marginal PMF g(v) represented by the corresponding collection of probabilities $q\in \left\{{q}_{-1},1-{q}_{-1}-{q}_{+1},{q}_{+1}\right\}$ with CMF $G\left(v\right)={q}_{-1}{1}_{\left\{v\le -1\right\}}+\left(1-{q}_{-1}-{q}_{+1}\right){1}_{\left\{v\le 0\right\}}+{q}_{+1}{1}_{\left\{v\le 1\right\}}$. The dependence is induced by the FGM copula

$C(w1,w2)=w1w2(1+ρ(1−w1)(1−w2)),$

where $\rho \in \left[-1,+1\right]$ and ρ > 0 implies positive, although moderate at most, dependence.

Let $\theta ={\left(\mu ,\gamma ,{q}_{-1},{q}_{+1},\rho \right)}^{\mathrm{\prime }}$. It is shown in Appendix B that the joint density/mass is

$f(u,v|θ)=fu(u|μ,γ)q−1C(Fu(u|μ,γ))1{v=−1}q0C(Fu(u|μ,γ))1{v=0}q+1C(Fu(u|μ,γ))1{v=+1}=γ−1ω(u)(1−ω(u))φ(u)−11{v=−1}(1−φ−1(u)−φ+1(u))1{v=0}φ(u)+11{v=+1},$

where

$ω(u)=exp⁡(−γ−1(u−μ))1+exp⁡(−γ−1(u−μ))$

and

$φ−1(u)=q−1+ρ(1−2(1−ω(u)))q−1(1−q−1)φ+1(u)=q+1−ρ(1−2(1−ω(u)))q+1(1−q+1)$

If, in addition to ${\mu }^{0}=0$, we set ${q}_{-1}^{0}={q}_{+1}^{0}$, then, due to a symmetry around the origin, the regression function will be linear: $E\left(u|v\right)=\lambda v$, where λ depends on ${q}_{-1}^{0}$. If we set ${q}_{-1}^{0}\ne {q}_{+1}^{0}$, the symmetry ceases to take place, and the regression function is no longer linear. We again set $\psi \left(v,\beta \right)=a+bv$, where $\beta ={\left(a,b\right)}^{\mathrm{\prime }}$, and study size properties when ${q}_{-1}^{0}={q}_{+1}^{0}$ and power properties when ${q}_{-1}^{0}\ne {q}_{+1}^{0}$. The total dimensionality of the parameter vector is $\mathrm{dim}\left(\vartheta \right)=7$.

Figure 1:

Regressions for the Second Experiment, Linear and Non-Linear.

The variables ${\left\{{u}_{i}\right\}}_{i=1}^{n}$ are drawn from the standard logistic distribution (i.e. with ${\mu }^{0}=0$ and ${\gamma }^{0}=1$). We set ${\rho }^{0}=1$ implying the correlation coefficient of $\frac{1}{3}$. Then, for a given i and given pair $\left({q}_{-1}^{0},{q}_{+1}^{0}\right)$, we compute ${\phi }_{-1}\left({u}_{i}\right)$ and ${\phi }_{+1}\left({u}_{i}\right)$ and use these to generate the variables ${\left\{{v}_{i}\right\}}_{i=1}^{n}$ from the three-point distribution $\left\{-1,0,+1\right\}$ with corresponding probabilities $\left\{{\phi }_{-1}\left({u}_{i}\right),1-\phi {}_{-1}\left({u}_{i}\right)-{\phi }_{+1}\left({u}_{i}\right),{\phi }_{+1}\left({u}_{i}\right)\right\}$. We set the pair $\left({q}_{-1}^{0},{q}_{+1}^{0}\right)$ to three values, one of which implies a linear regression, while the other two imply non-linear ones (see Figure 1).

The following table contains simulation results for samples of small size n = 100, moderate size n = 500, and big size n = 2000. The results are based on 2000 simulations; the rejection rates are expressed in percentages.9

Actual rejection rates in the second illustrative example

Except for small samples, the size and power figures are favorable. The actual rejection rates shown in the first line are quite close to nominal test sizes. The power figures are impressive, especially for large samples, and even though the true regression line does not deviate much from a linear form, the test detects it pretty often from a sample of a moderate size. With small samples, the null rejection rates fall short of nominal rates a bit, and the test has hard time detecting small deviations from the null. While a hundred observations are clearly not sufficient for the test to work properly, increasing the sample size severalfold straightens out the rejection rates and makes the properties of the test very attractive.

## 5 Illustrative Application

In this section we illustrate the test using the labor market data from Card (1995). These data contain, in particular, wage, education and age of a sample of US men of size n = 3010 taken in 1976. The main variable is logarithm of wages (lwage76), and regressors are education (ed76) and age (age76). We run bivariate and trivariate full parametric models for the pairs (lwage76,ed76), (lwage76,age76) and the triple (lwage76,ed76,age76), compute implied regressions of log wages on one or two regressors, and test them for linearity using the test developed in this paper.10

Because the regressand is a continuous variables while both regressors are discrete, we construct the joint distribution by using the copula machinery. The marginal density for the continuously distributed log wages is chosen to be the skew-normal distribution (Azzalini 1985):

$u=μ+σw,$

where μ is a location parameter, σ is a scale parameter,

$fw(w|γ)=2ϕ(w)Φ(γw),$

and11 γ is a shape parameter that indexes the degree of skewness; the distribution reduces to the regular normal when γ = 0. In total, the skew-normal density ${f}_{u}\left(u|{\theta }_{u}\right)$ and its CDF ${F}_{u}\left(u|{\theta }_{u}\right)$ are characterized by three parameters in ${\theta }_{u}={\left(\mu ,\sigma ,\gamma \right)}^{\mathrm{\prime }}$. Azzalini, Dal Cappello, and Kotz (2003) argue that this distribution (among others) well approximates the real log income data. Below are the results of fitting the marginal skew-normal density to the variable lwage76.

Estimates of the marginal log-wage distribution

The Kolmogorov–Smirnov statistic (the maximal difference between the empirical distribution function and estimated CDF) equals 0.0168, and, normalized by $\sqrt{n}$, equals 0.921, which is quite smaller than the critical value even at the 20% significance level (e.g. Massey 1951).

The marginal distributions of the variables ed76 and age76 are categorical, with a number of categories being k1 = 18 for the former and k2 = 11 for the latter,12 and with categorical probabilities ${q}_{\ell }={\left({q}_{j}\right)}_{j=1}^{{k}_{\ell }}$, $\ell =1,2$ subject to $\sum _{j=1}^{{k}_{\ell }}{q}_{j}=1$. Let us denote the CMF of this distribution by ${G}_{v}\left(v|q\right)=\sum _{j=1}^{⌊v⌋}{q}_{j}$. The estimates are shown in the following tables.

Estimates of the marginal education distribution

Estimates of the marginal age distribution

Because the two/three components are both discrete and continuous, we extend the method of Anatolyev and Gospodinov (2010) of constructing a joint distribution of mixed marginals to the case of multiple values in the discrete marginal’s support13 using copula machinery. We employ the Gaussian copula because it is simple and convenient, easily interpretable, and allows natural extension to higher dimensions with a reasonable increase in the degree of parameterization. When there is only one discrete regressor, the Gaussian copula has only one correlation parameter ϱ. It is derived in Appendix C that the joint density is

$f(u,v|θ)=fu(u|θu)fC(u,v|θ),$

where

$fC(u,v|θ)=Φ(Φ−1(G(v|q))−ϱΦ−1(Fu(u|θu))1−ϱ2)−Φ(Φ−1(G(v−1|q))−ϱΦ−1(Fu(u|θu))1−ϱ2)$

is ‘distorted’ categorical probability, and $\theta ={\left({\theta }_{u}^{\mathrm{\prime }},\varrho ,{q}^{\mathrm{\prime }}\right)}^{\mathrm{\prime }}$ collects all 21 or 14 parameters.

Maximization of the joint (log) likelihood yields estimates of parameters of the marginals very close to figures reported above but with lower standard errors, and the estimates of the copula as in the following table:

Estimates of the bivariate copula

One can see that the estimates of bivariate degrees of dependence are highly statistically significant and moderately large in value.

Figure 2:

Estimated Mean Regression with Regressor ed76 (Top Panel) or age76 (Bottom Panel).

Figure 2 shows the estimated mean regressions. In the case of ed76, it may appear that the true functional form is linear, which is what the corresponding literature tends to focus on. In the case of age76, linearity does not seem to hold, but a low-order polynomial like a cubic form may be appropriate. To verify whether these conjectures hold, we first perform the test for a linear mean regression:

$ψ(v,β)=a+bv.$

The test results are in the following table.

Results of testing for linearity in the bivariate case

The hypothesis of a linear regression form is decidedly rejected for both regressors at any conventional significance level; in fact, the exceedance is huge. We conclude that the form of the actual mean regression differs from what is usually assumed in regressions of wages on its determinants.

Labor econometricians often add in their linear regressions a square of a variable related to duration (e.g. work experience14); Murphy and Welch (1990) show that even fourth powers may be needed. Therefore, we have also run the test with low-order polynomial hypothesized regression forms: ${\psi }_{2}\left(v,\beta \right)=a+bv+c{v}^{2}$ and ${\psi }_{4}\left(v,\beta \right)=a+bv+c{v}^{2}+d{v}^{3}+f{v}^{4}$. These functional forms are also rejected at any conventional significance level.

When there are two discrete regressors, the Gaussian copula has a 3 × 3 correlation matrix

$R=[1ϱ0ϱ1ϱ01ϱ2ϱ1ϱ21]$

with three distinct parameters ϱ0, ϱ1, ϱ2. It is derived in Appendix C that the joint density is

$f(u,v1,v2|θ)=fu(u|θu)fC(u,v1,v2|θ),$

where

$fC(u1,v1,v2)=Φ2(φ1(v1),φ2(v2)|φu(u))−Φ2(φ1(v1−1),φ2(v2)|φu(u))−Φ2(φ1(v1),φ2(v2−1)|φu(u))+Φ2(φ1(v1−1),φ2(v2−1)|φ(u))$

for ${v}_{1},{v}_{2}\in \left\{0,1\right\}$ are ‘distorted’ bivariate categorical probabilities, where

$φℓ(v)=Φ−1(Gℓ(v)),ℓ=1,2,φu(u)=Φ−1(Fu(u)),$

and $\theta ={\left({\theta }_{u}^{\mathrm{\prime }},{\varrho }_{0},{\varrho }_{1},{\varrho }_{2},{q}_{1}^{\mathrm{\prime }},{q}_{2}^{\mathrm{\prime }}\right)}^{\mathrm{\prime }}$ collects all 33 parameters.

Maximization of the joint (log) likelihood yields estimates of parameters of the marginals very close to figures reported above but with lower standard errors, and the estimates of the copula as in the following table:

Estimates of the trivariate copula

One can see that the estimates of bivariate degrees of dependence ϱ1 and ϱ2 are very close to those from bivariate models with similar standard errors. The degree of dependence between the two regressors ϱ0 is estimated to be quite modest but significantly different from zero.

Figure 3:

Estimated Mean Regression with Regressors ed76 and age76.

Figure 3 shows the surface of the estimated mean regression which is arguably close to a plane. We perform the test for a linear mean regression:

$ψ(v1,v2,β)=a+b1v1+b2v2.$

The test results are:

Results of testing for linearity in the trivariate case

The hypothesis of a linear regression form is decidedly rejected for both regressors at any conventional significance level. We also repeat this exercise for the form quadratic in both regressors ${\psi }_{22}\left({v}_{1},{v}_{2},\beta \right)=a+{b}_{1}{v}_{1}+{b}_{2}{v}_{2}+{c}_{1}{v}_{1}^{2}+{c}_{12}{v}_{1}{v}_{2}+{c}_{2}{v}_{2}^{2}$, as well as, motivated by the study of Murphy and Welch (1990), for the form linear in education and quartic in age, ${\psi }_{14}\left({v}_{1},{v}_{2},\beta \right)=a+{b}_{1}{v}_{1}+{b}_{2}{v}_{2}+{c}_{2}{v}_{2}^{2}+{c}_{12}{v}_{1}{v}_{2}+{d}_{2}{v}_{2}^{3}+{d}_{12}{v}_{1}{v}_{2}^{2}+{f}_{2}{v}_{2}^{4}+{f}_{12}{v}_{1}{v}_{2}^{3}$, as well as the same form with age v2 replaced by potential experience that equals ${v}_{2}+17-{v}_{1}$.15

These functional forms are also decidedly rejected at any conventional significance level. Evidently, the observable “bumps” in the curves/surface in Figure 2 and Figure 3 are not due to a sampling error only, but rather are built-in attributes of the shapes of regressions. The overall results imply that the true mean regressions are not likely to reduce to low-order polynomials in the conditioning variables but rather take more complex functional forms, which is contradictory to popular empirical practices.16

## 6 Conclusion

We have developed a test for a restricted functional form of a mean regression function when a parametric distribution for all variables is specified and estimated. The test is based on mean-square comparison of the estimated regression implied by the joint density and estimated hypothesized functional form. The test statistic is asymptotically mixed χ2 distributed, with the coefficients computable from the true and hypothesized regression functions and the score function. The size and power properties are favorable for sample sizes usually employed. A possible direction of future research may be extension of the test to causal regressions estimated by instrumental variables.

## Acknowledgement

I am grateful to the Co-Editor and two anonymous referees for useful suggestions that significantly improved the presentation. I also thank Nikolay Kudrin for excellent research assistance.

## Proofs

#### Proof of Lemma 1

Consistency and asymptotic normality of $\stackrel{^}{\vartheta }$ follow from Newey and McFadden (1994, theorems 2.5, 2.6, 3.3 and 3.4) using Assumption 1 and Assumption 2.      □

#### Proof of Lemma 2

Note that

$E[‖∂E(u|v,θ0)∂ϑ‖2]=E[‖∫−∞+∞u∂f(u|v,θ0)∂θdu‖2]=E[‖E[u∂ln⁡f(u|v,θ0)∂θ|v]‖2]<∞,$

which follows from Assumption 3. Next,

$E[‖∂E(u|v,θ0)∂ϑ∂ψ(v,β0)∂ϑ′‖]≤E[‖∂E(u|v,θ0)∂ϑ‖‖∂ψ(v,β0)∂ϑ′‖]≤E[‖∂E(u|v,θ0)∂ϑ‖2]1/2E[‖∂ψ(v,β0)∂ϑ′‖2]1/2<∞,$

which follows from the previous and Assumption 2(f). Finally, ${M}_{\psi \psi }$ is finite by Assumption 2(f). This shows finiteness of Δ.

Now,

$E[supθ∈Nθ,β∈Nβ‖(∂E(u|vi,θ)∂ϑ−∂ψ(vi,β)∂ϑ)(∂E(u|vi,θ)∂ϑ−∂ψ(vi,β)∂ϑ)′‖]≤E[supθ∈Nθ,β∈Nβ(‖∂E(u|vi,θ)∂ϑ‖+‖∂ψ(vi,β)∂ϑ‖)2]≤2E[supθ∈Nθ‖E[u∂ln⁡f(u|v,θ)∂θ|v]‖2]+2E[supβ∈Nβ‖∂ψ(vi,β)∂β‖2]<∞$

by Assumption 2(d) and Assumption 3. Then, by Lemma 4.3 of Newey and McFadden (1994), $\stackrel{^}{\mathrm{\Delta }}\stackrel{p}{\to }\mathrm{\Delta }$.      □

#### Proof of Theorem 1

Take a second-order stochastic expansion of n$\stackrel{^}{D}$ around the true parameter value ϑ0:

$nD^=∑i=1n(E(u|vi,θ0)−ψ(vi,β0))2+n∂D^∂ϑ′|ϑ0ζ^ϑ+ζ^ϑ′12∂2D^∂ϑ∂ϑ′|ϑ0ζ^ϑ+OP(1n),$

where

$ζ^ϑ=n(ϑ^−ϑ0)→pζϑ=dN(0,Vϑ),$

and ${V}_{\vartheta }={H}^{-1}\mathrm{\Omega }{H}^{-1}$ is the asymptotic distribution of $\stackrel{^}{\vartheta }$. Under H0, the leading term is zero. Next, under H0,

$∂D^∂ϑ′|ϑ0=1n∑i=1n∂∂ϑ(E(u|v,θ0)−ψ(vi,β0))2=21n∑i=1n(E(u|v,θ0)−ψ(vi,β0))∂(E(u|v,θ0)−ψ(vi,β0))∂ϑ=0.$

Finally,

$12∂2D^∂ϑ∂ϑ′|ϑ0=121n∑i=1n∂2∂ϑ∂ϑ′(E(u|vi,θ0)−ψ(vi,β0))2=1n∑i=1n∂∂ϑ′[(E(u|vi,θ0)−ψ(vi,β0))(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)]=1n∑i=1n(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)′+1n∑i=1n(E(u|vi,θ0)−ψ(vi,β0))∂∂ϑ′(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)=H01n∑i=1n(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)(∂E(u|vi,θ0)∂ϑ−∂ψ(vi,β0)∂ϑ)′=Δ+oP(1)$

by the law of large numbers (see the proof of Lemma 2) and because

$E[‖(∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ)(∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ)′‖]≤E[‖∂E(u|v,θ0)∂ϑ−∂ψ(v,β0)∂ϑ‖2]≤2E[‖∂E(u|v,θ0)∂ϑ‖2+‖∂ψ(v,β0)∂ϑ‖2]<∞.$

Summarizing, we have that under H0,

$nD^=ζϑ′Δζϑ+oP(1).$

Now, using Lemma 3.2 from Vuong (1989), we get that

$nD^→p∑j=1dim⁡(ϑ)λjζj2,$

where $\left\{{\lambda }_{j}{\right\}}_{j=1}^{\mathrm{dim}\left(\vartheta \right)}$ are eigenvalues of ${V}_{\vartheta }\mathrm{\Delta }=\mathrm{\Lambda }$, and $\left\{{\zeta }_{j}^{2}{\right\}}_{j=1}^{\mathrm{dim}\left(\vartheta \right)}\sim IID{\chi }_{\left(1\right)}^{2}$.      □

#### Proof of Theorem 2

It follows from the proof of Theorem 1 that

$nD^=∑i=1n(E(u|vi,θ0)−ψ(vi,β0))2+OP(n).$

Because $E\left(u|v,{\theta }^{0}\right)\ne \psi \left(v,{\beta }^{0}\right)$ almost surely, we have that $n\stackrel{^}{D}$ tends to +∞ as n → ∞ as it is positive by construction.      □

## Details on Simulation Experiments

Consider the setup of the first experiment. Because $E\left(u|v\right)={\mu }_{u}+\rho \left(v-{\mu }_{v}\right)$ and $\psi \left(v,\beta \right)=a+bv$, we compute that

$∂E(u|v)∂ϑ−∂ψ(v,β)∂ϑ=[1−ρv−μv−1−v].$

Note that there are only two non-collinear elements. Hence,

$Δ=[1−ρ0−1−μv−ρρ20ρρμv0010−1−1ρ01μv−μvρμv−1μv1+μv2],$

which, expectedly, has a rank of 2.

The logdensity is

$ln⁡f(u,v|θ)=−ln⁡2π−12ln⁡(1−ρ2)−(u−μu)2−2ρ(u−μu)(v−μv)+(v−μv)22(1−ρ2),$

and its derivatives are

$∂ln⁡f(u,v|θ)∂θ=[11−ρ2((u−μu)−ρ(v−μv))11−ρ2((v−μv)−ρ(u−μu))ρ1−ρ2−ρ(1−ρ2)2((u−μu)2+(v−μv)2)+1+ρ2(1−ρ2)2(u−μu)(v−μv)].$

Then

$E[∂2ln⁡f(u,v|θ)∂θ∂θ′]=[−11−ρ2ρ1−ρ20ρ1−ρ2−11−ρ2000−1+ρ2(1−ρ2)2].$

The derivatives of the hypothesized regression function are

$∂ψ(v,β)∂β=(−1−v),$

and hence

$E[∂ψ(v,β)∂β∂ψ(v,β)∂β′]=[1μvμv1+μv2].$

So, the (minus) inverted Hessian is

$−H−1=[1ρ000ρ100000(1−ρ2)21+ρ2000001+μv2−μv000−μv1].$

Next we compute

$E[∂ln⁡f(u,v|θ)∂θ∂ln⁡f(u,v|θ)∂θ′]=[11−ρ2−ρ1−ρ20−ρ1−ρ211−ρ20001+ρ2(1−ρ2)2],E[(u−ψ(v,β))2∂ψ(v,β)∂β∂ψ(v,β)∂β′]=(1−ρ2)[1μvμv1+μv2],E[(u−ψ(v,β))∂ln⁡f(u,v|θ)∂θ∂ψ(v,β)∂β′]=[1μv−ρ−ρμv01].$

Hence, the matrix of expected cross-products of the elements of the score vector is

$Ω=[11−ρ2−ρ1−ρ201μv−ρ1−ρ211−ρ20−ρ−ρμv001+ρ2(1−ρ2)2011−ρ01−ρ2(1−ρ2)μvμv−ρμv1(1−ρ2)μv(1−ρ2)(1+μv2)].$

Then the asymptotic variance matrix is

$Vϑ=[1ρ01−ρ20ρ100000(1−ρ2)21+ρ2−μv(1−ρ2)21+ρ2(1−ρ2)21+ρ21−ρ20−μv(1−ρ2)21+ρ2(1−ρ2)(1+μv2)−μv(1−ρ2)00(1−ρ2)21+ρ2−μv(1−ρ2)1−ρ2],$

and, consequently,

$VϑΔ=2ρ21−ρ21+ρ2[00000000000000000μv0−μv00−101].$

For the second experiment, we extend the method of Anatolyev and Gospodinov (2010) of constructing a joint distribution of mixed discrete and continuous marginals to the cases of the cardinality of the discrete marginal’s support higher than two. The joint CDF/CMF is

$F(u,v)=C(F(u),G(v)),$

so the PDF/PMF is a derivative with respect to the continuous argument and a difference with respect to the discrete one:

$f(u,v)=∂C∂u(F(u),G(v))−∂C∂u(F(u),G(v−1))=fu(u)f∂(u,v),$

where the second term is

$f∂(u,v)=[∂C∂w(w,G(v))−∂C∂w(w,G(v−1))]w=F(u),$

or

$f∂(u,−1)=[∂C∂w(w,q−1)]w=Fu(u),f∂(u,0)=[∂C∂w(w,1−q+1)−∂C∂w(w,q−1)]w=Fu(u),f∂(u,1)=1−[∂C∂w(w,1−q+1)]w=Fu(u).$

For the FGM copula,

$∂C∂w1(w1,z)=z+ρ(1−2z)w2(1−w2),$

implying the distorted success probabilities

$q−1C(z)=q−1+ρ(1−2z)q−1(1−q−1),q0C(z)=1−q−1−q+1+ρ(1−2z)[q+1(1−q+1)−q−1(1−q−1)],q+1C(z)=q+1−ρ(1−2z)q+1(1−q+1).$

The joint density/mass is

$f(u,v)=fu(u)q−1C(Fu(u))1{v=−1}q0C(Fu(u))1{v=0}q+1C(Fu(u))1{v=+1},$

and the result follows.

## Details on Empirical Illustration

We omit the parameters during the derivations. In the case of only one discrete component, the joint PDF/PMF is

$f(u,v)=∂C∂u(Fu(u),Gv(v))−∂C∂u(Fu(u),Gv(v−1))=fu(u)fC(u,v),$

where the last term is

$fC(u,v)=[∂C∂w(w,Gv(v))−∂C∂w(w,Gv(v−1))]w=Fu(u).$

The Gaussian copula is $C\left(w,y\right)={\mathrm{\Phi }}_{2}\left({\mathrm{\Phi }}^{-1}\left(w\right),{\mathrm{\Phi }}^{-1}\left(y\right)\right)$, where Φ2 is CDF of the standard bivariate normal, and Φ−1 is inverse to the standard normal CDF. Note the important property:

$∂Φ2(x1,x2)∂x1=∂∂x1∫−∞x1∫−∞x2ϕ2(t1,t2)dt1dt2=∂∂x1∫−∞x1∫−∞x2ϕ(t2|t1)ϕ(t1)dt1dt2=∂∂x1∫−∞x1ϕ(t1)(∫−∞x2ϕ(t2|t1)dt2)dt1=∂∂x1∫−∞x1ϕ(t1)Φ(x2|t1)dt1=ϕ(x1)Φ(x2|x1).$

$∂C(w,y)∂w=∂Φ2(Φ−1(w),Φ−1(y))∂w=∂Φ2(x1,x2)∂x1|x1=Φ−1(w),x2=Φ−1(y)⋅∂Φ−1(w)∂w=ϕ(x1)Φ(x2|x1)|x1=Φ−1(w),x2=Φ−1(y)⋅1ϕ(x1)|x1=Φ−1(w)=Φ(Φ−1(y)|Φ−1(w)).$

Then,

$fC(u,v)=Φ(Φ−1(Gv(v))|Φ−1(Fu(u)))−Φ(Φ−1(Gv(v−1))|Φ−1(Fu(u))).$

Note that because Φ2 is bivariate standard normal with correlation coefficient ϱ, we have, by normality of the conditional distributions under joint normality, that

$Φ(Φ−1(y)|Φ−1(w))=Φ(Φ−1(y)−ϱΦ−1(w)1−ϱ2),$

and hence

$fC(u,v)=Φ(Φ−1(G(v))−ϱΦ−1(F(u))1−ϱ2)−Φ(Φ−1(G(v−1))−ϱΦ−1(F(u))1−ϱ2).$

In the case of two discrete components, the joint PDF/PMF is

$f(u,v1,v2)=∂C∂u(Fu(u),G1(v1),G2(v2))−∂C∂u(Fu(u),G1(v1−1),G2(v2))−∂C∂u(Fu(u),G1(v1),G2(v2−1))+∂C∂u(Fu(u),G1(v1−1),G2(v2−1))=fu(u)fC(u,v1,v2),$

where the last term is

$fC(u,v1,v2)=[∂C∂w(w,G1(v1),G2(v2))−∂C∂w(w,G1(v1−1),G2(v2))−∂C∂w(w,G1(v1),G2(v2−1))+∂C∂w(w,G1(v1−1),G2(v2−1))]w=Fu(u).$

Consider the three-dimensional Gaussian copula

$C(w,y1,y2)=Φ3(Φ−1(w),Φ−1(y1),Φ−1(y2)).$

Note the property

$∂Φ3(x1,x2,x3)∂x1=∂∂x1∫−∞x1∫−∞x2∫−∞x3ϕ3(t1,t2,t3)dt1dt2dt3=∫−∞x2∫−∞x3(∂∂x1∫−∞x1ϕ3(t1,t2,t3)dt1)dt2dt3=∫−∞x2∫−∞x3ϕ3(x1,t2,t3)dt2dt3=∫−∞x2∫−∞x3ϕ2(t2,t3|x1)ϕ(x1)dt2dt3=ϕ(x1)∫−∞x2∫−∞x3ϕ2(t2,t3|x1)dt2dt3=ϕ(x1)Φ2(x2,x3|x1),$

$∂C(w,y1,y2)∂w=∂Φ3(Φ−1(w),Φ−1(y1),Φ−1(y2))∂w=∂Φ3(x1,x2,x3)∂x1|x1=Φ−1(w),x2=Φ−1(y1),x3=Φ−1(y2)⋅∂Φ−1(w)∂w=ϕ(x1)Φ2(x2,x3|x1)|x1=Φ−1(w),x2=Φ−1(y1),x3=Φ−1(y2)⋅1ϕ(x1)|x1=Φ−1(w)=Φ2(Φ−1(y1),Φ−1(y2)|Φ−1(w)).$

Then,

$fC(u1,v1,v2)=Φ2(Φ−1(G1(v1)),Φ−1(G2(v2))|Φ−1(Fu(u)))− Φ2(Φ−1(G1(v1−1)),Φ−1(G2(v2))|Φ−1(Fu(u)))− Φ2(Φ−1(G1(v1)),Φ−1(G2(v2−1))|Φ−1(Fu(u)))+ Φ2(Φ−1(G1(v1−1)),Φ−1(G2(v2−1))|Φ−1(Fu(u))).$

As a computational matter, we use the fact that

$(y1y2)|x∼N(μϱx,ΩR),$

where

$μR=(ϱ1ϱ2),ΩR=[1−ϱ12ϱ0−ϱ1ϱ2ϱ0−ϱ1ϱ21−ϱ22],$

and that

$Φ2(y1,y2|x)=12πdetΩR∫−∞y1∫−∞y2exp⁡(−12((z1z2)−μϱx)′ΩR−1((z1z2)−μϱx))dz1dz2.$

## References

• Anatolyev, S., and N. Gospodinov. 2010. “Modeling Financial Return Dynamics via Decomposition.” Journal of Business & Economic Statistics 28: 232–245.

• Azzalini, A. 1985. “A Class of Distributions which Includes the Normal Ones.” Scandinavian Journal of Statistics 12: 171–178. Google Scholar

• Azzalini, A., T. Dal Cappello, and S. Kotz. 2003. “Log-Skew-Normal and Log-Skew-t Distributions as Models for Family Income Data.” Journal of Income Distribution 11 (3-4): 12–20. Google Scholar

• Card, D. 1995. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” In Aspects of Labor Market Behaviour: Essays in Honour of John Vanderkamp edited by L. N. Christofides, E. K. Grant, and R. Swidinsky. Toronto: University of Toronto Press. Google Scholar

• Judd, K. 1998. Numerical Methods in Economics. Cambridge, Massachusetts: MIT Press. Google Scholar

• Härdle, W., and E. Mammen. 1990. “Comparing Nonparametric versus Parametric Regression Fits.” Annals of Statistics 21: 1926–1947. Google Scholar

• Horowitz, J. L., and V. G. Spokoiny. 2001. “An Adaptive, Rate-Optimal Test of a Parametric Mean-Regression Model against a Nonparametric Alternative.” Econometrica 69 (3): 599–631.

• Massey, F. J. 1951. “The Kolmogorov-Smirnov Test for Goodness of Fit.”Journal of American Statistical Association 46 (253): 68–78.

• Murphy, K. M., and F. Welch. 1990. “Empirical Age-Earnings Profiles.” Journal of Labor Economics 8 (2): 202–229.

• Newey, W. K., and D. McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” In Handbook of Econometrics, edited by R.F. Engle and D. McFadden, Vol 4, pp. 2111–245. Amsterdam: North-Holland. Google Scholar

• Vuong, Q. H. 1989. “Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses.” Econometrica 57 (2): 307–333.

## Footnotes

• 1

Unless the distributional specification is very simple, as, for example, joint normalily, in which case the mean regression is necessarily linear.

• 2

Another appropriate context is semiparametric where a conditional distribution is specified and estimated in the first place. However, this case is less practical, because specifying a conditional distribution typically entails specifying the conditional mean as a part of the modeling strategy.

• 3

We call simply by ‘density’ what may in fact be a mass in the case discrete variables are considered, or a density/mass in the case mixed continuous/discrete variables are. The integrals considered from now on are then redefined accordingly.

• 4

The use of other consistent criteria is also possible. The test can be modified in a straightforward way.

• 5

Even in the most likely case when $\psi \left(v,\beta \right)$ is linear in β and the solution for $\stackrel{^}{\beta }$ is known in a closed form, the ML estimator $\stackrel{^}{\theta }$f is likely not, so one still has to solve a nonlinear optimization problem. The closed form of $\stackrel{^}{\beta }$ can be conveniently used as an starting (and final) point for β during optimization.

• 6

There are several sources of computational errors: software’s round-off error, error in evaluation of integrals on a finite domain, error from neglecting tails of functions being integrated, and error in evaluation of derivatives. See Judd (1998) for information about orders of some of these approximation errors. For example, two-sided differences in evaluation of first derivatives lead to errors of order $O\left({h}^{2}+{h}^{-1}ϵ\right),$ where h is a step size and ϵ is an error in computation of the function being integrated (which may exceed the round-off error) (Judd 1998, section 7.7); numerical integration on a bounded interval using the Gaussian–Chebychev quadrature causes errors of order $O\left(\left({2}^{2m}\left(2m\right)!{\right)}^{-1}\right)$, where m is a number of quadrature nodes (Judd 1998, section 7.2). We assume that the total computational error is sufficiently controllable so that it does not affect the test statistic to the precision used to compute it.

• 7

As a practical matter, simulation of the null distribution can be implemented very easily given the collection of eigenvalues. For example, in GAUSS, the vector of simulated values can be computed using the statement sumc(lambda.*rndn(d,S)^2);. Here, the vector lambda contains the eigenvalues, d is the dimension of ϑ, and S is the number of simulations.

• 8

The derivatives are computed using two-sided differences with the step of $h\theta$ componentwise, where h = 10−5. The integrals involved in evaluation of expectations are computed via Gauss–Chebychev quadrature with m = 100 quadrature nodes on [−8, 8]. Such precision is more than sufficient not to worry about the error ϵ of computation of the function being integrated; see the previous footnote.

• 9

See the previous computational footnote, except that the domain of integration is now [−20, 20].

• 10

We do not make any attempt to interpret these regressions as any sort of causal relationships. A causal approach when ed76 is involved requires an acknowledgement of its endogeneity and needs instrumental variables for consistent estimation; see Card (1995) and the rest of the returns to schooling literature.

• 11

Note that μ and σ are not the mean and standard deviation of u.

• 12

While the minimal values of ed76 is 1, that of age76 is 24, therefore we simply subtract 23 from age76 upfront for convenience.

• 13

In Anatolyev and Gospodinov (2010), the discrete marginal is Bernoulli.

• 14

More precisely, the potential experience is defined as age minus education less 6.

• 15

See footnotes 12 and 14.

• 16

While the regressions we have considered here are not causal (see footnote 10), the rejections obtained indirectly indicate probable misspecification of similar causal relationships used in the returns to schooling literature.

Published Online: 2018-02-10

Citation Information: Journal of Econometric Methods, Volume 8, Issue 1, 20160013, ISSN (Online) 2156-6674,

Export Citation

©2019 Walter de Gruyter GmbH, Berlin/Boston.