Skip to content
BY 4.0 license Open Access Published by De Gruyter March 15, 2021

Variable Selection in Regression Models Using Global Sensitivity Analysis

  • William Becker ORCID logo EMAIL logo , Paolo Paruolo ORCID logo and Andrea Saltelli ORCID logo

Abstract

Global sensitivity analysis is primarily used to investigate the effects of uncertainties in the input variables of physical models on the model output. This work investigates the use of global sensitivity analysis tools in the context of variable selection in regression models. Specifically, a global sensitivity measure is applied to a criterion of model fit, hence defining a ranking of regressors by importance; a testing sequence based on the ‘Pantula-principle’ is then applied to the corresponding nested submodels, obtaining a novel model-selection method. The approach is demonstrated on a growth regression case study, and on a number of simulation experiments, and it is found competitive with existing approaches to variable selection.

JEL Classification: C52; C53

1 Introduction

Model selection in regression analysis is a central issue, both in theory and in practice. Related fields include multiple testing (Bittman et al. 2009; Romano and Wolf 2005), pre-testing (Leeb and Poetscher 2006), information criteria (Hjort and Claeskens 2003; Liu and Yang 2011), model selection based on Lasso (Brunea 2008), model averaging (Claeskens and Hjort 2003), stepwise regression, (Miller 2002), risk inflation in prediction, (Foster and George 1994), directed acyclic graphs and causality discovery (Freedman and Humphreys 1999).[1]

Model choice is also of primary concern in many areas of applied econometrics, as witnessed for example by the literature on growth regression (Sala-i-Martin 1997). Controlling for the right set of covariates is central in the analysis of policy impact evaluations; this is embodied in the assumption of unconfoundedness (Imbens and Wooldridge 2009). In economic forecasting, model selection is the main alternative to model averaging, (Hjort and Claeskens 2003).

The analysis of the effects of pretesting on parameter estimation has a long tradition in econometrics (Danilov and Magnus 2004) and in this context Magnus and Durbin (1999) and co-authors proposed the weighted average least squares estimator (WALS) and compared it with model averaging for growth empirics (Magnus, Powell, and Prufer 2010).

Model selection is a major area of investigation also in time-series econometrics (Phillips 1997, 2003). The so-called London School of Economics (LSE) methodology has played a prominent role in this area, advocating the general-to-specific (GETS) approach to model selection (Castle, Doornik, and Hendry 2011; Hendry and Krolzig 2005) and references therein. In a widely cited paper, Hoover and Perez (1999) (hereafter HP) ‘mechanized’—i.e. translated—the GETS approach into an algorithm for model selection and they then tested the performance of the HP algorithm on a set of time-series regression experiments, constructed along the lines of Lovell (1983).

Model selection is also related to the issue of regression coefficients’ robustness (i.e. lack of sensitivity) to the omission/inclusion of additional variables. Leamer (1983) proposed extreme bound analysis, i.e. to report the range of possible parameter estimates of the coefficient of interest when varying the additional regressors included in the analysis, as an application of sensitivity analysis to econometrics. Other applications of sensitivity analysis to econometrics include the local sensitivity to model misspecification developed in Magnus and Vasnev (2007) and Magnus (2007).[2]

On the other hand, sensitivity analysis originated in the natural sciences, and is generally defined as ‘the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs’, (Saltelli, Tarantola, and Campolongo 2000). The term global sensitivity analysis (GSA) is used to refer to sensitivity analysis approaches that fully explore the space of uncertainties, as opposed to ‘local’ methods which are only valid at a nominal point (Saltelli and Annoni 2010). The main tools used in GSA are based on a decomposition of the variance of the model output (Sobol’ 1993).

Despite several uses of sensitivity in econometrics, the present authors are not aware of systematic applications of the techniques of Global Sensitivity Analysis to the problem of model selection in regression. With this in mind, the present paper explores the application of variance-based measures of sensitivity to model selection.

This paper aims to answer the question: “Can GSA methods help in model selection in practice?”, rather than to propose a single algorithm with the aim to dominate all alternatives. To this purpose, a simple algorithm is considered as a representative of a novel GSA approach; the new algorithm is found to perform rather well when compared with alternatives. This shows how GSA methods can indeed bring a useful contribution to this field.

In particular, a widely-used measure in the GSA literature, called the ‘total sensitivity index’ is employed to rank regressors in terms of their importance in a regression model. The information on the ordering of regressors given by GSA methods appears to be somewhat complementary to the one based on t-ratios employed in the GETS approach; this suggests to consider viable ordering of regressors combining the two orderings. Based on these insights, a GSA algorithm is constructed which combines the two rankings.

The proposed GSA representative algorithm uses the ordering of the regressors via GSA or the t-ratios within a testing strategy based on the ‘Pantula-principle’, see Pantula (1989). For any ordering of the regressors, this amounts to a single sequence of tests against the full model, starting from the most restricted submodel to the most unrestricted one.[3] This implies both a reduction in the number of tests for each given ordering (with an associated saving of computing times) and the favorable control of the size of the testing sequence. The present application of the ‘Pantula-principle’ appears novel in the context of model selection.

The GSA algorithm is tested here using several case studies. A detailed investigation of the performance of the GSA algorithm is first performed using the simulation experiments of HP, who defined a set of Data Generating Processes (DGPs) based on real economic data. Simulating these DGPs, one can record how often the algorithm recovers the variables that are included in the DGP. This is compared to the results of HP’s GETS algorithm, as well as those of the Autometrics GETS package (Pretis, Reade, and Sucarrat 2018).

In order to further compare the GSA approach to a wider set of model selection procedures, the DGPs in Deckers and Hanck (2014) are also considered; this allows a direct comparison with a number of procedures. Finally, the algorithm is applied to a growth regression case study which is also taken from the same paper.

Overall, results point to the possible usefulness of GSA methods in model selection algorithms. When comparing the optimized GSA and HP algorithms, the GSA method appears to be able to reduce the failure rate in recovering the underlying data generating process from 5 to 1% approximately—a fivefold reduction. When some of the regressors are weak, the recovery of the exact DGP does not appear to be improved by the use of GSA methods.

Comparing the GSA algorithm to a wider set of approaches considered in DH, the results are competitive with alternatives, in the sense that the GSA algorithm is not dominated by alternative algorithms in the Monte Carlo (MC) simulations. In the empirical application on growth regression, not surprisingly, it identifies similar variables to those found by other methods. While these results do not prove the GSA approach to dominate other existing approaches, they show that the GSA approach is not dominated by any single alternative, and that it has the potential to contribute to improve existing algorithms; the present study can hopefully hence pave the way for future advances.

The rest of the paper is organized as follows. Section 2 defines the problem of interest and introduce GSA and variance-based measures. Section 3 presents some theoretical properties of orderings based on the total sensitivity index, while Section 4 presents the GSA algorithm. Results are reported in Sections 5 and 6, where the former is a detailed investigation on datasets generated following the paper of Hoover and Perez (1999), and the latter is a comparison with a wide range of model selection procedures on simulated data sets and on a growth regression, following Deckers and Hanck (2014). Section 7 concludes. Three appendices report proofs of the propositions in the paper, details on the DGP design in HP and a discussion about the identifiability of DGPs. Finally, this paper follows the notational conventions in Abadir and Magnus (2002).

2 Model Selection and Global Sensitivity Analysis

This section presents the setup of the problem, and introduces global sensitivity analysis. The details of the proposed algorithm are deferred to Section 4.

2.1 Model Selection in Regression

Consider n data points in a standard multiple regression model with p regressors of the form

(1)y=X1β1+Xpβp+ε=Xβ+ε

where y=(y1,,yn) is n × 1, X = (X1, …, Xp) is n × p, Xi(xi,1,,xi,n) is n × 1, β=(β1,,βp) is p × 1 and ε is a n × 1 random vector with distribution N(0,σ2In). The symbol ′ indicates transposition.

Equation (1) describes both the model and the DGP. In the model, the coefficients βi are parameters to be estimated given the observed data Z = (y, X). Each DGP is described by Eq. (1) with βi set at some numerical values, here indicated as β0i, collected in the vector β0=(β01,,β0p).

Some of the true β0i may be 0, corresponding to irrelevant regressors. Let T{iJ:β0,i0} be the set of all relevant regressor indices in the DGP, with r0 elements, where J{1,,p} indicates the set of the first p integers. Let also MJ\T indicate the set of all regressor indices for irrelevant regressors.[4]Equation (1) also formally nests dynamic specifications, as detailed in Appendix B below; in this case Xi contain lagged dependent variables, and (1) is generated recursively.

Imposing the restriction βi = 0 for some regressors i, one obtains a submodel[5] of model (1). Each submodel can be characterized by a set a, aJ, containing the indices of the included regressors. For instance, a = {1, 5, 9}, indicates the submodel including regressors numbered 1, 5, 9. The model without any restriction on βi = 0 is called the general unrestricted model (GUM).

Alternatively, the same information on submodel a can be represented by a p × 1 vector γa=(γ1,,γp), with j-th coordinate γj with value 1 (respectively 0) that indicates the inclusion (respectively exclusion) of regressor j from the specification, i.e. γj=1(ja) and 1(⋅) is the indicator function.[6] The GUM corresponds to γ equal to ı, a vector with all 1s. γT corresponds to the best selection of regressors, i.e. the same one of the DGP; in the following the notation γT=γ0 is also used.

Let Γ be the set of all p × 1 vectors of indicators γ, Γ={0,1}p. Note that there are 2p different specifications, i.e. 2p possible γ vectors in Γ. When p = 40, as some experiments in Section 5, the number of specifications is 2p1.09951012, a very large number. This is why an exhaustive search of submodels is infeasible in many practical cases, and model selection techniques focus on a search over a limited set of submodels ΓsΓ.

Each submodel can be written as model (1) under the restriction

(2)β=Hγϕ,

where Hγ contains the columns of an identity matrix Ip corresponding to elements γi equal to 1 within γ. Specification (2) is referred to as the ‘γ submodel’ in the following. Also the ‘true’ vector β0 has representation β0=H0ϕ0 where H0 is a simplified notation for H0=HγT=Hγ0.

The least squares estimator of β in submodel γ can be written as

(3)βˆγ=Hγ(HγXXHγ)1HγXy.

The problem of interest is to retrieve T, or the corresponding γT, given the observed data Z=(y,X), i.e. to identify the DGP.[7]

2.2 GSA Approach

General-to-specific (GETS) approaches such as the algorithm used by HP (described in detail in Section 5) use t-ratios to rank regressors in order of importance, which guides the selection of the set of submodels Γs. This study proposes instead to decompose the selection of models in two stages:

  1. define an ordering of regressors based on their importance;

  2. use a sequence of p tests that compare the GUM with submodels which contain the first h most important regressors, starting from h = 0, 1, 2, … and ending at the first submodel r that does not reject the null hypothesis.

In this paper the ordering in (i) based on the t-ratios is complemented with a variance-based measure of importance from GSA, called the ‘total sensitivity index’. The proposed algorithm, called the ‘GSA algorithm’, combines this new ranking with the ranking by t-ratios.

The testing sequence is defined based on this new ranking; a ‘bottom-up’ selection process is adopted, which builds candidate models by adding regressors in descending order of importance. This ‘bottom-up’ selection process follows the ‘Pantula principle’ and has well defined theoretical properties, see e. g. (Paruolo 2001), and it can still be interpreted as a GETS procedure.

The total sensitivity index in GSA is based on systematic exploration of the space of the inputs to measure its influence on the system output, as is commonly practiced in mathematical modeling in natural sciences and engineering. It provides a global measure of the influence of each input to a system.[8] Reviews of global sensitivity analysis methods used therein are given in Saltelli et al. (2012), Norton (2015), Becker and Saltelli (2015), Wei, Lu, and Song (2015).[9] The total sensitivity index is a variance-based measures of sensitivity, which are the analogue of the analysis of the variance, see Archer, Saltelli, and Sobol (1997).[10]

Given the sample data Z = (y, X), consider the γ submodel, see eqs. (1), (2) and (3). Let q(γ) indicate the BIC of model fit of this submodel, q(γ)=logσˆγ2+kγcn, with cnlog(n)/n.[11] Remark that q is a continuous random variable that depends on the discretely-valued γ. The idea is to apply the total sensitivity index using q as output, with γ as input. Although the BIC is used here as q, the measure of model fit, other consistent information criteria or the maximized log-likelihood could be used instead.

The objective is to capture both the main effect and the interaction effects of the input factors onto the output q, see Saltelli et al. (2012). The following section defines the total sensitivity index.

2.3 Sensitivity Measures

Let E indicate the empirical expectation over Γ, i.e. E(h(γ))(#(Γ))1γΓ(h(γ)), for any function h. Let also V indicate the variance operator associated with E, V(h)E(h2)(E(h))2.

The γ vector is partitioned into two components γi and γi, where γi contains all elements in γ except γi. Let E(|b) and V(|b) (respectively E() and V()) indicate the conditional (respectively marginal) expectation and variance operators with respect to a partition (a, b) of γ, where a and b are taken equal to γi and to γi.

Two commonly-accepted variance-based measures are reviewed here, the ‘first-order sensitivity index’ Si, Sobol’ (1993), and the ‘total-order sensitivity index’ STi, Homma and Saltelli (1996); both rely on decomposing the variance of the output, V=V(q), into portions attributable to inputs or sets of inputs.

The first-order index measures the contribution to V=V(q) of varying the i-th input alone, and it is defined as Si=V(E(q|γi))/V. This index can be seen as the application of Karl Pearson’s correlation ratio η2, see Pearson (1905), to the present context. This corresponds to seeing the effect of including or not including a regressor, but averaged over all possible combinations of other regressors. However, this measure does not account for interactions with the inclusion/exclusion of other regressors; hence it is not used in the present paper.

Instead, here the focus is placed on the total effect index, which is defined by Homma and Saltelli (1996) as

(4)STi=E(V(q|γi))V=1V(E(q|γi))V.

In the following, the numerator of STi is indicated as σTi2=E(V(q|γi)), and the shorthand ST for STi is often used.

Examining σTi2, one can notice that the inner term, V(q|γi), is the variance of q due inclusion/exclusion of regressor i, but conditional on a given combination γi of the remaining regressors. The outer expectation then averages over all values of γi; this quantity is then standardized by V to give the fraction of total output variance caused by the inclusion of xi. The second expression shows that STi is 1 minus the first order effect for γi.

These measures are based on the standard variance decomposition formula, or ‘law of total variance’ (Billingsley 1995), Problem 34.10(b)). In the context of GSA, these decomposition formulae are discussed in Archer, Saltelli, and Sobol (1997), Saltelli and Tarantola (2002), Sobol’ (1993), Brell, Li, and Rabitz (2010). For further reading about GSA in their original setting, see Saltelli et al. (2012).

2.4 Estimation of the Total Sensitivity Index

In order to calculate the total sensitivity measure STi one should be able to compute q(γ) for all γΓ (i.e. estimate all possible submodels of the GUM) which is unfeasible or undesirable. Instead, STi can be estimated from a random subset of Γ, i.e. a sample of models. The estimation of STi is performed using an estimator and a structured sample constructed as in Jansen (1999), which is a widely used method in GSA.

Specifically, generate a random draw of γ in Γ, say γ*; then consider elements γ*(i) with all elements equal to γ* except for the i-th coordinate which is switched from 0 to 1 or vice-versa, γ*i(i)=1γ*i. Doing this for each coordinate i generates p pairs of γ vectors, γ* and γ*(i), that differ only in the coordinate i. This is then used to calculate q(γ) and apply an estimator of Jansen (1999).

This process can be formalized as follows: initialize at 1, then,

  1. Generate a random draw of γ, where γ is a p-length vector with each element is randomly selected from {0, 1}. Denote this by γ.

  2. Evaluate q = q(γ).

  3. Take the ith element of γ, and switch it to 0 if it is equal to 1, and to 1 if it is 0. Denote this new vector with inverted ith element as γ(i).

  4. Evaluate qi=q(γ(i)).

  5. Repeat steps 3 and 4 for i = 1, 2, …, p.

  6. Repeat steps 1–5 N times, i.e. for  = 1, 2, …, N.

The estimators for σTi2 and V are then defined as in Jansen (1999), see also Saltelli et al. (2010):

(5)σˆTi2=14N=1N(qiq)2,Vˆ=1N1=1N(qq)2,

where q=1N=1Nq. This delivers the following plug-in estimator for ST, SˆTi=σˆTi2/Vˆ. Readers familiar with sensitivity analysis may notice that the estimator in (5) is different by a factor of 2 to the estimator quoted in Saltelli et al. (2010). The reason for this is given in Appendix A.[12]

SˆTi is an accurate estimator for STi as the number N of models increases;[13] hence, the following discussion is based on the behavior of STi.

3 Properties of Orderings Based on STi

This section investigates the theoretical properties of ordering of variables in a regression model based on ST, and shows that these orderings satisfy the following minimal requirement. When the true regressors in T included in the DGP and the irrelevant ones in M are uncorrelated, the ordering of regressors based on ST separates the true from the irrelevant regressors in large samples.

Recall that STi=σTi2/V=E(V(q|γi))/V, see (4). The large n properties of STi are studied under the following regularity assumptions.

Assumption 1:

(Assumptions on the DGP). The variables wt(yt,x1,t,,xp,t,ϵt) are stationary with finite second moments, and satisfy a law of large numbers for large n, i.e. the second sample moments of wt converge in probability to Σ, the variance covariance matrix of wt.

Notice that these requirements are minimal, and they are satisfied by the HP DGPs as well as the DH DGPs. The following theorem shows that for large n, a scree plot on the ordered STi allows to separate the relevant regressors from the irrelevant ones when true and irrelevant regressors are uncorrelated.

Theorem 2:

(Ordering based on STi works for uncorrelated regressors in M and T). Let Assumption 1 hold and assume that the covariance Σℓjbetween xand xjequals 0 for alljTandM. Define(ST(1),ST(2),,ST(p))as the set of STivalues in decreasing order, withST(1)ST(2)ST(p). Then as n → ∞ one has

(ST(1),ST(2),,ST(p))p(c(1),c(2),,c(r0),0,0)
where(c(1),c(2),,c(r0))is the ordered set of ci > 0 values in decreasing order, where
(6)ci142p1γiΓilog(σ2+h,jT\aγ(i,1)β0,hΣhj.bγ(i,1)β0,jσ2+h,jT\aγ(i,0)β0,hΣhj.bγ(i,0)β0,j),
seeAppendix Afor the definition of the relevant quantities in eq. (6). Hence the ordered STivalues separate the block of true regressors inTin the first r0positions from the irrelevant onesMin the last p − r0positions of(ST(1),ST(2),,ST(p)).

Proof. See Lemma 6 in Appendix A.□

Given the above, one may hence expect this result to apply to other more general situations. However, this turns out not to be necessarily the case. The results in Appendix A also show that one can build examples with correlated regressors across T and M, where the ordering of regressors based on ST fails to separate the sets of true and irrelevant regressors in large samples.[14]

In the end, the question of whether the ordering based on ST can help in selecting regressors is an empirical matter. Section 5 explores the frequency with which this happens in practice, based on simulated data from various DGPs.

4 Construction of the Algorithm

In order to construct an algorithm to perform model selection based on ST, an initial investigation was performed to understand to what extent the ranking of regressors provided by ST is complementary to that given by t-ratios. These experiments are based on the MC design by HP; details of these experiments are reported in Section 5. However, since the results provide the basis of the GSA algorithm, they are also summarized here.

In short, 11 different datasets were simulated following the approach and underlying DGPs defined by HP. For each DGP, the regressors were ordered using both ST and the t-ratios. Then a metric was used which measures the success of each ranking in assigning the regressors in the DGP with the highest ranks. This gives a measure of the utility of each ranking in correctly identifying the DGP. It was found that first, ST gave overall better rankings than t-ratios, but for some DGPs t-ratios were still more effective.

This result pointed to the fact that the two measures are in some way complementary, and motivated the GSA algorithm proposed here, which combines the search paths obtained using the t-ratios and the ST measures, and then selects the best model between the two resulting specifications. The combined procedure is expected to be able to reap the advantages of both orderings. For simplicity, this algorithm is called the GSA algorithm, despite the fact that it exploits both the orderings based on GSA and on the t-ratios. The rest of this section contains a description of the GSA algorithm in its basic form and with two modifications.

4.1 The Basic Algorithm

The procedure involves ranking the regressors by t-ratios or ST, then adopting the ‘bottom up’ approach following the ‘Pantula principle’, where candidate models are built by successively adding regressors in order of importance. The steps are as follows.

  1. Order all regressors by method m (i.e. either the t-ratios or ST).

  2. Define the initial candidate model as the empty set of regressors (i.e. one with only the constant term).

  3. Add to the candidate model the highest-ranking regressor (that is not already in the candidate model).

  4. Perform an F test, comparing the validity of the candidate model to that of the GUM.

  5. If the p-value of the F test in step 4 is below a given significance level α, go to step 3 (continue adding regressors), otherwise, go to step 6.

  6. Since the F-test has not rejected the model in step 4, this is the selected model γ(m).

In the following, the notation γ(t) is used (respectively γ(S)) to denote the model selected by this algorithm when t-ratios (respectively ST) are used for the ordering. Note that candidate variables are added starting from an empty specification; this is hence a ‘bottom up’ approach induced by the ‘Pantula principle’.

One can observe that this ‘bottom up’ approach is in line with the GETS philosophy of model selection; in fact it corresponds to the nesting of models known as the ‘Pantula-principle’ in cointegration rank determination, see Johansen (1996). Every model in the sequence is compared with the GUM, and hence the sequence of tests can be interpreted as an implementation of the GETS philosophy. Moreover, it can be proved that, for large sample sizes, the sequence selects the smallest true model in the sequence with probability equal to 1 − α, where α is the size of each test. Letting α tend to 0 as the sample size gets large, one can prove that this delivers a true model with probability tending to 1.[15]

As a last step, the final choice of regressors γˆ is chosen between γ(t) and γ(S) as the one with the fewest regressors (since both models have been declared valid by the F-test). If the number of regressors is the same, but the regressors are different, the choice is made using the BIC.

The GSA algorithm depends on some key constants; the significance level of the F-test, α, is a truly ‘sensitive’ parameter, in that varying it strongly affects its performance. Of the remaining constants in the algorithm, N, the number of points in the GSA sampling, can be increased to improve accuracy; in practice it was found that N = 128 provided good results, and further increases made little difference.

In the following two subsections, two extensions to the basic algorithm are outlined with the reasoning explained.

4.2 Adaptive-α

Varying α essentially dictates how ‘strong’ the effect of regressors should be to be included in the final model, such that a high α value will tend to include more variables, whereas a low value will cut out variables more harshly. The difficulty is that some DGPs require low α for accurate identification of the true regressors in T, whereas others require higher values. Hence, there could exist no single value of α that is suitable for the identification of all DGPs.

A proposed modification to deal with this problem is to use an ‘adaptive-α’, αϕ, which is allowed to vary depending on the data. This is based on the observation that the F-test returns a high p-value pH (typically of the order 0.2–0.6) when the proposed model is a superset of the DGP, but when one or more of the regressors in T are missing from the proposed model, the p-value will generally be low, pL (of the order 10−3 say). The values of pH and pL will vary depending on the DGP and data set, making it difficult to find a single value of α which will yield good results across all DGPs. However, for a given DGP and data set, the pH and pL values are easy to identify.

Therefore, it is proposed to use a value of αϕ, such that for each data set,

(7)αϕ=pL+ϕ(pHpL)

where pH is taken as the p-value resulting from considering a candidate model with all regressors that have STi>0.01 against the GUM, and pL is taken as the p-value from considering the empty set of regressors against the GUM. The reasoning behind the definition of pH is that it represents a candidate model which will contain the DGP regressors with a high degree of confidence. Here ϕ is a tuning parameter that essentially determines how far between pL and pH the cutoff should be. Figure 1 illustrates this on a data set sampled from DGP 6B. Note that αϕ is used in the F-test for both the t-ranked regressors as well as those ordered by ST.

Figure 1: p-Values from F-test comparing candidate models to the GUM in a sample from DGP 6B, for the six highest-ranked regressors. Here ϕ = 0.2 and αϕ is marked as a dotted line.
Figure 1:

p-Values from F-test comparing candidate models to the GUM in a sample from DGP 6B, for the six highest-ranked regressors. Here ϕ = 0.2 and αϕ is marked as a dotted line.

4.3 Skipping Regressors

In order to correct situations where the ordering of the regressors is not correct, a different extension of the algorithm is to test discarding “weak” regressors in the selected model. Here, a weak regressor is defined as being one with a value of ST lower than a certain threshold, which is set as 0.2. When Step 6 is reached, if weak regressors exist in the selected model, they are removed one at a time, each time performing an F-test. If the F-test is satisfied, the regressor is discarded, otherwise it is retained. This approach is used instead of an exhaustive search of the combinations of remaining regressors, because occasionally there may still be too many regressors left to make this feasible.

4.4 Full GSA Algorithm

Adding the extensions discussed in the previous two sections results in the final full algorithm, which can be described as follows.

  1. Obtain two orderings of regressors, one by the t-ratios, and the other by ST, using the BIC as the output (penalized measure of model fit) q.

  2. Obtain pH as the p-value resulting from considering a candidate model with all regressors that have STi > 0.01 against the GUM, and pL as the p-value from considering the empty set of regressors against the GUM. Calculate αϕ using (7), which is used in all subsequent tests.

  3. Define the initial candidate model as the empty set of regressors (i.e. one with only the constant term).

  4. Add to the candidate model the regressor with the highest ST (that is not already in the candidate model).

  5. Perform an F test, comparing the validity of the candidate model to that of the GUM.

  6. If the p-value of the F test in step 4 is below αϕ, go to step 4 (continue adding regressors), otherwise, go to step 7.

  7. Since the F-test has not rejected the model in step 4, this is the selected model γ(m).

  8. Identify any remaining ‘weak’ regressors as those with ST < 0.2. Try removing these one at a time: if removing a regressor satisfies the F-test, it is discarded; otherwise it is retained. Repeat this procedure for all weak regressors.

  9. Repeat steps 3–8, except use the ordering based on t-ratios, rather than on ST.

  10. Compare between the final model selected by ST and the final model selected by the t-ratios, by selecting the model with the fewest regressors (since both have satisfied the F-test). If both final specifications have the same number of regressors, chose the specification with the lowest BIC.

In the following section the performance of the algorithm is examined compared to some benchmark test cases, with and without the extensions introduced in previous sections. In the following, STfull indicates the full procedure as described above; STno-skip refers to the same procedure without the skipping extension (i.e. without step 8); finally STsimple is the one without step 8, and also without step 2 (adaptive-α). For STsimple a fixed value of α is used.

5 The Experiments of Hoover and Perez

This section tests the GSA algorithm on a suite of DGP simulation experiments developed by HP. These experiments consider a possibly dynamic regression equation with n = 139 and exogenous variables, fixed across experiments, taken from real-world, stationary, macroeconomic time series, in the attempt to represent typical macroeconomic data. Several papers have used HP’s experiments to test the performance of other methods (Castle, Doornik, and Hendry 2011; Hendry and Krolzig 1999). HP’s experiments are of varying degree of difficulty for model search algorithms. Details on the design of HP DGPs are reported in Appendix B.

The features of HP’s experiments prompt a number of considerations. First, because sample size is limited and fixed, consistency of model-selection algorithms cannot be the sole performance criterion. Secondly, some of the DGPs in HP’s experiments are characterized by a low signal-to-noise ratio for some coefficients; the corresponding regressors are labeled ‘weak’. This situation makes it very difficult for statistical procedures to discover if the corresponding regressors should be included or not. This raises the question of how to measure selection performance in this context.

This paper observes that, in the case of weak regressors, one can measure performance of model-selection algorithms also with respect to a simplified DGP, which contains the subset of regressors with sufficiently high signal-to-noise ratio; this is called the ‘Effective DGP’ (EDGP). The definition of the EDGP is made operational using the ‘parametricness index’ introduced in Liu and Yang (2011)—this concept is described in detail in Appendix C. For full transparency, results are presented also relative to the original DGPs in cases where the EDGP is different.

5.1 Orderings Based on t and GSA

As mentioned in Section 4, the DGPs of HP were used as the basis for an initial investigation into the comparative rankings of ST and the t ratios. Here, these numerical experiments are described in more detail.

For each of the 11 DGPs under investigation, NR = 500 replications for Z were generated; on each sample, regressors were ranked by the t-ratios and ST, using N = 128 in (5). Both for the t-ratios ranking and the ST ranking, the ordering is from the best-fitting regressor to the worst-fitting one.

In order to measure how successful the two methods were in ranking regressors, the following measure δ of minimum relative covering size is defined. Indicate by φ0={i1,,ir0} the set containing the positions ij of the true regressors in the list i = 1, …, p; i.e. for each j one has γ0,ij=1. Recall also that r0 is the number of elements in φ0. Next, for a generic replication j, let φ(m)={i1(m),,i(m)} be the set containing the first positions ij(m) induced by the ordering of method m, m equal t, ST. Let bj(m)=min{:φ0φ(m)} be the minimum number of elements for which φ(m) contains all the true regressors. Observe that bj(m) is well defined, because at least for  = p one always has φ0φp(m)={1,,p}. δ is defined to equal bj(m) divided by its minimum; this corresponds to the (relative) minimum number of elements in the ordering m that covers the set of true regressors.

Observe that, by construction, one has r0bj(m)p, and that one wishes bj(m) to be as small as possible; ideally one would like to have to have bj(m)=r0. Hence for δj(m) defined as bj(m)/r0 one has 1δj(m)p/r0. Finally δ(m) is defined as the average δj(m) over j = 1, …, NR, i.e. δ(m)=1NRj=1NRδj(m).

For example, if the regressors, ranked in descending order of importance by method m in replication j, were x3, x12, x21, x11, x4, x31, …, and the true DGP were x3, x11 the measure δj would be 2; in fact the smallest-ranked set containing x3, x11 has four elements bj(m)=4, and r0 = 2.

The results over the NR = 500 replications are summarized in Table 1. Overall ST appears to perform better than t-ordering. For some DGPs (such as DGP 2 and 5) both approaches perform well (δ = 1 indicating correct ranking for all 500 data sets). There are other DGPs where the performance is significantly different. In particular the t-ratios is comparatively deficient on DGPs 3 and 6A, whereas ST performs worse on DGP 8. This suggests that there are some DGPs in which ST may offer an advantage over the t-ratios in terms of ranking regressors in order of importance. This implies that a hybrid approach, using both measures, may yield a more efficient method of regressor selection.

Table 1:

Values of δ (average over 500 data replications per DGP), using t-test and ST. Mean refers to average across DGPs. Comparatively poor rankings are in boldface.

DGP1234566A6B789Mean
ST1.001.011.001.001.001.121.021.151.641.131.11
t-ratios1.001.531.041.001.063.951.141.041.001.011.38

5.2 Measures of Performance

The performance of algorithms was measured by HP via the number of times the algorithm selected the DGP as a final specification. Here use is made of measures of performance similar to the ones in HP, as well as of additional ones proposed in Castle, Doornik, and Hendry (2011).

Recall that γT=γ0 is the true set of included regressors and let γˆj indicate the one produced by a generic algorithm in replication j = 1, …, NR. Define rj to be number of correct inclusions of components in vector γˆj, i.e. the number of regression indices i for which γˆj,i=γ0,i=1, rj=i=1p1(γˆj,i=γ0,i=1). Recall that r0 indicates the number of true regressors.

The following exhaustive and mutually exclusive categories of results can be defined:

  • C1: exact matches;

  • C2: the selected model is correctly specified, but it is larger than necessary, i.e. it contains all relevant regressors as well as irrelevant ones;

  • C3: the selected model is incorrectly specified (misspecified), i.e. it lacks relevant regressors.

C1 matches correspond to the case when γˆj coincides with γT=γ0; the corresponding frequency C1 is computed as C1=1NRj=1NR1(γˆj=γT). The frequency of C2 cases is given by C2=1NRj=1NR1(γˆjγ0,rj=r0). Finally, C3 cases are the residual category, and the corresponding frequency is C3 = 1 − C1 − C2.[16]

The performance can be further evaluated through measures taken from Castle, Doornik, and Hendry (2011), known as potency and gauge. First the retention rate p˜i of the i-th variable is defined as, p˜i=1NRj=1NR1(γˆj,i=1). Then, potency and gauge are defined as follows:

potency=1r0i:β0,i0p˜i, gauge=1pr0i:β0,i=0p˜i.

Potency therefore measures the average frequency of inclusion of regressors belonging to the DGP, while gauge measures the average frequency of inclusion of regressors not belonging to the DGP. An ideal performance is thus represented by a potency value of 1 and a gauge of 0.

In calculating these measures, HP chose to discard MC replications for which a preliminary application of the battery of misspecification tests defined in (15) in Appendix A reported a rejection.[17] This choice is called in the following ‘pre-search elimination’ of MC replications.

5.3 Benchmark

The performance of HP’s algorithm is taken as a benchmark. The original MATLAB code for generating data from HP’s experiments was downloaded from HP’s home page.[18] The original scripts were then updated to run on the current version of MATLAB. A replication of the results in Tables 4, 6 and 7 in HP is reported in the first panel of Table 2, using a nominal significance level of α = 1, 5, 10% and NR=103 replications. The results do not appear to be significantly different from the ones reported in HP.

Table 2:

Percentages of Category 1 matches C1 for different values of α. Original script: data generated by the original script, NR=103 replications. The frequencies are not statistically different from the ones reported in HP (Tables 4, 6, 7). Modified script: data from modified script for the generation of AR series, NR=104 replications.

Original scriptModified script
DGPα = 0.010.050.10.010.050.1
181.128.66.879.330.07.3
21.20.00.077.027.06.9
371.427.29.171.727.36.9
478.231.26.481.831.17.0
580.930.17.480.729.96.4
60.21.00.70.40.50.6
6A68.027.87.870.627.87.6
6B80.830.77.881.131.48.0
723.64.70.375.726.77.6
880.631.08.079.230.39.4
90.100000

When checking the original code, an incorrect coding was found in the original HP script for the generation of the AR series ut in Eq. (13), which produced simulations of a moving average process of order 1, MA(1), with MA parameter 0.75 instead of an AR(1) with AR parameter 0.75.[19] The script was hence modified to produce ut as an AR(1) with AR parameter 0.75; this is called the ‘modified script’ in the following.

Re-running the DGP simulation experiments using this modified script, the results in the second panel in Table 2 were obtained; for this set of simulations NR=104 replications were used. Comparing the first and second panel in the table for the same nominal significance level α, one observes a significant increase in C1 catches in DGP 2 and 7. One reason for this can be that when the modified script is employed, the regression model is well-specified, i.e. it contains the DGP as a special case.[20]Table 2 documents how HP’s algorithm depends on α, the significance level chosen in the test R in (15).

5.4 Alternative Algorithms

This section presents results using the performance measures introduced in Section 5.2. The results compare the three variations of the ST algorithm with the modified HP code. To compare with a similar but more recent GETS implementation, the Autometrics package ‘gets’, see Pretis, Reade, and Sucarrat (2018), is also added as an additional algorithm in the comparison.

The performance is measured with respect to the true DGP or with respect to the Effective DGP (EDGP) that one can hope to recover, given the signal to noise ratio. Because the HP, GSA and Autometrics algorithms depend on tunable constants, results are given for various values of these constants.

The procedure employed to define the EGDP is discussed in Appendix C; it implies that the only EDGPs differing from the true DGP are DGP 6 and DGP 9. DGP 6 contains regressors 3 and 11, but regressor 3 is weak and hence EDGP 6 contains only regressor 11. DGP 9 contains regressors 3, 11, 21, 29 and 37 but regressors 3 and 21 are weak and they are dropped from the corresponding EDGP 9. More details are given in Appendix C.

Both the HP algorithm and Autometrics depend on the significance levels α, whereas the GSA algorithm depends on the threshold ϕ (which controls αϕ) for STno-skip and STfull and on α for STsimple. Because the values of α and ϕ can seriously affect the performance of the algorithms, a fair comparison of the performance of the algorithms may be difficult, especially since the true parameter values will not be known in practice. To deal with this problem, the performance of the algorithms was measured at a number of parameter values within a plausible range.

This allowed two ways of comparing the algorithms: first, the ‘optimized’ performance, corresponding to the value of α or ϕ that produced the highest C1 score, averaged over the 11 DGPs. This can be viewed as the ‘potential performance’. In practice, the optimization was performed with a grid search on α and ϕ with NR = 103 replications, averaging across DGPs.

Secondly, a qualitative comparison was performed between the algorithms of comparing their average performance over the range of parameter values. This latter comparison gives some insight into the more realistic situation, where the optimum parameter values are not known.

5.5 Results for optimal values of tuning coefficients

Table 3 shows the classification results in terms of C1 matches, as well as the potency and gauge measures, for all algorithms at their optimal parameter values, using NR=104. Note that the value of α = 4 × 10−4 for Autometrics represents the lowest value of α that it was possible to assign without errors occurring due to singular matrices—likely due to issues with numerical precision. Results for the ST algorithm are shown with and without the extensions discussed in Section 4. Recovery of the true specification is here understood in the EDGP sense.

Table 3:

Percentage C1, percentage gauge (Gge) and percentage potency (Pot) by EDGP. Optimized parameter values used.

EDGPSTsimpleSTno-skipSTfullHPoptAutometrics
α = 0.0371ϕ = 0.3ϕ = 0.3α = 4 × 10−4α = 4 × 10−4
C1GgePotC1GgePotC1GgePotC1GgePotC1GgePot
198.70.1110099.80.0110099.8010099.20.0210085.00.48100
298.50.0910099.40.0210099.40.0210098.90.0310093.00.1896.0
379.40.894.795.20.1098.596.00.0698.562.00.0581.274.00.4290.0
498.60.0910099.20.0310099.20.0210099.30.0299.989.00.2898.0
598.80.0810099.9010099.9010099.30.0210090.00.2898.0
698.70.0910099.20.0310099.20.0210099.20.0399.889.00.3198.0
6A65.30.4687.978.40.6697.996.20.0598.585.30.5592.987.00.4794.0
6B97.60.110098.60.0410099.40.0210098.40.0799.591.00.2496.0
792.70.1398.697.10.0999.999.50.0199.998.80.0399.893.00.2496.7
898.40.0710099.9010099.9010099.10.0310094.00.0596.0
991.40.1898.696.50.1199.999.60.0199.998.20.0499.890.00.2496.0
Mean92.60.298.296.70.1099.798.90.0299.794.30.0897.588.60.2996.2

The C1 column measures the percentage frequency with which the algorithms identified the EDGP. One notable fact is that the performance of the HP algorithm has been vastly improved (compared to the results in HP’s original paper) simply by setting α to a better value, in this case α = 4 × 10−4, compare with Table 2.

The comparison shows that with the full ST algorithm, the correct classification rate (C1) is 98.9%, compared with 94.3% for HP, and 88.6% with Autometrics. It is presumed that if it were possible to reduce further the value of α for Autometrics, the mean C1 value would increase still further. However, it was not possible to test this conjecture. Removing the ‘skipping’ extension, the average performance falls to 96.7%, and further to 92.6% without the adaptive-α feature.

Examining the DGPs individually, the GSA algorithm performs well on all DGPs, although there are slightly lower C1 values in DGPs 3 and 6A (around 96%). For HP, these differences are more marked, with C1 = 62% for DGP 3, and C1 = 85.3% for DGP 7. Autometrics also has a lower success rate of 74% for DGP 3, and 87% for DGP 7. It is evident though that the adaptive-α and the skipping extensions contribute significantly to the performance of the GSA algorithm in these DGPs.

The potency and gauge measures (also in Table 3) reveal a little more about the nature of the errors made by the algorithms. Gauge is very low for the GSA and HP algorithms, but higher for Autometrics. Higher gauge measures are found in DGP 6A, particularly for HP, indicating the inclusion of irrelevant regressors. The full GSA algorithm has gauges of at most 0.06% across these data sets. Autometrics has gauge values which are relatively consistent around 0.2–0.5% in most cases, but as low as 0.05% for DGP 8. This partially confirms the authors’ assertion that gauge is close to constant across a variety of models, see Doornik (2009).

The potency measures show that the true regressors are being identified nearly all the time for all three approaches. However, overall the GSA method has the highest potency values of 98% or above, whereas HP has a lower value of 80% for DGP 3. Autometrics yields potency values generally over 95%, with the exception of DGP 3, which has a potency value of 90%.

5.6 Recovering the DGP

Although it is argued here that the signal-to-noise ratio in DGPs 6 and 9 is too low for certain regressors to be identified, it is still worth looking at the results with respect to the true DGP, shown in Table 4. All algorithms failed to identify the true DGP 9 even once out of the 104 runs. This fact is reflected in the potency, which drops from 100 to 50% (DGP 6), and about 60% (DGP 9). These results are mirrored in the original results of HP. This suggest that GSA may not help when regressors are very ‘weak’, but the same is true for HP and Autometrics. Put simply, the signal-to-noise ratio is too low to identify the effect of these regressors (see the discussion of the EDGP in Appendix C).

Table 4:

Percentage C1, gauge and potency by DGP. Optimized parameter values used. The mean frequency is taken over all DGPs, but only the results for DGPs 6 and 9 are shown since the remaining results are identical to Table 3.

DGPSTsimpleSTno-skipSTfullHPoptAutometrics
α = 0.0371ϕ = 0.3ϕ = 0.3α = 4 × 10−4α = 4 × 10−4
C1GgePotC1GgePotC1GgePotC1GgePotC1GgePot
610.0950.010.0350.020.0250.030.0349.900.2949.5
900.1959.200.1160.000.0160.000.0459.900.2058.0
Mean0.50.1454.60.50.0755.010.0255.01.50.0354.900.2553.8

5.7 Robustness of Algorithms

As discussed earlier, the results in Table 3 are obtained after optimization of the tuning parameters α and ϕ. This provides a measure of potential performance, but in reality the best α and ϕ will not be known. For this reason it is indicative to show the results when varying the tuning parameter.

The upper panel in Figure 2 shows how the categorization of the final model varies with ϕ in the full GSA algorithm. While the value of ϕ varies between 0.1 and 0.5, the value of C1 (exact matches) is generally above 95%, and C2 (correct specification, but with irrelevant regressors) and C3 (misspecification) are consistently very low.

Figure 2: Optimization of algorithms with respect to tuning parameters; upper panel: full ST algorithm; middle panel: HP algorithm, lower panel: Autometrics. Percentages correspond to averages over all EDGPs.
Figure 2:

Optimization of algorithms with respect to tuning parameters; upper panel: full ST algorithm; middle panel: HP algorithm, lower panel: Autometrics. Percentages correspond to averages over all EDGPs.

In contrast, the middle panel shows the effect of varying α for the HP algorithm. It is clear that the peak performance of the algorithm is obtained in a small neighborhood around a rather sharp maximum at a low α value—increasing α from this value results in a rapid increase in C2, whereas decreasing it sharply increases C3.

Autometrics exhibits a similar sensitivity to α, although the decrease in C1 is even steeper, when α is increased. As noted previously, Autometrics may yield even higher C1 values than those observed here, if it were possible to further reduce α without numerical issues.

Overall, while it is difficult to make a perfectly fair comparison of the robustness of the three algorithms, due to the incomparable scales of the optimizing parameters, the GSA algorithm seems to be considerably less sensitive to variation in its tuning coefficient on these data sets, since it indirectly specifies α through ϕ. This has the advantage that the tuning parameter, ϕ, is somewhat problem-independent.

6 Comparisons with Other Methods

In the previous section, a rather detailed comparison was given with the approach of HP and Autometrics, since those approaches share some similarities with the proposed GSA algorithm. This section compares with a wider set of approaches and also includes a real case study; specifically, the experiments in Deckers and Hanck (2014) (henceforth DH) are considered. They consist of a comparison of a nine-model selection approach (Section 6.1) in the DH DGPs, as well as an application to a growth regression model (Section 6.2).

6.1 Simulation Experiments

To give a comparison of performance with a wider range of model selection procedures, the DGP simulation experiment of Deckers and Hanck (2014) is used as a test case. In their paper, the authors use a simple cross-section regression framework to compare the performance of nine-model selection approaches, additionally investigating the effect of altering tuning parameters. Here, the GSA algorithm is applied to their test case, which allows a comparison with a number of competing model selection procedures. These other methods are the “classical” hypothesis testing procedure, which simply uses p-values (classical); the same procedure but with the Bonferroni correction Bonferroni (1936) (Bonferroni); the step-up method of Benjamini and Hochberg (1995) (BH); the bootstrap step-down method of Romano, Shaikh, and Wolf (2008) (Boot); the PcGets/Autometrics software package in Krolzig and Hendry (2001); the HP approach as investigated in the previous section (HP); Bayesian model averaging as in Ley and Steel (2009) (BMA); the “two million regressions” approach in Sala-i-Martin (1997) (S-i-M); and the least absolute shrinkage and selection operator (Lasso) of Tibshirani (1996). For full details on these approaches the reader is referred to Deckers and Hanck (2014), and references therein.

The DGPs are defined using (1) with p = 50 and n = 100 with X1, …, X50 distributed as a multivariate normal distribution with mean zero, variance one and common correlation ρ that can take values in {0, 0.3, 0.5}.

DGPs are classified according to how many βi are set to non-zero values.

  1. Every tenth βi is set to 0.5, with the rest is set to zero (5 regressor indices in T).

  2. Every fifth βi is set to 0.5, with the rest is set to zero (10 regressor indices in T).

  3. Every second βi is set to 0.5, with the rest is set to zero (25 regressor indices in T).

DH point out that the value of βi = 0.5 is used because it results in population R2 values that are realistic for data sets encountered in growth econometrics. The experiments are run here with 2000 replications for each case.

Tables 5, 6 and 7 give the results of the GSA algorithm added to the existing results of the other methods. For consistency with DH, the performance measures from their work are used. The false discovery rate (FDR) is defined as the expected number of falsely rejected hypotheses divided by the total number of falsely rejected hypotheses—in this sense, it is similar to the gauge, but is normalized by the number of correct rejections rather than the actual number of false hypotheses. CR denotes the average number of correct rejections—this means that it is simply the potency without the normalization by the number of false hypotheses.

Table 5:

Results of Monte Carlo experiment with five false hypotheses.

ρ = 0ρ = 0.3ρ = 0.5
FDRCRFDRCRFDRCR
Classical: α = 0.010.0833.960.0973.080.1252.25
Classical: α = 0.050.2734.610.3024.120.3363.48
Classical: α = 0.100.4264.810.4574.490.4733.98
Bonferroni: α = 0.010.0031.760.0030.940.0060.46
Bonferroni: α = 0.050.012.570.0141.650.020.96
Bonferroni: α = 0.100.0213.030.0312.060.0351.27
BH: α = 0.010.012.270.0071.210.0090.55
BH: α = 0.050.0383.350.0432.290.0421.32
BH: α = 0.100.0863.810.0922.850.081.8
Bootstrap: α = 0.010.0112.360.0091.220.010.59
Bootstrap: α = 0.050.0523.450.052.390.0481.35
Bootstrap: α = 0.100.13.940.1022.950.0921.88
PcGets/Autometrics0.3344.950.3444.830.3634.51
HP0.0874.870.1234.630.1654.12
Bayesian model averaging
m = k/2, g = k−2, random θ0.0053.970.023.970.0473.34
m = 7, g = k−2, random θ0.0043.910.023.940.0473.31
m = k/2, g = k−2, fixed θ0.0564.850.0664.580.0934.01
m = 7, g = k−2, fixed θ0.0074.380.0214.120.053.48
m = k/2, g = n−1, random θ0.0424.750.0494.460.073.82
m = 7, g = n−1, random θ0.0364.720.0424.420.0663.77
m = k/2, g = n−1, fixed θ0.2724.970.2664.830.2794.49
m = 7, g = n−1, fixed θ0.044.810.0474.50.0723.88
Sala-i-Martin0.4884.980.674.990.6894.97
Lasso0.2464.850.4344.970.54.9
GSA algorithm0.0294.270.0573.990.1023.60
Table 6:

Results of Monte Carlo experiment with 10 false hypotheses.

ρ = 0ρ = 0.3ρ = 0.5
FDRCRFDRCRFDRCR
Classical: α = 0.010.0397.940.0476.190.0624.57
Classical: α = 0.050.1569.260.1768.270.1996.9
Classical: α = 0.100.2689.610.2848.950.2998
Bonferroni: α = 0.010.0033.480.0031.870.0030.94
Bonferroni: α = 0.050.0055.280.0083.340.011.96
Bonferroni: α = 0.100.016.060.0144.110.0192.51
BH: α = 0.010.0085.390.0082.960.0061.41
BH: α = 0.050.047.680.0385.450.0363.27
BH: α = 0.100.0798.50.0766.770.0724.63
Bootstrap: α = 0.010.0115.50.013.10.0081.52
Bootstrap: α = 0.050.057.890.0465.710.0443.42
Bootstrap: α = 0.100.1018.670.0967.030.0914.74
PcGets/Autometrics0.2029.880.2119.550.2348.83
HP0.059.660.0719.040.1067.9
Bayesian model averaging
m = k/2, g = k−2, random θ0.0076.570.0227.90.0446.51
m = 7, g = k−2, random θ0.0066.230.0217.790.0446.41
m = k/2, g = k−2, fixed θ0.0379.50.0448.930.0647.64
m = 7, g = k−2, fixed θ0.0087.010.0217.720.0456.39
m = k/2, g = n−1, random θ0.0599.590.0478.920.0577.41
m = 7, g = n−1, random θ0.059.480.048.820.0537.28
m = k/2, g = n−1, fixed θ0.1649.880.149.560.1428.62
m = 7, g = n−1, fixed θ0.0259.290.0298.550.0477.06
Sala-i-Martin0.3289.440.761100.7810
Lasso0.3369.850.4259.960.4519.89
GSA algorithm0.0188.500.0358.150.0647.26
Table 7:

Results of Monte Carlo experiment with 25 false hypotheses.

ρ = 0ρ = 0.3ρ = 0.5
FDRCRFDRCRFDRCR
Classical: α = 0.010.01119.830.01315.390.01711.32
Classical: α = 0.050.04823.110.05420.610.06317.27
Classical: α = 0.100.0924.010.09822.380.10319.98
Bonferroni: α = 0.0108.790.0044.550.0012.28
Bonferroni: α = 0.050.00113.20.0028.270.0034.78
Bonferroni: α = 0.100.00315.20.00410.20.0056.4
BH: α = 0.010.00516.670.00410.180.0045.16
BH: α = 0.050.02521.610.02617.340.02411.81
BH: α = 0.100.0523.070.0520.10.04815.77
Bootstrap: α = 0.010.00818.150.00811.320.0065.6
Bootstrap: α = 0.050.04622.80.04219.310.03413.18
Bootstrap: α = 0.100.09523.930.08721.820.07517.27
PcGets/Autometrics0.07224.310.08722.970.10920.56
HP0.02222.190.03720.010.05917.19
Bayesian model averaging
m = k/2, g = k−2, random θ0.0054.170.03517.260.04413.97
m = 7, g = k−2, random θ0.0043.020.03516.520.04513.45
m = k/2, g = k−2, fixed θ0.02817.050.03619.040.04616.06
m = 7, g = k−2, fixed θ0.0082.860.03712.820.04711.45
m = k/2, g = n−1, random θ0.10724.130.03419.70.03414.26
m = 7, g = n−1, random θ0.07723.450.03118.750.03313.56
m = k/2, g = n−1, fixed θ0.06823.980.03620.40.03716.57
m = 7, g = n−1, fixed θ0.0212.150.0313.930.03510.83
Sala-i-Martin0.15517.210.49424.960.49925
Lasso0.28524.80.25524.930.26324.8
GSA algorithm0.01317.580.01919.420.03117.14

The results of the model selection approaches apart from the GSA algorithm are discussed in Deckers and Hanck (2014) in some detail—for this reason the discussion here is limited to the relative performance of the GSA algorithm. Table 5 shows that in the case of five false hypotheses, the GSA algorithm has a CR of 4.27, 3.99 and 3.60 for ρ = 0, 0.3, 0.5 respectively. This puts it on a similar level, albeit slightly less, than the other methods. Particularly high CR values result from the Lasso, Sala-i-Martin, and Autometrics methods, for example.

However, the FDR of the GSA algorithm is lower than most of the other methods apart from the classical and Bonferroni approaches, with one or two exceptions. Therefore the GSA algorithm could be viewed as giving a good, but conservative performance. In fact, only in three experiments different versions of BMA outperform the GSA algorithm in both CR and FDR simultaneously for ρ = 0.3 and 0.5. While the GSA algorithm is not an outright winner in this experiment, it appears to be competitive with other approaches, even without altering the tuning parameters from default values; this appears promising.

The results of the same experiment with 10 false hypotheses are shown in Table 6. Similar to the previous results, the GSA algorithm performs a good-quality, but slightly conservative model selection. The CR values show that it is correctly rejecting 7.3–8.5 hypotheses depending on the correlation between variables. Other methods, such as the Lasso and BMA, tend to have a higher CR. However the GSA algorithm again has a very low FDR, which is only bettered by other approaches at the expense of a low CR. The exceptions to this are three instances of the BMA approach, one at ρ = 0.3 and two at ρ = 0.5, where CR is marginally higher and FDR marginally lower. Consider however that the BMA results here reflect the performance at eight different values of tuning parameters, while the GSA algorithm uses only default values.

Finally, Table 7 gives the results in the case of 25 false hypotheses. Again the results are similar to the other two cases. The GSA algorithm gives slightly lower rates of correct rejection than Autometrics/HP, BMA, Sala-i-Martin and Lasso; however it has a FDR which is lower than most of these approaches. The GSA algorithm is outperformed both on CR and FDR only in two experiments by the classical procedure and a bootstrap one.

The BMA approach actually performs quite well on this case study. However as Decker and Hanck note, there is no one configuration of the BMA approach that performs consistently the best—particular tuning parameter values are suited to particular cases.

Overall, in spite of the fact that GSA was run with default tuning coefficients, the results are encouraging for the GSA algorithm. While it does not outperform all competitors, it has a performance which is competitive with other approaches and can be adjusted to give higher potency by changing the values of its tuning parameters.

6.2 An Empirical Growth Model

The final test case is also taken from DH and represents a real case study on economic data. This case study (more details of which can be found in their paper) takes the data set of Fernandez, Ley, and Steel (2001) to build an empirical growth model. Growth models attempt to explain the differences in economic growth across a set of countries, over a fixed period of time, in terms of the number of candidate explanatory variables. This problem is of course different from the previous test cases, in that y has not been generated from the candidate regressors and therefore there is no “true DGP” contained within the set of candidates. Indeed, cross-country growth may well be affected by many other variables outside of the list considered here.

The data set consists of n = 72 countries, whose growth is measured over the period 1960–1992, as well as p = 41 explanatory variables. Table 8 shows the results of the growth regression using the same list of model selection approaches considered in the previous example, with the addition of the GSA algorithm. For the classical, “boot” and BH columns, the number represents the level of significance at which the hypothesis is rejected; if no number is reported the regressor is not included in the final model. The BMA column can be interpreted as rejecting the null hypothesis when the probability of inclusion is greater than 0.5. The S-i-M column gives frequency of inclusion, with an asterisk denoting a significant relation to growth, and a double asterisk indicating that variables are always included. The FDP column refers to a method of Romano et al. (2006) which aims to ensure that the probability of the proportion of false rejections does not exceed a set value. Finally, the H&K, Lasso and GSA columns use a 1 to indicate that the variable is included, and a zero to indicate that it is not included.

Table 8:

Results for growth regression case study.

Regressorβˆlp-valueClassical (%)BH (%)Boot (%)BMAS-i-MH&KLassoFDP (%)GSA
1GDP level 1960−0.0170.0000111111.00**1111
2Fraction confucian0.0750.000031110.9951.00*1111
3Life expectancy0.0010.0031550.9460.999**11201
4Equipment investment0.1270.0081550.9421.00*111
5Sub-Saharan dummy−0.020.0061550.7570.997*111
6Fraction Muslim0.0110.2270.6561.00*010
7Rule of law0.0120.068100.5161.00*011
8Number of years open economy−0.0030.620.5021.00*101
9Degree of capitalism0.0010.2840.4710.987*000
10Fraction protestant−0.0030.6770.4610.966*001
11Fraction GDP in mining0.040.0081550.4410.994*111
12Non-equipment investment0.0370.081100.4310.982*010
13Latin American dummy−0.0130.0395100.190.998*111
14Primary school enrolment, 19600.020.0455100.1840.992**111
15Fraction Buddhist0.0070.2760.1670.964*000
16Black-market premium−0.0070.075100.1570.825000
17Fraction catholic0.0030.5930.110.963*000
18Civil liberties−0.0020.3210.10.997*000
19Fraction Hindu−0.0970.0011110.0970.6541151
20Political rights0.00020.9340.0710.998*000
21Primary exports, 1970−0.0060.4210.0690.990*001
22Exchange rate distortions−0.000020.5380.060.968*000
23Age−0.000010.7740.0580.903000
24War dummy−0.0010.5480.0520.984*000
25Size labor force3.00E-070.0041550.0470.835110
26Fraction speaking foreign language−0.0020.4680.0470.831000
27Fraction of pop speaking English−0.0070.1310.0470.91000
28Ethnologic fractionalization0.0140.0125550.0350.643110
29Spanish colony dummy0.0130.022510100.0340.938*100
30SD of black-market premium−0.0000010.8920.0310.993*000
31French colony dummy0.0090.0385100.0310.702100
32Absolute latitude−0.00010.5210.0240.980*000
33Ratio of workers to population−0.0010.9450.0240.766000
34Higher education enrolment−0.1290.0021550.0240.57911200
35Population growth−0.1190.6090.0220.807000
36British colony dummy0.0070.072100.0220.579100
37Outward orientation−0.0050.0365100.0210.634000
38Fraction Jewish−0.0010.9420.0190.747000
39Revolutions and coups0.0030.5030.0170.995*000
40Public education share0.1370.2490.0160.58000
41Area (scale effect)3.00E-070.6370.0160.532000

The results show that the GSA algorithm agrees with the majority of other approaches in selecting the regressors GDP level, Fraction Confucian and Fraction Hindu as being robustly related to growth. The results are in fact broadly similar with H&K and the Lasso, with a few exceptions. For example, the GSA algorithm identifies fraction protestant as an additional predictor of growth, which is not selected by other algorithms apart from S-i-M. Similarly, primary exports in 1970 is found by the GSA algorithm to be significant. However, the GSA algorithm does not select variables such as ethnologic fractionalization and higher education enrollment, which are included by most other approaches. Overall the performance of GSA algorithm is slightly more parsimonious than other methods, and this agrees with the results from the simulation experiments in the previous section, although the degree of conservatism can be adjusted by the tuning parameter.

7 Conclusions

In the model selection problem, one has to choose which candidate regressors to include in a regression model. The approach in this paper is to view the problem as a sensitivity analysis of a measure of fit in the space of candidate variables. One can therefore calculate the sensitivity, e.g. of the BIC with respect to the presence (or absence) of each candidate variable.

These interactions are in principle relevant, as the importance of including a given regressor is conditioned by inclusion or exclusion of the other regressors. For this reason it is appropriate to use ST, a sensitivity measure capable of appreciating the sensitivity of a trigger for the presence of one regressor, inclusive of its interaction effects with triggers for all other regressors.

The proposed algorithm uses the ordering of the regressors via GSA or the t-ratios within a testing strategy based on the ‘Pantula-principle’, see Pantula (1989). This implies a reduction in the number of tests for each given ordering (with an associated saving of computing times) and the favorable control of the size of the testing sequence.

When compared to the general-to-specific algorithm described by HP, and to Autometrics, the GSA algorithm performs well on the simulation experiments investigated, both in the theoretical case where tuning parameters were known, and in average performance in the practical situation when tuning parameters are unknown. The robustness of the algorithm to its main tuning parameter is a particularly positive feature, since the optimal values would not be known in a practical case. When compared to a wider range of approaches, the GSA approach performs competitively, giving comparable performance to a range of well-established approaches, without adjusting its tuning parameters in any way. This shows that GSA methods can help in model selection.

This study is a first exploration of the use of GSA in the world of model selection; it shows that GSA can be fruitfully used to order regressors by importance. These results call for more research on the use of GSA methods in model selection.


Corresponding author: William Becker, European Commission, Joint Research Centre, Ispra, VA, Italy, E-mail:
Information and views set out in this paper are those of the authors and do not necessarily reflect the ones of the institutions of affiliation.

Appendix A: Proofs

In this section, Assumption 1 is maintained throughout, σTi2 is first expressed as a sum of terms involving σˆγ2 for γΓ in Lemma 3; next the large n behavior of σˆγ2 is discussed in Lemma 4. Lemma 5 states the probability limit of σTi2. Lemma 6 shows that, in case the true regressors in the DGP and the irrelevant ones are uncorrelated, σTi2p0 for an irrelevant regressor i, while σTi2pci>0 for a relevant one, where p indicates convergence in probability as n → ∞. Under the same conditions, Lemma 6 proves Theorem 2, which shows that for large samples, a scree plot on the ordered STi allows to separate the relevant regressors from the irrelevant ones.

Let γi=eiγ and γi=Aiγ, where ei is the i-th column of the identity matrix of order p, Ip and Ai is a p × (p − 1) matrix containing all the columns of Ip except the i-th one. Next indicate q(γ) as q(γi,γi) or, more simply as qi(γi). Denote by γ(i,0) the vector corresponding to γi = 0, with the remaining coordinates equal to γi, and let γ(i,1) the vector corresponding to γi = 1 with the remaining coordinates equal to γi. Finally let Γi{γi=Aiγ,γΓ}.

Lemma 3

(σTi2as an average over γi). One has

(8)σTi2=E(V(q|γi))=142p1γiΓi(qi(1)qi(0))2
and for q equal to BIC (or any other consistent information criterion)
(9)qi(1)qi(0)=log(σˆγ(i,1)2σˆγ(i,0)2)+o(1),
where o(1) is a non-stochastic term tending to 0 for large n andσˆγ2n1εˆγεˆγwhereεˆγare the residuals of model γ.

Proof. Note that for h = 1,2 one has E(qh|γi)=12(qih(1)+qih(0)) so that

Vq|γi=Eq2|γiEq|γi2=12qi21+qi2014qi21+qi20+2qi1qi0=14qi1qi02 .

Hence one finds (8). When q is BIC, q(γ)=logσˆγ2+kγcn with cnlog(n)/n. Other consistent information criteria replace logn with some other increasing function f(n) of n with the property cn=f(n)/n0, see Paulsen (1984) Theorem 1. Note also that kγ(i,1)kγ(i,0)=1, and that one has

qi(1)qi(0)=log(σˆγ(i,1)2σˆγ(i,0)2)+(kγ(i,1)kγ(i,0))cn=log(σˆγ(i,1)2σˆγ(i,0)2)+cn.

Because cn0, one finds (9). □

The asymptotic behavior of σˆγ2 is next discussed. Let wt(yt,x1,t,,xp,t,ϵt), where, without loss of generality, it is assumed that all variables have mean zero. Denote ΣE(wtwt), where

Σ=(ΣyyΣyxσ2Σxx0σ2)=(ΣyyΣy1Σypσ2Σ11Σ1p0Σpp0σ2).

Let Σij.vΣijΣivΣvv1Σvj indicate partial covariances, where v{i1,,is} indicates a set of indices. Note that Σxϵ=0.

For each γ, let aγ{i1,,ikγ} indicate the set of indices ij such that γij=1 in γ. Similarly let bγ{i1,,is} indicate the set of indices ij that belong to aγ\T. The representation β0 as β0=H0ϕ0 is used here, where H0 contains the r0 columns of Ip corresponding to T, and ϕ0 contains the corresponding β0,i coefficients. Moreover the matrix of regressors in the γ specification is written as XHγ, where Hγ contains the columns of Ip with column indices aγ. Define also MγInXHγ(HγXXHγ)1HγX.

Lemma 4

(Large sample behavior ofσˆγ2). As n → ∞, one has

(10)σˆγ2pσ2+h,jT\aγβ0,hΣhj.bγβ0,j,
whereT\aγis the set of indices of the true regressors omitted from the γ specification, and bγis the set of indicesaγ\Tof the regressors included in the γ specification except the ones that belong to the DGP. Remark that the sum in (10) is equal to 0 when γ is correctly specified (i.e. it contains all regressors in the DGP) i.e.T\aγ=.

Proof. Because y=XH0ϕ0+ε one has

σˆγ2=n1yMγy=n1εMγε+2n1εMγXH0ϕ0+n1ϕ0H0XMγXH0ϕ0

Because Σxϵ=0, by the law or large numbers for stationary linear processes, see e.g. Anderson (1971), one finds

n1εMγεpσ2ΣϵxHγ(HγΣxxHγ)1HγΣxϵ=σ2,n1εMγXpΣϵx(IpHγ(HγΣxxHγ)1HγΣxx)=0.

Similarly

n1H0XMγXH0pH0(ΣxxΣxxHγ(HγΣxxHγ)1HγΣxx)H0=H0Vγ(VγΣxx1Vγ)1VγH0

where Vγ=Hγ, contains the columns in Ip not contained in Hγ, and the last equality is a special case of a non-orthogonal projection identity, see e.g. eq. (2.13) in Paruolo and Rahbek (1999) and references therein. Here H indicates a basis of the orthogonal complement of the space spanned by the columns in H. Observe that the (pkγ)×r0 matrix CγVγH0 contains the columns of Iprγ corresponding to the index set of regressors in vγT\aγ. Hence, using e.g. eq. (A.4) in Paruolo and Rahbek (1999), one finds (VγΣxx1Vγ)1=Σvγvγ.bγ. Substituting one finds

n1ϕ0H0XMγXH0ϕ0pϕ0CγΣvγvγ.bγCγϕ0.

Simplifying one obtains (10). □

Lemma 5

(Large sample behavior of σTi2). As n → ∞ one has

σTi2pci142p1γiΓilogσ2+h,jT\aγi,1β0,hΣhj.bγi,1β0,jσ2+h,jT\aγi,0β0,hΣhj.bγi,0β0,j
where aγis the set of indices of the regressors in the γ specification, andbγaγ\Tincludes the indices of regressors included in the γ specification except the ones that belong to the DGP.

Proof. Apply Lemma 3 and 4. □

Lemma 5 shows that the limit behavior of σTi2 depends on the covariance structure Σ. Some covariance structures imply that, in the limit, the value of ST for true regressors is greater than the value of ST for irrelevant regressors. There also exist other covariance structures which can imply a reverse ordering.[21] In the special case when true and irrelevant regressors are uncorrelated, the next Lemma 6 shows that ST converges to 0 for irrelevant regressors, while ST converges to a positive constant for true regressors. This result proves Theorem 2 that shows that the ordering based on ST separates true and irrelevant regressors in this special case.

Lemma 6

(Orthogonal regressors in M and T). Assume thatΣj=0for alljTandM. Then wheniMone has, as n → ∞,σTi2p0, whereas otherwise wheniTone finds

(11)σTi2pci>0.

Proof. From Lemma 4, one finds

(12)qi1qi0=logσ2+h,jT\aγi,1β0,hΣhjγ(i,1).bβ0,jσ2+h,jT\aγi,0β0,hΣhjγ(i,0).bβ0,j+op1,

Assume that Σj0 for some jT and M; then for some γiΓi one has Σhj.bγ(i,1)Σhj.bγ(i,0) in the numerator and denominator on the r.h.s. of (12); let c ≠ 1 indicate the corresponding ratio. Hence (qi(1)qi(0))2 converges in probability to log2c>0, and because the terms in E(V(q|γi))=142p1γiΓi(qi(1)qi(0))2, see Lemma 5, are non-negative, one concludes that σTi2pci>0.

Assume instead that Σj=0 for all jT and M and iM. Then T\aγ(i)=T\aγi and, because Σj=0 for all jT and M, one has Σjbγ(i,)=0. This implies Σhj.bγ(i,)ΣhjΣhbγ(i,)Σbγ(i,)bγ(i,.)1Σbγ(i,)j=Σhj. Hence

qi1qi0=logσ2+h,jT\aγiβ0,hΣhjβ0,jσ2+h,jT\aγiβ0,hΣhjβ0,j+op1=op1,

for all γiΓi because the numerator and denominator are identical. Thus (qi(1)qi(0))2 converges in probability to log21=0 for all γiΓi, and this implies σTi2p0. □

Appendix B: HP design and algorithm

B.1 HP DGPs. HP’s experiments are constructed as follows. Following Lovell (1983), HP chose a set of 18 major US quarterly macroeconomic variables. Only two variables considered in Lovell (1983) were discarded in HP, namely the linear trend and the ‘potential level of GNP in $1958’, because they were no longer relevant or available. Unlike in Lovell (1983), HP applied 0, 1 or 2 differences to the data; the order of differencing was selected by HP in order to obtain stationary variables according to standard unit root tests, see their Table 1.

The values of these (differenced) 18 major US quarterly macroeconomic series are then fixed in HP’s experiments; they are here indicated as xit*, where t = 1, …, n indicates quarters and i = 1, …, k, with k = 18 indexes variables. The values of yt were then generated by the following scheme

(13)yt=i=1kβi*xit*+ut,ρ(L)ut=ϵt,

where εt are i.i.d. N(0,σϵ2), and ρ(z)=1ρz for all DGPs except for DGP 3, for which ρ(z)=1ρ1zρ2z2. Here βi* for i = 1, …, k and σϵ2 are known constants, which define the DGP. In practice εts are simulated using a computer random number generator, ut is then calculated as an autoregressive series of order 1, AR(1), with coefficient ρ. ut is then fed into the equation for yt, where xit* are kept fixed and do not change across replications.

It is useful to express (13) as a special case of (1). To this end one can substitute (yti=1kβi*xit*) in place of ut in the dynamic equation of ut; one hence finds the following equivalent representation of the DGP

(14)yt=ρyt1+i=12kβixit+ϵt

for all DGPs except for DGP 3, where βi=βi* and xit=xit* for i = 1, …, k while βi=ρβi* and xit=xit1* for i = k + 1, …, 2k. For DGP 3, one has yt=ρ1yt1+ρ2yt2+ϵt. Both these representations are of the form (1), and the parameters can be estimated as in (3).

Regressions in HP were performed setting the elements xi,t in column Xi equal to variable xit from (14), for i = 1, …, 2k with 2k = 36, and setting the elements xi,t of the remaining columns Xi for i = 2k + 1, …, p, i.e. from 37 to 40, equal to the first, second, third and fourth lag of yt. Therefore, four lags were always considered in estimation regardless of how many lags are in the DGP, and the only part of the X that changes across replications is block of the last four columns.

HP defined 11 DGPs by choosing values for the parameters ρ, βi* and σϵ2. Table 9 summarizes the chosen parameter values. The choice of these values was made to reflect the coefficient estimates obtained on US data, using personal consumption expenditure as dependent variable, following the rationale in Lovell (1983). Because they were chosen as explanatory variables for a consumption equation, not all the macroeconomic time series were included in the DGP; in particular only (the second differences of the) Government purchases on goods and services G and the (first differences of the) M1 monetary aggregate, and their respective first lags, were included in the experiments.

Table 9:

DGPs design. ytj indicates lags of the dependent variable, Gtj denotes (lags of) second differences of government purchases of goods and services and M1tj indicates (lags of) first differences of M1.

DGP1234566A6B789
Coefficients in DGP
yt−10.750.3950.750.750.75
yt−20.3995
Gt−0.046−0.023−0.32−0.65−0.046−0.023
Gt−10.003450.01725
M1t1.330.670.670.671.330.67
M1t−1−0.9975−0.5025
σε13085.990.001729.730.114.924.924.926.730.0733.25
  1. : in DGP 3 the regression analysis is performed on yt*=exp(yt), where yt is simulated as in (14).

B.2 HP algorithm. This subsection gives an overview of the algorithm proposed by HP, following Hansen (1999). The algorithm proposed in HP aimed to provide a close approximation to a subset of what practitioners of the LSE approach actually do; further details can be found in the original reference.

The HP algorithm can be described by a choice of a triplet (R,f,Γs) composed of (i) a test procedure R, (ii) a measure of fit f and (iii) a subset Γs of all models Γ, ΓsΓ. For any model γ, the test procedure R is defined as

(15)R(γ)=1(min1vpα)

where p are the p-values of v specification tests and α is the chosen significance level. Note that R(γ)=0 when all v tests do not reject the null, which corresponds to the hypothesis of correct specification and/or constant parameters.[22]

HP’s measure of fit f is based on the least-square estimate of σ2, the regression variance, which equals σ˜γ21nkγεˆγεˆγ, where kγ and εˆγ are the number of regressors and the residuals in model γ. HP’s measure of fit is f(γ)=σ˜γ, which should be minimized. Finally the subset Γs is selected recursively, going from general to specific models, starting from the GUM, γ=ıp; the recursion continues as long as R(γ)=0. Details on HP’s choice of Γs are given in the next subsection.

Overall the HP algorithm selects a model γˆ as the preferred model using the rule

γˆ=arg minγΓs:R(γ)=0f(γ).

The above description shows that the HP algorithm depends on α, which is a tuning parameter, as well as on the choice of specific path Γs. For large n, Hansen (1999) noted that γˆ corresponds approximately to minimizing the information criterion HP(γ)=logσˆγ2+kγ/n, where σˆγ21nεˆγεˆγ is the ML estimator of σ2. This differs from Akaike’s information criterion AIC(γ)=logσˆγ2+2kγ/n and from the Bayesian information criterion of Schwarz BIC(γ)=logσˆγ2+kγlog(n)/n by the different choice of penalty term.[23]

B.3 HP’s choice of search paths. The choice of subset Γs of Γ is a critical aspect of the HP algorithm, as well as of any selection method based e.g. on information criteria, see Section 5.2. in Hansen (1999) and Burnham and Anderson (2002).

In particular, HP select a subset Γs as follows. All paths start from the GUM regression, and the regressors are ranked in ascending order according their t-ratios. The 10 lowest variables in this list are then candidates for elimination; this starts an iterative elimination path. Each candidate model γ* then becomes the current specification provided R(γ*)=0. In this stage, the first 90% of the observations are used in the specification tests. Each search is terminated when for any choice of regressor the test R rejects.

At this final stage, the HP algorithm reconsiders all the observations in a ‘block search’; this consists in considering the joint elimination of all the regressors with an insignificant t-ratios. If the R tests for the block search does not reject, the resulting model becomes the terminal specification. Otherwise, the specification that entered the final stage becomes the terminal specification. Once all 10 search paths have ended in a terminal specification, the final specification is the one among these with lowest f(γ)=σ˜γ.

Appendix C: Effective DGP

This appendix describes how the notion of ‘weak regressors’ was made operational in the present context. The ‘Parametricness Index’ (PI), Liu and Yang (2011), is used here to identify the ‘effective DGP’ (EDGP). Parametricness, in the sense of Liu and Yang, is a measure dependent both on sample size and the proposed model; a model is parametric if omission of any of its variables implies a marked change in its fit, and nonparametric otherwise.[24] Here parametricness is taken as a sign of detectability, i.e. of a sufficiently high signal-noise ratio. This concept is applied both to complete specifications as well as to single regressors; in particular the EDGP is defined as the subset of DGP regressors which the PI would classify as parametric.

Considering a model γkΓ, one can express the regression fit as yˆk=Pky, where Pk is the projection matrix on col(XHγk), and col indicates the column space; let rγk be the dimension of col(XHγk). The index PI is defined in terms of an information criterion IC, which depends on λn, d and σˆ2. Here λn is a nonnegative sequence that satisfies λn(logn)1, d is a nonnegative constant and σˆ2 is a consistent estimator of σ2 such as yyˆk2/(nrγk) with γk consistent for γT. In the application γk=γT was used. The information criterion IC is defined by

(16)ICλn,d(γk,σˆ2)=yyˆk2+λnlog(n)rkσˆ2nσˆ2+dn1/2log(n)σˆ2

where represents Euclidean distance; here the values λn = 1 and d = 0 are used, as suggested in Liu and Yang (2011).

Let now γT be the DGP; PI is defined in the present context as,

(17)PI={infγkΓ1(γT)ICλn,d(γk,σˆ2)ICλn,d(γT,σˆ2) if rγ0>1n if rγT=1

where Γ1(γT) is the set of submodels γk of the DGP γT such that rγk=rγT1, i.e. all submodels obtained by removing one regressor at a time (with replacement).[25]

The reasoning is that if the model is parametric (and correctly specified for the data), removing any of the regressors will have a marked impact on IC. In contrast, if (some of the) regressors are just incremental terms in a nonparametric approximation, removing one of these regressors will have little effect on IC. Liu and Yang (2011) show that PI converges to 1 for a nonparametric scenario, and goes to infinity in a parametric scenario. The authors suggest to take PI = 1.2 as a cutoff point between parametric and nonparametric scenarios; this threshold was adopted in the present paper.

PI is applied in this paper at the level of each DGP; if PI indicates that the DGP is nonparametric, it is also investigated which of the submodels is responsible for this and label the corresponding omitted variables as ‘weak’. As in the rest of the paper, a MC approach is used. Five thousand datasets are generated from each DGP and PI is calculated for each sample, hence obtaining a distribution of PI values. Table 10 summarizes the MC distribution of PI values through the empirical distribution function Fm(x)=m1j=1m1(PIjx), where m = NR and PIj is the PI value in replication j = 1, …, NR. Quantiles of PI are indicated as PIα, with α = 0.01, 0.1, 0.9, 0.99, and the MC mean PI is indicated as EN(PI), where for simplicity the subscript R is omitted from NR.

Table 10:

Distribution of PI values for DGPs 1–9. Fn(⋅) is the MC cumulative distribution function of PI and PIα is the α-quantile of Fm(⋅). DGPs where EDGP DGP are in boldface.

DGPDGP indicesFN(1.2)PI0.01PI0.1EN(PI)PI0.9PI0.99EDGP indices
1{}{}
2{37}0.0016.5525.5941.8060.8384.61{37}
3{37, 38}0.040.881.532.543.524.17{37, 38}
4{11}0.0030.5037.8249.1961.7874.88{11}
5{3}0.00365.84415.39493.63578.17668.84{3}
6{3, 11}0.980.370.380.530.791.40{11}
6A{3, 11}0.002.774.156.448.9511.69{3, 11}
6B{3, 11}0.0015.1118.1023.0428.3833.72{3, 11}
7{11, 29, 37}0.002.844.166.468.9611.76{11, 29, 37}
8{3, 21, 37}0.005.778.4013.4919.2226.41{3, 21, 37}
9{3, 11, 21, 29, 37}1.000.750.750.770.810.93{11, 29, 37}

The reference threshold is PI = 1.2, and FN(1.2) shows the frequency of PI being below this limit; in other words this gives an estimate for the DGP to be classified as nonparametric. There is a very clear distinction: DGPs 6 and 9 are regarded as nonparametric 98 and 100% of the time respectively. In contrast, all other DGPs are always regarded as parametric, with the slight exception of DGP 3, which is a little less clear cut.

Examining the quantiles, DGP 3 has a mean PI value of 2.54 and PI0.1 = 1.53, which puts it in the parametric class in the large majority of cases. DGP 6 has a mean PI of 0.53, and PI0.9 = 0.79, making it almost always nonparametric. DGP 9 has PI0.99 = 0.93, making it the most obviously nonparametric DGP. Of the remaining DGPs, all are well above the threshold and can be considered parametric.

Next it was further investigated which regressors were causing the nonparametricness, i.e. which regressors are ‘weak’, examining the individual IC ratios for each regressor of a given DGP, see (17). Let ICR(i) indicate the IC ratio between the DGP and the submodel of the DGP where variable i is removed. Table 11 reports the distribution of ICR(i) for DGPs 6 and 9, which are the nonparametric DGPs. One can clearly see that in DGP 6, it is x3 that is causing the nonparametricness, since it has a mean ICR(3) of 0.53. Removing this regressor improves the information criterion given the data. The same is true for x3 and x21 in DGP 9, which both have ICRs with a mean of around 0.8. In contrast, removing any of the other regressors has a significant impact on the quality of the model fit. In practice, therefore, one could consider these as the weak regressors.

Table 11:

Distribution of ICRs for DGPs 6 and 9. Notation as in Table 10. Variables that are excluded from the EDGP are in boldface.

DGPVariableFN(1.2)ICR0.01ICR0.1E(ICR)ICR0.9ICR0.99
6x30.980.370.380.530.791.40
x110.0015.7919.2725.0331.3337.31
9x30.990.750.750.820.941.20
x110.008.449.9512.5615.4118.36
x210.990.750.750.810.921.18
x290.002.152.884.245.737.38
x370.003.875.468.8212.6117.25

Therefore, in DGPs 6 and 9, the variables in boldface inTable 11 are excluded from the EDGP. The EDGP are defined as the remaining regressors in each case, see the last column in Table 10.

References

Abadir, K., and J. R. Magnus. 2002. “Notation in Econometrics: A Proposal for a Standard.” The Econometrics Journal 5: 76–90, https://doi.org/10.1111/1368-423x.t01-1-00074.Search in Google Scholar

Anderson, T. W. 1971. The Statistical Analysis of Time Series. New York: Wiley.Search in Google Scholar

Archer, G., A. Saltelli, and I. Sobol. 1997. “Sensitivity Measures, Anova-like Techniques and the Use of Bootstrap.” Journal of Statistical Computation and Simulation 58 (2): 99–120, https://doi.org/10.1080/00949659708811825.Search in Google Scholar

Becker, W., M. Saisana, P. Paruolo, and I. Vandecasteele. 2017. “Weights and Importance in Composite Indicators: Closing the Gap.” Ecological Indicators 80: 12–22, https://doi.org/10.1016/j.ecolind.2017.03.056.Search in Google Scholar PubMed PubMed Central

Becker, W., and A. Saltelli. 2015. “Design for Sensitivity Analysis.” In Handbook of Design and Analysis of Experiments, Chapter 18, edited by A. Dean, M. Morris, J. Stufken, and D. Bingham, 631–78. Boca Raton, New York: CRC Press.Search in Google Scholar

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B 57 (1): 289–300, https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.Search in Google Scholar

Billingsley, P. 1995. Probability and Measure. NY: John Wiley & Sons.Search in Google Scholar

Bittman, R., J. P. Romano, C. Vallarino, and M. Wolf. 2009. “Testing Multiple Hypotheses with Common Effect.” Biometrika 96: 399–410, https://doi.org/10.1093/biomet/asp006.Search in Google Scholar

Bonferroni, C. E. 1936. Teoria statistica delle classi e calcolo delle probabilita. Florence, Italy: Libreria internazionale Seeber.Search in Google Scholar

Brell, G., G. Li, and H. Rabitz. 2010. “An Efficient Algorithm to Accelerate the Discovery of Complex Material Formulations.” Journal of Chemical Physics 132 (17): 174103-1–10, https://doi.org/10.1063/1.3407440.Search in Google Scholar PubMed

Brunea, F. 2008. “Consistent Selection via the Lasso for High Dimensional Approcimating Regression Models.” In Pushing the Limits of Contemporary Statistics: Essays in Honor of J. K. Gosh, edited by B. Clarke, and S. Ghosal, 122–37. Dordrecht: IMS.10.1214/074921708000000101Search in Google Scholar

Burnham, K. P., and D. R. Anderson. 2002. Model Selection and Multimodel Inference – A Practical Information-Theoretic Approach, 2nd ed. New York: Springer.Search in Google Scholar

Castle, J. L., J. A. Doornik, and D. F. Hendry. 2011. “Evaluating Automatic Model Selection.” Journal of Time Series Econometrics 3: 1–33, https://doi.org/10.2202/1941-1928.1097.Search in Google Scholar

Claeskens, G., and N. L. Hjort. 2003. “The Focused Information Criterion, with Discussion.” Journal of the American Statistical Association 98: 900–845, https://doi.org/10.1198/016214503000000819.Search in Google Scholar

Danilov, D., and J. Magnus. 2004. “On the Harm that Ignoring Pretesting Can Cause.” Journal of Econometrics 122: 27–46, https://doi.org/10.1016/j.jeconom.2003.10.018.Search in Google Scholar

Deckers, T., and C. Hanck. 2014. “Variable Selection in Cross-Section Regressions: Comparisons and Extensions.” Oxford Bulletin of Economics and Statistics 76 (6): 841–73, https://doi.org/10.1111/obes.12048.Search in Google Scholar

Doornik, J. A. 2009. “Autometrics.” In The Methodology and Practice of Econometrics: Festschrift in Honour of David F. Hendry. Oxford, UK: Oxford University Press.10.1093/acprof:oso/9780199237197.003.0004Search in Google Scholar

Fernandez, C., E. Ley, and M. F. Steel. 2001. “Model Uncertainty in Cross-Country Growth Regressions.” Journal of Applied Econometrics 16 (5): 563–76, https://doi.org/10.1002/jae.623.Search in Google Scholar

Foster, D. P., and E. I. George. 1994. “The Risk Inflation Criterion for Multiple Regression.” Annals of Statistics 22: 1947–75, https://doi.org/10.1214/aos/1176325766.Search in Google Scholar

Freedman, D. 1983. “A Note on Screening Regression Equations.” The American Statistician 37: 152–5, https://doi.org/10.1080/00031305.1983.10482729.Search in Google Scholar

Freedman, D., and P. Humphreys. 1999. “Are There Algorithms that Discover Causal Structure?” Synthese 121: 29–54, https://doi.org/10.1023/a:1005277613752.10.1023/A:1005277613752Search in Google Scholar

Freedman, L. E., and D. Pee. 1989. “Return to a Note on Screening Regression Equations.” The American Statistician 43: 279–82, https://doi.org/10.2307/2685389.Search in Google Scholar

Freedman, L. E., D. Pee, and N. Midthune. 1992. “The Problem of Underestimating the Residual Error Variance in Forward Stepwise Regression.” Journal of the Royal Statistical Society, Series D (The Statistician) 41: 405–12, https://doi.org/10.2307/2349005.Search in Google Scholar

Hansen, B. 1999. “Discussion of ‘Data Mining Reconsidered’ by K.D. Hoover and S.J. Perez.” Econometrics Journal 2: 192–201, https://doi.org/10.1111/1368-423x.00026.Search in Google Scholar

Harrell, F. 2001. Regression Modeling Strategies. New York: Springer.10.1007/978-1-4757-3462-1Search in Google Scholar

Hendry, D., and H.-M. Krolzig. 1999. “Improving on ‘Data Mining Reconsidered’ by K.D. Hoover and S.J. Perez.” Econometrics Journal 2: 41–58, https://doi.org/10.1111/1368-423x.00027.Search in Google Scholar

Hendry, D. F., and A. Krolzig. 2005. “The Properties of Automatic Gets Modelling.” Economic Journal 115: C32–61, https://doi.org/10.1111/j.0013-0133.2005.00979.x.Search in Google Scholar

Hjort, N. L., and G. Claeskens. 2003. “Frequentist Model Average Estimators.” Journal of the American Statistical Association 98: 879–99, https://doi.org/10.1198/016214503000000828.Search in Google Scholar

Homma, T., and A. Saltelli. 1996. “Importance Measures in Global Sensitivity Analysis of Nonlinear Models.” Reliability Engineering & System Safety 52 (1): 1–17, https://doi.org/10.1016/0951-8320(96)00002-6.Search in Google Scholar

Hoover, K., and S. Perez. 1999. “Data Mining Reconsidered: Encompassing and the General-to-specific Approach to Specification Search.” The Econometrics Journal 2 (2): 167–91, https://doi.org/10.1111/1368-423x.00025.Search in Google Scholar

Imbens, G., and J. Wooldridge. 2009. “Recent Developments in the Econometrics of Program Evaluation.” Journal of Economic Literature 47: 5–86, https://doi.org/10.1257/jel.47.1.5.Search in Google Scholar

Jansen, M. J. W. 1999. “Analysis of Variance Designs for Model Output.” Computer Physics Communications 117 (1–2): 35–43, https://doi.org/10.1016/s0010-4655(98)00154-4.Search in Google Scholar

Johansen, S. 1996. Likelihood-based Inference in Cointegrated Vector Auto-Regressive Models. Oxford, UK: Oxford University Press.10.1093/0198774508.001.0001Search in Google Scholar

Krolzig, H.-M., and D. F. Hendry. 2001. “Computer Automation of General-to-specific Model Selection Procedures.” Journal of Economic Dynamics and Control 25 (6): 831–66, https://doi.org/10.1016/s0165-1889(00)00058-0.Search in Google Scholar

Leamer, E. E. 1983. “Let’s take the con out of econometrics.” The American Economic Review 73 (1): 31–43.Search in Google Scholar

Leeb, H., and B. M. Poetscher. 2006. “Can One Estimate the Conditional Distribution of Post-Model-Selection Estimators?” Annals of Statistics 34: 2554–91, https://doi.org/10.1214/009053606000000821.Search in Google Scholar

Ley, E., and M. F. Steel. 2009. “On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regression.” Journal of Applied Econometrics 24 (4): 651–74, https://doi.org/10.1002/jae.1057.Search in Google Scholar

Liu, W., and Y. Yang. 2011. “Parametric or Nonparametric? A Parametricness Index for Model Selection.” The Annals of Statistics 39 (4): 2074–102, https://doi.org/10.1214/11-aos899.Search in Google Scholar

Lovell, M. C. 1983. “Data Mining.” The Review of Economics and Statistics 65: 1–12, https://doi.org/10.2307/1924403.Search in Google Scholar

Magnus, J. 2007. “Local Sensitivity in Econometrics.” In Measurement in Economics, edited by M. Boumans, 295–319. San Diego: Academic Press.Search in Google Scholar

Magnus, J., and J. Durbin. 1999. “Estimation of Regression Coefficients of Interest when Other Regression Coefficients are of No Interest.” Econometrica 67: 639–43, https://doi.org/10.1111/1468-0262.00040.Search in Google Scholar

Magnus, J., and A. Vasnev. 2007. “Local Sensitivity and Diagnostic Tests.” The Econometrics Journal 10: 166–92, https://doi.org/10.1111/j.1368-423x.2007.00204.x.Search in Google Scholar

Magnus, J., O. Powell, and P. Prufer. 2010. “A Comparison of Two Model Averaging Techniques with an Application to Growth Empirics.” Journal of Econometrics 154: 139–53, https://doi.org/10.1016/j.jeconom.2009.07.004.Search in Google Scholar

Miller, A. 2002. Subset Selection in Regression, 2nd ed. Boca Raton, USA: Chapman and Hall, CRC Press.10.1201/9781420035933Search in Google Scholar

Norton, J. 2015. “An Introduction to Sensitivity Assessment of Simulation Models.” Environmental Modelling & Software 69: 166–74, https://doi.org/10.1016/j.envsoft.2015.03.020.Search in Google Scholar

Pantula, S. G. 1989. “Testing for Unit Roots in Time Series Data.” Econometric Theory 5 (2): 256–71, https://doi.org/10.1017/s0266466600012421.Search in Google Scholar

Paruolo, P. 2001. “The Power of Lambda Max.” Oxford Bulletin of Economics and Statistics 63: 395–403, https://doi.org/10.1111/1468-0084.00227.Search in Google Scholar

Paruolo, P., and A. Rahbek. 1999. “Weak Exogeneity in I(2) VAR Systems.” Journal of Econometrics 93: 281–308, https://doi.org/10.1016/s0304-4076(99)00012-3.Search in Google Scholar

Paruolo, P., A. Saltelli, and M. Saisana. 2013. “Ratings and Rankings: Voodoo or Science?” Journal of the Royal Statistical Society, Series A (Statistics in Society) 176: 609–34, https://doi.org/10.1111/j.1467-985x.2012.01059.x.Search in Google Scholar

Paulsen, J. 1984. “Order Determination of Multivariate Autoregressive Time Series with Unit Roots.” Journal of Time Series Analysis 5: 115–27, https://doi.org/10.1111/j.1467-9892.1984.tb00381.x.Search in Google Scholar

Pearson, K. 1905. On the General Theory of Skew Correlation and Non-linear Regression, Volume XIV of Mathematical Contributions to the Theory of Evolution, Drapers’ Company Research Memoirs. London: Dulau & Co. Reprinted in: Early Statistical Papers, Cambridge University Press, Cambridge, UK, 1948.Search in Google Scholar

Phillips, P. C. B. 1997. “Econometric Model Determination.” Econometrica 64: 763–812.10.2307/2171845Search in Google Scholar

Phillips, P. C. B. 2003. “Laws and Limits of Econometrics.” Economic Journal 113: C26–52, https://doi.org/10.1111/1468-0297.00114.Search in Google Scholar

Poetscher, B. M. 1991. “Effects of Model Selection on Inference.” Econometric Theory 7: 163–85.10.1017/S0266466600004382Search in Google Scholar

Pretis, F., J. Reade, and G. Sucarrat. 2018. “Automated General-to-Specific (Gets) Regression Modeling and Indicator Saturation for Outliers and Structural Breaks.” Journal of Statistical Software, Articles 86 (3): 1–44, https://doi.org/10.18637/jss.v086.i03.Search in Google Scholar

Romano, J. P., and A. M. Shaikh. 2006. “On Stepdown Control of the False Discovery Proportion.” In Optimality, 33–50. Institute of Mathematical Statistics.10.1214/074921706000000383Search in Google Scholar

Romano, J. P., A. M. Shaikh, and M. Wolf. 2008. “Control of the False Discovery Rate under Dependence Using the Bootstrap and Subsampling.” Test 17 (3): 417–42, https://doi.org/10.1007/s11749-008-0126-6.Search in Google Scholar

Romano, J. P., and M. Wolf. 2005. “Stepwise Multiple Testing as Formalized Data Snooping.” Econometrica 73: 1237–82, https://doi.org/10.1111/j.1468-0262.2005.00615.x.Search in Google Scholar

Sala-i-Martin, X. 1997. “I Just Ran Two Million Regressions.” The American Economic Review, Papers and Proceedings of the Hundred and Fourth Annual Meeting of the American Economic Association, Vol. 87, 178–83.Search in Google Scholar

Saltelli, A., and P. Annoni. 2010. “How to Avoid a Perfunctory Sensitivity Analysis.” Environmental Modelling & Software 25 (12): 1508–17, https://doi.org/10.1016/j.envsoft.2010.04.012.Search in Google Scholar

Saltelli, A., and S. Tarantola. 2002. “On the Relative Importance of Input Factors in Mathematical Models.” Journal of the American Statistical Association 97 (459): 702–9, https://doi.org/10.1198/016214502388618447.Search in Google Scholar

Saltelli, A., M. Ratto, S. Tarantola, and F. Campolongo. 2012. “Sensitivity Analysis for Chemical Models.” Chemical Reviews 112: PR1–PR21, https://doi.org/10.1021/cr200301u.Search in Google Scholar

Saltelli, A., P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola. 2010. “Variance Based Sensitivity Analysis of Model Output. Design and Estimator for the Total Sensitivity Index.” Computer Physics Communications 181 (2): 259–70, https://doi.org/10.1016/j.cpc.2009.09.018.Search in Google Scholar

Saltelli, A., S. Tarantola, and F. Campolongo. 2000. “Sensitivity Analysis as an Ingredient of Modelling.” Statistical Science 15 (4): 377–95.10.1214/ss/1009213004Search in Google Scholar

Saltelli, A., T. Andres, and T. Homma. 1993. “Sensitivity Analysis of Model Output: An Investigation of New Techniques.” Computational Statistics & Data Analysis 15: 211–38, https://doi.org/10.1016/0167-9473(93)90193-w.Search in Google Scholar

Sobol’, I. M. 1967. “On the Distribution of Points in a Cube and the Approximate Evaluation of Integrals.” USSR Computational Mathematics and Mathematical Physics 7 (4): 86–112, https://doi.org/10.1016/0041-5553(67)90144-9.Search in Google Scholar

Sobol’, I. M. 1993. “Sensitivity Estimates for Nonlinear Mathematical Models.” Mathematical Modeling and Computational Experiment 1 (4): 407–14.Search in Google Scholar

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B 58 (1): 267–88, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar

Wei, P., Z. Lu, and J. Song. 2015. “Variable Importance Analysis: A Comprehensive Review.” Reliability Engineering & System Safety 142: 399–432, https://doi.org/10.1016/j.ress.2015.05.018.Search in Google Scholar

Received: 2018-08-31
Accepted: 2021-02-16
Published Online: 2021-03-15

© 2021 William Becker et al., published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/jtse-2018-0025/html
Scroll to top button