Forecasting in the factor model was already alluded to in the previous section. We shall now consider our empirical application and hence return to it in more detail. The forecasting framework used is very similar to what is often applied in the literature, e.g., Stock and Watson (2002b). The aim is to obtain *h*-step-ahead forecasts of a number of macroeconomic variables. In order to avoid having to specify a process for the factors we will only consider direct forecasting. In general we will consider the following forecasting model:

$${y}_{t+h}^{h}={\alpha}_{h}+{\beta}_{h}\mathrm{(}L{\mathrm{)}}^{\prime}{F}_{t}+{\gamma}_{h}\mathrm{(}L\mathrm{)}{y}_{t}+{e}_{t+h}^{h}\text{\hspace{1em}(20)}$$(20)

where β_{h}(*L*) and γ_{h}(*L*) are lag polynomials. The explicit dependence on the forecast horizon should be noted as the model is specific to the horizon.

The variables to be forecast are assumed to be either I(1) or I(2) in logarithms. Let *z*_{t} be the original variable of interest recorded at a monthly frequency, then in the case of I(1) we define the *h*-step-ahead variable as:

$${y}_{t+h}^{h}=\mathrm{(}1200/h\mathrm{)}log\mathrm{(}{z}_{t+h}\mathrm{/}{z}_{t}\mathrm{)}\text{\hspace{1em}(21)}$$(21)

i.e., annualized growth over the horizon in percent. In the case of I(2) we define it as:

$${y}_{t+h}^{h}=\mathrm{(}1200/h\mathrm{)}log\mathrm{(}{z}_{t+h}\mathrm{/}{z}_{t}\mathrm{)}-1200log\mathrm{(}{z}_{t}\mathrm{/}{z}_{t-1}\mathrm{)}\text{\hspace{1em}(22)}$$(22)

i.e., the difference between annualized growth over the horizon in percent and annualized growth over the last month.

Since the forecasting model (20) contains the unobserved factors *F*_{t}, estimation is done in a two-step approach. First the factors are estimated using either PC or the proposed LAD procedure. Then in the second step (20) is estimated by OLS with the estimated factors in place of the unobserved true ones. Hence, for a dataset ending at time *T* we obtain the forecast of *T*+*h* by fitting the forecasting equation using the OLS estimates:

$${\widehat{y}}_{T+h|T}^{h}={\widehat{\alpha}}_{h}+{\displaystyle \sum _{j=1}^{k}}{\widehat{\beta}}_{h\mathrm{,}j}{\widehat{F}}_{T\mathrm{,}\text{\hspace{0.17em}}j}+{\displaystyle \sum _{j=1}^{p}}{\widehat{\gamma}}_{h\mathrm{,}\text{\hspace{0.17em}}j}{y}_{T-j+1}\text{\hspace{1em}(23)}$$(23)

Clearly, in order to fully specify the forecasting model, *k* and *p* need to be specified, i.e., the number of factors to include and how many lags of *y*_{t} to include. Note that in the following we will also allow for the possibility of including no lags of *y*_{t}, this case will be referred to as *p*=0. Either BIC or the IC_{j} criterion by Bai and Ng (2002) as described earlier will be used to select the number of factors, and BIC to select the number of lags. In addition to this the possibility of simply fixing *k* and *p* will also be considered.

Before estimation can be carried out the data need to be prepared. Recall that the factor model assumes the variables to be stationary and centered at zero. We will need to ensure that the data reflect this. Therefore all series are transformed to be stationary, the details on which can be found in the data appendix. For the PC estimation we further center all variables such that they have mean zero. The intuition behind this is that the factor estimation is basically a regression without an intercept, centering corrects for this. In some cases the mean of the series is even referred to as the zeroth PC. The LAD estimation will also require centering. However, since LAD estimates the median, centering will be done at the median.

Finally, the data have to be scaled. The need for this is not immediately apparent from the general model setup, but is in fact quite critical. Recall that the solution to the PC estimation problem is the loadings that maximize the variances of the individual factors. Therefore it is quite clear that if some variables have a high variance and others a low variance the latter will be crowded out by the former. To avoid this problem it is common to scale all variables to have unit variances. By doing this we ensure that the “choice” of variables in the factors is not driven by differences in the variances.

Although scaling the variables to have unit variances is the obvious choice, we must remember that the variance is not a robust measure of dispersion. In the case of the LAD estimator one could imagine more appropriate scalings. A popular robust alternative to the variance (or rather the standard deviation) is the median absolute deviation (MAD):^{7}

$$\text{MAD}\mathrm{(}x\mathrm{)}=med[|{x}_{i}-med[x]|]\text{\hspace{1em}(24)}$$(24)

Consequently, we will consider the following methods for estimating the factors. PC: Estimation by PC using data centered at the mean and scaled to have unit variance; LAD: Estimation by LAD using data centered at the median and scaled to have unit variance; LAD-MAD: Estimation by LAD using data centered at the median and scaled to have unit MAD. We will also compare to the effects of screening the data, hence we also have a PC-S model, i.e., estimation by PC using data centered at the mean, scaled to have unit variance and screened for outliers. The screening procedure used is the same as in the previous section.^{8}

In addition to these factor models we will also consider a number of benchmark models. Firstly, our main benchmark will be an AR model, i.e., (20) without factors, all results will be relative to this model. Secondly, we include a naïve scenario where the forecast is computed as the unconditional mean (denoted U.Mean) or median (denoted U.Median) of either the entire data series or a window containing the last *h* observations (abbreviated W). Lastly, we also include what could be referred to as a low-dimensional factor model based on the economic indexes developed by Stock and Watson (1989). The model is the same basic factor model as before, i.e., (2), however, we will now specify the dynamics of the factors and error terms. Written in general terms, we have

$${X}_{it}={\lambda}_{i}{F}_{t}+{e}_{it}\text{\hspace{1em}(25)}$$(25)

$$\Gamma \mathrm{(}L\mathrm{)}{F}_{t}={\eta}_{t}\text{\hspace{1em}(26)}$$(26)

$${\delta}_{i}\mathrm{(}L\mathrm{)}{e}_{it}={V}_{it}\text{\hspace{1em}(27)}$$(27)

where Γ(*L*) and δ_{i}(*L*) are lag polynomials and (*ν*_{1}_{t}, *…*, *ν*_{n}_{t}, *η*_{1}_{t}, …, *η*_{rt}) are i.i.d. normal and mutually independent error terms.

Stock and Watson (1989) only considered the case of a single factor, we will also limit ourselves to this case. Furthermore, (26) and (27) will be specified as AR(1) processes.^{9} Identification will be achieved by setting the top block of the loading matrix equal to the identity matrix, i.e., $$\Lambda =\mathrm{(}{I}_{r}\mathrm{,}\text{\hspace{0.17em}}{{\Lambda}^{\prime}}_{2}{\mathrm{)}}^{\prime}.$$ Even though this identification scheme is different to the one used in e.g., PC estimation, it is merely a different rotation and for forecasting purposes not important. Under these assumptions we can use the Kalman filter to compute the likelihood and thus obtain the maximum likelihood estimates of the model. See Stock and Watson (2006) for further details. The interest of Stock and Watson (1989) was to build coincident and leading economic indexes by using the model (25)–(27). Our interest is to forecast key macroeconomic variables, and to ensure comparability we will again compute these as direct forecasts. Hence we use the Kalman smoother to extract the factors and use these in (23), the same forecasting relationship used for the other factor models.

Such a model, where we fully specify the dynamics of the factors and error terms, is an example of confirmatory factor analysis, as opposed to exploratory factor analysis often conducted using PC. As pointed out by a referee the LAD approach can also be seen in a confirmatory factor analysis context as we are implicitly assuming that the errors have fat tails (or equivalently that they contain outliers) when choosing to use this approach.

## 4.1 The dataset

Many of the papers written on the subject of factor model forecasting use the same few datasets available online. More specifically data from Stock and Watson (2002b, 2005), or an updated version of the latter from Ludvigson and Ng (2010). For this paper, however, a more up-to-date dataset shall be considered, and hence we have collected a new dataset. All variables used are either directly from the Federal Reserve Economic Data (FRED) database made available by the Federal Reserve of St. Louis or have been computed on the basis of these. We have attempted to keep the composition of the dataset close to the original Stock and Watson datasets. A total of 111 monthly US macroeconomic variables are included in our dataset, and hence it is slightly smaller than the typical datasets. The data have been taken in seasonally adjusted form from FRED and hence we have not performed any adjustments ourselves.

We will be forecasting six variables; three variables measuring real economic activity: Industrial production (INDPRO), real personal income excluding current transfer receipts (W875RX1), and the number of employees on nonagricultural payrolls (PAYEMS); as well as three price indexes: The consumer price index less food and energy (CPILFESL), the personal consumption expenditures price index (PCEPI), and the producer price index for finished goods (PPIFGS). We assume that the first three are I(1) in logarithms and that the last three are I(2) in logarithms.

A balanced dataset is needed for the estimation procedures and hence the span of the dataset will be determined by the variable with the shortest span after possibly being differenced to obtain stationarity. We will therefore be considering data spanning the period 1971:2–2012:10. Data are available for the variables we wish to forecast prior to 1971:2 and hence the span is not reduced when lags are included in the forecasting model.

The forecasting exercise will be carried out as a pseudo-real-time forecasting experiment; real-time in the sense that both the factors and the forecasting model are reestimated in each period using only data available up until that period; and pseudo in the sense that we are not using actual real-time data, but rather final vintage data which may have undergone revision. Forecasting will be done for the period 1981:1–2012:10.

## 4.2 Results

Before examining the forecasting results we start with an inspection of the factor estimates for the different models. In Figure 1 the first four factors estimated using the four main models and the entire dataset are plotted. All four methods give very similar estimates of the first factor, with one difference being the LAD-MAD estimate which appears to be scaled differently. Clearly, this is due to the MAD scaling of the dataset. For the next three factors the methods agree less. Although it might not be obvious from these plots, it does appear that some of the factors have interchanged. To illustrate this further we have provided the correlation between the different factor estimates in .

Table 5 Correlation matrix for the first four factors estimated using the four methods.

The correlations confirm that the methods agree on the estimation of the first factor and that the difference for the LAD-MAD estimate is indeed simply a scale effect. For the second factor, we see that the two LAD methods agree on the estimate. For the PC estimates the screening has caused factors two and three to switch, and these factors are different from the second factor estimated by the LAD methods. For the third factor the LAD methods agree on the estimate, and finally, the PC and PC-S methods agree on the fourth factor.

Figure 1 Plots of the estimated factors. The left column of the figure depicts factors 1 through 4 (top to bottom) for the PC estimates (blue solid line) and the PC-S estimates (red dashed line). Likewise for the right column with the LAD estimates (blue solid line) and the LAD-MAD estimates (red dashed line).

We thus see that the choice of estimation method can give very different estimates of the factors. As this is based on real data it is of course impossible to determine the reason for this. However, returning to our previous conjecture that outliers affect the ordering of estimated factors, we do see that in the case of the PC method controlling for outliers though screening does affect the ordering as factors 2 and 3 change places. The ordering is of course not important for the forecasting performance if we indeed include all factors in the forecasting model. However, if the number of factors is not correctly determined it may be crucial for the performance.

In we present the main forecasting results for the 12-month horizon. The results are divided into seven main scenarios. In the first scenario we fix the number of factors at four and include no AR terms. In the next three scenarios we determine the number of factors using IC_{1} and include no AR terms, six AR terms, or let BIC determine the number of AR terms. The fifth and sixth scenarios use BIC to determine the number of AR terms and either IC_{2} or IC_{3} for the number or factors. Finally, the seventh scenario uses BIC for both the number of AR terms and factors. In addition we include the benchmark models mentioned above, the naïve scenario and the Kalman-filter-based model of Stock and Watson (1989). For the latter we consider two different specifications. First, we will mimic their coincident index which modelled four variables: Industrial production; personal income, total less transfer payments; manufacturing and trade sales; employee-hours in non-agricultural establishments. We do not have data on manufacturing and trade sales, however, production, personal income, and employment are the first three variables we wish to forecast. We therefore consider the case where *X*_{t} consists of these three variables, a case we denote KF(3). Second, we will also consider a model where *X*_{t} consists of all six variables we wish to forecast, this case will be denoted KF(6).

Table 6 Forecasting results for *h* = 12 reported as relative MSFEs.

For each variable of interest we report the mean squared forecast error (MSFE) relative to the MSFE of an AR(*p*) forecast with 0≤*p*≤6 chosen by BIC, i.e., forecasts based on model (20) without factors. The root mean squared forecast error (RMSFE) of this model is included in the final row of the table. For each scenario the lowest MSFE is underlined and the overall lowest MSFE is bold. Two sets of Diebold and Mariano (1995) tests have been performed. First, we test for equal predictive accuracy comparing the best performing model, i.e., the one in bold, to all other models. Significance is indicated as: 5%(^{⁑}), 10%(^{*}). Second, we test for equal predictive accuracy comparing the benchmark AR model to to all other models. Significance is indicated as: 5%(^{‡}), 10%(^{†}).

First, notice that the best performing model within a scenario, i.e., the underlined number, is very often associated with the LAD-MAD method, especially if we focus on the cases using the IC_{j} criterion. Furthermore, for three of the six variables LAD-MAD provides the overall best results. To a large extent it could appear that this is driven by the number of factors included in the forecasting equation. Looking at the average number of factors across the experiment horizon $$\overline{k}$$ we see that the number of factors is consistently lowest for the LAD-MAD method. PC-S and LAD on the other hand roughly choose the same number of factors, which is perhaps to be expected since our Monte Carlo results did show very similar performance for these two methods in many cases. It is, however, surprising that LAD-MAD chooses fewer factors than LAD. This suggests that scaling using a non-robust dispersion measure may incorrectly increase the number of factors and thereby hamper forecasting performance. Furthermore, if the true number of factors is as low as 2–3 as suggested by the LAD-MAD method the ordering of the factors becomes very important. Even if the other methods do estimate these factors but ordered differently, we may need to include many more estimated factors to ensure we include the few truly relevant ones. This may then in turn lead to degradation of the performance since we include irrelevant factors.

Looking at the individual variables, we first have for IP that PC-S is the preferred method. Its performance is, however, close to that of the best performing LAD-MAD case and the Diebold-Mariano test shows no significant difference in predictive accuracy. For PI the LAD-MAD method is preferred with significantly lower MSFE than PC, PC-S and LAD in all the scenarios and the benchmark AR model. Interestingly, we cannot reject equal predictive accuracy when comparing to the KF models. For Employment LAD-MAD is again preferred, but is in many cases not significantly different from the alternatives. The last three variables are in some sense less interesting since we in none of the cases outperform the benchmark AR model. However, we do recognize that if we for a moment exclude the benchmark models then the LAD-MAD method is the best performing method for all three variables. However, for PCE and PPI we have that KF(6) slightly outperforms LAD-MAD. Hence, in this application, we generally see that LAD-MAD is either on par with the alternatives or it outperforms them. Often it is difficult to show that these differences are significant according to the Diebold-Mariano tests. However, in the case of PI where we do see significant differences the results also point to LAD-MAD as the preferred method.

The results for the PC model are very interesting since they highlight the effects of screening when compared to PC-S. In general PC tends to choose a quite high number of factors. The drop in the number of factors between PC and PC-S suggests that we do have outliers in the data. However, it is not always the case that this translates into higher forecasting performance. This could again be explained by the ordering of the factors. If the ordering is such that not all true factors are included in PC-S and the severity of not including all true factors is greater than the performance decrease due to possibly including irrelevant factors, then we should include more estimated factors as in the PC model.

In the final two scenarios either IC_{3} or BIC is used to choose *k*. None of these scenarios provide results of comparable performance. For IC_{3} the Monte Carlo results demonstrated a tendency to overestimate the number of factors. In these empirical results we also find that IC_{3} in general includes more factors than IC_{1} and IC_{2}. This may very well be the reason for the poor performance. Further, as we mentioned earlier, the use of BIC is no longer common, and hence the poor performance is expected. Note, that one crucial difference between the IC_{j} criterion and BIC is that BIC is applied to the forecasting equation and hence depends on the variable being forecast. Although this could give rise to better performance than IC_{j}, it is not in line with the general theory of the factor model which assumes one model with *r* factors regardless of what is being forecast. This is also why $$\overline{k}$$ is not reported in the tables for the BIC case. Finally, it is clear that the naïve alternatives are inferior to the models considered as they underperform in all cases.

Similar forecasting experiments for 6- and 24-month horizons have also been conducted. These results can be found in the appendix in and . For the 6-month horizon the results are quite similar to the 12-month forecasts. In general LAD-MAD still performs very well. For IP it is now the preferred method, although not significantly different from PC. For PI LAD-MAD is no longer preferred, instead KF(3) is associated with the lowest MSFE, albeit the MSFEs are very close and not significantly different. Interestingly, we are now able to outperform the AR forecast for CPI, i.e., the relative MSFE is less than one. In the 12-month case the factor-based forecasts did not improve on the AR forecasts for any of the price indexes. For the 24-month horizon it is generally harder to show significant differences between the different methods. Putting that aside we do, however, have that LAD-MAD is associated with the lowest MSFE for the first four variables. For the last two, PCE and CPI, we again see that KF(6) is preferred among the factor methods. However, as it was also the case for the 6- and 12-month horizons, the benchmark AR model is the best performing model for these two variables. Note, however, that in general the sizes of the results are difficult to compare across horizons since the definition of the variable being forecast changes with the horizon. Specifically this means that the variable being forecast becomes smoother as the horizon increases. This is also evident from the RMSFE of the AR model which actually decreases with the horizon.

## Comments (0)