The Efficient Shrinkage Path: Maximum Likelihood of Minimum MSE Risk

A new generalized ridge regression shrinkage path is proposed that is as short as possible under the restriction that it must pass through the vector of regression coefficient estimators that make the overall Optimal Variance-Bias Trade-Off under Normal distribution-theory. Five distinct types of ridge TRACE displays plus other graphics for this efficient path are motivated and illustrated here. These visualizations provide invaluable data-analytic insights and improved self-confidence to researchers and data scientists fitting linear models to ill-conditioned (confounded) data.


Regularization Perspective
Regularization of linear models in machine learning is focused here on how generalized ridge TRACE displays enable researchers to literally "see" the effects of variance-bias trade-offs on measures of mean-squared-error (MSE) risk that accompany all changes in estimated β−coefficients.Normaltheory Maximum Likelihood (ML) estimates exist for linear models when the number, p, of nonconstant x−predictor variables is strictly less than the number of observations, n.The X−predictor matrix is then said to be "narrow ", and the ordinary least-squares (OLS) estimator is uniquely 1 arXiv:2103.05161v5[stat.ME] 15 Feb 2024 determined.The additional assumption that at least one OLS residual, y i − ŷi , is non-zero assures that the estimated error-variance, σ2 , is strictly positive.Gunst (2000, p. 62) wrote: "Although ridge regression is widely used in the application of regression methods today, it remains as controversial as when it was first introduced."Many early proposals for choosing the "k-factor" defining the 1−parameter shrinkage path of Hoerl and Kennard (1970a,b) were indeed based mainly on heuristics.In sharp contrast, shrinkage based upon Normaltheory Maximum Likelihood estimation has a firm theoretical foundation, Obenchain (1975Obenchain ( , 1977Obenchain ( , 1978Obenchain ( , 2022)).
The generalized ridge estimators of β−coefficients that are most-likely to have minimum MSE risk under Normal-theory are of the known form discussed here in §5; see Thompson (1968).No GRR shrinkage path with fewer than p−parameters will always pass through this "best" pointestimate.A new Efficient p−parameter shrinkage-path in the form of a two-piece linear spline with a single interior Knot at the "best" point-estimate is proposed in §6 and illustrated graphically in the following sections, §7 − §9.
Ridge TRACE displays for this new efficient path and other visualizations are implemented in version 2.3 of the RXshrink R-package, Obenchain (2023).Each TRACE typically displays estimates of p ≥ 2 quantities that change as shrinkage occurs.The "coef" TRACE displays all p fitted linear-model β−coefficients, and the "rmse" TRACE plots relative mean-squared-error estimates given by the diagonal elements of the MSE-matrix divided by the OLS-estimate of σ 2 .Whenever shrinkage becomes excessive, the "infd" TRACE displays estimated direction-cosines for the inferior-direction in X−space along which shrunken coefficients have higher MSE Risk than OLS estimates, Obenchain (1978).The two remaining types of TRACE diagnostics, named "spat" and "exev", refer to the p rotated axes defining the principal-coordinates of the given (confounded) x−predictors.All estimates displayed in TRACEs are ML or unbiased or have correct-range when p < n − 3.

Shrinkage Estimation
Linear models and the OLS estimator, β0 , of β−coefficients can be placed in a canonical form that is easy to generalize when defining the shrinkage-estimators of interest here.We assume that the y−outcome vector has been both centered and re-scaled to have an observed mean of zero and variance 1, that each column of the X−matrix (n × p) has been standardized in this same way, and that the resulting X−matrix has full (column) rank = p that is ≥ 2 and ≤ (n − 1).Recalling that the OLS fit corresponds to an orthogonal projection in the n−dimensional space of individual observations onto the column-space of X, we can write the following well-know matrix-expressions: ŷ = HH ′ y = X β0 . (2.1) These results follow from writing the singular-value decomposition (SVD) of X as X = HΛ 1/2 G ′ .HH ′ then denotes an orthogonal projection; it is a n×n symmetric and idempotent matrix of rank p known as the Hat-matrix for OLS.In particular, β0 = Gc where G is an orthogonal rotation within the column-space of X, and c = Λ −1/2 H ′ y is the p × 1 column vector containing the uncorrelated components of β0 , Obenchain (1975).Generalized ridge estimators apply scalar-valued shrinkage-factors, each confined to the range 0 ≤ δ j ≤ 1, to the p uncorrelated components of β0 .Thus these β−coefficient estimators are of the form: where ∆ denotes the diagonal matrix containing all p shrinkage δ−factors and g j denotes the j th column of G.
While the above conventions have placed all X−information about the form and extent of any illconditioning into a convenient canonical-form, these conventions have done nothing to predetermine the relative importance of individual x−variables in predicting y−outcomes.That information, as well as information on the many effects of deliberate shrinkage, may well be best and most-clearly revealed via visual examination of TRACE diagnostic plots.For examples, see Figures 1 through  5.

Quantifying Extent of Shrinkage
The multicollinearity allowance, m, measures the "extent" of shrinkage applied in equation (2.2): where 0 ≤ m ≤ p, Obenchain (1977).Besides being the rank of X, p is also the trace of the OLS Hat-matrix.Similarly, trace(∆) is like a measure of "rank" for the (p × p) ∆ shrinkage-factor matrix in equation (2.2).Thus m is essentially a measure of inferred rank deficiency in the given X−matrix.This deficiency is due to ill-conditioning and is best quantified by using the δ Shrinkage Factors that Maximize the Normal distribution-theory Likelihood of the resulting GRR estimates.
Use of this m−scale for displaying TRACE diagnostics also suggests using the short-hand notation, βm , to denote individual β point-estimates in equation (2.2).The OLS solution, β0 , occurs at the beginning, m = 0, of each shrinkage path.Similarly, βp ≡ 0 denotes the shrinkage terminus at m = p.
Since the range of the m−index of equation (3.1) is finite, this m−scale is ideal for use as the horizontal axis on all TRACE plots.Specificaly, a TRACE can then display the entire regularization path.Since Hoerl and Kennard (1970a,b) restricted attention to shrinkage-factors of the form δ j = λ j /(λ j + k) where k is increased from 0 to ∞, their ridge TRACE display with k on the horizontal axis could only display some finite, initial fraction of their full path.

MSE Optimal Shrinkage
When the unknown true components of β are denoted by γ = G ′ β, it follows that the i th uncorrelated component of the c−vector in equation (2.2) has mean γ i and variance σ 2 /λ i .
The F-ratio for testing , where ρj denotes the observed principal correlation between the centered and rescaled y−vector and the j th column of the H−matrix in equation (2.1), while under Normal-theory.

ML Estimation of Uncorrelated Components
When no restrictions are placed on the functional form of regularization, one is free to simply substitute ML estimates for the unknowns in equation ( 4.1) to identify the estimate most likely to have minimum MSE risk.This ML shrinkage estimate under Normal-theory is of the cubic (clearly nonlinear ) form (5.1) Thompson ( 1968) studied this estimator using numerical integration and showed that it yields: (i) reduced MSE risk when a true |γ j | is small relative to σ, (ii) increased risk when |γ j | is larger, but (iii) the same limiting risk as |γ j | approaches +∞.Under conditional distribution-theory for linear models, the γML j estimates of equation ( 5.1) are viewed as being given linear functions of y multiplied by y ′ y/λ j .In other words, the Normaltheory conditional distributions of ρj −estimates are not those of correlation coefficients.After all, the individual columns of the H−matrix are considered given here and are not subject to random variation.
In the limit as the OLS estimate of σ 2 decreases to 0, R 2 naturally increases to 1 for a "correct" linear model.This causes equation (5.1) to simplify to γML . This is the special case of equation (2.2) where δ j ≡ 1, and the OLS fit becomes exact.In other words, OLS predictions, X β0 = ŷ, from equation (2.1) are then identical to the observed y−outcomes.

Efficient ML Shrinkage
As argued in Obenchain (1975), equation (2.8), the Normal-theory likelihood that a set of delta shrinkage factors (δ 1 , . . ., δ p ) has minimum MSE risk is identical with the likelihood that once this likelihood has been maximized by choice of the σ estimate of σ as well as by choice of all p of the ± signs.
In practical applications, equation (6.1) can be "too general".For example, Obenchain (1975), section 4.1, restricted interest to a 2−parameter family of shrinkage Paths that frequently fail to pass directly through the overall optimal set of shrinkage δ−factors of equation (4.1) when p > 2.
The new p−parameter "efficient" shrinkage path satisfying equation (2.2) and passing through the unrestricted βML = Gγ M L estimate defined by equation (5.1) is implemented by the eff.ridge()R-function, Obenchain (2023).Note that the "spat" TRACE plot of Figure 1 for this new Maximum Likelihood Path shows exactly why this Path is "as short as possible".After all, the shortest distance between 2 points is a straight line, so the shortest Path consists of a two-piece linear spline.This Path always starts with the OLS estimate at m = 0 and always ends with β ≡ 0 at m = p.In Figure 1, p = 4, and the "interior" knot defined by equation (5.1) occurs at m = 1.848.
Specifically, the Efficient Path starts with the OLS solution, β0 , where m = 0 and all δ−factors = 1.Similarly, this Path ends with β ≡ 0 at m = p because each δ−factor is 0 there.The most-likely shrunken β−estimates occur at the interior "Knot" that is clearly seen on the "spat" Trace.This solution on the efficient-path cannot be less likely under Normal-theory than the "best" solution on any other path.The shapes of traditional 1− or 2−parameter paths are actually predetermined almost exclusively by the eigen-values of the X−matrix, a disadvantage pointed out in Hoerl and Kennard (1975).
Since Figure 1 depicts the shrinkage δ ⋆ −factors that apply to all β−estimates for 0 ≤ m ≤ 4, note that the four δ M SE estimates are (0.9986, 0.0743, 0.9266, 0.1528) here.The second shrinkagefactor, δ ⋆ 2 (red short-dash line), is always smallest because it has the smallest δ M SE factor.Similarly, δ ⋆ 4 (blue long-dash line) is also small because δ M SE 4 is the 2 nd −smallest factor.Ultimately, the δ ⋆

3
(green dot-dash line) remains quite close to δ ⋆ 1 (black solid line) because their δ M SE −estimates are nearly equal and much larger than the other two.

TRACE Diagnostics
TRACEs are graphical aids that help users literally "see" most of the details needed to fully appreciate how and when shrinkage-estimators alleviate the effects of ill-conditioning on linear model coefficient estimates.Here, we will use a well-known benchmark dataset to perform calculations and plot TRACEs.
The Portland cement dataset of Woods, Steinour and Starke (1932) contains characteristics of n = 13 cement mixtures, where the y−outcome variable is heat (cals/gm) evolved during hardening.The p = 4 predictor x−variables recorded are "ingredient percentages" that appear to have been "rounded down" to full integers.Due to the small size (n = 13) and limited number of digits reported, this dataset has served as both a benchmark for accuracy of manual OLS computations and as an example where the "sign" of a fitted OLS coefficient differs from that of the correlation between y and the corresponding x−variable.
The sum of all n = 13 values for all four x−predictors varies from 95% to 99%.If these x−values had summed to exactly 100% for all 13 mixtures, the centered X-matrix would then be of rank = 3.In other words, this p = 4 regression model is rather clearly ill-conditioned in the sense of suffering an effective rank deficiency of at least m = 1.In fact, we will see that an m−extent of slightly less than 2 can be a more appropriate and realistic estimate of rank deficiency here.
The TRACE diagnostics displayed here in Figures 1 to 5 were generated using the plot() function for eff.ridge() objects in version 2.3 of the RXshrink package using steps = 100.Since calculations defining regularization paths are performed only on a lattice of m−extents from equation (3.1), each m−value is then a multiple of 1/100 = 0.01.The default setting for eff.ridge() uses steps = 8 to produce a unit change in m.The extra "detail" that results from steps = 100 shows that the curved portions of Figures 3, 4 and 5 are very smooth.
For the Portland cement data, the ML shrinkage-extent is m = 1.848.Thus a gray vertical dashed line is plotted at this point on the TRACEs displayed in Figures 1 to 5.
Note that Figure 2 also features a "wrong sign" correction to the 3 rd fitted coefficient (percentage of p4caf in the mix, green dot-dashed line).This 3 rd coefficient becomes negative past m = 0.75 to agree in sign with the marginal correlation (−0.5347) of p4caf with the heat y−outcome.
Note that he numerical signs and relative magnitudes of βm estimates in Figure 2 change between m = 0 and the "Interior Knot" at m = 1.848 on the Efficient Path.However, from that point onward to the Shrinkage Terminus at m = p, note that both the numerical signs and relative magnitudes of shrunken coefficients in their TRACE displays always remain perfectly stable.
All p−curves plotted in the efficient Coefficient TRACE of Figure 2, like the corresponding Shrinkage Pattern TRACE of Figure 1, are always two-piece linear splines.In sharp contrast, the three other types of TRACE plots need to contain "curved" lines to realistically depict the Non-Linear Effects of shrinkage on measures of MSE Risk and Normal-theory Likelihood Ratio statistics.
The relative MSE TRACE of Figure 3, the excess eigenvalue TRACE of Figure 4  risk estimates are given by the diagonal elements of the M SE/σ 2 matrix and are both particularly relevant and easy to interpret in Figure 3.While unbiased under Normal-theory, each estimated relative-risk is increased, if necessary, to assure it is at least as large as its relative-variance, δ2 j /λ j .Here, user interest rightly becomes focused upon the range 1.5≤m≤2.5where relative-risks are greatly reduced.
The eigen-values and vectors of the difference in risk matrices, {M SE(ols) − M SE(ridge)}, provide clear insights into key effects of ridge shrinkage.The good news is that at most one eigenvalue of this difference in MSE risks can be negative, Obenchain (1978).While an "inferior direction" corresponding to a negative estimated excess-eigenvalue does suddenly appear in Figure 5 at m≈1.8, the largest positive excess eigenvalue in Figure 4 is relatively gigantic (+50) at this same m−extent.The only negative excess-eigenvalue indicates a MSE risk increase due to shrinkage of at most 15.6 even at m = 4, while the concomitant decrease in MSE in a direction strictly orthogonal to this lone inferior direction exceeds +50.Thus, shrinkage along the path depicted in Figure 1 has clear potential for a net overall reduction in MSE risk.
Also note that the inferior-direction at the shrinkage terminus, m = 4, of Figure 5 ends up pointing almost directly "backwards" at the initial OLS β0 solution of Figure 2. The absolute value of the correlation between the two corresponding direction-cosine vectors is 0.988 here, and similar results would always be expected whenever the initial OLS solution is significantly different from the shrinkage terminus, βp ≡ 0, at m = p = 4.
Finally, Figure 6 displays a type of plot not available within the RXshrink R-package because it compares results for 2 different shrinkage paths for the Portland cement data.The horizontal axis is m−Extent of Shrinkage, while the vertical axis is −2log(Likelihood Ratio) over the finite range [0, 80].The eff.ridge() and qm.ridge() R-functions were invoked with "steps=100".In Figure 6, the solid-black curve shows how the Likelihood Ratio χ 2 for the efficient path swoops down from very large values all of the way to 0.0 at m = 1.848, then suddenly starts increasing and ultimately ends at 52.5 for m = 4.The solid blue curve that enters the top of Figure 6 at m ≈ 2 corresponds to the Likelihood Ratio χ 2 for the qm.ridge() path of most-likely q − Shape = −5 on the default mesh.This solid blue curve reaches its minimum of 26.4 at m = 2.1 [marked by a vertical red dot-dash line], then also increases to 52.5 at m = 4. Figure 6 thus confirms that the eff.ridge() path is fully efficient.After all, this path reduces the Normal-theory Negative Log Likelihood Ratio to Zero while using an equal or lesser m−Extent of shrinkage than all viable alternatives.Figure 6: Negative Log Likelihood Ratio curves for two different shrinkage-paths on the Portland cement data.The solid black curve shows how the −2 log Likelihood-Ratio under Normal-theory drops all of the way to 0 at m = 1.848 for the "Efficient" 4−parameter path.The solid blue curve depicts the corresponding LR for the 2−parameter path of "best" q −Shape = −5.This −2log(LR) reaches a minimum of 26.4 at m ≈ 2.1.Since the upper 99%−point of a χ 2 −variate with 2 degreesof-freedom is only 9.21, this "best" 2−parameter fit is significantly different from MSE-optimal.
Truly "favorable" cases of ridge shrinkage occur when an m−Extent greater than 1.0 is favored in two senses: (a) no "inferior direction" has yet appeared, and (b) the relative MSE risk over all p coefficients is optimized.Both indicators are clearly present in the TRACE plots of Figures 4, 5 and 3 for the Portland cement data benchmark.

Elliptical Confidence Regions for Pairs of Coefficients
One of the central ideas in generalized ridge regression is that unbiasedness is not a very important property in point estimation of regression β−coefficients.As shown in Figure 3, biased point estimates can have smaller Mean Squared Error risk than unbiased OLS estimators when shrinkage reduces variance by more than the concomident increase in squared bias.
On the other hand, unbiasedness plays a more central role in the Classical theory of set estimation (statistical inference) than it does in point estimation.For example, an unbiased set estimator will not cover any incorrect value with larger probability than the probability, 1 − α, of covering the correct value; biased set estimators will cover some incorrect values with probabilities greater than 1 − α.See Obenchain (1977).
Figure 7 was created using the eff.aug() and eff.biv() functions in version 2.3 of the RXshrink R-package to illustrate graphics useful in Classical statistical inference on the effects of shrinkage.Purely Bayesian inferences are quite different because they view shrinkage estimators as being unbiased relative to combined Sample and Prior information.  .Note that the change in sign of the "p4caf" coefficient occurs within the inner Ellipse of only 10% confidence.Furthermore, since the point where the "p3cs" and "p4caf" estimates are both Zero lies within the 90% confidence Ellipse, at least one of these two coefficients is not significantly different from Zero with confidence of 10% or less.Here, the "p4caf" coefficient that changes sign is not significant, while the "p3cs" coefficient is highly significant (***).See ANOVA Table 1.
9 Linear Models of Rank One or Two Shrinkage of "Simple" (p = 1) Linear Regression models (with R-code formula y ∼ x), where x is a non-constant vector, are performed by the YonX() R-function, Obenchain (2021).All statistics that are (p×1) vectors or (p×p) matrices when p > 1, are simply scalars in these p = 1 models, and the relative MSE risk is strictly quadratic in these cases.As a result, when the heat y-Outcome is regressed on only the percentage of p4caf in Portland cement mixtures, the unbiased OLS estimate of the β−coefficient, −1.256 cannot have the "wrong" numerical sign, and the "rmse" TRACE of Figure 8 is purely quadratic.Any MSE risk "inferior-direction" when p = 1 is simply "up and down."TRACE plots for p = 1 models are also augmented with a graphic like that of Figure 9 that shows the observed (x=p4caf, y=heat)−scatter plus three fitted lines that pass through (x, ȳ).The OLS fitted line is colored Blue [Best Linear Unbiased], the Minimum MSE Risk line (m = 0.161) is colored Purple, and the "Double Shrunken" line (m = 0.322) is colored Red.
Models with p = 2 non-constant X-variables, use generalized ridge shrinkage δ−factors that can be written in the form δ i = 1/(1+f i ) where f 1 and f 2 are both non-negative.For example, q −Shape paths, Goldstein and Smith (1974), are of this form and can pass through the ML coefficient-vector with MSE optimal risk.This requires for i = 1 and 2, where the γ i values are unknown but can easily be estimated.Note that the k-factor is a non-negative constant that starts at k = 0 and is steadily increased to produce shrinkage.The restrictions imposed via equation (9.1) suggest using a q−Shape of the form: Two invocations of the qm.ridge()R-function can be used to perform "best" MSE risk computations and display Traces for rank p = 2 models.An initial call of the form qm.ridge(y ∼ x1 + x2, data) uses the default argument of Q = "qmse" to make a simple mesh search for a reasonable q−Shape.But, when p = 2, the printed output also displays an estimate, Qbest, of q M SE from equation (9.2).A second call of the form qm.ridge(y ∼ x1 + x2, data, Q = Qbest) then uses this "best" smoothly curved path.Naturally, this path is slightly longer than the efficient path from the 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Figure 8: The "rmse" TRACE for the "Simple" Linear Regression of heat onto p4caf shows a single quadratic curve with its minimum at m = 1 − δ M SE = 0.161.Furthermore, when the m-Extent of shrinkage is doubled to m = 0.322, note that the resulting biased β−estimate has the same Relative MSE Risk as the unbiased OLS solution at m = 0, which is also the Relative Variance of OLS.q q q q q q q q q q q q q 5 10 15 20

Updates and Corrections to Earlier Versions
A vast majority of the material included here (plus a §10 on "Bi-variate Distributions of Non-Linear Beta-Coefficient Estimates") was published in Open Statistics, Obenchain (2022).Since that journal is not currently being published, this latest revision (v5) of my 2021 arXiv paper provides some updates as well as a correction to the caption of Figure 7.
The default for the plot(ef f.ridge, trace, LP ) directive in version 2.3 of the RXshrink R−package, Obenchain (2023), is LP = 0 to display a TRACE without any "Legend".As shown below in Table 2, 9 positive integer values for the "LP" argument are recognized.When linear models are fit to ill-conditioned or confounded narrow-data, TRACE plots are useful in demonstrating and justifying deliberately biased estimation.This makes plots of TRACE diagnostics powerful "visual" displays.If advanced students of regression are trained in TRACE interpretation, they can help administrators capable of basic statistical thinking to avoid misinterpretations of questionable regression coefficient estimates.
All five types of ridge TRACE plots for a wide variety of ridge paths can be explored using existing R-functions.For example, the RXshrink aug.lars() function generates TRACE s for Least-Angle, Lasso and Forward Stagewise methods [Efron, Hastie, Johnstone and Tibshirani (2004), Hastie and Efron (2013)] when applied to narrow-data with p < (n − 3).These TRACE plots

Figure 1 :
Figure1: Efficient Shrinkage δ−factors for the Portland cement example.Note that this TRACE consists of p = 4 two-piece linear splines that all have their "interior Knot" at m = 1.848.These δ−factors are applied to the uncorrelated components vector, c, via equation (2.2) rather than directly to the OLS β−coefficients of equation (2.1).

Figure 2 :Figure 3 :
Figure2: Efficiently shrunken β−coefficient estimates for the Portland cement data.This TRACE also consists of p = 4 two-piece linear splines.The vertical dashed-line again marks the "Knot" at m = 1.848.Note the "numerical sign" correction to the green dot-dash "p4caf" coefficient on this TRACE when the m−extent of shrinkage exceeds m = 0.76.The LP = 7 (Legend Position) here is "Upper-Right".

Figure 4 :Figure 5 :
Figure4: Unrestricted Excess Eigenvalues for the Portland cement data.A positive Excess Eigenvalue emerges as shrikage occurs to the right of m = 0.0 and grows rapidly until shrinkage exceeds the MSE risk optimal extent, m = 1.848.The bad-news is that the single possible negative Excess Eigenvalue appears for m > 1.848 and slowly becomes more and more negative as shrinkage continues.The LP = 3 (Legend Position) here is "Lower-Left".

Figure 7 :
Figure7: Here are two unbiased Elliptical Confidence regions for the Second ("p3cs") and Third ("p4caf") regression coefficients in the p = 4 linear model with TRACEs depicted in Figures1 to 5. Note that the change in sign of the "p4caf" coefficient occurs within the inner Ellipse of only 10% confidence.Furthermore, since the point where the "p3cs" and "p4caf" estimates are both Zero lies within the 90% confidence Ellipse, at least one of these two coefficients is not significantly different from Zero with confidence of 10% or less.Here, the "p4caf" coefficient that changes sign is not significant, while the "p3cs" coefficient is highly significant (***).See ANOVA Table1.16

Figure 9 :
Figure9: Three different slopes for the "Simple" Linear Regression model (heat ∼ p4caf) are shown here superimposed upon the scatter of 13 Portland cement mixtures.Note that "shrinkage" always tends to make the Purple and Red fitted lines more horizontal than the Blue fit.

TABLE 2 -
Legend Positions Table Left Middle Right Upper