Academia.eduAcademia.edu

Enhanced ridge regressions

2010, Mathematical and Computer Modelling

With a simple transformation, the ordinary least squares objective can yield a family of modified ridge regressions which outperforms the regular ridge model. These models have more stable coefficients and a higher quality of fit with the growing profile parameter.

Mathematical and Computer Modelling 51 (2010) 338–348 Contents lists available at ScienceDirect Mathematical and Computer Modelling journal homepage: www.elsevier.com/locate/mcm Enhanced ridge regressions Stan Lipovetsky ∗ GfK Custom Research North America, 8401 Golden Valley Road, Minneapolis, MN 55427, United States article info Article history: Received 22 October 2009 Accepted 21 December 2009 Keywords: Least squares objective Modified ridge regressions Multicollinearity Stable solutions abstract With a simple transformation, the ordinary least squares objective can yield a family of modified ridge regressions which outperforms the regular ridge model. These models have more stable coefficients and a higher quality of fit with the growing profile parameter. With an additional adjustment based on minimization of the residual variance, all the characteristics become even better: the coefficients of these regressions do not shrink to zero when the ridge parameter increases, the coefficient of multiple determination stays high, while bias and generalized cross-validation are low. In contrast to regular ridge regression, the modified ridge models yield robust solutions with various values of the ridge parameter, encompass interpretable coefficients, and good quality characteristics. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction Ridge regression had been originated to overcome the effects of multicollinearity in linear models by Hoerl [1–3], and Hoerl and Kennard [4,5]. Multicollinearity can make confidence intervals so wide that coefficients are incorrectly identified as insignificant, theoretically important variables receive negligible coefficients, or the coefficients have signs opposite to those of the corresponding pair relations, so it is hardly possible to identify the individual predictors’ importance in the regression [6,7]. Ridge regression and its modifications have been developed in numerous works, for instance, [8–16], and used in various applications, for example, [17–20]. Among the further innovations the regularization methods based on the quadratic L2 -metric, lasso L1 -metric, and other Lp -metrics and their combinations have been considered [21–28]. Most of the ridge model modifications use the least squares objective with different added penalizing and regularizing items to prevent inflation of the regression coefficients. This paper can be considered as a further development of the techniques suggested in [29,30]. It shows that instead of rather arbitrary insertion of a regularizing and penalizing item, it is possible to transform the ordinary least squares (OLS) objective itself so that it produces improved ridge solutions outperforming the regular ridge models. In contrast to the regular ridge (RR) model, the coefficients of the improved models and their quality of fit do not diminish to zero when the profile ridge parameter grows. This permits a high quality of fit and acquires a multiple ridge model with coefficients of the same signs as the pair correlations of the dependent variable with the predictors that facilitate interpretability of the individual regressors in the model. A special further adjustment of the model improves its quality, diminishes the coefficients’ bias, and such a characteristic of the residual error as generalized cross-validation. Six modified ridge regressions are considered and compared with the regular ridge model. One of these variants corresponds to the technique constructed in different assumptions in [29,30], but all the other models are newly developed ones with better properties. The enhanced models surpass the regular ridge models, and the best of them is identified. The paper is organized as follows. Section 2 describes the main features of ordinary least squares (OLS) and regular ridge (RR) regressions, and Section 3 introduces three enhanced ridge models. Section 4 considers a modification of each model by ∗ Tel.: +1 763 417 4509; fax: +1 763 542 0864. E-mail address: [email protected]. 0895-7177/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.mcm.2009.12.028 339 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 adjusting it to the maximum quality of fit, and Section 5 describes some other characteristics of the models quality. Section 6 presents numerical results, and Section 7 summarizes. 2. OLS and RR models OLS regression. Consider some properties of the ordinary least squares (OLS) model needed for further analysis. For the standardized (centered and normalized by standard deviation) variables a model of multiple linear regression is: yi = β1 xi1 + · · · + βn xin + εi , (1) where yi and xij are i-th observations (i = 1, . . . , N ) by the dependent variable y and by each j-th independent variable xj (j = 1, . . . , n), βj are the theoretical beta-coefficients, and εi are the deviations of the observed yi from the theoretical model. The least squares (LS) objective for estimation of the coefficients consists of minimizing the sum of squared deviations: S2 = N X i=1 εi2 = or in matrix form: N X i=1 (yi − β1 xi1 − · · · − βn xin )2 , (2) S 2 = kεk2 = (y − X β)′ (y − X β) = 1 − 2β ′ r + β ′ C β, (3) where X is N by n matrix with elements xij of the observations by the independent variables, y is the N-th order vectorcolumn of observations by the dependent variable, β is the n-th order vector-column of the beta-coefficient estimates, and ε is a vector of deviations. The prime used in (3) denotes transposition, the variance of the standardized y equals y′ y = 1, notations C and r correspond to the correlation matrix C = X ′ X among the x–s, and the vector of correlations r = X ′ y between the x–s and y. The first order condition ∂ S 2 /∂β = 0 of minimization (3) by the vector of coefficients yields a normal system of equations with the corresponding solution: C βOLS = r , βOLS = C −1 r , (4) −1 where the vector βOLS denotes the vector of the OLS estimates defined via the inverse correlation matrix C . The model quality is estimated by the residual sum of squares (3), or by the coefficient of multiple determination defined as: R2 = 1 − S 2 = 2β ′ r − β ′ C β = β ′ (2r − C β). (5) ′ ′ R2 (βOLS ) = βOLS r = βOLS C βOLS . (6) ′ ′ ′ S 2 = kεk2 + kkβRR k2 = 1 − 2βRR r + βRR C βRR + kβRR βRR , (7) βRR = (C + kI )−1 r , (8) The minimum of the objective (3) corresponds to the equality C βOLS = r (4), with which the coefficient of multiple determination for OLS regression reaches its maximum and reduces to the following forms: If any x–s are highly correlated or multicollinear, the matrix C (4) becomes ill-conditioned, so its determinant is close to zero, and with inverse matrix C −1 the OLS solution in (4) has vastly inflated values of the coefficients of regression. These coefficients often have signs opposite to the matching pair correlations of x–s with y. Such a regression can be used for prediction, but is worthless in the analysis and interpretation of the individual predictors’ role in the model. Regular ridge (RR) regression. A regular ridge model is usually constructed by adding to the LS objective (3) a penalizing function of the square norm of the vector of coefficients: where βRR denotes a vector of the RR coefficient estimates, and k is a so called ‘‘ridge profile’’ positive parameter. Minimizing this objective by the vector βRR yields the following system of equations and its solution: (C + kI )βRR = r , where I is the identity matrix of n-th order. The solution βRR (8) exists even for a singular matrix of correlations C . For ′ k = 0, the RR model (7)–(8) reduces to the OLS regression model (3)–(4). Multiplying βRR by the Eq. (8) yields the relation ′ ′ ′ βRR C βRR + kβRR βRR = βRR r, and using it in the expression (5) shows that the coefficient of multiple determination for the RR solution can be represented in several forms: ′ ′ ′ ′ ′ ′ R2 (βRR ) = 2βRR r − βRR C βRR = βRR r + kβRR βRR = βRR C βRR + 2kβRR βRR . (9) The last two expressions (9) show that the equality of the type (6) does not hold in this case. With the increase of the profile parameter k, the matrix C + kI (8) approaches a scalar matrix kI, so the inverted matrix reduces to (C + kI )−1 ≈ k−1 I. Then the RR solution (8) and the coefficient of multiple determination (9) go asymptotically to the expressions: βRR = k−1 r , R2 (βRR ) = 2k−1 r ′ r − k−2 r ′ Cr . (10) Thus, the RR solution becomes proportional to the pair correlations and keeps their signs, which is convenient for interpretation of the predictors. But on the other hand, this solution quickly reaches zero with k growth, and the quality of fit also reduces towards zero. 340 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 3. Enhanced ridge regressions Consider several straightforward transformations of the LS objective which lead to a family of ridge regressions with better properties than a regular ridge model. Ridge Enhanced 1 (RE1). Regrouping and squaring the sum of the items in LS objective (2) yields: 2 S = = N X i=1 2 (yi − β1 xi1 − · · · − βn xin ) = n X N  X 1 n j=1 i=1 yi − βj xij 2 +2 N  X 1 n i=1 n X N  X 1 n j>k i=1 yi − β1 xi1 yi − βj xij   1 n + ··· +  yi − βk xik  1 n yi − βn xin 2 (11) . So the LS objective for multiple OLS regression can be represented as the total of the paired regressions of 1/n-th portion of y by each xj separately, plus cross-products of the residuals yi /n − βj xij in each of two pair-wise regressions by xj and xk . Each j-th paired regression has a coefficient βj = ryj /n equal to the pair correlation of the quotient y/n with xj . If we use a term g with the second part of the objective (11), 2 S = n X N  X 1 n j=1 i=1 yi − βj xij 2 +g ·2 n X N  X 1 n j>k i=1 yi − βj xij  1 n yi − βk xik  (12) , then for g = 0 this objective reduces to the total of the paired regressions, for g = 1 it coincides with the LS objective (2), and for intermediate g ranging from 0 to 1 it corresponds to the models between the pair-wise and multiple OLS regressions. The objective (12) can be represented as: S2 = g · N X i=1 (yi − β1 xi1 − · · · − βn xin )2 + (1 − g ) · n X N  X 1 j=1 i=1 n yi − βj xij 2 . (13) Indeed, using (11) in (13) returns it to (12). So the last two expressions are identical, but the latter one is more convenient for derivations. Let us divide (13) by g and use another parameter k = (1 − g )/g. For g = 1, or k = 0, the objective (13) coincides with the LS objective (2), and if g diminishes to zero, or k grows, it corresponds to reducing the objective (13) to the total of the pair-wise objectives. Minimizing the objective (13) yields a system of equations: N N   X X ∂ S2 yi = −2 − βj xij xij = 0. (yi − β1 xi1 − · · · − βn xin ) xij − 2k ∂βj n i=1 i=1 (14) This system and its solution can be presented in matrix form as follows:   k r, (C + kI )βRE1 = 1 + n     k k βRE1 = 1 + (C + kI )−1 r = 1 + βRR , n n (15) where C = X ′ X and r = X ′ y are the correlations among the x–s and between the x–s and y, βRE1 denotes the vector of estimates of the enhanced ridge RE1 model in the approach (13)–(14), and it has taken into account that the variance of a standardized xj equals x′j xj = 1. Using the solution (15) in (5) yields the coefficient of multiple determination for RE1 model:  R2 (βRE1 ) = 2 1 + k n  r ′ (C + kI )−1 r −  1+ k n 2 r ′ (C + kI )−2 Cr . (16) The results (15)–(16) are similar to those of the regular ridge (8), with one exception—a new term 1 + k/n enters into the βRE1 solution which is proportional to the regular ridge vector βRR . It seems a minor modification of the regular ridge, especially for small k with 1 + k/n close to one. However, this term leads to a very noticeable enhancement in the ridge regression results. For a large k, the inverted matrix is (C + kI )−1 ≈ k−1 I, so the RE1 solution (15) and its quality of fit (16) go to the following non-zero asymptotes, respectively: βRE1 =  1 k + 1 n  r ≈ 1 n r, R2 (βRE1 ) = 2 ′ 1 r r − 2 r ′ Cr . n n (17) Thus, in contrast to the regular ridge (10), the enhanced solution does not diminish to zero with increasing k but reaches stable levels (17). So it is possible to increase k until it reaches all the interpretable coefficients of multiple regression proportional to the pair correlations of x and y, with a high quality of fit. Ridge Enhanced 2 (RE2). Consider a more general partitioning of the items in the objective (11) when in place of the same shares 1/n some different fractions pj of y are used with each xj . Then in place of the objective (13), its generalization becomes: 341 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 S2 = N X i=1 (yi − β1 xi1 − · · · − βn xin )2 + k · n X N X j=1 i=1 pj yi − βj xij 2 (18) with the parameter k as is used in (14). Minimizing (18) by the unknown coefficients of regression βj and by the unknown fractions pj yields:  2 N N X X  ∂S   pj yi − βj xij xij = 0, (yi − β1 xi1 − · · · − βn xin ) xij − 2k   ∂β = −2 j i=1 i=1 N X   ∂ S2   = 2k pj yi − βj xij yi = 0.  ∂ pj i=1 (19) This system can be represented in matrix form as:  −r + C β − k (diag(r )p − β) = 0 p − diag(r )β = 0, (20) where diag(r ) is the diagonal matrix with the elements of pair correlations ryj between y and xj , and p is the vector of fractions pj . Substituting p from the second expression into the first one (20) yields the following equation and its solution: C + kI − k · diag(r 2 ) βRE2 = r ,  −1 βRE2 = C + kI − k · diag(r 2 ) r, (21) where βRE2 denotes the vector of estimates of the enhanced ridge RE2 model with two items depending on the term k in its matrix, and diag(r 2 ) is the diagonal matrix of the squared pair correlations of x and y. The obtained solution (21) is similar to the regular ridge (8), the only difference consists of using the diagonal matrix I − diag(r 2 ) in (21) in place of the identity matrix I. The coefficient of multiple determination R2 (βRE2 ) for the solution (21) can be constructed similarly to (9), and various numerical runs show that the enhanced RE2 regression always outperforms the regular RR regression by this characteristic of fit quality. Likewise in the RR model (10), with increasing k the matrix in (21) reduces to the scalar matrix k(I − diag(r 2 )), so the inverted matrix becomes k−1 diag(1/(1 − r 2 )). Then it is easy to show that the RE2 solution (21) and its coefficient of multiple determination goes to asymptotic levels: βRE2 = 1 k R2 (βRE2 ) = Dr , 2 ′ 1 r Dr − 2 r ′ DCDr , k k (22) where the notation D = diag(1/(1 − r 2 )) is used. Similar to the RR behavior (10), RE2 (22) attains the signs of the pair correlations, but the coefficients and quality of fit diminish to zero with k growth. Ridge Enhanced 3 (RE3). Another generalized partitioning of the items in the objective (11) with different fractions pj restricted by their total equal one can be expressed by the conditional objective: S2 = N X i=1 (yi − β1 xi1 − · · · − βn xin )2 + k · n X N X j=1 i=1 pj yi − βj xij 2 − 2λ n X j=1 ! pj − 1 , (23) P pj = 1. Minimization where λ is a Lagrange term. The objective is similar to (18) but contains the normalizing relation of (23) by βj produces the same first system of equations as in (19), and minimization by pj yields the second system of Eqs. (19) with additional item λ. The total system can be represented in matrix form as:  −r + C β − k (diag(r )p − β) = 0 k(p − diag(r )β) − λe = 0, (24) where e is the n-th order identity vector, and all the other notations are the same as in (20). Multiplying e′ by the second equation (24) and taking into account the normalization e′ p = 1 and relation e′ e = n, we obtain the term λ = (1 − r ′ β)(k/n). Using the latter expression in (24) and substituting p from the second of these equations into the first one, yields the following system and its solution:  C + kI − k · diag(r 2 ) + k n rr ′    k r, βRE3 = 1 + n  −1  k k C + kI − k · diag(r 2 ) + rr ′ r. βRE3 = 1 + n n (25a) (25b) This is the enhanced RE3 solution with three items containing the term k in its matrix. In comparison with RE2 (21), the matrix in (25) has one additional item (k/n)rr ′ proportional to the outer product rr ′ of the correlation vector. But the term 1 + k/n in the solution (25b) makes its behavior rather similar to that of RE1 (15) than to the RE2 (21) model. Indeed, when 342 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 increasing k, the item C in (25) becomes negligible in comparison with the other items, so inversion of the remaining part with the help of the known Sherman–Morrison formula (see [31]) can be reduced to: 1 k  I − diag(r 2 ) +   r √ n  r √ n ′ −1 = 1 k  D− Drr ′ D n + r ′ Dr  , (26) where D = diag(1/(1 − r 2 )) denotes the same diagonal matrix as is used in (22). Using this inverted matrix in (25b) for large k leads to the following solution and the corresponding coefficient of multiple determination: βRE3 =  1 k R2 (βRE3 ) = + 1 n  Dr − 2r ′ Dr n + r ′ Dr − r ′ Dr n + r ′ Dr Dr r ′ DCDr (n + r ′ Dr )2  = 1 n + r ′ Dr Dr , . (27a) (27b) So in contrast to the RR (10) and RE2 (22) asymptotic behaviors, but similar to the RE1 (17) model, the enhanced RE3 solution and quality of fit converge to the asymptotic levels (27) independent of the ridge parameter k. This means that without losing much on the quality of fit when k increases, the RE3 model can produce interpretable coefficients of multiple regression with signs of the pair correlations of x and y. 4. Adjustment to the best fit The next step in constructing the enhanced ridge models consists of the following procedure. For any of the ridge solutions β – this could be the regular ridge RR (8), or ridge estimates RE1 (15), RE2 (21), and RE3 (25) – it is possible to improve it by adjusting to the maximum possible quality of fit, which is estimated by the residual sum of squares S 2 (proportional to the residual variance), or by the convenient characteristic of the coefficient of multiple determination R2 = 1 − S 2 (5) which ranges from zero to one for the worst models to the best ones, respectively. For any given solution β , consider a proportionally modified (adjusted) vector: βadj = qβ, (28) R2 (βadj ) = 2qr ′ β − q2 β ′ C β. (29) q = (r ′ β)/(β ′ C β). (30) and substitute it into the general expression for the coefficient of multiple determination (5): This is a concave quadratic function by the unknown parameter q, and it reaches its maximum at the point between its roots at the value: Using (30) in (28)–(29) yields the adjusted solution and the maximum of the coefficient of multiple determination which can be attained with such an adjusted solution as follows: βadj = r ′β · β, β ′C β R2 (βadj ) = (r ′ β)2 . β ′C β (31) This new adjusted solution βadj is easy to find, and it produces a maximum fit for a given vector β at any value of the ridge parameter k. The coefficient of multiple determination in (31) can be presented in the two following equivalent forms: r ′β (r ′ β) = q(r ′ β) = r ′ βadj , β ′C β  ′ 2 rβ ′ 2 (β ′ C β) = q2 (β ′ C β) = βadj C βadj . R (βadj ) = β ′C β R2 (βadj ) = (32a) (32b) ′ This interesting result shows that the equality R2 (βadj ) = r ′ βadj = βadj C βadj holds for any adjusted solution, similar to the OLS model (6). Let us consider explicitly the adjusted solutions for the considered ridge models. Regular Ridge adjusted (RR.adj). For the regular ridge solution (8) the coefficient of adjustment (30) is: qRR = r ′ βRR ′ βRR C βRR = r ′ (C + kI )−1 r r ′ (C + kI )−2 Cr , (33) because the matrices C and (C + kI )−1 are commutative. The adjusted solution is defined as: βRR.adj = r ′ (C + kI )−1 r r ′ (C + kI )−2 Cr (C + kI )−1 · r , (34) 343 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 and with it, the coefficient of multiple determination can be found by (31) or (32). In the limit of large k, as it was considered for (10), the matrix inversion is (C + kI )−1 ≈ k−1 I, so the coefficient (33) becomes proportional to the ridge parameter: r ′r qRR = k ′ . r Cr Then the adjusted solution converges to the asymptote: r ′r · k−1 r = (35) r ′r · r, (36) r ′ Cr which does not depend on k, so it does not reduce to zero as the RR solution (10) does. The coefficient of multiple determination and its simplification with large k are as follows: βRR.adj = k r ′ Cr (r ′ (C + kI )−1 r )2 (r ′ r )2 ≈ , r ′ (C + kI )−2 Cr r ′ Cr R2 (βRR.adj ) = (37) so it approaches a constant independent of k, and the quality of fit does not decrease steeply as in the case of the regular ridge model (10). The results (33)–(37) for the ridge regression were obtained in [29,30]. Ridge Enhanced 1 adjusted (RE1.adj). For the ridge enhanced solution RE1 (15), the coefficient of adjustment (30) equals: qRE1 = r ′ βRE1 ′ βRE1 C βRE1 =  1 1 + k/n  r ′ (C + kI )−1 r · r ′ (C + kI )−2 Cr = qRR 1 + k/n (38) , which is proportional to the coefficient (33). The adjusted solution (15) becomes:  βRE1.adj = qRE1 1 + k n  βRR = qRR βRR = βRR.adj , (39) so it coincides with solution (34). Then the asymptotic behavior of the RE1.adj solution agrees with that of RR.adj (36), and also R2 (βRE1.adj ) = R2 (βRR.adj ) as given in (37). Thus, with the help of the adjustment, the behavior of the regular ridge (10) is significantly improved and made stable similarly to the enhanced solution (17); and both models RR.adj and RE1.adj have an even higher level of fit due to the attained maximum coefficient of multiple determination in a large span of increasing k. Ridge Enhanced 2 adjusted (RE2.adj). The behavior of the next enhanced ridge model RE2 (21) with undesirable features (22), can also be drastically recovered by the adjustment. Indeed, the coefficient of adjustment (30) for this model is: qRE2 = r ′ βRE2 ′ βRE2 C βRE2 = r′ r ′ C + k(I − diag(r 2 )) −1 r (40) −1  −1 . C + k(I − diag( )) C C + k(I − diag(r 2 )) r r2 With coefficient (40) the adjusted solution becomes: βRE2.adj =  −1  r ′ C + k(I − diag(r 2 )) r · C + k(I − diag(r 2 )) r ′ C + k(I − diag(r 2 )) C C + k(I − diag(r 2 )) −1 −1 −1 r ·r (41) and the coefficient of multiple determination can be found by (31)–(32). For large k, as it was considered for (22), the inverted matrix equals k−1 diag(1/(1 − r 2 )), so the coefficient (40) is proportional to the ridge parameter: r ′ Dr , qRE2 = k ′ r DCDr (42) with the diagonal matrix D = diag(1/(1 − r 2 )) used in (22). Then the adjusted solution converges to the asymptote: βRE2.adj = k r ′ Dr r ′ DCDr · k−1 Dr = r ′ Dr r ′ DCDr · Dr . (43) The coefficient of multiple determination and its simplification with large k areas follows: R2 (βRE2.adj ) =  r ′ C + k(I − diag(r 2 )) −1 2 r (r ′ Dr )2 . −1 −1 ≈ ′ r DCDr r ′ C + k(I − diag(r 2 )) C C + k(I − diag(r 2 )) r (44) In striking contrast to (22), the adjusted solution (43) and quality of fit (44) do not depend on k for its large values. Ridge Enhanced 3 adjusted (RE3.adj). The model of the enhanced ridge RE3 (25) has good features already (27), but its quality of fit can anyway be amplified by the adjustment procedure. The coefficient of adjustment (30) for this model is: qRE3 =   r ′ C + k I − diag(r 2 ) + 1+   k n  r ′ C + k I − diag(r 2 ) + r′r n −1  r′r n −1  r C C + k I − diag(r 2 ) + r′r n −1 . r (45) 344 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 The solution (25b) adjusted by the coefficient (45) is: βRE3.adj =    r ′ C + k I − diag(r 2 ) +   r ′ C + k I − diag(r 2 ) + r′r n r′r n  −1   r · C + k I − diag(r 2 ) + −1   C C + k I − diag(r 2 ) + r′r n r′r n −1 −1 r ·r (46) and again the corresponding coefficient of multiple determination can be estimated by (31) or (32). With large k, the matrix inversion in (45)–(46) can be taken by the formula (26), so the coefficient (45) reduces to the following constant: qRE3 = (r ′ Dr )(n + r ′ Dr ) r ′ DCDr (47) , with the same diagonal matrix D as in (22) and (42). The adjusted solution in the limit of large k converges to:   k (r ′ Dr )(n + r ′ Dr ) −1 ·k βRE3.adj = 1 + ′ n r DCDr n n+ r ′ Dr r ′ Dr Dr ≈ ′ · Dr , r DCDr (48) which is the same solution (43) as for the previous model. So for growing k, the coefficient of multiple determination eventually reduces to the last expression (44). Resuming, the family of seven ridge-regression solutions includes: regular ridge RR (8); ridge enhanced—the models RE1 (15), RE2 (21), RE3 (25); and their adjusted versions—coinciding models RR.adj (34) and RE1.adj (39) (denoted further as one solution RR&RE1.adj), then RE2.adj (41), and RE3.adj (46). Besides the classical RR and previously considered RR&RE1.adj, all the other models are newly developed. 5. Several other characteristics of ridge models Besides the above considered coefficients of ridge regressions and their multiple determination, let us describe several other characteristics of the obtained models. For this aim, the notation A will be used for the matrix operator in transformation β̃ = Ar of the vector of pair correlations r into the vector of beta-coefficients β̃ , where tilde denotes estimates obtained by any of the considered ridge techniques. For instance, in RR model (8) this transformation is fulfilled by the matrix A = (C + kI )−1 for the vector β̃ = βRR , in RE1 (15) – by the matrix A = (1 + k/n)(C + kI )−1 for the vector β̃ = βRE1 , etc., and in RE3.adj – by the whole complicated structure shown in (46) for the vector β̃ = βRE3.adj . The effective number of parameters for a ridge model is defined as: n∗ = Tr (AC ) , (49) where Tr denotes the trace of a matrix, or total of its diagonal elements. For k = 0 the value (49) coincides with the total number n of the predictors, and with a larger k the value n∗ diminishes. The residual error variance equals: 2 Sres = (1 − R2 )/(N − n∗ − 1), (50) where R2 is a coefficient of multiple determination in any of the above considered models. The mathematical expectation for an estimated vector of parameters is: M (β̃) = A · M (r ) = AX ′ M (y) = AX ′ M (X β + ε) = AX ′ X β = AC β, (51) with the errors’ expectation M (ε) = 0, and β denotes the theoretical vector of coefficients. If the matrix product AC does not equal the identity matrix I, then a ridge solution is biased. Actually, any ridge solution is biased, and only when k = 0 does it reduce to the unbiased OLS solution. A convenient measure of the bias can be built as the squared norm of the difference between the matrix in (51) and the identity matrix divided by the number of parameters: Bias(β̃) = 1 n kAC − I k2 . (52) The lower the total bias is, the closer measure (52) is to zero, for instance, in the OLS solution it equals exactly zero. The covariance matrix of the parameters’ estimates can be written as: 2 AX ′ XA, cov(β̃) = Sres (53) with the residual error variance estimated by (50). The trace of this matrix (53) can be used as the efficiency, or the total variance of the estimated coefficients: 2 2 Var(β̃) = Sres Tr(AX ′ XA) = Sres Tr(A2 C ). (54) S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 a b (a) Coef. of mult. determination. (b) Bias of estimates. c d (c) Efficiency of estimates. (d) Gen. cross-validation 345 Fig. 1. Ridge profile of quality characteristics. Another measure for residual variance is the generalized cross-validation criterion well known in regression modeling [32–34], which can presented as follows: GCV = N 1 − R2 k(I − H )yk2 = N , (Tr(I − H ))2 (N − Tr(AC ))2 (55) where the hat-matrix H corresponds to projection of the empirical to the theoretical vector of the dependent variable Hy = X β̃ = XAX ′ y. In (55) it is also taken into account that the squared norm of the residuals y − Hy can be expressed via the coefficient of multiple determination for each type of ridge model, and the trace Tr(XAX ′ ) = Tr(AX ′ X ). 6. Numerical example Consider a numerical example with the above described ridge models traced by the profile parameter k. The data present various cars’ characteristics given in [35], and also available in [36] (as ‘‘car.all’’ data). The data describes dimensions and mechanical specifications of 105 cars, supplied by the manufacturers and measured by Consumer Reports. The variables are: y—Price of a car, US$ K; x1 —Weight, pounds; x2 —Length overall, inches; x3 —Wheel base length, inches; x4 —Width, inches; x5 —Front Leg Room maximum, inches; x6 —Front Shoulder room, inches; x7 —Turning circle radius, feet; x8 —Displacement of the engine, cubic inches; x9 —HP, the net horsepower; x10 —Tank fuel refill capacity, gallons. The cars’ price is estimated in the regression model by the dimensions and specifications variables which can help to find better design solutions. 346 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 Table 1 Correlations, OLS and ridge regressions. k=2 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ryx 0.653 0.533 0.496 0.478 0.567 0.371 0.378 0.642 0.783 0.657 RR RE1 RE2 RE3 RR&RE1.adj RE2.adj RE3.adj 0.085 0.056 0.035 0.025 0.124 0.003 0.002 0.083 0.16 0.092 0.102 0.067 0.042 0.03 0.149 0.004 0.002 0.099 0.192 0.11 0.106 0.057 0.012 −0.001 0.16 −0.01 −0.03 0.085 0.333 0.114 0.114 0.061 0.013 −0.003 0.172 −0.015 −0.031 0.092 0.359 0.123 0.125 0.083 0.051 0.037 0.183 0.005 0.002 0.122 0.236 0.135 0.115 0.062 0.013 −0.003 0.174 −0.016 −0.031 0.093 0.363 0.125 0.125 0.067 0.015 −0.001 0.189 −0.02 −0.03 0.101 0.394 0.135 R2 0.562 0.605 0.667 0.679 0.627 0.68 0.686 R2 R2OLS 0.778 0.839 0.923 0.94 0.869 0.942 0.951 RE2.adj RE3.adj 0.105 0.056 0.034 0.024 0.129 0.006 0.003 0.097 0.293 0.116 0.116 0.062 0.038 0.026 0.143 0.007 0.003 0.107 0.323 0.128 k=6 OLS RR RE1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 0.278 0.225 −0.085 −0.144 0.245 −0.060 −0.199 0.101 0.409 0.160 0.053 0.039 0.032 0.029 0.063 0.017 0.017 0.053 0.082 0.056 0.085 0.063 0.052 0.046 0.101 0.027 0.028 0.084 0.131 0.09 R2 R2 R2OLS 0.722 RE2 0.081 0.043 0.026 0.018 0.099 0.005 0.002 0.075 0.224 0.088 RE3 0.102 0.054 0.033 0.023 0.125 0.006 0.003 0.094 0.284 0.112 RR&RE.1adj 0.112 0.082 0.068 0.06 0.132 0.036 0.037 0.111 0.171 0.118 0.409 0.532 0.579 0.633 0.564 0.637 0.645 0.567 0.737 0.802 0.877 0.781 0.883 0.894 Fig. 1 shows several main characteristics of the regression quality traced by the k parameter for all seven considered models: regular ridge RR (8); enhanced models RE1 (15), RE2 (21), RE3 (25); and adjusted models RR&RE1.adj (coinciding RR.adj (34) and RE1.adj (39)), RE2.adj (41), and RE3.adj (46). Fig. 1 consists of: (a)—coefficient of multiple determination R2 for each model, (b)—bias of the estimates (52), (c)—efficiency of the estimates (54) (logarithm of variance is shown), and (d)— generalized cross-validation (55). All curves start on the OLS solution (4) corresponding to k = 0 in the ridge models. We see in graph A that the regular ridge RR has the worst R2 behavior, while the enhanced models are better, and the adjusted models have the most stable R2 values when k increases. Enhanced RE3, and especially its adjusted version RE3.adj are the best of all the models. The next graphs (b), (c), and (d) in Fig. 1, support this conclusion, showing that all the other models perform between the RR and RE3.adj ridge models. For this reason, the models RR and RE3.adj are chosen for presenting all ten beta-coefficients in Fig. 2. As it was discussed above, the RR solution reaches a zero level (10) for all the estimates, while the RE3.adj solution (48) reaches stable, constant levels. All the other coefficients behave within the range of these two models. Several coefficients (for beta 3, 4, 6, and 7) are negative at the origin corresponding to the OLS regression, but with increasing k they become positive, as the pair correlations are. So we can always find the solution where all the coefficients are interpretable and have a high quality of fit. Table 1 presents more results. In its first numerical column, there are the vectors of pair correlations ryx of y with x, and below it, the vectors of beta coefficients of OLS (4). All correlations are positive, but because of multicollinearity the variables x3 , x4 , x6 , and x7 have negative coefficients in the multiple OLS regression, although it has a good coefficient of multiple determination R2OLS = 0.722. The next seven columns in Table 1 present in their upper part all the considered ridge models for the parameter k = 2, and below them—the same models with the value k = 6. Below each model, its coefficient of multiple determination is shown, together with the quotient of this coefficient to its value for the OLS model, R2 /R2OLS . The regular ridge RR at the upper part of Table 1 has all positive coefficients and R2 = 0.562, the enhanced models outperform it, and the adjusted models have the best quality of fit. All the ridge models in the lower part of Table 1 have positive coefficients, and a high quality of fit. The values of R2 /R2OLS for the two regular ridge RR models are 56.7% and 77.8%, while for the best RE3 adjusted model these values are 89.4% and 95.1%. The RE3 adjusted model systematically demonstrates the best characteristics and suggests interpretable coefficients of regression. This data had also been used for comparison across several other regularization techniques in [26], and the current results are among the best regressions. The discussed results are very typical, and have been observed with different data sets. 7. Summary A modified least squares objective is used to produce a family of new ridge regressions with enhanced properties. These models are additionally adjusted to attain the best possible quality of fit. Together with the regular ridge regression, six S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 347 Fig. 2. Ridge profile for RR and RE3.adj solution. newly developed models are described and compared. Each of them outperforms the regular ridge model, in contrast to which the enhanced and adjusted ridge solutions have a stabilized profile behavior and a better quality of fit. The enhanced models are less biased, but are efficient, and encompass other helpful features of ridge regression. The results of the enhanced 348 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 ridge models are stable and easily interpretable. Judging by the theoretical features and numerical validation, the best of the enhanced models is the RE3 adjusted ridge regression. The suggested approach is useful for theoretical consideration and practical applications of regression modeling and analysis. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] A.E. Hoerl, Optimal solution of many variables equations, Chemical Engineering Progress 55 (1959) 69–78. A.E. Hoerl, Application of ridge analysis to regression problems, Chemical Engineering Progress 58 (1962) 54–59. A.E. Hoerl, Ridge analysis, Chemical Engineering Progress Symposium Series 60 (1964) 69–78. A.E. Hoerl, R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. (reprinted in 42 (2000) 80–86). A.E. Hoerl, R.W. Kennard, Ridge regression: Iterative estimation of the biasing parameter, Communications in Statistics, Part A 5 (1976) 77–88. A. Grapentine, Managing multicollinearity, Marketing Research 9 (1997) 11–21. C.H. Mason, W.D. Perreault, Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research 28 (1991) 268–280. M. Aldrin, Multivariate prediction using softly shrunk reduced-rank regression, The American Statistician 54 (2000) 29–34. P.J. Brown, Measurement, Regression and Calibration, Oxford University Press, Oxford, 1994. G. Casella, Condition numbers and minimax ridge regression estimators, Journal of the American Statistical Association 80 (1985) 753–758. G. Diderrich, The Kalman filter from the perspective of Goldberger–Theil estimators, The American Statistician 39 (1985) 193–198. N.R. Draper, A.M. Herzberg, A ridge-regression sidelight, The American Statistician 41 (1987) 282–283. R.W. Hoerl, Ridge analysis 25 years later, The American Statistician 39 (1985) 186–192. D.R. Jensen, D.E. Ramirez, Anomalies in the foundations of ridge regression, International Statistical Review 76 (2008) 89–105. D.W. Marquardt, R.D. Snee, Ridge regression in practice, The American Statistician 29 (1975) 3–20. Y. Maruyama, W.E. Strawderman, A new class of generalized Bayes minimax ridge regression estimators, The Annals of Statistics 33 (2005) 1753–1770. G.M. Erickson, Using ridge regression to estimate directly lagged effects in marketing, Journal of the American Statistical Association 76 (1981) 766–773. E.C. Malthouse, Ridge regression and direct marketing scoring models, Journal of Interactive Marketing 13 (1999) 10–23. K.B. Newman, J. Rice, Modeling the survival of Chinook salmon smolts outmigrating through the lower Sacramento river system, Journal of the American Statistical Association 97 (2002) 983–993. E. Vago, S. Kemeny, Logistic ridge regression for clinical data analysis (a case study), Applied Ecology and Environmental Research 4 (2006) 171–179. B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The Annals of Statistics 32 (2004) 407–489. C. Fraley, T. Hesterberg, Least angle regression and LASSO for large datasets, Statistical Analysis and Data Mining 1 (2009) 251–259. G.M. James, P. Radchenko, J. Lv, DASSO: Connections between the Dantzig selector and lasso, Journal of the Royal Statistical Society. Series B 71 (2009) 127–142. S. Lipovetsky, Optimal Lp-metric for minimizing powered deviations in regression, Journal of Modern Applied Statistical Methods 6 (2007) 219–227. S. Lipovetsky, Equidistant regression modeling, Model Assisted Statistics and Applications 2 (2007) 71–80. S. Lipovetsky, Linear regression with special coefficient features attained via parameterization in exponential, logistic, and multinomial-logit forms, Mathematical and Computer Modelling 49 (2009) 1427–1435. L. Meier, S. van de Geer, P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society. Series B 70 (2008) 53–71. R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B 58 (1996) 267–288. S. Lipovetsky, M. Conklin, Ridge regression in two parameter solution, Applied Stochastic Models in Business and Industry 21 (2005) 525–540. S. Lipovetsky, Two-parameter ridge regression and its convergence to the eventual pairwise model, Mathematical and Computer Modelling 44 (2006) 304–318. W.H. Press, S.A. Teukolsky, W.T. Wetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing, 3rd ed., Cambridge University Press, New York, 2007. P. Craven, G. Wahba, Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized crossvalidation, Numerical Mathematics 31 (1979) 317–403. G.H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. J.M. Chambers, T.J. Hastie, Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA, 1992. S-PLUS’2000, MathSoft, Seattle, WA, 1999.