Lesson 5: Regression Shrinkage Methods (2024)

Motivation: too many predictors

  • It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. microarray data analysis, environmental pollution studies.

  • With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist.

Motivation: ill-conditioned X

  • Because the LS estimates depend upon \((X'X)^{-1}\), we would have problems in computing \(\beta_{LS}\) if \(X'X\) were singular or nearly singular.

  • In those cases, small changes to the elements of \(X\) lead to large changes in \((X'X)^{-1}\).

  • The least square estimator \(\beta_{LS}\) may provide a good fit to the training data, but it will not fit sufficiently well to the test data.

Ridge Regression:

One way out of this situation is to abandon the requirement of an unbiased estimator.

We assume only that X's and Y have been centered so that we have no need for a constant term in the regression:

  • X is ann byp matrix with centered columns,
  • Y is a centered n-vector.

ho*rl and Kennard (1970) proposed that potential instability in the LS estimator

\begin{equation*}
\hat{\beta} = (X'X)^{-1} X' Y,
\end{equation*}

could be improved by adding a small constant value \( \lambda \) to the diagonal entries of the matrix \(X'X\) before taking its inverse.

The result is the ridge regression estimator

\begin{equation*}
\hat{\beta}_{ridge} = (X'X+\lambda I_p)^{-1} X' Y
\end{equation*}

Ridge regression places a particular form of constraint on the parameters \( \left(\beta\text{'s}\right)\): \(\hat{\beta}_{ridge}\) is chosen to minimize the penalized sum of squares:

\begin{equation*}
\sum_{i=1}^n (y_i - \sum_{j=1}^p x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2
\end{equation*}

which is equivalent to minimization of \(\sum_{i=1}^n (y_i - \sum_{j=1}^p x_{ij}\beta_j)^2\) subject to, for some \(c>0\), \(\sum_{j=1}^p \beta_j^2 < c\), i.e. constraining the sum of the squared coefficients.

Therefore, ridge regression puts further constraints on the parameters, \(\beta_j\)'s, in the linear model. In this case, what we are doing is that instead of just minimizing the residual sum of squares we also have a penalty term on the \(\beta\)'s. This penalty term is \(\lambda\) (a pre-chosen constant) times the squared norm of the \(\beta\) vector. This means that if the \(\beta_j\)'s take on large values, the optimization function is penalized. We would prefer to take smaller \(\beta_j\)'s, or \(\beta_j\)'s that are close to zero to drive the penalty term small.

Geometric Interpretation of Ridge Regression:

Lesson 5: Regression Shrinkage Methods (1)


The ellipses correspond to the contours of the residual sum of squares (RSS): the inner ellipse has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates.

For \(p=2\), the constraint in ridge regression corresponds to a circle, \(\sum_{j=1}^p \beta_j^2 < c\).

We are trying to minimize the ellipse size and circle simultaneously in the ridge regression. The ridge estimate is given by the point at which the ellipse and the circle touch.

There is a trade-off between the penalty term and RSS. Maybe a large \(\beta\) would give you a better residual sum of squares but then it will push the penalty term higher. This is why you might actually prefer smaller \(\beta\)'s with a worse residual sum of squares. From an optimization perspective, the penalty term is equivalent to a constraint on the \(\beta\)'s. The function is still the residual sum of squares but now you constrain the norm of the \(\beta_j\)'s to be smaller than some constant c. There is a correspondence between \(\lambda\) and c. The larger the \(\lambda\) is, the more you prefer the \(\beta_j\)'s close to zero. In the extreme case when \(\lambda = 0\), then you would simply be doing a normal linear regression. And the other extreme as \(\lambda\) approaches infinity, you set all the \(\beta\)'s to zero.

Properties of Ridge Estimator:

  • \(\hat{\beta}_{ls}\) is an unbiased estimator of \(\beta\); \(\hat{\beta}_{ridge}\) is a biased estimator of \(\beta\).

For orthogonal covariates, \(X'X=n I_p\), \(\hat{\beta}_{ridge} = \dfrac{n}{n+\lambda} \hat{\beta}_{ls}\). Hence, in this case, the ridge estimator always produces shrinkage towards \(0\). \(\lambda\) controls the amount of shrinkage.

An important concept in shrinkage is the "effective'' degrees of freedom associated with a set of parameters. In a ridge regression setting:

  1. If we choose \(\lambda=0\), we have \(p\) parameters (since there is no penalization).
  2. If \(\lambda\) is large, the parameters are heavily constrained and the degrees of freedom will effectively be lower, tending to \(0\) as \(\lambda\rightarrow \infty\).

The effective degrees of freedom associated with \(\beta_1, \beta_2, \ldots, \beta_p\) is defined as
\begin{equation*}
df(\lambda) = tr(X(X'X+\lambda I_p)^{-1}X') = \sum_{j=1}^p \dfrac{d_j^2}{d_j^2+\lambda},
\end{equation*}
where \(d_j\) are the singular values of \(X\). Notice that \(\lambda = 0\), which corresponds to no shrinkage, gives \(df(\lambda) = p\) (as long as \(X'X\) is non-singular), as we would expect.

There is a 1:1 mapping between \( \lambda \) and the degrees of freedom, so in practice one may simply pick the effective degrees of freedom that one would like associated with the fit, and solve for \( \lambda \).

  • As an alternative to a user-chosen \( \lambda \), cross-validation is often used in choosing \( \lambda \): we select \( \lambda \) that yields the smallest cross-validation prediction error.
  • The intercept \(\beta_0\) has been left out of the penalty term because \( Y \) has been centered. Penalization of the intercept would make the procedure depend on the origin chosen for \( Y \).

  • Since the ridge estimator is linear, it is straightforward to calculate the variance-covariance matrix \(var(\hat{\beta}_{ridge}) = \sigma^2 (X'X+\lambda I_p)^{-1} X'X (X'X+\lambda I_p)^{-1}\).

A Bayesian Formulation

Consider the linear regression model with normal errors:

\begin{equation*}
Y_i = \sum_{j=1}^p X_{ij}\beta_j + \epsilon_i
\end{equation*}
\(\epsilon_i\) is i.i.d. normal errors with mean 0 and known variance \(\sigma^2\).

Since \( \lambda \) is applied to the squared norm of the β vector, people often standardize all of the covariates to make them have a similar scale. Assume \( \beta_j \) has the prior distribution \( \beta_j \sim_{iid} N(0,\sigma^2/\lambda)\). A large value of \( \lambda \) corresponds to a prior that is more tightly concentrated around zero and hence leads to greater shrinkage towards zero.

The posterior is \(\beta|Y \sim N(\hat{\beta}, \sigma^2 (X'X+\lambda I_p)^{-1} X'X (X'X+\lambda I_p)^{-1})\), where \(\hat{\beta} = \hat{\beta}_{ridge} = (X'X+\lambda I_p)^{-1} X' Y\), confirming that the posterior mean (and mode) of the Bayesian linear model corresponds to the ridge regression estimator.

Whereas the least squares solutions \(\hat{\beta}_{ls} = (X'X)^{-1} X' Y\) are unbiased if model is correctly specified, ridge solutions are biased, \(E(\hat{\beta}_{ridge}) \neq \beta\). However, at the cost of bias, ridge regression reduces the variance, and thus might reduce the mean squared error (MSE).

\begin{equation*}
MSE = Bias^2 + Variance
\end{equation*}

More Geometric Interpretations (optional)

\( \begin {align} \hat{y} &=\textbf{X}\hat{\beta}^{ridge}\\
& = \textbf{X}(\textbf{X}^{T}\textbf{X} + \lambda\textbf{I})^{-1}\textbf{X}^{T}\textbf{y}\\
& = \textbf{U}\textbf{D}(\textbf{D}^2 +\lambda\textbf{I})^{-1}\textbf{D}\textbf{U}^{T}\textbf{y}\\
& = \sum_{j=1}^{p}\textbf{u}_j \dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\\
\end {align} \)

where \(\textbf{u}_j\) are the normalized principal components of X.

\(\hat{\beta}_{j}^{ridge}=\dfrac{d_{j}^2}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\)

\(Var(\hat{\beta}_{j})=\dfrac{\sigma^2}{d_{j}^{2}}\)

where \(\sigma^2\) is the variance of the error term \(\epsilon\) in the linear model.

  • Inputs are centered first;
  • Consider the fitted response
  • Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components.
  • Coordinates with respect to principal components with smaller variance are shrunk more.
  • Instead of using X = (X1, X2, ... , Xp) as predicting variables, use the new input matrix \(\tilde{X}\) = UD
  • Then for the new inputs:
  • The shrinkage factor given by ridge regression is:

\(\dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\)

    We saw this in the previous formula. The larger λ is, the more the projection is shrunk in the direction of \(u_j\). Coordinates with respect to the principal components with a smaller variance are shrunk more.

    Let's take a look at this geometrically.

    Lesson 5: Regression Shrinkage Methods (2)

    Shrinkage

    This interpretation will become convenient when we compare it to principal components regression where instead of doing shrinkage, we either shrink the direction closer to zero or we don't shrink at all. We will see this in the "Dimension Reduction Methods" lesson.

    Lesson 5: Regression Shrinkage Methods (2024)

    FAQs

    What are the shrinkage methods of regression? ›

    Types of regression that involve shrinkage estimates include ridge regression, where coefficients derived from a regular least squares regression are brought closer to zero by multiplying by a constant (the shrinkage factor), and lasso regression, where coefficients are brought closer to zero by adding or subtracting a ...

    What question does regression answer? ›

    Multiple Linear Regression Analysis helps answer three key types of questions: (1) identifying causes, (2) predicting effects, and (3) forecasting trends. Identifying Causes: It determines the cause-and-effect relationships between one continuous dependent variable and two or more independent variables.

    How does Lasso shrink? ›

    The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

    What is the shrinkage penalty? ›

    The term on the right is called a shrinkage penalty because it forces each coefficient βj closer to zero by squaring it. The shrinkage part is clearer once you think of this term as forcing each coefficient to be as small as possible without compromising the Residual Sum of Squares (RSS).

    How do you solve for shrinkage? ›

    To find the inventory shrinkage rate, divide your inventory losses by the amount of inventory you should have. Multiply your inventory shrinkage rate by 100 to convert it into a percentage.

    What are the shrinking methods? ›

    Shrinkage Methods
    • Ridge regression shrinks the regression coefficients by imposing a penalty on their size. ...
    • An equivalent way to write the ridge problem is ˆβridge=argminβN∑i=1(yi−β0−p∑j=1xijβj)2,subject top∑j=1β2j≤t,
    • The lasso is a shrinkage method like ridge, with subtle but important differences.

    What is regression in short answer? ›

    Regression allows researchers to predict or explain the variation in one variable based on another variable. Definitions: ❖ The variable that researchers are trying to explain or predict is called the response variable. It is also sometimes called the dependent variable because it depends on another variable.

    What is the math behind regression? ›

    To work out the regression line the following values need to be calculated: a=¯y−b¯x a = y ¯ − b x ¯ and b=SxySxx b = S x y S x x . The easiest way of calculating them is by using a table. Start off by working out the mean of the independent and dependent variables.

    What is regression equation answer? ›

    A regression equation can be defined as a statistical model, used to determine the specific relationship between the predictor variable and the outcome variable. A model regression equation allows predicting outcome with a very small error.

    Why use lasso regression? ›

    Lasso regression can help to reduce dimensionality within a dataset by shrinking the weight parameters to zero, eliminating less important features from the model.

    What is least angle regression and shrinkage? ›

    Least angle regression is a variable selection/shrinkage procedure for high-dimensional data. It is also an algorithm for efficiently finding all knots in the solution path for the aforementioned this regression procedure, as well as for lasso (L1-regularized) linear regression.

    What is the least absolute shrinkage? ›

    In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

    What is shrinkage in regression? ›

    Copas in 1983 writes the first article Regression, Prediction and Shrinkage to coin the term "shrinkage". It's defined implicitly in the abstract: The fit of a regression predictor to new data is nearly always worse than its fit to the original data.

    What is an example of shrinkage? ›

    For example, any waste of materials that results during a manufacturing process is a form of shrinkage. So are perishable items that expire enroute to a restaurant. And all have the same detrimental effect on a company's profits. Shrinkage is the difference between recorded inventory and actual inventory.

    What is negative shrinkage? ›

    Shrinkage means to contract and negative shrinkage means to expand.

    How many types of shrinkage are there? ›

    The paper explains the basic types of shrinkage: carbonation shrinkage, plastic shrinkage, temperature shrinkage, chemical shrinkage, autogenous shrinkage, and drying shrinkage.

    What are the different stages of shrinkage? ›

    As molten metal cools, shrinkage occurs in three distinct stages:
    • Liquid shrinkage is the contraction that occurs as the alloy cools but remains in its liquid state. ...
    • Liquid-to-solid shrinkage (also known as solidification shrinkage) occurs as the alloy changes from a liquid to a solid.
    Feb 18, 2019

    What are the two types of regression methods? ›

    The two main types of regression are linear regression and logistic regression. Linear regression is used to predict a continuous numerical outcome, while logistic regression is used to predict a binary categorical outcome (e.g., yes or no, pass or fail).

    What are the different loss functions for regression? ›

    Loss Functions for Regression
    • Mean Absolute Error (MAE) This is also known as the L1 loss. ...
    • Mean Squared Error (MSE) ...
    • Mean Bias Error (MBE) ...
    • Relative Absolute Error (RAE) ...
    • Relative Squared Error (RSE) ...
    • Mean Absolute Percentage Error (MAPE) ...
    • Root Mean Squared Error (RMSE) ...
    • Mean Squared Logarithmic Error (MSLE)
    Jan 21, 2023

    Top Articles
    Latest Posts
    Article information

    Author: Lidia Grady

    Last Updated:

    Views: 5546

    Rating: 4.4 / 5 (65 voted)

    Reviews: 80% of readers found this page helpful

    Author information

    Name: Lidia Grady

    Birthday: 1992-01-22

    Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

    Phone: +29914464387516

    Job: Customer Engineer

    Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

    Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.