Lesson 5: Regression Shrinkage Methods (2024)

Motivation: too many predictors

It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. microarray data analysis, environmental pollution studies.

With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist.

Motivation: ill-conditioned X

Because the LS estimates depend upon \((X'X)^{-1}\), we would have problems in computing \(\beta_{LS}\) if \(X'X\) were singular or nearly singular.
In those cases, small changes to the elements of \(X\) lead to large changes in \((X'X)^{-1}\).
The least square estimator \(\beta_{LS}\) may provide a good fit to the training data, but it will not fit sufficiently well to the test data.

Ridge Regression:

One way out of this situation is to abandon the requirement of an unbiased estimator.

We assume only that X's and Y have been centered so that we have no need for a constant term in the regression:

X is ann byp matrix with centered columns,
Y is a centered n-vector.

ho*rl and Kennard (1970) proposed that potential instability in the LS estimator

\begin{equation*}
\hat{\beta} = (X'X)^{-1} X' Y,
\end{equation*}

could be improved by adding a small constant value \( \lambda \) to the diagonal entries of the matrix \(X'X\) before taking its inverse.

Geometric Interpretation of Ridge Regression:

The ellipses correspond to the contours of the residual sum of squares (RSS): the inner ellipse has smaller RSS, and RSS is minimized at ordinal least square (OLS) estimates.

For \(p=2\), the constraint in ridge regression corresponds to a circle, \(\sum_{j=1}^p \beta_j^2 < c\).

We are trying to minimize the ellipse size and circle simultaneously in the ridge regression. The ridge estimate is given by the point at which the ellipse and the circle touch.

There is a trade-off between the penalty term and RSS. Maybe a large \(\beta\) would give you a better residual sum of squares but then it will push the penalty term higher. This is why you might actually prefer smaller \(\beta\)'s with a worse residual sum of squares. From an optimization perspective, the penalty term is equivalent to a constraint on the \(\beta\)'s. The function is still the residual sum of squares but now you constrain the norm of the \(\beta_j\)'s to be smaller than some constant c. There is a correspondence between \(\lambda\) and c. The larger the \(\lambda\) is, the more you prefer the \(\beta_j\)'s close to zero. In the extreme case when \(\lambda = 0\), then you would simply be doing a normal linear regression. And the other extreme as \(\lambda\) approaches infinity, you set all the \(\beta\)'s to zero.

Properties of Ridge Estimator:

\(\hat{\beta}_{ls}\) is an unbiased estimator of \(\beta\); \(\hat{\beta}_{ridge}\) is a biased estimator of \(\beta\).

For orthogonal covariates, \(X'X=n I_p\), \(\hat{\beta}_{ridge} = \dfrac{n}{n+\lambda} \hat{\beta}_{ls}\). Hence, in this case, the ridge estimator always produces shrinkage towards \(0\). \(\lambda\) controls the amount of shrinkage.

An important concept in shrinkage is the "effective'' degrees of freedom associated with a set of parameters. In a ridge regression setting:

If we choose \(\lambda=0\), we have \(p\) parameters (since there is no penalization).
If \(\lambda\) is large, the parameters are heavily constrained and the degrees of freedom will effectively be lower, tending to \(0\) as \(\lambda\rightarrow \infty\).

The effective degrees of freedom associated with \(\beta_1, \beta_2, \ldots, \beta_p\) is defined as
\begin{equation*}
df(\lambda) = tr(X(X'X+\lambda I_p)^{-1}X') = \sum_{j=1}^p \dfrac{d_j^2}{d_j^2+\lambda},
\end{equation*}
where \(d_j\) are the singular values of \(X\). Notice that \(\lambda = 0\), which corresponds to no shrinkage, gives \(df(\lambda) = p\) (as long as \(X'X\) is non-singular), as we would expect.

There is a 1:1 mapping between \( \lambda \) and the degrees of freedom, so in practice one may simply pick the effective degrees of freedom that one would like associated with the fit, and solve for \( \lambda \).

A Bayesian Formulation

Consider the linear regression model with normal errors:

\begin{equation*}
Y_i = \sum_{j=1}^p X_{ij}\beta_j + \epsilon_i
\end{equation*}
\(\epsilon_i\) is i.i.d. normal errors with mean 0 and known variance \(\sigma^2\).

Since \( \lambda \) is applied to the squared norm of the β vector, people often standardize all of the covariates to make them have a similar scale. Assume \( \beta_j \) has the prior distribution \( \beta_j \sim_{iid} N(0,\sigma^2/\lambda)\). A large value of \( \lambda \) corresponds to a prior that is more tightly concentrated around zero and hence leads to greater shrinkage towards zero.

The posterior is \(\beta|Y \sim N(\hat{\beta}, \sigma^2 (X'X+\lambda I_p)^{-1} X'X (X'X+\lambda I_p)^{-1})\), where \(\hat{\beta} = \hat{\beta}_{ridge} = (X'X+\lambda I_p)^{-1} X' Y\), confirming that the posterior mean (and mode) of the Bayesian linear model corresponds to the ridge regression estimator.

Whereas the least squares solutions \(\hat{\beta}_{ls} = (X'X)^{-1} X' Y\) are unbiased if model is correctly specified, ridge solutions are biased, \(E(\hat{\beta}_{ridge}) \neq \beta\). However, at the cost of bias, ridge regression reduces the variance, and thus might reduce the mean squared error (MSE).

\begin{equation*}
MSE = Bias^2 + Variance
\end{equation*}

More Geometric Interpretations (optional)

\( \begin {align} \hat{y} &=\textbf{X}\hat{\beta}^{ridge}\\
& = \textbf{X}(\textbf{X}^{T}\textbf{X} + \lambda\textbf{I})^{-1}\textbf{X}^{T}\textbf{y}\\
& = \textbf{U}\textbf{D}(\textbf{D}^2 +\lambda\textbf{I})^{-1}\textbf{D}\textbf{U}^{T}\textbf{y}\\
& = \sum_{j=1}^{p}\textbf{u}_j \dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\\
\end {align} \)

where \(\textbf{u}_j\) are the normalized principal components of X.

\(\hat{\beta}_{j}^{ridge}=\dfrac{d_{j}^2}{d_{j}^{2}+\lambda}\textbf{u}_{j}^{T}\textbf{y}\)

\(Var(\hat{\beta}_{j})=\dfrac{\sigma^2}{d_{j}^{2}}\)

where \(\sigma^2\) is the variance of the error term \(\epsilon\) in the linear model.

Inputs are centered first;
Consider the fitted response
Ridge regression shrinks the coordinates with respect to the orthonormal basis formed by the principal components.
Coordinates with respect to principal components with smaller variance are shrunk more.
Instead of using X = (X₁, X₂, ... , X_p) as predicting variables, use the new input matrix \(\tilde{X}\) = UD
Then for the new inputs:
The shrinkage factor given by ridge regression is:

\(\dfrac{d_{j}^{2}}{d_{j}^{2}+\lambda}\)

We saw this in the previous formula. The larger λ is, the more the projection is shrunk in the direction of \(u_j\). Coordinates with respect to the principal components with a smaller variance are shrunk more.

Let's take a look at this geometrically.

Shrinkage

This interpretation will become convenient when we compare it to principal components regression where instead of doing shrinkage, we either shrink the direction closer to zero or we don't shrink at all. We will see this in the "Dimension Reduction Methods" lesson.

FAQs

What are the shrinkage methods of regression? ›

Types of regression that involve shrinkage estimates include ridge regression, where coefficients derived from a regular least squares regression are brought closer to zero by multiplying by a constant (the shrinkage factor), and lasso regression, where coefficients are brought closer to zero by adding or subtracting a ...

Discover More ›

What question does regression answer? ›

Multiple Linear Regression Analysis helps answer three key types of questions: (1) identifying causes, (2) predicting effects, and (3) forecasting trends. Identifying Causes: It determines the cause-and-effect relationships between one continuous dependent variable and two or more independent variables.

Learn More Now ›

How does Lasso shrink? ›

The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

Discover More Details ›

What is the shrinkage penalty? ›

The term on the right is called a shrinkage penalty because it forces each coefficient βj closer to zero by squaring it. The shrinkage part is clearer once you think of this term as forcing each coefficient to be as small as possible without compromising the Residual Sum of Squares (RSS).

See Details ›

How do you solve for shrinkage? ›

To find the inventory shrinkage rate, divide your inventory losses by the amount of inventory you should have. Multiply your inventory shrinkage rate by 100 to convert it into a percentage.

What are the shrinking methods? ›

Shrinkage Methods

Ridge regression shrinks the regression coefficients by imposing a penalty on their size. ...
An equivalent way to write the ridge problem is ˆβridge=argminβN∑i=1(yi−β0−p∑j=1xijβj)2,subject top∑j=1β2j≤t,
The lasso is a shrinkage method like ridge, with subtle but important differences.

More items...

Discover More ›

What is regression in short answer? ›

Regression allows researchers to predict or explain the variation in one variable based on another variable. Definitions: ❖ The variable that researchers are trying to explain or predict is called the response variable. It is also sometimes called the dependent variable because it depends on another variable.

What is the math behind regression? ›

To work out the regression line the following values need to be calculated: a=¯y−b¯x a = y ¯ − b x ¯ and b=SxySxx b = S x y S x x . The easiest way of calculating them is by using a table. Start off by working out the mean of the independent and dependent variables.

Get More Info Here ›

What is regression equation answer? ›

A regression equation can be defined as a statistical model, used to determine the specific relationship between the predictor variable and the outcome variable. A model regression equation allows predicting outcome with a very small error.

Read On ›

Why use lasso regression? ›

Lasso regression can help to reduce dimensionality within a dataset by shrinking the weight parameters to zero, eliminating less important features from the model.

Keep Reading ›

What is least angle regression and shrinkage? ›

Least angle regression is a variable selection/shrinkage procedure for high-dimensional data. It is also an algorithm for efficiently finding all knots in the solution path for the aforementioned this regression procedure, as well as for lasso (L1-regularized) linear regression.

Keep Reading ›

What is the least absolute shrinkage? ›

In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Discover More ›

What is shrinkage in regression? ›

Copas in 1983 writes the first article Regression, Prediction and Shrinkage to coin the term "shrinkage". It's defined implicitly in the abstract: The fit of a regression predictor to new data is nearly always worse than its fit to the original data.

View Details ›

What is an example of shrinkage? ›

For example, any waste of materials that results during a manufacturing process is a form of shrinkage. So are perishable items that expire enroute to a restaurant. And all have the same detrimental effect on a company's profits. Shrinkage is the difference between recorded inventory and actual inventory.

See Details ›

What is negative shrinkage? ›

Shrinkage means to contract and negative shrinkage means to expand.

How many types of shrinkage are there? ›

The paper explains the basic types of shrinkage: carbonation shrinkage, plastic shrinkage, temperature shrinkage, chemical shrinkage, autogenous shrinkage, and drying shrinkage.

Discover More ›

What are the different stages of shrinkage? ›

As molten metal cools, shrinkage occurs in three distinct stages:

Liquid shrinkage is the contraction that occurs as the alloy cools but remains in its liquid state. ...
Liquid-to-solid shrinkage (also known as solidification shrinkage) occurs as the alloy changes from a liquid to a solid.

More items...

Feb 18, 2019

View Details ›

What are the two types of regression methods? ›

The two main types of regression are linear regression and logistic regression. Linear regression is used to predict a continuous numerical outcome, while logistic regression is used to predict a binary categorical outcome (e.g., yes or no, pass or fail).

Know More ›

What are the different loss functions for regression? ›

Loss Functions for Regression

Mean Absolute Error (MAE) This is also known as the L1 loss. ...
Mean Squared Error (MSE) ...
Mean Bias Error (MBE) ...
Relative Absolute Error (RAE) ...
Relative Squared Error (RSE) ...
Mean Absolute Percentage Error (MAPE) ...
Root Mean Squared Error (RMSE) ...
Mean Squared Logarithmic Error (MSLE)

More items...

Jan 21, 2023

Read The Full Story ›