5.4 - The Lasso | STAT 508 (2024)

A ridge solution can be hard to interpret because it is not sparse (no \(\beta\)'s are set exactly to 0). What if we constrain the \(L1\) norm instead of the Euclidean (\(L2\) norm?

\begin{equation*}
\textrm{Ridge subject to:} \sum_{j=1}^p (\beta_j)^2 < c.
\end{equation*}
\begin{equation*}
\textrm{Lasso subject to:} \sum_{j=1}^p |\beta_j| < c.
\end{equation*}

This is a subtle, but important change. Some of the coefficients may be shrunk exactly to zero.

The least absolute shrinkage and selection operator, or lasso, as described in Tibshirani (1996) is a technique that has received a great deal of interest.

As with ridge regression we assume the covariates are standardized. Lasso estimates of the coefficients (Tibshirani, 1996) achieve \(\min_\beta (Y-X\beta)'(Y-X\beta) + \lambda \sum_{j=1}^p|\beta_j|\), so that the L2 penalty of ridge regression \(\sum_{j=1}^{p}\beta_{j}^{2}\) is replaced by an L1 penalty, \(\sum_{j=1}^{p}|\beta_{j}|\).

Let \(c_0 = \sum_{j=1}^p|\hat{\beta}_{LS,j}|\) denote the absolute size of the least squares estimates. Values of \(0< c < c_0\) cause shrinkage towards zero.

If, for example, \(c = c_0/2\) the average shrinkage of the least squares coefficients is 50%. If \( \lambda\) is sufficiently large, some of the coefficients are driven to zero, leading to a sparse model.

Geometric Interpretation

The lasso performs \( L1 \) shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

5.4 - The Lasso | STAT 508 (1)

As \(p\) increases, the multidimensional diamond has an increasing number of corners, and so it is highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage and (effectively) subset selection.

In contrast with subset selection, Lasso performs a soft thresholding: as the smoothing parameter is varied, the sample path of the estimates moves continuously to zero.

Least Angle Regression

The lasso loss function is no longer quadratic, but is still convex:
\begin{equation*}
\textrm{Minimize:} \sum_{i=1}^n(Y_i-\sum_{j=1}^p X_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p|\beta_j|
\end{equation*}

Unlike ridge regression, there is no analytic solution for the lasso because the solution is nonlinear in \( Y \). The entire path of lasso estimates for all values of \( \lambda\) can be efficiently computed through a modification of the Least Angle Regression (LARS) algorithm (Efron et al. 2003).

Lasso and ridge regression both put penalties on \( \beta \). More generally, penalties of the form \(\lambda \sum_{j=1}^p |\beta_j|^q\) may be considered, for \(q\geq0\). Ridge regression and the lasso correspond to \(q = 2\) and \(q = 1\), respectively. When \( X_j \) is weakly related with \( Y \), the lasso pulls \(\beta_j\) to zero faster than ridge regression.

Inference for Lasso Estimation

The ordinary lasso does not address the uncertainty of parameter estimation; standard errors for \( \beta \)'s are not immediately available.

For inference using the lasso estimator, various standard error estimators have been proposed:

  • Tibshirani (1996) suggested the bootstrap (Efron, 1979) for the estimation of standard errors and derived an approximate closed-form estimate.

  • Fan and Li (2001) derived the sandwich formula in the likelihood setting as an estimator for the covariance of the estimates.

However, the above approximate covariance matrices give an estimated variance of \( 0 \) for predictors with \(\hat{\beta}_j=0\). The "Bayesian lasso" of Park and Casella (2008) provides valid standard errors for \( \beta \) and provides more stable point estimates by using the posterior median. The lasso estimate is equivalent to the mode of the posterior distribution under a normal likelihood and an independent Laplace (double exponential) prior:
\begin{equation*}
\pi(\beta) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|)
\end{equation*}
The Bayesian lasso estimates (posterior medians) appear to be a compromise between the ordinary lasso and ridge regression. Park and Casella (2008) showed that the posterior density was unimodal based on a conditional Laplace prior, \(\lambda|\sigma\), a noninformative marginal prior \(\pi(\sigma^2) \propto 1/\sigma^2\), and the availability of a Gibbs algorithm for sampling the posterior distribution.
\begin{equation*}
\pi(\beta|\sigma^2) = \prod_{j=1}^p \frac{\lambda}{2\sqrt{\sigma^2}}\exp(-\frac{\lambda |\beta_j|}{2\sqrt{\sigma^2}})
\end{equation*}

Compare Ridge Regression and Lasso

The colored lines are the paths of regression coefficients shrinking towards zero. If we draw a vertical line in the figure, it will give a set of regression coefficients corresponding to a fixed \( \lambda\). (The x-axis actually shows the proportion of shrinkage instead of \( \lambda\)).

5.4 - The Lasso | STAT 508 (2)

Ridge regression shrinks all regression coefficients towards zero; the lasso tends to give a set of zero regression coefficients and leads to a sparse solution.

5.4 - The Lasso | STAT 508 (3)

Note that for both ridge regression and the lasso the regression coefficients can move from positive to negative values as they are shrunk toward zero.

Group Lasso

In some contexts, we may wish to treat a set of regressors as a group, for example, when we have a categorical covariate with more than two levels. The grouped lasso Yuan and Lin (2007) addresses this problem by considering the simultaneous shrinkage of (pre-defined) groups of coefficients.

5.4 - The Lasso | STAT 508 (2024)

FAQs

What is the least absolute shrinkage? ›

In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

What does Lasso coefficient mean? ›

Lasso regression uses shrinkage, where the data values are shrunk towards a central point such as the mean value. What is the Lasso penalty? The Lasso penalty shrinks or reduces the coefficient value towards zero. The less contributing variable is therefore allowed to have a zero or near-zero coefficient.

What is the formula for Lasso regularization? ›

Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients. Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

What is the problem with lasso regression? ›

The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

Why can lasso shrink to zero? ›

The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

What is the lasso shrinkage method? ›

Least Absolute Shrinkage and Selection Operator (LASSO) is a method for modeling relationship between a dependent variable (which may be a vector) and one or more explanatory variables by fitting regularized least squares model. Trained LASSO model can produce sparse coefficients due to the use of regularization term.

How is lasso calculated? ›

The LASSO:

The LASSO is an extension of OLS, which adds a penalty to the RSS equal to the sum of the absolute values of the non-intercept beta coefficients multiplied by parameter λ that slows or accelerates the penalty. E.g., if λ is less than 1, it slows the penalty and if it is above 1 it accelerates the penalty.

What is the range of the lasso? ›

“LASSO can range less than or equal to 20km (straight line with auxiliary antenna) with a flight endurance that enables the Soldier to make multiple orbits within the IBCT typically assigned battlespace, to acquire and attack targets within and beyond current crew served and small arms fire.

What does a lasso plot tell us? ›

Now, let's fit the lasso model. Lasso regularization does both shrinkage and variable selection. This plot tells us how much of the deviance which is similar to R-squared has been explained by the model.

What is the penalty for ridge regression? ›

Ridge regression

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical.

When to use lasso? ›

Lasso Regression shines when dealing with datasets containing numerous variables, many of which may not be relevant or have a weak impact on the target variable. By shrinking some coefficient estimates to zero, Lasso automatically performs feature selection, simplifying your model and improving its interpretability.

Why is lasso feature important? ›

Advantages: Feature selection: Lasso helps identify the most important features, making the model more interpretable. Reduces overfitting: By adding a penalty term, Lasso reduces the risk of overfitting on the training data.

What is lasso regression in simple terms? ›

Lasso regression is a regularization technique that applies a penalty to prevent overfitting and enhance the accuracy of statistical models.

What are the disadvantages of lasso regularization? ›

Lasso has two noticeable shortcomings (Zou and Hastie, 2005): (i) the number of selected predictors is bounded by the number of samples size as shown in Rosset et al. (2004), and (ii) the Lasso technique tends to select only one (or a few) predictors from a subset of correlated predictors and shrinks the rest to zero.

Why is lasso unstable? ›

Stability applies to your question because Lasso feature selection is (often) performed via Cross Validation, hence the Lasso algorithm is performed on different folds of data and may yield different results each time.

What is the least absolute shrinkage and selection operator in R? ›

LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e variable selection or parameter elimination.

What is the least absolute shrinkage and selection operator in Stata? ›

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters.

What is the least absolute deviation fit? ›

In the case of the least absolute deviations fit, the straight line is obtained by minimizing the sum of the absolute values of the residuals. The least absolute deviations fit is a robust fit method, unlike the least-squares fit.

What is the least absolute shrinkage and selection operator in Python? ›

Lasso is short for Least Absolute Shrinkage and Selection Operator, which is used both for regularization and model selection. If a model uses the L1 regularization technique, then it is called lasso regression.

Top Articles
Latest Posts
Article information

Author: Terrell Hackett

Last Updated:

Views: 5720

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.