5.4 - The Lasso | STAT 508 (2024)

A ridge solution can be hard to interpret because it is not sparse (no \(\beta\)'s are set exactly to 0). What if we constrain the \(L1\) norm instead of the Euclidean (\(L2\) norm?

\begin{equation*}
\textrm{Ridge subject to:} \sum_{j=1}^p (\beta_j)^2 < c.
\end{equation*}
\begin{equation*}
\textrm{Lasso subject to:} \sum_{j=1}^p |\beta_j| < c.
\end{equation*}

This is a subtle, but important change. Some of the coefficients may be shrunk exactly to zero.

The least absolute shrinkage and selection operator, or lasso, as described in Tibshirani (1996) is a technique that has received a great deal of interest.

As with ridge regression we assume the covariates are standardized. Lasso estimates of the coefficients (Tibshirani, 1996) achieve \(\min_\beta (Y-X\beta)'(Y-X\beta) + \lambda \sum_{j=1}^p|\beta_j|\), so that the L2 penalty of ridge regression \(\sum_{j=1}^{p}\beta_{j}^{2}\) is replaced by an L1 penalty, \(\sum_{j=1}^{p}|\beta_{j}|\).

Let \(c_0 = \sum_{j=1}^p|\hat{\beta}_{LS,j}|\) denote the absolute size of the least squares estimates. Values of \(0< c < c_0\) cause shrinkage towards zero.

If, for example, \(c = c_0/2\) the average shrinkage of the least squares coefficients is 50%. If \( \lambda\) is sufficiently large, some of the coefficients are driven to zero, leading to a sparse model.

Geometric Interpretation

The lasso performs \( L1 \) shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

Least Angle Regression

The lasso loss function is no longer quadratic, but is still convex:
\begin{equation*}
\textrm{Minimize:} \sum_{i=1}^n(Y_i-\sum_{j=1}^p X_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p|\beta_j|
\end{equation*}

Unlike ridge regression, there is no analytic solution for the lasso because the solution is nonlinear in \( Y \). The entire path of lasso estimates for all values of \( \lambda\) can be efficiently computed through a modification of the Least Angle Regression (LARS) algorithm (Efron et al. 2003).

Lasso and ridge regression both put penalties on \( \beta \). More generally, penalties of the form \(\lambda \sum_{j=1}^p |\beta_j|^q\) may be considered, for \(q\geq0\). Ridge regression and the lasso correspond to \(q = 2\) and \(q = 1\), respectively. When \( X_j \) is weakly related with \( Y \), the lasso pulls \(\beta_j\) to zero faster than ridge regression.

Inference for Lasso Estimation

The ordinary lasso does not address the uncertainty of parameter estimation; standard errors for \( \beta \)'s are not immediately available.

For inference using the lasso estimator, various standard error estimators have been proposed:

Tibshirani (1996) suggested the bootstrap (Efron, 1979) for the estimation of standard errors and derived an approximate closed-form estimate.
Fan and Li (2001) derived the sandwich formula in the likelihood setting as an estimator for the covariance of the estimates.

However, the above approximate covariance matrices give an estimated variance of \( 0 \) for predictors with \(\hat{\beta}_j=0\). The "Bayesian lasso" of Park and Casella (2008) provides valid standard errors for \( \beta \) and provides more stable point estimates by using the posterior median. The lasso estimate is equivalent to the mode of the posterior distribution under a normal likelihood and an independent Laplace (double exponential) prior:
\begin{equation*}
\pi(\beta) = \frac{\lambda}{2} \exp(-\lambda |\beta_j|)
\end{equation*}
The Bayesian lasso estimates (posterior medians) appear to be a compromise between the ordinary lasso and ridge regression. Park and Casella (2008) showed that the posterior density was unimodal based on a conditional Laplace prior, \(\lambda|\sigma\), a noninformative marginal prior \(\pi(\sigma^2) \propto 1/\sigma^2\), and the availability of a Gibbs algorithm for sampling the posterior distribution.
\begin{equation*}
\pi(\beta|\sigma^2) = \prod_{j=1}^p \frac{\lambda}{2\sqrt{\sigma^2}}\exp(-\frac{\lambda |\beta_j|}{2\sqrt{\sigma^2}})
\end{equation*}

Compare Ridge Regression and Lasso

The colored lines are the paths of regression coefficients shrinking towards zero. If we draw a vertical line in the figure, it will give a set of regression coefficients corresponding to a fixed \( \lambda\). (The x-axis actually shows the proportion of shrinkage instead of \( \lambda\)).

Ridge regression shrinks all regression coefficients towards zero; the lasso tends to give a set of zero regression coefficients and leads to a sparse solution.

Note that for both ridge regression and the lasso the regression coefficients can move from positive to negative values as they are shrunk toward zero.

Group Lasso

In some contexts, we may wish to treat a set of regressors as a group, for example, when we have a categorical covariate with more than two levels. The grouped lasso Yuan and Lin (2007) addresses this problem by considering the simultaneous shrinkage of (pre-defined) groups of coefficients.

FAQs

What is the least absolute shrinkage? ›

In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Discover More ›

What does Lasso coefficient mean? ›

Lasso regression uses shrinkage, where the data values are shrunk towards a central point such as the mean value. What is the Lasso penalty? The Lasso penalty shrinks or reduces the coefficient value towards zero. The less contributing variable is therefore allowed to have a zero or near-zero coefficient.

Learn More Now ›

What is the formula for Lasso regularization? ›

Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients. Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

Discover More Details ›

What is the problem with lasso regression? ›

The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

See Details ›

Why can lasso shrink to zero? ›

The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

What is the lasso shrinkage method? ›

Least Absolute Shrinkage and Selection Operator (LASSO) is a method for modeling relationship between a dependent variable (which may be a vector) and one or more explanatory variables by fitting regularized least squares model. Trained LASSO model can produce sparse coefficients due to the use of regularization term.

Discover More ›

How is lasso calculated? ›

The LASSO:

The LASSO is an extension of OLS, which adds a penalty to the RSS equal to the sum of the absolute values of the non-intercept beta coefficients multiplied by parameter λ that slows or accelerates the penalty. E.g., if λ is less than 1, it slows the penalty and if it is above 1 it accelerates the penalty.

What is the range of the lasso? ›

“LASSO can range less than or equal to 20km (straight line with auxiliary antenna) with a flight endurance that enables the Soldier to make multiple orbits within the IBCT typically assigned battlespace, to acquire and attack targets within and beyond current crew served and small arms fire.

Get More Info Here ›

What does a lasso plot tell us? ›

Now, let's fit the lasso model. Lasso regularization does both shrinkage and variable selection. This plot tells us how much of the deviance which is similar to R-squared has been explained by the model.

Read On ›

What is the penalty for ridge regression? ›

Ridge regression

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical.

Keep Reading ›

When to use lasso? ›

Lasso Regression shines when dealing with datasets containing numerous variables, many of which may not be relevant or have a weak impact on the target variable. By shrinking some coefficient estimates to zero, Lasso automatically performs feature selection, simplifying your model and improving its interpretability.

Keep Reading ›

Why is lasso feature important? ›

Advantages: Feature selection: Lasso helps identify the most important features, making the model more interpretable. Reduces overfitting: By adding a penalty term, Lasso reduces the risk of overfitting on the training data.

Discover More ›

What is lasso regression in simple terms? ›

Lasso regression is a regularization technique that applies a penalty to prevent overfitting and enhance the accuracy of statistical models.

View Details ›

What are the disadvantages of lasso regularization? ›

Lasso has two noticeable shortcomings (Zou and Hastie, 2005): (i) the number of selected predictors is bounded by the number of samples size as shown in Rosset et al. (2004), and (ii) the Lasso technique tends to select only one (or a few) predictors from a subset of correlated predictors and shrinks the rest to zero.

See Details ›

Why is lasso unstable? ›

Stability applies to your question because Lasso feature selection is (often) performed via Cross Validation, hence the Lasso algorithm is performed on different folds of data and may yield different results each time.

What is the least absolute shrinkage and selection operator in R? ›

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e variable selection or parameter elimination.

Discover More ›

What is the least absolute shrinkage and selection operator in Stata? ›

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters.

View Details ›

What is the least absolute deviation fit? ›

In the case of the least absolute deviations fit, the straight line is obtained by minimizing the sum of the absolute values of the residuals. The least absolute deviations fit is a robust fit method, unlike the least-squares fit.

Know More ›

What is the least absolute shrinkage and selection operator in Python? ›

Lasso is short for Least Absolute Shrinkage and Selection Operator, which is used both for regularization and model selection. If a model uses the L1 regularization technique, then it is called lasso regression.

Read The Full Story ›