Regularization and Sparsity

257 阅读3分钟

1. Why do we need to make the parameters sparse?

The sparsity of the parameters means that most of the parameters are equal to 00s. Now, let's imagine one occasion that the model is so complicated that can approximate most of the training samples, even the anomaly noising samples are learned by the model. This is called overfitting, which means the model has a high variance basing different training sets. If a model is trapped by overfitting, it will not perform well during validation. So we need to let our model get rid of overfitting. Then the question comes, "How can we let our model get rid of overfitting?" There are a lot of methods, such as more training data, feature selection, model complexity adjustment, dropout, cross-validation and regularization introduced in this paper.

1.1. Why sparsity of the parameters can prevent models from over-fitting?

As we all know, over-fitting means the model has a strong flexibility to the training set. Only if the model is complex enough, then the model has the strong ability to approximate most of the training samples. A complex model is equal to a large number of trainable parameters. If the parameter matrix is sparsity, though the number of trainable parameters is not decreased, there are still many parameters equal to zeros. If any parameters are equal to 00s, the corresponding connections between neurons will lose the transmission capacity of activation information. This means a lot:

  1. The corresponding features input to these invalid connections will not be sent to the next layer. If we can invalidate connections selectively, we can drop the redundant features.
  2. The corresponding feature vectors sent to the next layer will not contain the noisy information from the previous layer. Therefore the representation of the model is weakened overall. Generally speaking, making parameters sparse can make the model simpler, so that the feature-extracting ability and the feature representation are both weakened. Through this, we can prevent models from over-fitting.

1.2. How can we make parameters sparse?

This question brings us to regularization. To be honest, only L1 Regularization is good for parameters sparsity, and L2 Regularization is not. The reason is explained by the following sections.

2. L1L1 Regularization and L2L2 Regularization

The loss function with L1L1 regularization is as follows:

L(W)+CW1=L(W)+Cj=1pwjL(W) + C||W||_1 = L(W) + C\sum_{j=1}^{p}|w_j|

where pp denotes that the number of parameters of the whole model. CC is a tuning parameter between regularization and the original goal.

This is come from an old regression methods called LASSO, which is initially designed for feature selection and model shrinking.

The function about the feature selection is also because of the L1L1 regularization.

The loss function with L2L2 regularization is as follows:

L(W)+CW2=L(W)+Cj=1pwj2L(W) + C||W||_2 = L(W) + C\sum_{j=1}^{p}{w_j}^2

This is come from the Ridge Regression, where they use L2L2 regularization to constraint the model's complexity.

2.1. the function addition perspective

Pasted image 20230806165050.png

If the absolute derivatives of L(W)L(W) are less than CC, the gradients of L(W)+CWL(W) + C|W| are different from L(W)L(W). If W<0W<0, L(W)L'(W) may be positive and less than CC, (L(W)+CW)=L(W)C(L(W)+C|W|)'=L'(W)-C will be negative. If W>0W>0, L(W)L'(W) may be negative and L(W)|L'(W)| may be less than CC, (L(W)+CW)=L(W)C(L(W)+C|W|)'=L'(W)-C will be positive. So when we conduct the gradient descent, the point the model converges to W=0sW=0s. This is why the parameters will be sparse when using L1L1 regularization.

If we use L2L2 instead, if we want to make our parameters W=0sW=0s, it means that CW2=0C\left\| W\right\| _{2}=0 and L(W)=0L(W)=0 at the same time. But it is hardly possible to happen. Generally speaking, L2L2 is just making the weights close to zeros, instead of be equal to zeros, which cannot lead to parameter sparsity.

But, what I'd confused about that whether any loss function with L1L1 will converge to the point W=0sW=0s, which means if training completes, all of the parameters are zeros! I'd lost my mind because it is definitely that the derivatives of loss functions will become less than CC in the end of training. As soon as the derivatives become less than CC, the optimal point is W=0sW=0s.

However, this is a wrong idea of mine. I have raised an example as K(W)K(W), a specific loss function.

In this loss function, K(W)K'(W) near zero point is much greater than CC, the sign of the derivatives K(W)CK'(W)-C will not change. Maybe at the left side of WW, K(W)K'(W) is less than CC, but this left side WW is too far from zero point to make the whole line of K(W)+CWK(W)+C|W| converges to zeros.

Of course, the bigger CC is, the wider range of WW will converges to zeros. But when CC is proper, it is impossible to force loss function to converge all parameters into zeros.

Actually, L1L1 regularization is just encouraging the parameters whose loss function absolute derivatives less than CC to come up to zeros, and furthermore force a part of them which are close to zero point to converge to zeros.

2.2.  the solution space perspective

Based on KKT condition and Lagrange duality. The loss function with L2L2 is equal to:

{minL(W)s.t.W2m\begin{cases}\min L\left( W\right) \\ s.t. \left\| W\right\| _{2}\leq m\end{cases}

The loss function with L2L2 is equal to:

{minL(W)s.t.W1m\begin{cases}\min L\left( W\right) \\ s.t. \left\| W\right\| _{1}\leq m\end{cases}

Pasted image 20230806210853.png

So the yellow area is the solution space of L2L2 and L1L1. The circles are the manifolds of L(W)L(W).

Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-zero.  However, the lasso constraint has corners at each of the axes, so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. Regularization in Machine Learning | by Prashant Gupta | Towards Data Science (medium.com)

2.3. the Bayes perspective

If wjw_j obey the Gaussian distribution or the Laplace distribution separately, the log of the prior probability density function (PDF) in Bayes is as follows:

wjN(0,σ2)logP(W)=12σ2ΣjWj2+cwjLaplace(0,a)logP(W)=1aΣjwj+c\begin{aligned} w_{j}\sim N\left( 0,\sigma ^{2}\right) \\ \log P\left( W\right) =-\dfrac{1}{2\sigma ^{2}}\Sigma _{j}Wj^{2}+c\\ w_{j}\sim Laplace\left( 0,a\right) \\ \log P\left( W\right) =-\dfrac{1}{a}\Sigma _{j}\left| w_{j}\right| +c \end{aligned}

As for the post PDF and the denominator, the total probability formula, in the Bayes are shared between Gaussian and Laplace, because the training set is shared. We can find that the loss function with L1L1 is equal to that we introduce the Laplace distribution as the prior.

Pasted image 20230806213700.png

The loss function with L2L2 is equal to that we introduce the Gaussian distribution as the prior.

Pasted image 20230806213754.png

If we assume that the dataset obeys the Laplace, we believe that the W=0W=0 is with much higher probability compared with other values. On the contrary, if the dataset obeys the Gaussian, we believe that the parameters in W[ϵ,+ϵ]W \in [-\epsilon, +\epsilon] are not different enough.