1. Why do we need to make the parameters sparse?

The sparsity of the parameters means that most of the parameters are equal to $0$ s. Now, let's imagine one occasion that the model is so complicated that can approximate most of the training samples, even the anomaly noising samples are learned by the model. This is called overfitting, which means the model has a high variance basing different training sets. If a model is trapped by overfitting, it will not perform well during validation. So we need to let our model get rid of overfitting. Then the question comes, "How can we let our model get rid of overfitting?" There are a lot of methods, such as more training data, feature selection, model complexity adjustment, dropout, cross-validation and regularization introduced in this paper.

1.1. Why sparsity of the parameters can prevent models from over-fitting?

As we all know, over-fitting means the model has a strong flexibility to the training set. Only if the model is complex enough, then the model has the strong ability to approximate most of the training samples. A complex model is equal to a large number of trainable parameters. If the parameter matrix is sparsity, though the number of trainable parameters is not decreased, there are still many parameters equal to zeros. If any parameters are equal to $0$ s, the corresponding connections between neurons will lose the transmission capacity of activation information. This means a lot:

The corresponding features input to these invalid connections will not be sent to the next layer. If we can invalidate connections selectively, we can drop the redundant features.
The corresponding feature vectors sent to the next layer will not contain the noisy information from the previous layer. Therefore the representation of the model is weakened overall. Generally speaking, making parameters sparse can make the model simpler, so that the feature-extracting ability and the feature representation are both weakened. Through this, we can prevent models from over-fitting.

1.2. How can we make parameters sparse?

This question brings us to regularization. To be honest, only L1 Regularization is good for parameters sparsity, and L2 Regularization is not. The reason is explained by the following sections.

2. $L1$ Regularization and $L2$ Regularization

The loss function with $L1$ regularization is as follows:

L(W) + C||W||_1 = L(W) + C\sum_{j=1}^{p}|w_j|

where $p$ denotes that the number of parameters of the whole model. $C$ is a tuning parameter between regularization and the original goal.

This is come from an old regression methods called LASSO, which is initially designed for feature selection and model shrinking.

The function about the feature selection is also because of the $L1$ regularization.

The loss function with $L2$ regularization is as follows:

L(W) + C||W||_2 = L(W) + C\sum_{j=1}^{p}{w_j}^2

This is come from the Ridge Regression, where they use $L2$ regularization to constraint the model's complexity.

2.1. the function addition perspective

Pasted image 20230806165050.png

If the absolute derivatives of $L(W)$ are less than $C$ , the gradients of $L(W) + C|W|$ are different from $L(W)$ . If $W<0$ , $L'(W)$ may be positive and less than $C$ , $(L(W)+C|W|)'=L'(W)-C$ will be negative. If $W>0$ , $L'(W)$ may be negative and $|L'(W)|$ may be less than $C$ , $(L(W)+C|W|)'=L'(W)-C$ will be positive. So when we conduct the gradient descent, the point the model converges to $W=0s$ . This is why the parameters will be sparse when using $L1$ regularization.

If we use $L2$ instead, if we want to make our parameters $W=0s$ , it means that $C\left\| W\right\| _{2}=0$ and $L(W)=0$ at the same time. But it is hardly possible to happen. Generally speaking, $L2$ is just making the weights close to zeros, instead of be equal to zeros, which cannot lead to parameter sparsity.

But, what I'd confused about that whether any loss function with $L1$ will converge to the point $W=0s$ , which means if training completes, all of the parameters are zeros! I'd lost my mind because it is definitely that the derivatives of loss functions will become less than $C$ in the end of training. As soon as the derivatives become less than $C$ , the optimal point is $W=0s$ .

However, this is a wrong idea of mine. I have raised an example as $K(W)$ , a specific loss function.

In this loss function, $K'(W)$ near zero point is much greater than $C$ , the sign of the derivatives $K'(W)-C$ will not change. Maybe at the left side of $W$ , $K'(W)$ is less than $C$ , but this left side $W$ is too far from zero point to make the whole line of $K(W)+C|W|$ converges to zeros.

Of course, the bigger $C$ is, the wider range of $W$ will converges to zeros. But when $C$ is proper, it is impossible to force loss function to converge all parameters into zeros.

Actually, $L1$ regularization is just encouraging the parameters whose loss function absolute derivatives less than $C$ to come up to zeros, and furthermore force a part of them which are close to zero point to converge to zeros.

2.2. the solution space perspective

Based on KKT condition and Lagrange duality. The loss function with $L2$ is equal to:

\begin{cases}\min L\left( W\right) \\ s.t. \left\| W\right\| _{2}\leq m\end{cases}

The loss function with $L2$ is equal to:

\begin{cases}\min L\left( W\right) \\ s.t. \left\| W\right\| _{1}\leq m\end{cases}

Pasted image 20230806210853.png

So the yellow area is the solution space of $L2$ and $L1$ . The circles are the manifolds of $L(W)$ .

Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coeﬃcient estimates will be exclusively non-zero. However, the lasso constraint has corners at each of the axes, so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coeﬃcients will equal zero. Regularization in Machine Learning | by Prashant Gupta | Towards Data Science (medium.com)

2.3. the Bayes perspective

If $w_j$ obey the Gaussian distribution or the Laplace distribution separately, the log of the prior probability density function (PDF) in Bayes is as follows:

\begin{aligned} w_{j}\sim N\left( 0,\sigma ^{2}\right) \\ \log P\left( W\right) =-\dfrac{1}{2\sigma ^{2}}\Sigma _{j}Wj^{2}+c\\ w_{j}\sim Laplace\left( 0,a\right) \\ \log P\left( W\right) =-\dfrac{1}{a}\Sigma _{j}\left| w_{j}\right| +c \end{aligned}

As for the post PDF and the denominator, the total probability formula, in the Bayes are shared between Gaussian and Laplace, because the training set is shared. We can find that the loss function with $L1$ is equal to that we introduce the Laplace distribution as the prior.

Pasted image 20230806213700.png

The loss function with $L2$ is equal to that we introduce the Gaussian distribution as the prior.

Pasted image 20230806213754.png

If we assume that the dataset obeys the Laplace, we believe that the $W=0$ is with much higher probability compared with other values. On the contrary, if the dataset obeys the Gaussian, we believe that the parameters in $W \in [-\epsilon, +\epsilon]$ are not different enough.

Regularization and Sparsity