[TBD]

The previous notes are available in CS4220 Machine Learning 1 .

SVM

Bias-Variance dilemma / tradeoff

Prediction errors can be decomposed into two components: bias and variance. 截屏2021-04-07 14.50.47.png

Bias 大是指预测值和真实值之间的误差很大，因为模型underfitting（欠拟合）
Variance 大是指当数据有小的disturbance(扰动)时，预测结果相差大，因为模型overfitting（过拟合）

The relationship between bias-variance and model complecity is shown in Understanding the Bias-Variance Tradeoff.

分析：当一个模型具有low bias and high variance时，往往因为模型过于复杂，变量太多。

e.g. kNN When we increase k, we will decrease variance and increase bias. If we increase k too much, then we no longer follow the true boundary and we observe high bias. When k = 1, it will cause severe overfitting where the training error is 0 while the testing error will be higher.

Extension:

Least square is unbiased estimation -> low bias

LASSO regression可以使得一部分变量的回归系数为0，所以降低了模型的复杂度 -> low variance but high bias

Regularization

导入正则化，是对原始损失函数引入额外信息，以便防止过拟合和提高模型泛化性能的一类方法。因此：目标函数 = 原始损失函数 + 额外项

常用的额外项（正则化方法）有两种， $l_1$ -norm (Lasso)和 $l_2$ -norm(Ridge) 截屏2021-04-07 22.36.21.png 左图为 $l_1$ -norm (Lasso)，右图为 $l_2$ -norm(Ridge)

为什么Lasso可以做特征选择（feature selection）？

因为Lasso目标函数非光滑，处理这种优化问题，optimal 要么在导数为零处，要么在不可导的地方，也就是角点上。所谓的“角”是那些很多特征的系数为0的地方。所以Lasso会给一个稀疏解，能有特征选择的作用。稀疏性，就是模型的很多参数为0.

为什么Ridge能防止过拟合？

因为拟合过程中通常都倾向于让权值尽可能小。参数足够小的时候，数据偏移不会造成太大影响，即抗扰动能力强。每次迭代， $\theta_j$ 乘以一个小于1的因子，从而使之不断减小。

正则化项 $\lambda$ 的参数选择

从0开始逐渐增大 $\lambda$ ，在training set上学习参数，然后在test set上验证误差，反复进行这个过程，直到test set上的误差最小。一般情况， $\lambda$ 从0增大，test set error 先减小后增大。Cross-validation的目的是为了找到误分类率最小的那个位置。

VC - dimension

VC-dimension is a good way to measure complexity. The largest number of vectors $\mathbf{h}$ that can be separated in all the $2^h$ possible ways.

For linear classifier: $h = d + 1$ , where $d$ is the dimensionality.

But only a few classifiers the VC-dimension is known.

e.g. h for a canonical hyperplane: $h \leq \min (\lfloor \frac{R^2}{\rho^2} \rfloor , d) + 1$ 截屏2021-04-10 13.40.56.png

SVM - soft margin

loss function

$l(w)=\frac{1}{n}\sum_{i=1}^{n}max(0,1-y_i(w^Tx_i-b))+\lambda||w||^2$

截屏2021-04-10 13.43.27.png Support vector is the point (orange ones) which meets the requirement of $1-y_i(w^Tx_i-b)\geq 0$ or Or the point at which there is a contribution in the loss function. To find the support vectors specifically, use the dual problem method. Assume that all support vectors have been found and that there are

$l(w)=\frac{1}{n}\sum_{i\in S}(1-y_i(w^Tx_i-b))+\lambda||w||^2$ where $S$ is the set of SV points. Derive for $w$

$\dfrac{dl(w)}{dw}=\frac{1}{n}\sum_{i\in S}(-y_ix_i)+\frac{\lambda}{2}w=0$

$w=\frac{2}{n\lambda}\sum_{i\in S}(y_ix_i)$

设新输入点 $x_i^{'}=y_ix_i$ ，当 $yi=1$ 时， $x_i^{'}=x_i$ ，当 $y_i=−1$ 时，把 $x_i$ 沿原点翻转，得到新的标签为1的点 $x_i^{'}=−x_i$ 。最后 $x^{'}$ 就是所有标签都为1的点。求 $x^{'}$ 的平均正比于法线 $w$ （normal vector），也就决定了分割平面的方向。

$w=\frac{2}{n\lambda}\sum_{i\in S}(x_i')$

Cross Validation

Cross Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice.

ROC: receiver operator characteristic

Instead of being overwhelmed with confusion matrices, ROC graphs provide a simple way to summarize all of the information.

Confusion matrix

This matrix makes it easy to see if the machine has confused two different classes. The table reports the number of false positives, false negatives, true positives and true negatives. This allows for not only a correct classification (accuracy) analysis but also a more detailed analysis. e.g. Considering the heart disease

截屏2021-04-10 14.32.05.png

The y-axis of the ROC is True Positive Rate, which is equal to the sensitivity. The x-axis of the ROC is False Positive Rate, which is equal to 1-specificity.

$TPR = \frac{TP}{TP+FN}$

$FPR = \frac{FP}{TN+FP}$

AUC-ROC曲线是针对各种阈值设置下的分类问题的性能度量。ROC是概率曲线，AUC(Area Under Curve)表示可分离的程度或测度，出色的模型的AUC接近1，这意味着它具有良好的可分离性度量,较差的模型的AUC接近于0，这意味着它的可分离性度量最差。当AUC为0.5时，表示模型没有类别分离能力。下图中A优于B优于C。

Hierarchical Clustering

e.g. Heatmap

Hierarchical clustering orders the rows and/or the columns based on similarity. Of the different combinations, figure out which two genes are the most similar. Merge them into a cluster. Finally, since all we have left are 2 clusters, we merge them.

截屏2021-04-10 21.20.28.png Hierarchical clustering is usually accompanied by a "dendrogram". It indicates both the similarity and the order that the clusters were formed. Cluster #1 was formed first and is most similar. It has the shortest branch. Cluster #2 was second and is the second most similar. It has the second shortest branch. Cluster #3, which contains all of the genes, was formed last. It has the longest branch. The height of the branches in the "dendrogram" shows what is the most similar.

截屏2021-04-10 21.40.23.png

Criteria on the similarity decision: Euclidean distance.

截屏2021-04-10 21.43.16.png

截屏2021-04-10 21.43.28.png

Different ways to compare to clusters

We can compare the target point to

The average of each cluster (this is called the centroid)
The closest point in each cluster (this is called single-linkage)
The furthest point in each cluster (this is called complete-linkage)

截屏2021-04-10 21.46.32.png

截屏2021-04-10 21.50.50.png

K-means clustering

Slect the number of clusters you want to identify in the data. This is the "K" in "K-means clustering". If we select K=3, that is to say we want to identify 3 clusters.
Randomly select K distinct data points.
Measure the distance between the $1^{st}$ point and the three initial clusters.
Assign the $1^{st}$ point to the nearest cluster. In this case, the nearest cluster is the blue cluster.
Calculate the mean of each cluster.

K-means clustering knows that the $2^{nd}$ clustering is the best clustering so far. But t will do a few more clusters to verify whether it's best overall.

截屏2021-04-10 22.04.46.png

Choice of K

Each time we add a new cluster, the total variation within each cluster is smaller than before. And when there is only one point per cluster, the variation = 0.

Pick K by finding the "elbow" in the plot.

截屏2021-04-10 22.08.19.png

Difference from hierarchical clustering

K-means clustering specifically tries to put the data into the number of clusters you tell it to. But hierarchical clustering tells you pairwise, what two things are most similar.

2D situation

Pick random points and use the Euclidean distance to decide the nearest points' cluster. Then we calculate the center of each cluster and recluster.

截屏2021-04-10 22.11.16.png

截屏2021-04-10 22.12.35.png

截屏2021-04-10 22.12.54.png

Heatmap situation

Plot the values on the x-y plane and do it like before.

Notes for Machine Learning

SVM