[TBD]
The previous notes are available in CS4220 Machine Learning 1 .
SVM
Bias-Variance dilemma / tradeoff
Prediction errors can be decomposed into two components: bias and variance.
- Bias 大是指预测值和真实值之间的误差很大,因为模型underfitting(欠拟合)
- Variance 大是指当数据有小的disturbance(扰动)时,预测结果相差大,因为模型overfitting(过拟合)
The relationship between bias-variance and model complecity is shown in Understanding the Bias-Variance Tradeoff.
分析:当一个模型具有low bias and high variance时,往往因为模型过于复杂,变量太多。
e.g. kNN When we increase k, we will decrease variance and increase bias. If we increase k too much, then we no longer follow the true boundary and we observe high bias. When k = 1, it will cause severe overfitting where the training error is 0 while the testing error will be higher.
Extension:
Least square is unbiased estimation -> low bias
LASSO regression可以使得一部分变量的回归系数为0,所以降低了模型的复杂度 -> low variance but high bias
Regularization
导入正则化,是对原始损失函数引入额外信息,以便防止过拟合和提高模型泛化性能的一类方法。 因此: 目标函数 = 原始损失函数 + 额外项
常用的额外项(正则化方法)有两种,-norm (Lasso)和-norm(Ridge)
左图为-norm (Lasso),右图为-norm(Ridge)
为什么Lasso可以做特征选择(feature selection)?
因为Lasso目标函数非光滑,处理这种优化问题,optimal 要么在导数为零处,要么在不可导的地方,也就是角点上。所谓的“角”是那些很多特征的系数为0的地方。所以Lasso会给一个稀疏解,能有特征选择的作用。稀疏性,就是模型的很多参数为0.
为什么Ridge能防止过拟合?
因为拟合过程中通常都倾向于让权值尽可能小。参数足够小的时候,数据偏移不会造成太大影响,即抗扰动能力强。
每次迭代,乘以一个小于1的因子,从而使之不断减小。
正则化项的参数选择
从0开始逐渐增大,在training set上学习参数,然后在test set上验证误差,反复进行这个过程,直到test set上的误差最小。一般情况,从0增大,test set error 先减小后增大。Cross-validation的目的是为了找到误分类率最小的那个位置。
VC - dimension
VC-dimension is a good way to measure complexity. The largest number of vectors that can be separated in all the possible ways.
For linear classifier: , where is the dimensionality.
But only a few classifiers the VC-dimension is known.
e.g. h for a canonical hyperplane:
SVM - soft margin
loss function
Support vector is the point (orange ones) which meets the requirement of or Or the point at which there is a contribution in the loss function. To find the support vectors specifically, use the dual problem method. Assume that all support vectors have been found and that there are
where is the set of SV points. Derive for
设新输入点,当时,,当时, 把沿原点翻转,得到新的标签为1的点。最后就是所有标签都为1的点。求的平均正比于法线(normal vector),也就决定了分割平面的方向。
Cross Validation
Cross Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice.
ROC: receiver operator characteristic
Instead of being overwhelmed with confusion matrices, ROC graphs provide a simple way to summarize all of the information.
Confusion matrix
This matrix makes it easy to see if the machine has confused two different classes. The table reports the number of false positives, false negatives, true positives and true negatives. This allows for not only a correct classification (accuracy) analysis but also a more detailed analysis. e.g. Considering the heart disease
The y-axis of the ROC is True Positive Rate, which is equal to the sensitivity. The x-axis of the ROC is False Positive Rate, which is equal to 1-specificity.
AUC-ROC曲线是针对各种阈值设置下的分类问题的性能度量。ROC是概率曲线,AUC(Area Under Curve)表示可分离的程度或测度,出色的模型的AUC接近1,这意味着它具有良好的可分离性度量,较差的模型的AUC接近于0,这意味着它的可分离性度量最差。当AUC为0.5时,表示模型没有类别分离能力。下图中A优于B优于C。
Hierarchical Clustering
e.g. Heatmap
Hierarchical clustering orders the rows and/or the columns based on similarity. Of the different combinations, figure out which two genes are the most similar. Merge them into a cluster. Finally, since all we have left are 2 clusters, we merge them.
Hierarchical clustering is usually accompanied by a "dendrogram". It indicates both the similarity and the order that the clusters were formed. Cluster #1 was formed first and is most similar. It has the shortest branch. Cluster #2 was second and is the second most similar. It has the second shortest branch. Cluster #3, which contains all of the genes, was formed last. It has the longest branch. The height of the branches in the "dendrogram" shows what is the most similar.
Criteria on the similarity decision: Euclidean distance.
Different ways to compare to clusters
We can compare the target point to
-
The average of each cluster (this is called the centroid)
-
The closest point in each cluster (this is called single-linkage)
-
The furthest point in each cluster (this is called complete-linkage)
K-means clustering
-
Slect the number of clusters you want to identify in the data. This is the "K" in "K-means clustering". If we select K=3, that is to say we want to identify 3 clusters.
-
Randomly select K distinct data points.
-
Measure the distance between the point and the three initial clusters.
-
Assign the point to the nearest cluster. In this case, the nearest cluster is the blue cluster.
-
Calculate the mean of each cluster.
K-means clustering knows that the clustering is the best clustering so far. But t will do a few more clusters to verify whether it's best overall.
Choice of K
Each time we add a new cluster, the total variation within each cluster is smaller than before. And when there is only one point per cluster, the variation = 0.
Pick K by finding the "elbow" in the plot.
Difference from hierarchical clustering
K-means clustering specifically tries to put the data into the number of clusters you tell it to. But hierarchical clustering tells you pairwise, what two things are most similar.
2D situation
Pick random points and use the Euclidean distance to decide the nearest points' cluster. Then we calculate the center of each cluster and recluster.
Heatmap situation
Plot the values on the x-y plane and do it like before.