为什么谱聚类效果好?
首先,因为kmeans算法的限制,也就是kmeans只能分开线性可分的类别,因此对于一些非凸的图像就不能很好的聚类。比如说下图:
如果使用kmeans算法划分上图的非凸的图形,那么得到的结果很不好。因为kmeans知识根据距离的远近来判断泪别,无法识别出其他方面的相似。 因此,很多算法被提出来,比如说核聚类的算法。核聚类首先将原始的数据点映射到高维空间中,因为在低维空间中线性不可分的点可能在高维空间中就线性可分了,或者说效果就更好了。因为高维空间的数据点非常的稀疏,分布就比较开,而且很多会集中到角落里面,具体请查看www.visiondummy.com/2014/ ... e…。因此在低维空间可能存在一些平面不能分开的点经过映射之后就能够分开了。 还有一种方法,就是谱聚类的算法,谱聚类的算法应用得很多,下面是谱聚类的一个算法,因为谱聚类可以说是一类算法:
算法的实现细节可能改变,但是总的来说就是找到图拉普拉斯矩阵L,然后得到L的前k个特征向量,然后在这k个特征向量下做传统的kmeans。一般L=I-D-1/2AD-1/2,但是论文中为了讨论方便,直接使用L=D-1/2AD-1/2,但是结果只会使得特征值从λ变为1-λ,并不会影响特征向量。 图拉普拉斯的定义是L=D-W,其中D是构造的对角矩阵,W是weight matrix也就是描述了整个图的权重矩阵。 normalized graph Laplacians也就是规范图拉普拉斯的定义如下:
Lsym是对称的,Lrw近似于random walk,大致就是数值完全随机,类似于做布朗运动。 为什么通过得到L的k个特征向量构造新的矩阵Y,然后对Y使用kmeans算法就能划分上面的非凸的图形呢?因为直观的看的话,它没有像核聚类一样把原始特征映射到高维空间之类的? 下面是别人回答的为什么谱聚类算法能够得到很好的效果: Spectral clustering works by first transforming the data from Cartesian space into similarity space and then clustering in similarity space. The original data is projected into the new coordinate space which encodes information about how nearby data points are. The similarity transformation reduces the dimensionality of space and, loosely speaking, pre-clusters the data into orthogonal dimensions. This pre-clustering is non-linear and allows for arbitrarily connected
non-convex
geometries which is the main advantage of spectral clustering. 上面这一段说,在构造拉普拉斯矩阵的过程中,因为需要构造邻接矩阵,或者说affinity matrix ,因此,使用拉普拉斯矩阵L从本质上讲就已经把数据从笛卡尔系转化到了相似度空间中,换句话说,就是将原来的点重新映射到了一个新的坐标系,在这个新的坐标系中包含了点与点之间的相似度信息,这种映射是非线性的,因此达到的和核聚类一样的处理线性不可分聚类的效果。 The mapping from Cartesian space to similarity space is facilitated by the creation and diagonalization of the similarity matrix. In the case where you have k spatially separated, well defined clusters, regardless of the geometrical shape of the cluster, the resulting similarity matrix is block diagonal. Each block will correspond to a different cluster. When you stack the lowest k-eigenvectors of this matrix as columns in a new matrix and normalize them, the rows of the matrix are the new coordinates for each data point in the new space. Ignoring degeneracy, if you inspect these new coordinates you will see that the data lies along each of the axes of your new space. These are the coordinates which are used to do the clustering and assign the original data cluster labels. The K-means part is run because the eigenvectors can be degenerate, and the clusters don't have to be so cleanly separated. The eigenvectors span the linear space defined by the clusters. But, the clusters could sit at any coordinates in the space as long as they are rotated 90 degrees from each other relative to the origin. On a side note: Up to the normalization steps, the similarity transformation procedure is the same one would do if applying the kernel PCA [2]. If we don't use a kernel, we get linearly separated data. Non-linear clusters can't be separated this way. Using a kernel lets us separate clustered non-convex groups using the kernel to define what "close" means. To understand the similarity transformation better, you can read a little kernel PCA to get insight to the procedure. Wikipedia has a good example using a Gaussian kernel. 其实,在上面算法那里,可以看到定义的A的点实际上就是两点之间的高斯核的值,因此实际上A和核聚类中的核矩阵是一样的,或许这也从一方面解释了为什么使用拉普拉斯矩阵分解能够得到好的聚类效果。 另外,有一篇论文说从本质上核聚类和谱聚类是一样的,Kernel kmeans, Spectral Clustering and Normalized Cuts 我还没看完,等我看完之后在更新。
更多免费技术资料可关注:annalin1203