本文已参加「新人创作礼」活动,一起开启掘金创作之路。
这次带来的是 RStudio 的聚类分析 - K-Means。
聚类分析: K-Means Clustering (K均值聚类)
K-Means 是一种快速聚类算法, 它可以将数据聚为K个类(K需要指定). 它的思路是:
- 随机选取K个中心点
- 将每一个样本归为离它最近的中心点所在的类中
- 计算每一个类的中心位置, 作为新的K个中心点
- 不断重复2和3, 直到中心点位置不再变动
R自带的kmeans函数
R的{stats}::kmeans
函数可以实现K-Means聚类算法.
# example: 将iris前四列聚为2类
km2 = kmeans(iris[,1:4], centers = 2)
# 查看聚类结果
km2
# 得到k个中心点位置
km2$centers
# 得到每个样本的所属类
km2$cluster
# 画出所有样本点(前两个维度), 用颜色标记不同类
plot(iris[,1:2], col = km2$cluster + 1)
自编一个实现K-Means聚类算法的函数
在本节中, 我们想要通过自己编写一个K-Means函数来更加深入的理解K-Means算法的流程. 并且在输出k个中心点位置和k个分组的基础上, 还想在每一次迭代中画出当前中心点的位置, 以便将这个算法动态的展示出来.
具体的流程如下:
# two input arguments: input_data, k
input_data = iris[,1:4]
k = 3
# convert input_data into matrix
mat = as.matrix(input_data)
# randomly choosing k lines to be k centers
centers = mat[sample(nrow(mat),k), ]
# create a dist_mat to store the distance from each row in mat to each row in centers
dist_mat = matrix(NA, nrow=nrow(mat), ncol=k)
# this function calculate the distance between x and y
euc_dist = function(x,y){
x = as.vector(x)
y = as.vector(y)
sqrt( sum((x-y)^2) )
}
# calculate the values in dist_mat
for(i in 1:nrow(mat)){
for(j in 1:k){
dist_mat[i,j] = euc_dist(mat[i,], centers[j,])
}
}
# get the clustering
groups = apply(dist_mat, 1, which.min)
# calculate the k new centers
centersNew = matrix(NA, nrow = k, ncol = ncol(mat))
for(i in 1:k){
centersNew[i,] = apply(mat[groups == i,], 2, mean)
}
# calculate the difference between old centers and new centers
diff_centers = apply(centersNew - centers, 1, function(x) sqrt(sum(x^2)) )
# set the error
err = 1e-5
# check whether all k difference are small enough
all(diff_centers < err)
# update the centers
centers = centersNew
# do them again until "all(diff_centers < err)" is TRUE
练习
将以上过程包装成一个函数, 并加入每一次迭代中的centers位置可视化(可以控制是否关闭此功能).
具体要求:
kmeans_clustering(input_data, centers, err=1e-6, visual=TRUE)
- 参数中的
centers
可以是一个数值(代表k),或者一个矩阵(代表k个centers的坐标) - 参数中的
err
为允许的误差值, 迭代前后的差低于err
即认为算法收敛 - 返回值: 一个列表,包含收敛后的clusters, centers,
kmeans_clustering = function(input_data, centers, err=1e-6, visual=TRUE){
# convert input_data into matrix
mat = as.matrix(input_data)
if(length(centers) == 1){
k = centers
# randomly choosing k lines to be k centers
centers = mat[sample(nrow(mat),k), ]
}else{
k = nrow(centers)
}
# create a dist_mat to store the distance from each row in mat to each row in centers
dist_mat = matrix(NA, nrow=nrow(mat), ncol=k)
repeat{
# calculate the values in dist_mat
for(i in 1:nrow(mat)){
for(j in 1:k){
dist_mat[i,j] = euc_dist(mat[i,], centers[j,])
}
}
# get the clustering
groups = apply(dist_mat, 1, which.min)
# calculate the k new centers
centersNew = matrix(NA, nrow = k, ncol = ncol(mat))
for(i in 1:k){
centersNew[i,] = apply(mat[groups == i,], 2, mean)
}
# calculate the difference between old centers and new centers
diff_centers = apply(centersNew - centers, 1, function(x) sqrt(sum(x^2)) )
# check whether all k difference are small enough
if(all(diff_centers < err)) break
# visualization
if(visual){
plot(input_data[,1:2], col=groups, cex=0.2) # plot only the first 2 dimension
points(centers[,1:2], col=1:k, pch="X")
points(centersNew[,1:2], col=1:k, pch=19)
title(main = paste("difference:",
paste(round(diff_centers,6),collapse=",")))
locator(1)
}
# update the centers
centers = centersNew
}
result = list(clusters = groups, centers = centers)
return(result)
}
# this function calculate the eculidean distance between x and y
euc_dist = function(x,y){
x = as.vector(x)
y = as.vector(y)
sqrt( sum((x-y)^2) )
}
kmeans_clustering(iris[,1:4], centers=3)