可视化:RStudio 聚类分析 - K-Means

629 阅读3分钟

本文已参加「新人创作礼」活动,一起开启掘金创作之路。

这次带来的是 RStudio 的聚类分析 - K-Means。

聚类分析: K-Means Clustering (K均值聚类)

K-Means 是一种快速聚类算法, 它可以将数据聚为K个类(K需要指定). 它的思路是:

  1. 随机选取K个中心点
  2. 将每一个样本归为离它最近的中心点所在的类中
  3. 计算每一个类的中心位置, 作为新的K个中心点
  4. 不断重复2和3, 直到中心点位置不再变动

R自带的kmeans函数

R的{stats}::kmeans函数可以实现K-Means聚类算法.

# example:  将iris前四列聚为2类
km2 = kmeans(iris[,1:4], centers = 2)

# 查看聚类结果
km2

# 得到k个中心点位置
km2$centers

# 得到每个样本的所属类
km2$cluster

# 画出所有样本点(前两个维度), 用颜色标记不同类
plot(iris[,1:2], col = km2$cluster + 1)

自编一个实现K-Means聚类算法的函数

在本节中, 我们想要通过自己编写一个K-Means函数来更加深入的理解K-Means算法的流程. 并且在输出k个中心点位置和k个分组的基础上, 还想在每一次迭代中画出当前中心点的位置, 以便将这个算法动态的展示出来.

具体的流程如下:

image.png

# two input arguments:  input_data,  k
input_data = iris[,1:4]
k = 3

# convert input_data into matrix
mat = as.matrix(input_data)

# randomly choosing k lines to be k centers
centers = mat[sample(nrow(mat),k), ]

# create a dist_mat to store the distance from each row in mat to each row in centers
dist_mat = matrix(NA, nrow=nrow(mat), ncol=k)

# this function calculate the distance between x and y
euc_dist = function(x,y){
  x = as.vector(x)
  y = as.vector(y)
  sqrt( sum((x-y)^2) )
}

# calculate the values in dist_mat
for(i in 1:nrow(mat)){
  for(j in 1:k){
    dist_mat[i,j] = euc_dist(mat[i,], centers[j,])
  }
}

# get the clustering
groups = apply(dist_mat, 1, which.min)

# calculate the k new centers
centersNew = matrix(NA, nrow = k, ncol = ncol(mat))
for(i in 1:k){
  centersNew[i,] = apply(mat[groups == i,], 2, mean)
}

# calculate the difference between old centers and new centers
diff_centers = apply(centersNew - centers, 1, function(x) sqrt(sum(x^2)) )

# set the error
err = 1e-5

# check whether all k difference are small enough
all(diff_centers < err)

# update the centers
centers = centersNew

# do them again until "all(diff_centers < err)" is TRUE

练习

将以上过程包装成一个函数, 并加入每一次迭代中的centers位置可视化(可以控制是否关闭此功能).

具体要求:

  • kmeans_clustering(input_data, centers, err=1e-6, visual=TRUE)
  • 参数中的centers可以是一个数值(代表k),或者一个矩阵(代表k个centers的坐标)
  • 参数中的err为允许的误差值, 迭代前后的差低于err即认为算法收敛
  • 返回值: 一个列表,包含收敛后的clusters, centers,
kmeans_clustering = function(input_data, centers, err=1e-6, visual=TRUE){
  
  # convert input_data into matrix
  mat = as.matrix(input_data)
  
  if(length(centers) == 1){
    k = centers
    # randomly choosing k lines to be k centers
    centers = mat[sample(nrow(mat),k), ]
  }else{
    k = nrow(centers)
  }
  
  # create a dist_mat to store the distance from each row in mat to each row in centers
  dist_mat = matrix(NA, nrow=nrow(mat), ncol=k)

  repeat{
    # calculate the values in dist_mat
    for(i in 1:nrow(mat)){
      for(j in 1:k){
        dist_mat[i,j] = euc_dist(mat[i,], centers[j,])
      }
    }
    
    # get the clustering
    groups = apply(dist_mat, 1, which.min)
    
    # calculate the k new centers
    centersNew = matrix(NA, nrow = k, ncol = ncol(mat))
    for(i in 1:k){
      centersNew[i,] = apply(mat[groups == i,], 2, mean)
    }
    
    # calculate the difference between old centers and new centers
    diff_centers = apply(centersNew - centers, 1, function(x) sqrt(sum(x^2)) )
    
    # check whether all k difference are small enough
    if(all(diff_centers < err)) break
    
    # visualization
    if(visual){
      plot(input_data[,1:2], col=groups, cex=0.2) # plot only the first 2 dimension
      points(centers[,1:2], col=1:k, pch="X")
      points(centersNew[,1:2], col=1:k, pch=19)
      title(main = paste("difference:",
                         paste(round(diff_centers,6),collapse=",")))
      locator(1)
    }
    
    # update the centers
    centers = centersNew
    
  }  
  result = list(clusters = groups, centers = centers)
  return(result)
}

# this function calculate the eculidean distance between x and y
euc_dist = function(x,y){
  x = as.vector(x)
  y = as.vector(y)
  sqrt( sum((x-y)^2) )
}

kmeans_clustering(iris[,1:4], centers=3)