聚类算法K-Means - 矢量化应用(三) 案例：聚类算法用于降维，KMeans的矢量量化应用 K-Means聚类最重

根据菜菜的课程进行整理，方便记忆理解

代码位置如下：

案例：聚类算法用于降维，KMeans的矢量量化应用

K-Means聚类最重要的应用之一是非结构数据（图像，声音）上的矢量量化（VQ）。
- 非结构化数据往往占用比较多的储存空间，文件本身也会比较大，运算非常缓慢，我们希望能够在保证数据质量的前提下，尽量地缩小非结构化数据的大小，或者简化非结构化数据的结构。
- 矢量量化
  - KMeans聚类的矢量量化本质是一种降维运用，但它与我们之前学过的任何一种降维算法的思路都不相同。
  - 特征选择的降维是直接选取对模型贡献最大的特征
  - PCA的降维是聚合信息
  - 矢量量化的降维是在同等样本量上压缩信息的大小，即不改变特征的数目也不改变样本的数目，只改变在这些特征下的样本上的信息量。

对于图像来说，一张图片上的信息可以被聚类如下表示：

这是一组40个样本的数据，分别含有40组不同的信息(x1,x2)。我们将代表所有样本点聚成4类，找出四个质心，我们认为，这些点和他们所属的质心非常相似，因此他们所承载的信息就约等于他们所在的簇的质心所承载的信息。于是，我们可以使用每个样本所在的簇的质心来覆盖原有的样本，有点类似四舍五入的感觉，类似于用1来代替0.9和0.8。这样，40个样本带有的40种取值，就被我们压缩了4组取值，虽然样本量还是40个，但是这40个样本所带的取值其实只有4个，就是分出来的四个簇的质心.

导入需要的库

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
#对两个序列中的点进行距离匹配的函数
from sklearn.datasets import load_sample_image
#导入图片数据所用的类
from sklearn.utils import shuffle #洗牌

导入数据，探索数据

# 实例化，导入颐和园的图片
china = load_sample_image("china.jpg")

china
"""
array([[[174, 201, 231],
        [174, 201, 231],
        [174, 201, 231],
        ...,
        [250, 251, 255],
        [250, 251, 255],
        [250, 251, 255]],

       [[172, 199, 229],
        [173, 200, 230],
        [173, 200, 230],
        ...,
        [251, 252, 255],
        [251, 252, 255],
        [251, 252, 255]],

       [[174, 201, 231],
        [174, 201, 231],
        [174, 201, 231],
        ...,
        [252, 253, 255],
        [252, 253, 255],
        [252, 253, 255]])

"""

#查看数据类型
china.dtype
# dtype('uint8')

china.shape
#长度 x 宽度 x 像素 > 三个数决定的颜色
# (427, 640, 3)

china[0][0]
# array([174, 201, 231], dtype=uint8)

#包含多少种不同的颜色?
newimage = china.reshape((427 * 640,3))
newimage.shape
# (273280, 3)

import pandas as pd
# 去除重复值
pd.DataFrame(newimage).drop_duplicates().shape

#我们现在有9W多种颜色
# (96615, 3)

图像探索完毕，我们了解了，图像现在有9W多种颜色。我们希望来试试使用K-Means将颜色压缩到64种，还不严重损耗图像的质量。

展示图片

# 图像可视化
plt.figure(figsize=(15,15))
plt.imshow(china) #导入3维数组形成的图片

为了比较，我们还要画出随机压缩到64种颜色的矢量量化图像。我们需要随机选取64个样本点作为随机质心，计算原数据中每个样本到它们的距离来找出离每样本最近的随机质心，然后用每个样本所对应的随机质心来替换原本的样本。两种状况下，我们观察图像可视化之后的状况，以查看图片信息的损失。在这之前，我们需要把数据处理成sklearn中的K-Means类能够接受的数据。

决定超参数，数据预处理

n_clusters = 64

china = np.array(china, dtype=np.float64) / china.max()
w, h, d = original_shape = tuple(china.shape)
assert d == 3
image_array = np.reshape(china, (w * h, d))

# plt.imshow在浮点数上表现非常优异，在这里我们把china中的数据，转换为浮点数，压缩到[0,1]之间
china = np.array(china, dtype=np.float64) / china.max()

#把china从图像格式，转换成矩阵格式
w, h, d = original_shape = tuple(china.shape)

image_array = np.reshape(china, (w * h, d)) #reshape是改变结构

image_array
"""
array([[0.68235294, 0.78823529, 0.90588235],
       [0.68235294, 0.78823529, 0.90588235],
       [0.68235294, 0.78823529, 0.90588235],
       ...,
       [0.16862745, 0.19215686, 0.15294118],
       [0.05098039, 0.08235294, 0.02352941],
       [0.05882353, 0.09411765, 0.02745098]])
"""

image_array.shape
# (273280, 3)

对数据进行K-Means的矢量量化

#首先，先使用1000个数据来找出质心
image_array_sample = shuffle(image_array, random_state=0)[:1000]
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(image_array_sample)

kmeans.cluster_centers_.shape
# (64, 3)

#找出质心之后，按照已存在的质心对所有数据进行聚类
labels = kmeans.predict(image_array)
labels.shape
# (273280,)

#使用质心来替换所有的样本
image_kmeans = image_array.copy()
image_kmeans #27W个样本点，9W多种不同的颜色（像素点）

kmeans.cluster_centers_[labels[0]]
# array([0.73524384, 0.82021116, 0.91925591])

# 在这个部分我们使用聚类中心像素替换聚类中的点的像素
for i in range(w * h):
    image_kmeans[i] = kmeans.cluster_centers_[labels[i]]

# 查看替换后的颜色种类
pd.DataFrame(image_kmeans).drop_duplicates().shape
# (64, 3)

#恢复图片的结构，用于展示
image_kmeans = image_kmeans.reshape(w,h,d)
image_kmeans.shape

对数据进行随机的矢量量化

# 只使用图片中的64个像素进行聚类(就是随机取了64个点，用这64点使用kmeans进行预测，直接替换)
centroid_random = shuffle(image_array, random_state=0)[:n_clusters]

labels_random = pairwise_distances_argmin(centroid_random,image_array,axis=0)

#函数pairwise_distances_argmin(x1,x2,axis) #x1和x2分别是序列
#用来计算x2中的每个样本到x1中的每个样本点的距离，并返回和x2相同形状的，x1中对应的最近的样本点的索引

#使用随机质心来替换所有样本
image_random = image_array.copy()

for i in range(w*h):
    image_random[i] = centroid_random[labels_random[i]]
    
#恢复图片的结构
image_random = image_random.reshape(w,h,d)
image_random.shape

将原图，按KMeans矢量量化和随机矢量量化的图像绘制出来

plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)

plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, K-Means)')
plt.imshow(image_kmeans)

plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(image_random)
plt.show()