机器学习之降维算法

182 阅读4分钟
参与拿奖:本文已参与「新人创作礼」活动,一起开启掘金创作之路

1.相关概念

1.1 降维

降维:将高维度的数据集变成低维度易于理解的数据

降维原因:1.使得数据更易于使用;2.降低很多算法的设计开销;3.去除噪声;使得结果易于理解

1.2 主要技术

a)PCA:主成分分析 principal Component Analysis

更新数据集的坐标系,选择新的坐标轴根据方差来选择

第一个坐标轴选择原始数据集中方差最大的方向,第二个坐标轴选择与第一个坐标轴正交且具有最大方差的方向,如此重复执行

优点:降低数据的复杂性,识别最重要的多个特征;缺点:有时候会损失重要信息;适用数据类型为数值型

b)FA:因子分析 Factor Analysis

因子分析假设在观察数据的生成中有一些观察不到的隐变量。假设观察数据是这些隐变量和某些噪声的线性组合

隐变量的数据可能比观察数据少,可以通过查找隐变量来实现数据的降维

c) ICA:独立成分分析 Independent Component Analysis

假设数据是从N个数据源产生,假设数据为多个数据源的混合观察结果,这些数据在统计上是相互独立的

在PCA中只假设数据是不相关的,同因子分析一样,如果数据源的数目少于观察数据的数目,则可以实现降维

2.PCA案例

2.1 PCA伪代码

去除平均值
计算协方差矩阵
计算协方差矩阵的特征值和特征向量
将特征值进行排序(逆序)
保留前N个特征向量
将数据转换到上述N个向量构建的新空间中

2.2 图示案例

a)原始数据图像

### 查看数据集
from numpy import *
import matplotlib
import matplotlib.pyplot as plt

n = 1000 #number of points to create
xcord0 = []
ycord0 = []
xcord1 = []
ycord1 = []
markers =[]
colors =[]
fw = open(r'data/testSet.txt','w')
for i in range(n):
    [r0,r1] = random.standard_normal(2)
    fFlyer = r0 + 9.0
    tats = 1.0*r1 + fFlyer + 0
    xcord0.append(fFlyer)
    ycord0.append(tats)
    fw.write("%f\t%f\n" % (fFlyer, tats))

fw.close()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord0,ycord0, marker='^', s=90)
plt.xlabel('hours of direct sunlight')
plt.ylabel('liters of water')
plt.show()

image.png b)降维代码

from numpy import *
import matplotlib.pyplot as plt

def loadDataSet(fileName,delim="\t"):
    with open(fileName) as f:
        tempData=f.readlines()
    stringArr=[line.strip().split() for line in tempData]
    datArr=[list(map(float,line)) for line in stringArr]
    # print(stringArr,datArr)
    return mat(datArr)


def pca(dataMat, topNfeat=9999999):
    # 按列计算均值
    meanVals = mean(dataMat, axis=0)
    # 去平均值,中心化
    meanRemoved = dataMat - meanVals #remove mean
    # print(shape(meanRemoved),meanRemoved[0,:],dataMat[0,:],meanVals)
    # 求协方差矩阵
    covMat = cov(meanRemoved, rowvar=0)
    # print(shape(covMat))
    # 求协方差矩阵特征值和特征向量
    eigVals,eigVects = linalg.eig(mat(covMat))
    eigValInd = argsort(eigVals)            #sort, sort goes smallest to largest
    eigValInd = eigValInd[:-(topNfeat+1):-1]  #cut off unwanted dimensions
    # print(shape(eigValInd),eigValInd)
    # print(shape(eigVects))
    redEigVects = eigVects[:,eigValInd]       #reorganize eig vects largest to smallest
    # print(redEigVects)
    lowDDataMat = meanRemoved * redEigVects#transform data into new dimensions
    reconMat = (lowDDataMat * redEigVects.T) + meanVals
    return lowDDataMat, reconMat


def test():
    dataMat=loadDataSet(r'data/testSet.txt')
    lowDMat,reconMat=pca(dataMat,1)
    print("矩阵形状:",shape(lowDMat))
    
def plotPC():
    # 该数据集包含1000个数据点
    dataMat=loadDataSet(r'data/testSet.txt')
    # print(dataMat)
    lowDMat,reconMat=pca(dataMat,1)
    # print("矩阵形状:",shape(lowDMat),shape(reconMat))
    fig=plt.figure()
    ax=fig.add_subplot(111)
    # print(dataMat[:,0].flatten().A[0][0],dataMat[0,0])
    ax.scatter(dataMat[:,0].flatten().A[0],dataMat[:,1].flatten().A[0],marker='^',s=90)
    # ax.scatter(lowDMat[:,0].flatten().A[0],lowDMat[:,0].flatten().A[0],marker='x',s=90,color="green")
    ax.scatter(reconMat[:,0].flatten().A[0],reconMat[:,1].flatten().A[0],marker="o",s=50,color="red")
    plt.xlabel('hours of direct sunlight')
    plt.ylabel('liters of water')
    plt.show()
    

plotPC()
    

image.png

2.3 多维度降维

from numpy import *
import matplotlib
import matplotlib.pyplot as plt


n = 1000 #number of points to create
xcord0 = []; ycord0 = []
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
markers =[]
colors =[]
fw = open(r'data/testSet3.txt','w')
for i in range(n):
    groupNum = int(3*random.uniform())
    [r0,r1] = random.standard_normal(2)
    if groupNum == 0:
        x = r0 + 16.0
        y = 1.0*r1 + x
        xcord0.append(x)
        ycord0.append(y)
    elif groupNum == 1:
        x = r0 + 8.0
        y = 1.0*r1 + x
        xcord1.append(x)
        ycord1.append(y)
    elif groupNum == 2:
        x = r0 + 0.0
        y = 1.0*r1 + x
        xcord2.append(x)
        ycord2.append(y)
    fw.write("%f\t%f\t%d\n" % (x, y, groupNum))

fw.close()
fig = plt.figure()
ax = fig.add_subplot(211)
ax.scatter(xcord0,ycord0, marker='^', s=90)
ax.scatter(xcord1,ycord1, marker='o', s=50,  c='red')
ax.scatter(xcord2,ycord2, marker='v', s=50,  c='yellow')
ax = fig.add_subplot(212)
myDat = loadDataSet(r'data/testSet3.txt')
lowDDat,reconDat = pca(myDat[:,0:2],1)
# print(lowDDat)
label0Mat = lowDDat[nonzero(myDat[:,2]==0)[0],:2] #get the items with label 0
label1Mat = lowDDat[nonzero(myDat[:,2]==1)[0],:2] #get the items with label 1
label2Mat = lowDDat[nonzero(myDat[:,2]==2)[0],:2] #get the items with label 2

# print(label0Mat[:,0])
ax.scatter(label0Mat[:,0].tolist(),zeros(shape(label0Mat)[0]), marker='^', s=90)
ax.scatter(label1Mat[:,0].tolist(),zeros(shape(label1Mat)[0]), marker='o', s=50,  c='red')
ax.scatter(label2Mat[:,0].tolist(),zeros(shape(label2Mat)[0]), marker='v', s=50,  c='yellow')
plt.show()

image.png

3.小结

降维技术使得数据变得更易使用,并且往往能够去除数据中的噪声

降维往往用在数据预处理部分,在使用数据进行训练之前进行数据清洗

PCA通过沿着数据最大方差方向旋转坐标轴来实现降维

参考资料

[1] 机器学习实战

[2] 书籍源码

[3] jupyter版本

[4] 本节数据集和代码