参与拿奖:本文已参与「新人创作礼」活动,一起开启掘金创作之路
1.相关概念
1.1 降维
降维:将高维度的数据集变成低维度易于理解的数据
降维原因:1.使得数据更易于使用;2.降低很多算法的设计开销;3.去除噪声;使得结果易于理解
1.2 主要技术
a)PCA:主成分分析 principal Component Analysis
更新数据集的坐标系,选择新的坐标轴根据方差来选择
第一个坐标轴选择原始数据集中方差最大的方向,第二个坐标轴选择与第一个坐标轴正交且具有最大方差的方向,如此重复执行
优点:降低数据的复杂性,识别最重要的多个特征;缺点:有时候会损失重要信息;适用数据类型为数值型
b)FA:因子分析 Factor Analysis
因子分析假设在观察数据的生成中有一些观察不到的隐变量。假设观察数据是这些隐变量和某些噪声的线性组合
隐变量的数据可能比观察数据少,可以通过查找隐变量来实现数据的降维
c) ICA:独立成分分析 Independent Component Analysis
假设数据是从N个数据源产生,假设数据为多个数据源的混合观察结果,这些数据在统计上是相互独立的
在PCA中只假设数据是不相关的,同因子分析一样,如果数据源的数目少于观察数据的数目,则可以实现降维
2.PCA案例
2.1 PCA伪代码
去除平均值
计算协方差矩阵
计算协方差矩阵的特征值和特征向量
将特征值进行排序(逆序)
保留前N个特征向量
将数据转换到上述N个向量构建的新空间中
2.2 图示案例
a)原始数据图像
### 查看数据集
from numpy import *
import matplotlib
import matplotlib.pyplot as plt
n = 1000 #number of points to create
xcord0 = []
ycord0 = []
xcord1 = []
ycord1 = []
markers =[]
colors =[]
fw = open(r'data/testSet.txt','w')
for i in range(n):
[r0,r1] = random.standard_normal(2)
fFlyer = r0 + 9.0
tats = 1.0*r1 + fFlyer + 0
xcord0.append(fFlyer)
ycord0.append(tats)
fw.write("%f\t%f\n" % (fFlyer, tats))
fw.close()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord0,ycord0, marker='^', s=90)
plt.xlabel('hours of direct sunlight')
plt.ylabel('liters of water')
plt.show()
b)降维代码
from numpy import *
import matplotlib.pyplot as plt
def loadDataSet(fileName,delim="\t"):
with open(fileName) as f:
tempData=f.readlines()
stringArr=[line.strip().split() for line in tempData]
datArr=[list(map(float,line)) for line in stringArr]
# print(stringArr,datArr)
return mat(datArr)
def pca(dataMat, topNfeat=9999999):
# 按列计算均值
meanVals = mean(dataMat, axis=0)
# 去平均值,中心化
meanRemoved = dataMat - meanVals #remove mean
# print(shape(meanRemoved),meanRemoved[0,:],dataMat[0,:],meanVals)
# 求协方差矩阵
covMat = cov(meanRemoved, rowvar=0)
# print(shape(covMat))
# 求协方差矩阵特征值和特征向量
eigVals,eigVects = linalg.eig(mat(covMat))
eigValInd = argsort(eigVals) #sort, sort goes smallest to largest
eigValInd = eigValInd[:-(topNfeat+1):-1] #cut off unwanted dimensions
# print(shape(eigValInd),eigValInd)
# print(shape(eigVects))
redEigVects = eigVects[:,eigValInd] #reorganize eig vects largest to smallest
# print(redEigVects)
lowDDataMat = meanRemoved * redEigVects#transform data into new dimensions
reconMat = (lowDDataMat * redEigVects.T) + meanVals
return lowDDataMat, reconMat
def test():
dataMat=loadDataSet(r'data/testSet.txt')
lowDMat,reconMat=pca(dataMat,1)
print("矩阵形状:",shape(lowDMat))
def plotPC():
# 该数据集包含1000个数据点
dataMat=loadDataSet(r'data/testSet.txt')
# print(dataMat)
lowDMat,reconMat=pca(dataMat,1)
# print("矩阵形状:",shape(lowDMat),shape(reconMat))
fig=plt.figure()
ax=fig.add_subplot(111)
# print(dataMat[:,0].flatten().A[0][0],dataMat[0,0])
ax.scatter(dataMat[:,0].flatten().A[0],dataMat[:,1].flatten().A[0],marker='^',s=90)
# ax.scatter(lowDMat[:,0].flatten().A[0],lowDMat[:,0].flatten().A[0],marker='x',s=90,color="green")
ax.scatter(reconMat[:,0].flatten().A[0],reconMat[:,1].flatten().A[0],marker="o",s=50,color="red")
plt.xlabel('hours of direct sunlight')
plt.ylabel('liters of water')
plt.show()
plotPC()
2.3 多维度降维
from numpy import *
import matplotlib
import matplotlib.pyplot as plt
n = 1000 #number of points to create
xcord0 = []; ycord0 = []
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
markers =[]
colors =[]
fw = open(r'data/testSet3.txt','w')
for i in range(n):
groupNum = int(3*random.uniform())
[r0,r1] = random.standard_normal(2)
if groupNum == 0:
x = r0 + 16.0
y = 1.0*r1 + x
xcord0.append(x)
ycord0.append(y)
elif groupNum == 1:
x = r0 + 8.0
y = 1.0*r1 + x
xcord1.append(x)
ycord1.append(y)
elif groupNum == 2:
x = r0 + 0.0
y = 1.0*r1 + x
xcord2.append(x)
ycord2.append(y)
fw.write("%f\t%f\t%d\n" % (x, y, groupNum))
fw.close()
fig = plt.figure()
ax = fig.add_subplot(211)
ax.scatter(xcord0,ycord0, marker='^', s=90)
ax.scatter(xcord1,ycord1, marker='o', s=50, c='red')
ax.scatter(xcord2,ycord2, marker='v', s=50, c='yellow')
ax = fig.add_subplot(212)
myDat = loadDataSet(r'data/testSet3.txt')
lowDDat,reconDat = pca(myDat[:,0:2],1)
# print(lowDDat)
label0Mat = lowDDat[nonzero(myDat[:,2]==0)[0],:2] #get the items with label 0
label1Mat = lowDDat[nonzero(myDat[:,2]==1)[0],:2] #get the items with label 1
label2Mat = lowDDat[nonzero(myDat[:,2]==2)[0],:2] #get the items with label 2
# print(label0Mat[:,0])
ax.scatter(label0Mat[:,0].tolist(),zeros(shape(label0Mat)[0]), marker='^', s=90)
ax.scatter(label1Mat[:,0].tolist(),zeros(shape(label1Mat)[0]), marker='o', s=50, c='red')
ax.scatter(label2Mat[:,0].tolist(),zeros(shape(label2Mat)[0]), marker='v', s=50, c='yellow')
plt.show()
3.小结
降维技术使得数据变得更易使用,并且往往能够去除数据中的噪声
降维往往用在数据预处理部分,在使用数据进行训练之前进行数据清洗
PCA通过沿着数据最大方差方向旋转坐标轴来实现降维
参考资料
[1] 机器学习实战
[2] 书籍源码
[3] jupyter版本
[4] 本节数据集和代码