Python机器学习(4)-训练/测试数据集&机器学习算法的准确度&超参数&数据归一化

190 阅读5分钟

训练数据集和测试数据集

一般我们拿到一批数据集不能全部喂给我们的算法模型,需要保留一小部分留作测试我们算的准确性和性能使用。我们称“喂给”算法的数据集为“训练数据集”,测试使用的数据集为“测试数据集”

将鸢尾花的数据集切分为训练数据集和测试数据集

具体拆分规则,随机选中20%个数据作为测试数据集,剩下80%作为训练数据集

import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
Y = iris.target
shuffle_index = np.random.permutation(len(Y))
test_ratio = 0.2
test_size = int(test_ratio * len(Y))
test_indexes = shuffle_index[:test_size]
train_index = shuffle_index[test_size:]

if __name__ == '__main__':
    print(test_indexes)
    print(train_index)

对切分训练数据集和测试数据集拆分算法进行封装

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    assert X.shape[0] == y.shape[0], "the size of X must be equal to the size of y"
    assert 0.0 < test_ratio < 1.0, "the test_ratio range must in (0~1)"
    if seed:
        np.random.seed(seed)
    shuffled_indexes = np.random.permutation(len(X))
    test_size = int(test_ratio * len(y))
    test_indexes = shuffled_indexes[test_size:]
    train_indexes = shuffled_indexes[:test_size]
    X_train = X[train_indexes]
    y_train = y[train_indexes]
    X_test = X[test_indexes]
    y_test = y[test_indexes]
    return X_train, X_test, y_train, y_test

验证算法的准确率

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

from playML.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
Y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, Y)

knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)
y_predict = knn_classifier.predict(X_test)
# 计算预测准确率
res = (sum(y_predict == y_test) / len(y_test))
if __name__ == '__main__':
    print(res)
    #out: 0.9416666666666667

超参数

上面提到的K的值,一般需要算法人员的经验或者通过实验数据获取,并不是拍脑袋拍出来的,下面展示通过尝试使用K的值在1~10之间获取最佳的算法准确度从而得到对应的K的值

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


if __name__ == '__main__':
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    best_score = 0.0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k = ", best_k)
    print("best_score", best_score)
    
    # out:
    # best_k =  3
    # best_score 0.9972222222222222

权重参数

假设我们设定K等于3,那么如下场景中,一共有三个点,因为少数服从多数原则,绿色的点被判定为蓝色。但是此时红色的点距离是最近的,没有考虑每个参考点的权重信息,会在一定程度上造成误判。 image.png 考虑距离权重判断节点颜色。这里使用倒数作为每个距离的权重系数,那么响应的计算为:

1. 红色权重计算
red = 1 * 1 / 1 = 1
2. 蓝色权重计算
blue = 1 * 1 / 3 + 1 * 1 / 4 = 7 / 12
red > blue
红色获胜
计算权重参数和K参数
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

if __name__ == '__main__':
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    best_score = 0.0
    best_k = -1
    best_mth = ''
    for mth in ["distance", "uniform"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=mth)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_mth = mth
    print("best_k = ", best_k)
    print("best_score=", best_score)
    print("best_mth=", best_mth)

#out:
#best_k =  6
#best_score= 0.9944444444444445
#best_mth= uniform

名可夫斯基参数

欧拉距离

image.png

曼哈顿距离

image.png 曼哈顿距离是指两点间的通过之间连接的距离,如红色线,蓝色线,黄色线,他们的长度是相同的,对应的计算两点的曼哈顿距离公式为:

image.png 曼哈顿距离公式和欧拉距离公式进一步推导出:

image.png 获得的这个公式称为“名可夫斯基”公式

image.png

计算权重参数和K参数和名可夫斯基参数
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

if __name__ == '__main__':
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    best_score = 0.0
    best_k = -1
    best_mth = ''
    best_p = 1
    for p in [1, 2, 3, 4, 5]:
        for mth in ["distance", "uniform"]:
            for k in range(1, 11):
                knn_clf = KNeighborsClassifier(n_neighbors=k, weights=mth, p=p)
                knn_clf.fit(X_train, y_train)
                score = knn_clf.score(X_test, y_test)
                if score > best_score:
                    best_k = k
                    best_score = score
                    best_mth = mth
                    best_p = p
    print("best_k = ", best_k)
    print("best_score=", best_score)
    print("best_mth=", best_mth)
    print("best_p=", best_p)
    
#out:
#best_k =  6
#best_score= 0.9944444444444445
#best_mth= uniform
#best_p= 2

网格搜索参数

上文中,通过不断地使用for循环的方式,确定最优参数的具体值。sklearn中提供了对应的网格参数搜索。

  1. 定义网格搜索参数的值
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

digits = datasets.load_digits()
x = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

grid_search = [
    {
        "n_neighbors": [i for i in range(1, 10)],
        "weights": ["uniform"]
    },
    {
        "n_neighbors": [i for i in range(1, 10)],
        "weights": ["distance"],
        "p": [i for i in range(1, 6)]
    }
]

knn_clf = KNeighborsClassifier()

if __name__ == '__main__':
    grid_search = GridSearchCV(knn_clf, grid_search)
    print(grid_search.fit(X_train, y_train))
    # 获取最优调参后的分类器
    print(grid_search.best_estimator_)
    print(grid_search.best_score_)
    print(grid_search.best_params_)

# out:
#GridSearchCV(estimator=KNeighborsClassifier(),
#             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9],
#                          'weights': ['uniform']},
#                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9],
#                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}])
#KNeighborsClassifier(n_neighbors=7, p=4, weights='distance')
#0.9853948896631823
#{'n_neighbors': 7, 'p': 4, 'weights': 'distance'}

GridSearchCV中还提供了并行计算指定计算机核数的参数(n_jobs)和在运行过程中输出一些运算信息的参数(verbose)


if __name__ == '__main__':
    grid_search = GridSearchCV(knn_clf, grid_search, n_jobs=6,verbose=2)
    print(grid_search.fit(X_train, y_train))
    # 获取最优调参后的分类器
    print(grid_search.best_estimator_)
    print(grid_search.best_score_)
    print(grid_search.best_params_)

#out:
[CV] END ...............n_neighbors=9, p=3, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=3, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=4, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=4, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=4, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=8, p=5, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=4, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=4, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=5, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=5, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=5, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=5, weights=distance; total time=   0.3s
[CV] END ...............n_neighbors=9, p=5, weights=distance; total time=   0.3s

数据归一化

下面的这组癌症数据中,肿瘤大小和癌症数据有着很大的差异,大小是1和5,但是发现的天数却是100天和200天,这种数据在一定程度上,发现天数会严重左右最终的预测结果。 image.png 这时,我们为了避免某个数据对预测结果的判断,可以修改一下对应的单位,将天数,换成年为单位,这样就可以避免。

image.png 我们这种将所有数据映射到一个尺度,称为“数据归一化”

将所有数据映射到0~1之间 image.png

最值归一化

找到最大特征点X_max和X_min,然后每个特征X减掉X_min除以X_max-X_min。这种算法适合有明显边界的情况。如学生的成绩,最低为0分,最高为100分。

对一个一维数组进行均值方差归一化

import numpy as np

if __name__ == '__main__':
    x = np.random.random_integers(0, 100, size=100)
    y = (x - np.min(x)) / (np.max(x) - np.min(x))
    print(y)
    
#out:
[0.78787879 0.35353535 0.72727273 0.25252525 0.7979798  0.64646465
 0.09090909 0.09090909 0.82828283 0.68686869 0.08080808 0.93939394
 0.34343434 0.36363636 0.04040404 0.70707071 0.19191919 0.16161616
 0.56565657 0.66666667 0.46464646 0.78787879 0.48484848 0.18181818
 0.2020202  0.73737374 0.90909091 0.1010101  0.66666667 0.22222222
 0.90909091 0.02020202 0.1010101  0.90909091 0.         0.01010101
 0.29292929 0.52525253 0.49494949 0.96969697 0.77777778 0.94949495
 0.11111111 0.87878788 0.48484848 0.92929293 0.49494949 0.12121212
 0.73737374 0.         0.48484848 1.         0.15151515 0.56565657
 0.25252525 0.50505051 0.84848485 0.31313131 0.95959596 0.64646465
 0.37373737 0.41414141 0.37373737 0.42424242 0.29292929 0.67676768
 0.78787879 0.05050505 0.         0.02020202 0.47474747 0.92929293
 0.45454545 0.13131313 0.67676768 0.29292929 0.02020202 0.97979798
 0.28282828 0.17171717 0.73737374 0.8989899  0.14141414 0.88888889
 0.94949495 0.71717172 0.29292929 0.22222222 0.90909091 0.22222222
 0.28282828 0.29292929 0.28282828 0.53535354 0.66666667 0.90909091
 0.12121212 0.67676768 0.29292929 0.36363636]

均值方差归一化

如收入的分布,没有明显的边界,那么这种算法就不适合。这种形式数据归一可以使用"均值方差归一化" 把所有的数据归一到均值为0方差为1的分布中

image.png

对一个二维数组的第0列进行均值方差归一化

x = np.random.random_integers(0, 100, (50, 2))
x = np.array(x, dtype="float")
x[:, 0] = (x[:, 0] - np.mean(x[:, 0])) / np.std(x[:, 0])
print(x[:, 0])

#out:
[-1.27374257  0.01286609  0.65617042 -0.23730782  0.97782258  1.62112691
 -1.48817735  0.58469216  0.08434435  0.19156173  0.12008347  0.7633878
 -0.59469911  1.08503997 -0.91635128 -0.13009043 -0.38026434  0.72764867
  0.5132139   0.83486606 -1.77409038 -1.41669909  1.62112691 -0.0943513
 -1.45243822  0.90634432  0.72764867  0.37025738  0.90634432  1.370953
  1.19225736  1.19225736  0.26303999 -0.20156869 -1.63113387 -0.91635128
 -1.63113387  0.04860522  0.97782258  1.62112691  0.54895303 -0.7019165
 -0.98782954  0.37025738  0.83486606 -0.4517426  -1.84556864 -1.27374257
 -0.16582956 -1.55965561]

特征值减掉均值除以特征值的方差。