《Machine Learning with R, tidyverse, and mlr》
MEAP Edition
Manning Early Access Program
Machine Learning with R, tidyverse, and mlr
Version 7
Classifying based on similar observations: the k-Nearest neighbors algorithm
一些机器学习的基本概念:- undefitting (making a model that is too simple to reflect trends in the data)
- overfitting (making a model that is too complex and fits the noise in the data)
- the bias-variance trade-off (the balance between underfitting and overfitting)
- cross-validation (the process by which we tell if we are overfitting or underfitting)
- hyperparameter tuning (the process by which we find the optimal options for an algorithm)
KNN算法往往分为:
- the training phase
k-NN算法的训练阶段只包括存储数据。这是不寻常的在机器学习算法中,这意味着大多数计算在预测阶段完成 - the prediction phase
在预测阶段,k-NN算法计算每个新的,距离没有标签的箱子和所有标签的箱子。当我说“距离”时,我指的是它们之间的距离攻击性和体长的变量,而不是你在树林里找到他们的距离! 这通常是一种距离度量,在二维甚至三维空间中叫做欧几里得距离在你的脑海中很容易想象成一个图上两点之间的直线距离。控件中存在的多个维度中计算数据。
构造一个简单的KNN模型
首先获取需求的依赖包并加载:install.packages("mlr", dependencies = TRUE)
> library(mlr)
载入需要的程辑包:ParamHelpers
Warning message: 'mlr' is in 'maintenance-only' mode since July
2019. Future development will only happen in 'mlr3'
(<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there
might be uncaught bugs meanwhile in {mlr} - please consider
switching.
> library(tidyverse)
-- Attaching packages -------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.3 v purrr 0.3.4
v tibble 3.1.1 v dplyr 1.0.5
v tidyr 1.1.3 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.1
-- Conflicts ----------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
然后获取数据集:
> data(diabetes, package = "mclust")
> diabetesTib <- as_tibble(diabetes)
> summary(diabetesTib)
class glucose insulin sspg
Chemical:36 Min. : 70 Min. : 45.0 Min. : 10.0
Normal :76 1st Qu.: 90 1st Qu.: 352.0 1st Qu.:118.0
Overt :33 Median : 97 Median : 403.0 Median :156.0
Mean :122 Mean : 540.8 Mean :186.1
3rd Qu.:112 3rd Qu.: 558.0 3rd Qu.:221.0
Max. :353 Max. :1568.0 Max. :748.0
> diabetesTib
# A tibble: 145 x 4
class glucose insulin sspg
<fct> <dbl> <dbl> <dbl>
1 Normal 80 356 124
2 Normal 97 289 117
3 Normal 105 319 143
4 Normal 90 356 199
5 Normal 90 323 240
6 Normal 86 381 157
7 Normal 100 350 221
8 Normal 85 301 186
9 Normal 97 379 142
10 Normal 97 296 131
# ... with 135 more rows
可视化:
> ggplot(diabetesTib, aes(glucose, insulin, col = class)) +
+ geom_point() +
+ theme_bw()
> ggplot(diabetesTib, aes(sspg, insulin, col = class)) +
+ geom_point() +
+ theme_bw()
> ggplot(diabetesTib, aes(sspg, glucose, col = class)) +
+ geom_point() +
+ theme_bw()
使用mlr包构建机器学习模型有三个主要阶段:
- Define the task. The task consists of the data, and what we want do to with it. In thiscase, the data is and we want to classify the data with the variable diabetesTib classas the target variable.
- Define the learner. The learner is simply the name of the algorithm we plan to use,along with any additional arguments the algorithm accepts.
- Train the model. This stage is what it sounds like: you pass the task to the learner, andthe learner generates a model that you can use to make future predictions 定义任务所需的组件有:
- 包含预测变量的数据(我们希望包含信息的变量需要做出预测/解决我们的问题)
- 我们想要预测的目标变量 我们希望构建一个分类模型,因此使用makeClassifTask()函数来定义 分类任务。包含类标签的因子的名称作为数据目标参数。
> diabetesTask
Supervised task: diabetesTib
Type: classif
Target: class
Observations: 145
Features:
numerics factors ordered functionals
3 0 0 0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 3
Chemical Normal Overt
36 76 33
Positive class: NA
我们使用makeLearner()函数来定义学习者。的第一个参数makeLearner()函数是我们用来训练模型的算法。在这种情况下,我们想要使用K-NN算法,所以我们提供“classif.knn”作为论据,设置"k" = 2
knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
训练模型后进行展示
> knnModel <- train(knn, diabetesTask)
> knnPred <- predict(knnModel, newdata = diabetesTib)
Warning in predict.WrappedModel(knnModel, newdata = diabetesTib) :
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
> performance(knnPred, measures = list(mmce, acc))
mmce acc
0.05517241 0.94482759
交叉验证模型
> diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Provided data is not a pure data.frame but from class tbl_df, hence it will be converted.
> knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
> holdout <- makeResampleDesc(method = "Holdout", split = 2/3,
+ stratify = TRUE)
> holdoutCV <- resample(learner = knn, task = diabetesTask, resampling = holdout, measures = list(mmce, acc))
Resampling: holdout
Measures: mmce acc
[Resample] iter 1: 0.1224490 0.8775510
Aggregated Result: mmce.test.mean=0.1224490,acc.test.mean=0.8775510
> holdoutCV$aggr
mmce.test.mean acc.test.mean
0.122449 0.877551
>
保留交叉验证的混淆矩阵
> calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
predicted
true Chemical Normal Overt -err.-
Chemical 0.92/0.69 0.08/0.04 0.00/0.00 0.08
Normal 0.08/0.12 0.92/0.96 0.00/0.00 0.08
Overt 0.27/0.19 0.00/0.00 0.73/1.00 0.27
-err.- 0.31 0.04 0.00 0.12
Absolute confusion matrix:
predicted
true Chemical Normal Overt -err.-
Chemical 11 1 0 1
Normal 2 24 0 2
Overt 3 0 8 3
-err.- 5 1 0 6
创建一个k倍交叉验证重采样描述:
> kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50,
+ stratify = TRUE)
> kFoldCV <- resample(learner = knn, task = diabetesTask,
+ resampling = kFold, measures = list(mmce, acc))
Resampling: repeated cross-validation
Measures: mmce acc
[Resample] iter 1: 0.0714286 0.9285714
[Resample] iter 2: 0.0769231 0.9230769
[Resample] iter 3: 0.1333333 0.8666667
[Resample] iter 4: 0.0714286 0.9285714
[Resample] iter 5: 0.1875000 0.8125000
[Resample] iter 6: 0.0714286 0.9285714
...
[Resample] iter 498: 0.0714286 0.9285714
[Resample] iter 499: 0.1333333 0.8666667
[Resample] iter 500: 0.1428571 0.8571429
Aggregated Result: mmce.test.mean=0.1064117,acc.test.mean=0.8935883
提取平均性能度量:
> kFoldCV$aggr
mmce.test.mean acc.test.mean
0.1064117 0.8935883
基于重复的k-fold交叉验证来构建混淆矩阵:
> calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
predicted
true Chemical Normal Overt -err.-
Chemical 0.81/0.77 0.10/0.05 0.09/0.11 0.19
Normal 0.04/0.08 0.96/0.95 0.00/0.00 0.04
Overt 0.17/0.15 0.00/0.00 0.83/0.89 0.17
-err.- 0.23 0.05 0.11 0.11
Absolute confusion matrix:
predicted
true Chemical Normal Overt -err.-
Chemical 1450 184 166 350
Normal 144 3656 0 144
Overt 280 0 1370 280
-err.- 424 184 166 774
创建 leave-one-out交叉验证重采样描述:
> LOO <- makeResampleDesc(method = "LOO")
> LOOCV <- resample(learner = knn, task = diabetesTask, resampling = LOO,
+ measures = list(mmce, acc))
Resampling: LOO
Measures: mmce acc
[Resample] iter 1: 0.0000000 1.0000000
[Resample] iter 2: 0.0000000 1.0000000
[Resample] iter 3: 0.0000000 1.0000000
...
[Resample] iter 144: 0.0000000 1.0000000
[Resample] iter 145: 0.0000000 1.0000000
Aggregated Result: mmce.test.mean=0.1103448,acc.test.mean=0.8896552
> LOOCV$aggr
mmce.test.mean acc.test.mean
0.1103448 0.8896552
如果反复运行交叉验证,发现对于这个模型和数据性能估计比k倍的变量更大,但比之前运行的变量更小。
> calculateConfusionMatrix(LOOCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
predicted
true Chemical Normal Overt -err.-
Chemical 0.81/0.76 0.08/0.04 0.11/0.13 0.19
Normal 0.04/0.08 0.96/0.96 0.00/0.00 0.04
Overt 0.18/0.16 0.00/0.00 0.82/0.87 0.18
-err.- 0.24 0.04 0.13 0.11
Absolute confusion matrix:
predicted
true Chemical Normal Overt -err.-
Chemical 29 3 4 7
Normal 3 73 0 3
Overt 6 0 27 6
-err.- 9 3 4 16
优化以改进模型
> knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))
> gridSearch <- makeTuneControlGrid()
> cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
> tunedK <- tuneParams("classif.knn", task = diabetesTask,
+ resampling = cvForTuning,
+ par.set = knnParamSpace, control = gridSearch)
[Tune] Started tuning learner classif.knn for parameter set:
Type len Def Constr Req Tunable Trafo
k discrete - - 1,2,3,4,5,6,7,8,9,10 - TRUE -
With control class: TuneControlGrid
Imputation value: 1
[Tune-x] 1: k=1
[Tune-y] 1: mmce.test.mean=0.1119524; time: 0.0 min
[Tune-x] 2: k=2
[Tune-y] 2: mmce.test.mean=0.1035952; time: 0.0 min
[Tune-x] 3: k=3
[Tune-y] 3: mmce.test.mean=0.0881429; time: 0.0 min
[Tune-x] 4: k=4
[Tune-y] 4: mmce.test.mean=0.0920714; time: 0.0 min
[Tune-x] 5: k=5
[Tune-y] 5: mmce.test.mean=0.0843810; time: 0.0 min
[Tune-x] 6: k=6
[Tune-y] 6: mmce.test.mean=0.0835000; time: 0.0 min
[Tune-x] 7: k=7
[Tune-y] 7: mmce.test.mean=0.0753571; time: 0.0 min
[Tune-x] 8: k=8
[Tune-y] 8: mmce.test.mean=0.0839048; time: 0.0 min
[Tune-x] 9: k=9
[Tune-y] 9: mmce.test.mean=0.0896905; time: 0.0 min
[Tune-x] 10: k=10
[Tune-y] 10: mmce.test.mean=0.0841190; time: 0.0 min
[Tune] Result: k=7 : mmce.test.mean=0.0753571
> tunedK
Tune result:
Op. pars: k=7
mmce.test.mean=0.0753571
可视化调优过程
训练最终的模型:
> tunedKnn <- setHyperPars(makeLearner("classif.knn"),
+ par.vals = tunedK$x)
> tunedKnnModel <- train(tunedKnn, diabetesTask)
在交叉验证中包括超参数调优:
> inner <- makeResampleDesc("CV")
> outer <- makeResampleDesc("RepCV", folds = 10, reps = 5)
> knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner,
+ par.set = knnParamSpace,
+ control = gridSearch)
> cvWithTuning <- resample(knnWrapper, diabetesTask, resampling = outer)
Resampling: repeated cross-validation
Measures: mmce
[Tune] Started tuning learner classif.knn for parameter set:
Type len Def Constr Req Tunable Trafo
k discrete - - 1,2,3,4,5,6,7,8,9,10 - TRUE -
With control class: TuneControlGrid
Imputation value: 1
[Tune-x] 1: k=1
[Tune-y] 1: mmce.test.mean=0.0983516; time: 0.0 min
[Tune-x] 2: k=2
[Tune-y] 2: mmce.test.mean=0.0912088; time: 0.0 min
[Tune-x] 3: k=3
[Tune-y] 3: mmce.test.mean=0.0835165; time: 0.0 min
...
[Tune-y] 9: mmce.test.mean=0.0906593; time: 0.0 min
[Tune-x] 10: k=10
[Tune-y] 10: mmce.test.mean=0.0835165; time: 0.0 min
[Tune] Result: k=8 : mmce.test.mean=0.0758242
[Resample] iter 50: 0.0000000
Aggregated Result: mmce.test.mean=0.0873333
> cvWithTuning
Resample Result
Task: diabetesTib
Learner: classif.knn.tuned
Aggr perf: mmce.test.mean=0.0873333
Runtime: 19.2561
调用模型预测:
> newDiabetesPatients <- tibble(glucose = c(82, 108, 300),
+ insulin = c(361, 288, 1052),
+ sspg = c(200, 186, 135))
> newDiabetesPatients
# A tibble: 3 x 3
glucose insulin sspg
<dbl> <dbl> <dbl>
1 82 361 200
2 108 288 186
3 300 1052 135
将这些患者输入模型,得到他们预测的糖尿病状态:
> newPatientsPred <- predict(tunedKnnModel, newdata = newDiabetesPatients)
Warning in predict.WrappedModel(tunedKnnModel, newdata = newDiabetesPatients) :
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
> getPredictionResponse(newPatientsPred)
[1] Normal Normal Overt
Levels: Chemical Normal Overt
Strengths and weaknesses of k-NN
The strengths of the k-NN algorithm
- The algorithm is very simple to understand- There is no computational cost during the learning process, all the computation is done during prediction
- It makes no assumptions about the data, such as how it’s distributed
The weaknesses of the k-NN algorithm
- It cannot natively handle categorical variables (they must be recoded first or a different distance metric must be used)- When the training set is large, it can be computationally expensive to compute the distance between new data and all the cases in the training set
- The model can’t be interpreted in terms of real-world relationships in the data
- Prediction accuracy can be strongly impacted by noisy data and outliers
- In high-dimensional datasets, k-NN tends to perform poorly. This is due to a phenomenon you’ll learn about in chapter 5, called the . In brief, curse of dimensionality in high dimensions the distances between the cases start to look the same, so finding the nearest neighbors becomes difficult