ML with R 读书笔记

496 阅读5分钟
《Machine Learning with R, tidyverse, and mlr》
MEAP Edition 
Manning Early Access Program 
Machine Learning with R, tidyverse, and mlr 
Version 7 
Classifying based on similar observations: the k-Nearest neighbors algorithm
一些机器学习的基本概念:
- undefitting (making a model that is too simple to reflect trends in the data)
- overfitting (making a model that is too complex and fits the noise in the data)
- the bias-variance trade-off (the balance between underfitting and overfitting)
- cross-validation (the process by which we tell if we are overfitting or underfitting)
- hyperparameter tuning (the process by which we find the optimal options for an algorithm)

DXLA9LGL01XC6600`L5X6@8.png

KNN算法往往分为:

  1. the training phase
    k-NN算法的训练阶段只包括存储数据。这是不寻常的在机器学习算法中,这意味着大多数计算在预测阶段完成
  2. the prediction phase
    在预测阶段,k-NN算法计算每个新的,距离没有标签的箱子和所有标签的箱子。当我说“距离”时,我指的是它们之间的距离攻击性和体长的变量,而不是你在树林里找到他们的距离! 这通常是一种距离度量,在二维甚至三维空间中叫做欧几里得距离在你的脑海中很容易想象成一个图上两点之间的直线距离。控件中存在的多个维度中计算数据。
构造一个简单的KNN模型
首先获取需求的依赖包并加载:
install.packages("mlr", dependencies = TRUE)
> library(mlr)
载入需要的程辑包:ParamHelpers
Warning message: 'mlr' is in 'maintenance-only' mode since July
2019. Future development will only happen in 'mlr3'
(<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there
might be uncaught bugs meanwhile in {mlr} - please consider
switching.
> library(tidyverse)
-- Attaching packages -------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.3     v purrr   0.3.4
v tibble  3.1.1     v dplyr   1.0.5
v tidyr   1.1.3     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.1
-- Conflicts ----------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

然后获取数据集:

> data(diabetes, package = "mclust")
> diabetesTib <- as_tibble(diabetes)
> summary(diabetesTib)
      class       glucose       insulin            sspg      
 Chemical:36   Min.   : 70   Min.   :  45.0   Min.   : 10.0  
 Normal  :76   1st Qu.: 90   1st Qu.: 352.0   1st Qu.:118.0  
 Overt   :33   Median : 97   Median : 403.0   Median :156.0  
               Mean   :122   Mean   : 540.8   Mean   :186.1  
               3rd Qu.:112   3rd Qu.: 558.0   3rd Qu.:221.0  
               Max.   :353   Max.   :1568.0   Max.   :748.0 
> diabetesTib
# A tibble: 145 x 4
   class  glucose insulin  sspg
   <fct>    <dbl>   <dbl> <dbl>
 1 Normal      80     356   124
 2 Normal      97     289   117
 3 Normal     105     319   143
 4 Normal      90     356   199
 5 Normal      90     323   240
 6 Normal      86     381   157
 7 Normal     100     350   221
 8 Normal      85     301   186
 9 Normal      97     379   142
10 Normal      97     296   131
# ... with 135 more rows

可视化:

> ggplot(diabetesTib, aes(glucose, insulin, col = class)) +
+   geom_point() +
+   theme_bw()

Rplot01.png

> ggplot(diabetesTib, aes(sspg, insulin, col = class)) +
+   geom_point() +
+   theme_bw()

Rplot02.png

> ggplot(diabetesTib, aes(sspg, glucose, col = class)) +
+   geom_point() +
+   theme_bw()

Rplot03.png 使用mlr包构建机器学习模型有三个主要阶段:

  1. Define the task. The task consists of the data, and what we want do to with it. In thiscase, the data is and we want to classify the data with the variable diabetesTib classas the target variable.
  2. Define the learner. The learner is simply the name of the algorithm we plan to use,along with any additional arguments the algorithm accepts.
  3. Train the model. This stage is what it sounds like: you pass the task to the learner, andthe learner generates a model that you can use to make future predictions 定义任务所需的组件有:
  4. 包含预测变量的数据(我们希望包含信息的变量需要做出预测/解决我们的问题)
  5. 我们想要预测的目标变量 我们希望构建一个分类模型,因此使用makeClassifTask()函数来定义 分类任务。包含类标签的因子的名称作为数据目标参数。
> diabetesTask
Supervised task: diabetesTib
Type: classif
Target: class
Observations: 145
Features:
   numerics     factors     ordered functionals 
          3           0           0           0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 3
Chemical   Normal    Overt 
      36       76       33 
Positive class: NA

我们使用makeLearner()函数来定义学习者。的第一个参数makeLearner()函数是我们用来训练模型的算法。在这种情况下,我们想要使用K-NN算法,所以我们提供“classif.knn”作为论据,设置"k" = 2

knn <- makeLearner("classif.knn", par.vals = list("k" = 2))

训练模型后进行展示

> knnModel <- train(knn, diabetesTask)
> knnPred <- predict(knnModel, newdata = diabetesTib)
Warning in predict.WrappedModel(knnModel, newdata = diabetesTib) :
  Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
> performance(knnPred, measures = list(mmce, acc))
      mmce        acc 
0.05517241 0.94482759 
交叉验证模型
> diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking,  :
  Provided data is not a pure data.frame but from class tbl_df, hence it will be converted.
> knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
> holdout <- makeResampleDesc(method = "Holdout", split = 2/3,
+                             stratify = TRUE)
> holdoutCV <- resample(learner = knn, task = diabetesTask, resampling = holdout, measures = list(mmce, acc))
Resampling: holdout
Measures:             mmce      acc       
[Resample] iter 1:    0.1224490 0.8775510 


Aggregated Result: mmce.test.mean=0.1224490,acc.test.mean=0.8775510


> holdoutCV$aggr
mmce.test.mean  acc.test.mean 
      0.122449       0.877551 
> 

保留交叉验证的混淆矩阵

> calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-   
  Chemical 0.92/0.69 0.08/0.04 0.00/0.00 0.08     
  Normal   0.08/0.12 0.92/0.96 0.00/0.00 0.08     
  Overt    0.27/0.19 0.00/0.00 0.73/1.00 0.27     
  -err.-        0.31      0.04      0.00 0.12     


Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical       11      1     0      1
  Normal          2     24     0      2
  Overt           3      0     8      3
  -err.-          5      1     0      6

创建一个k倍交叉验证重采样描述:

> kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50,
+                           stratify = TRUE)
> kFoldCV <- resample(learner = knn, task = diabetesTask,
+                     resampling = kFold, measures = list(mmce, acc))
Resampling: repeated cross-validation
Measures:             mmce      acc       
[Resample] iter 1:    0.0714286 0.9285714 
[Resample] iter 2:    0.0769231 0.9230769 
[Resample] iter 3:    0.1333333 0.8666667 
[Resample] iter 4:    0.0714286 0.9285714 
[Resample] iter 5:    0.1875000 0.8125000 
[Resample] iter 6:    0.0714286 0.9285714 
...
[Resample] iter 498:  0.0714286 0.9285714 
[Resample] iter 499:  0.1333333 0.8666667 
[Resample] iter 500:  0.1428571 0.8571429 

Aggregated Result: mmce.test.mean=0.1064117,acc.test.mean=0.8935883

提取平均性能度量:

> kFoldCV$aggr
mmce.test.mean  acc.test.mean 
     0.1064117      0.8935883 

基于重复的k-fold交叉验证来构建混淆矩阵:

> calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-   
  Chemical 0.81/0.77 0.10/0.05 0.09/0.11 0.19     
  Normal   0.04/0.08 0.96/0.95 0.00/0.00 0.04     
  Overt    0.17/0.15 0.00/0.00 0.83/0.89 0.17     
  -err.-        0.23      0.05      0.11 0.11     


Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical     1450    184   166    350
  Normal        144   3656     0    144
  Overt         280      0  1370    280
  -err.-        424    184   166    774

创建 leave-one-out交叉验证重采样描述:

> LOO <- makeResampleDesc(method = "LOO")
> LOOCV <- resample(learner = knn, task = diabetesTask, resampling = LOO,
+                   measures = list(mmce, acc))
Resampling: LOO
Measures:             mmce      acc       
[Resample] iter 1:    0.0000000 1.0000000 
[Resample] iter 2:    0.0000000 1.0000000 
[Resample] iter 3:    0.0000000 1.0000000 
...
[Resample] iter 144:  0.0000000 1.0000000 
[Resample] iter 145:  0.0000000 1.0000000 

Aggregated Result: mmce.test.mean=0.1103448,acc.test.mean=0.8896552

> LOOCV$aggr
mmce.test.mean  acc.test.mean 
     0.1103448      0.8896552 

如果反复运行交叉验证,发现对于这个模型和数据性能估计比k倍的变量更大,但比之前运行的变量更小。

> calculateConfusionMatrix(LOOCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-   
  Chemical 0.81/0.76 0.08/0.04 0.11/0.13 0.19     
  Normal   0.04/0.08 0.96/0.96 0.00/0.00 0.04     
  Overt    0.18/0.16 0.00/0.00 0.82/0.87 0.18     
  -err.-        0.24      0.04      0.13 0.11     


Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical       29      3     4      7
  Normal          3     73     0      3
  Overt           6      0    27      6
  -err.-          9      3     4     16
优化以改进模型
> knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))
> gridSearch <- makeTuneControlGrid()
> cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
> tunedK <- tuneParams("classif.knn", task = diabetesTask,
+                      resampling = cvForTuning,
+                      par.set = knnParamSpace, control = gridSearch)
[Tune] Started tuning learner classif.knn for parameter set:
      Type len Def               Constr Req Tunable Trafo
k discrete   -   - 1,2,3,4,5,6,7,8,9,10   -    TRUE     -
With control class: TuneControlGrid
Imputation value: 1
[Tune-x] 1: k=1
[Tune-y] 1: mmce.test.mean=0.1119524; time: 0.0 min
[Tune-x] 2: k=2
[Tune-y] 2: mmce.test.mean=0.1035952; time: 0.0 min
[Tune-x] 3: k=3
[Tune-y] 3: mmce.test.mean=0.0881429; time: 0.0 min
[Tune-x] 4: k=4
[Tune-y] 4: mmce.test.mean=0.0920714; time: 0.0 min
[Tune-x] 5: k=5
[Tune-y] 5: mmce.test.mean=0.0843810; time: 0.0 min
[Tune-x] 6: k=6
[Tune-y] 6: mmce.test.mean=0.0835000; time: 0.0 min
[Tune-x] 7: k=7
[Tune-y] 7: mmce.test.mean=0.0753571; time: 0.0 min
[Tune-x] 8: k=8
[Tune-y] 8: mmce.test.mean=0.0839048; time: 0.0 min
[Tune-x] 9: k=9
[Tune-y] 9: mmce.test.mean=0.0896905; time: 0.0 min
[Tune-x] 10: k=10
[Tune-y] 10: mmce.test.mean=0.0841190; time: 0.0 min
[Tune] Result: k=7 : mmce.test.mean=0.0753571
> tunedK
Tune result:
Op. pars: k=7
mmce.test.mean=0.0753571

可视化调优过程

Rplot04.png

训练最终的模型:

> tunedKnn <- setHyperPars(makeLearner("classif.knn"),
+                          par.vals = tunedK$x)
> tunedKnnModel <- train(tunedKnn, diabetesTask)

在交叉验证中包括超参数调优:

> inner <- makeResampleDesc("CV")
> outer <- makeResampleDesc("RepCV", folds = 10, reps = 5)
> knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner,
+                               par.set = knnParamSpace,
+                               control = gridSearch)
> cvWithTuning <- resample(knnWrapper, diabetesTask, resampling = outer)
Resampling: repeated cross-validation
Measures:             mmce      
[Tune] Started tuning learner classif.knn for parameter set:
      Type len Def               Constr Req Tunable Trafo
k discrete   -   - 1,2,3,4,5,6,7,8,9,10   -    TRUE     -
With control class: TuneControlGrid
Imputation value: 1
[Tune-x] 1: k=1
[Tune-y] 1: mmce.test.mean=0.0983516; time: 0.0 min
[Tune-x] 2: k=2
[Tune-y] 2: mmce.test.mean=0.0912088; time: 0.0 min
[Tune-x] 3: k=3
[Tune-y] 3: mmce.test.mean=0.0835165; time: 0.0 min
...
[Tune-y] 9: mmce.test.mean=0.0906593; time: 0.0 min
[Tune-x] 10: k=10
[Tune-y] 10: mmce.test.mean=0.0835165; time: 0.0 min
[Tune] Result: k=8 : mmce.test.mean=0.0758242
[Resample] iter 50:   0.0000000 


Aggregated Result: mmce.test.mean=0.0873333


> cvWithTuning
Resample Result
Task: diabetesTib
Learner: classif.knn.tuned
Aggr perf: mmce.test.mean=0.0873333
Runtime: 19.2561

调用模型预测:

> newDiabetesPatients <- tibble(glucose = c(82, 108, 300),
+                               insulin = c(361, 288, 1052),
+                               sspg = c(200, 186, 135))
> newDiabetesPatients
# A tibble: 3 x 3
  glucose insulin  sspg
    <dbl>   <dbl> <dbl>
1      82     361   200
2     108     288   186
3     300    1052   135

将这些患者输入模型,得到他们预测的糖尿病状态:

> newPatientsPred <- predict(tunedKnnModel, newdata = newDiabetesPatients)
Warning in predict.WrappedModel(tunedKnnModel, newdata = newDiabetesPatients) :
  Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
> getPredictionResponse(newPatientsPred)
[1] Normal Normal Overt 
Levels: Chemical Normal Overt
Strengths and weaknesses of k-NN
The strengths of the k-NN algorithm
- The algorithm is very simple to understand
- There is no computational cost during the learning process, all the computation is done during prediction
- It makes no assumptions about the data, such as how it’s distributed
The weaknesses of the k-NN algorithm
- It cannot natively handle categorical variables (they must be recoded first or a different distance metric must be used)
- When the training set is large, it can be computationally expensive to compute the distance between new data and all the cases in the training set
- The model can’t be interpreted in terms of real-world relationships in the data
- Prediction accuracy can be strongly impacted by noisy data and outliers
- In high-dimensional datasets, k-NN tends to perform poorly. This is due to a phenomenon you’ll learn about in chapter 5, called the . In brief, curse of dimensionality in high dimensions the distances between the cases start to look the same, so finding the nearest neighbors becomes difficult