（翻译）30天学习Python👨‍💻第二十八天——机器学习 & 数据科学（二）

今天我将探索 Scikit-Learn 库，并创建了一个 notebook 项目来复习一些基础知识，并尝试创建一个机器学习模型。Scikit-Learn是一个巨大的库，需要大量的实践和探索才能掌握它。我跟随一些教程和文章尝试构建一个简单的分类器模型，只是为了弄清楚它是如何工作的。它看起来有点吓人，但我决定在 Jupyter Notebook 上创建一个基本的工作流程，以便当我决定深入ML和数据科学领域，可以使用它作为参考。

Scikit-Learn是一个流行的机器学习Python库。Scikit-Learn可以处理提供给它的数据，并创建机器学习模型来学习数据中的模式，并使用其工具进行预测。

为什么是 Scikit-learn ？

构建在numpy和matplotlib库之上
有大量的内置机器学习模型
机器学习模型的评价方法很多
易于理解和良好设计的API

通常情况下，机器学习可能有点难以应付，因为它涉及复杂的算法和统计来分析数据。Scikit-learn 对这种复杂性进行了抽象，使建立模型和训练它们变得容易，而不需要了解太多的数学和统计知识。

这是我今天创建的 Notebook，Github的仓库链接为github.com/arindamdawn…

数据准备

该项目使用的数据是从 www.kaggle.com/ronitf/hear… 获取的心脏病数据集。

import pandas as pd
import numpy as np
heart_disease = pd.read_csv('data/heart.csv')
heart_disease.head()

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

目的是根据上述数据预测患者是否患有心脏病。目标列确定结果，其他列称为特性。

# 创建特征矩阵 (X)
X = heart_disease.drop('target', axis=1)

# 创建标签 (Y)
y = heart_disease['target']

为问题选择适当的模型/估计器

对于这个问题，我们将使用sklearn中的 RandomForestClassifier 模型，这是一个分类机器学习模型。

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.get_params() # lists the hyperparameters

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

将模型与训练数据进行拟合

在这个步骤中，模型被分为训练数据和测试数据

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2) 
# 表示20%的数据将用作测试数据

clf.fit(X_train, y_train);

# 进行预测
y_label = clf.predict(np.array([0,2,3,4]))

y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1], dtype=int64)

y_test.head()

72     1
116    1
107    1
262    0
162    1
Name: target, dtype: int64

评估模型

在此步骤中，模型将根据训练数据和测试数据进行评估

clf.score(X_train, y_train)

1.0

clf.score(X_test, y_test)

0.7704918032786885

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.77      0.71      0.74        28
           1       0.77      0.82      0.79        33

    accuracy                           0.77        61
   macro avg       0.77      0.77      0.77        61
weighted avg       0.77      0.77      0.77        61

print(confusion_matrix(y_test, y_preds))

[[20  8]
 [ 6 27]]

print(accuracy_score(y_test, y_preds))

0.7704918032786885

改进模型

这一步需要改进模型以获得更准确的结果。

# 尝试不同数量的 n_estimators
np.random.seed(42)
for i in range(1, 100, 10):
    print(f'Trying model with {i} estimators')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f'Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%')
    print('')

Trying model with 1 estimators
Model accuracy on test set: 72.13%

Trying model with 11 estimators
Model accuracy on test set: 83.61%

Trying model with 21 estimators
Model accuracy on test set: 78.69%

Trying model with 31 estimators
Model accuracy on test set: 78.69%

Trying model with 41 estimators
Model accuracy on test set: 75.41%

Trying model with 51 estimators
Model accuracy on test set: 75.41%

Trying model with 61 estimators
Model accuracy on test set: 75.41%

Trying model with 71 estimators
Model accuracy on test set: 73.77%

Trying model with 81 estimators
Model accuracy on test set: 73.77%

Trying model with 91 estimators
Model accuracy on test set: 75.41%

保存模型并加载

我们将使用Python中的 pickle 库保存模型。

import pickle

pickle.dump(clf, open('random_forest_model_1.pkl', 'wb'))

# 加载模型
loaded_model = pickle.load(open('random_forest_model_1.pkl','rb'))
loaded_model.score(X_test, y_test)

0.7540983606557377

今天就到这里。因为机器学习和数据科学本身就是一个海洋，所以我决定在更熟练地使用它的工具和概念之后，通过博客和项目来分享我的经验。对于这个挑战的其余两个部分，我想探索一些领域，比如使用在 Python中使用 Selenium 进行自动化测试，并创建另一篇关于Python资源编译的文章。

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1