机器学习| K邻近疾病预测演示

204 阅读4分钟

白底色二维码.png

应用演示

我们将使用 KNN 算法演示预测患者是否有心脏疾病。

训练数据集:

数据编号年龄性别胆固醇水平 (mg/dL)收缩压 (mmHg)BMI心电图结果老年动脉硬化是否有心脏疾病
15019512522正常轻度
26525015527异常重度
35221513024正常中度
46827516029异常重度
55521012023正常轻度

预测数据

数据编号年龄性别胆固醇水平 (mg/dL)收缩压 (mmHg)BMI心电图结果老年动脉硬化
A6328016028异常重度
B4819011521正常轻度

特征工程

  • 特征缩放

    年龄、胆固醇水平、收缩压、BMI:这些是数值特征,可以进行标准化或最小-最大缩放。

  • 特征编码

    • 性别:男 = 1, 女 = 0
    • 心电图结果:正常 = 0, 异常 = 1
    • 老年动脉硬化:轻度 = 0, 中度 = 1, 重度 = 2
    • 是否有心脏疾病:否 = 0, 是 = 1

特征工程处理后的训练数据

数据编号年龄性别胆固醇水平收缩压BMI心电图结果老年动脉硬化是否有心脏疾病
10.000.00.00.0000
21.011.01.01.0121
30.100.360.170.4010
41.211.451.21.4121
50.300.270.00.2000

特征工程处理后的预测数据

数据编号年龄性别胆固醇水平收缩压BMI心电图结果老年动脉硬化
A0.6610.861.00.812
B0.3600.20.10.100

流程演示

  • 计算距离:我们需要计算A与所有训练数据点的距离。

  • 找到最近的k个邻居:根据计算出的距离,我们选择最近的k(此处为3)个数据点。

  • 进行投票:看最近的3个邻居中,有多少是正例,多少是负例。多数的那个类别会被预测为A的类别。

数据点A与训练数据集的距离:

数据编号年龄性别胆固醇水平收缩压BMI心电图结果老年动脉硬化距离公式距离
10.4500.40.50.400d(A,1)=(0.630.45)2+(10)2+(0.90.4)2+(0.950.5)2+(0.80.4)2+(10)2+(20)2 d(A,1) = \sqrt{(0.63-0.45)^2 + (1-0)^2 + (0.9-0.4)^2 + (0.95-0.5)^2 + (0.8-0.4)^2 + (1-0)^2 + (2-0)^2}2.41
20.7810.80.90.712d(A,2)=(0.630.78)2+(11)2+(0.90.8)2+(0.950.9)2+(0.80.7)2+(11)2+(22)2 d(A,2) =\sqrt{(0.63-0.78)^2 + (1-1)^2 + (0.9-0.8)^2 + (0.95-0.9)^2 + (0.8-0.7)^2 + (1-1)^2 + (2-2)^2}0.31
30.4800.50.50.501d(A,3)=(0.630.48)2+(10)2+(0.90.5)2+(0.950.5)2+(0.80.5)2+(10)2+(21)2 d(A,3) =\sqrt{(0.63-0.48)^2 + (1-0)^2 + (0.9-0.5)^2 + (0.95-0.5)^2 + (0.8-0.5)^2 + (1-0)^2 + (2-1)^2}1.59
40.8210.90.950.812d(A,4)=(0.630.82)2+(11)2+(0.90.9)2+(0.950.95)2+(0.80.8)2+(11)2+(22)2 d(A,4) =\sqrt{(0.63-0.82)^2 + (1-1)^2 + (0.9-0.9)^2 + (0.95-0.95)^2 + (0.8-0.8)^2 + (1-1)^2 + (2-2)^2}0.19
50.500.450.450.4500d(A,5)=(0.630.5)2+(10)2+(0.90.45)2+(0.950.45)2+(0.80.45)2+(10)2+(20)2d(A,5) =\sqrt{(0.63-0.5)^2 + (1-0)^2 + (0.9-0.45)^2 + (0.95-0.45)^2 + (0.8-0.45)^2 + (1-0)^2 + (2-0)^2}2.35

排序距离

数据编号是否有心脏疾病距离
40.19
20.31
31.59
52.35
12.41

假设K=3,我们依据分类任务投票机制,找出A数据点最近的三个领居

  • 4 - 是
  • 2 - 是
  • 3 - 否

投票结果,在三个邻居中有两个是心脏病,一个不是心脏疾病。因此预测A可能有心脏病

代码实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# 1. 数据加载

# Define the column names for the dataset
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]

# Load the dataset from UCI machine learning repository
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
heart_data = pd.read_csv(url, header=None, names=column_names, na_values="?")

# Drop rows with missing values for simplicity
heart_data.dropna(inplace=True)

print(heart_data)

# Now, you can use the KNN workflow code provided previously, replacing the data loading part with the above lines.

X, y = heart_data, heart_data.target

# 2. 数据清洗 (数据集已经是处理过的,这里可以跳过)

# 3. 特征工程
scaler = StandardScaler()
X = scaler.fit_transform(X)

# 4. 定义变量
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. 训练模型
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# 6. K值选择
neighbors = list(range(1, 50, 2))
cv_scores = []
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring="accuracy")
    cv_scores.append(scores.mean())
optimal_k = neighbors[cv_scores.index(max(cv_scores))]

# 7. 模型优化和评估
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)

print(f"Optimal K value: {optimal_k}")
print(f"Optimized Model Accuracy: {score:.4f}")

# 可视化
plt.figure(figsize=(10,6))
plt.plot(neighbors, cv_scores, label="CV Average Score")
plt.axvline(optimal_k, color="red", linestyle="--", label="Optimal K")
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.title("Model accuracy with respect to K")
plt.legend()
plt.show()


# 使用 age 和 thalach 两个特征进行可视化
scaler = StandardScaler()
X_visualize = heart_data[["age", "thalach"]].values
y_visualize = heart_data["target"].values
X_visualize = scaler.fit_transform(X_visualize)

# 根据训练集创建网格来显示决策边界
x_min, x_max = X_visualize[:, 0].min() - 1, X_visualize[:, 0].max() + 1
y_min, y_max = X_visualize[:, 1].min() - 1, X_visualize[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_visualize,y_visualize)
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# 使用蓝色和红色分别表示没有心脏病和有心脏病的数据点
plt.contourf(xx, yy, Z, alpha=0.4)
scatter = plt.scatter(X_visualize[:, 0], X_visualize[:, 1], c=y_visualize, cmap=plt.cm.RdBu_r, s=20)
plt.xlabel("Age (normalized)")
plt.ylabel("Max Heart Rate (normalized)")
plt.title("Decision Boundary for Heart Disease Classification")
plt.colorbar(scatter)
plt.legend(handles=scatter.legend_elements()[0], labels=["No Heart Disease", "Heart Disease"])
plt.show()

数据可视化

K值选择可视化

image.png

预测结果可视化

image.png