[机器学习]LR逻辑回归（分类实战）1/实战背景使用Logistic回归来预测患疝气病的马的存活问题，数据包含了368

我来为您提供两个完整的逻辑回归分类代码示例：

1. 二分类逻辑回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# 生成二分类数据集
# 自己生成数据
X, y = make_classification(
    n_samples=1000,           # 样本数量
    n_features=2,             # 特征数量
    n_informative=2,          # 有用特征数量
    n_redundant=0,            # 冗余特征数量
    n_clusters_per_class=1,   # 每个类的簇数
    random_state=42
)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建并训练逻辑回归模型
log_reg = LogisticRegression(
    random_state=42,
    max_iter=1000  # 增加迭代次数确保收敛
)

log_reg.fit(X_train_scaled, y_train)

# 预测
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率: {accuracy:.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred))

# 绘制混淆矩阵
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('混淆矩阵')
plt.xlabel('预测标签')
plt.ylabel('真实标签')

# 绘制决策边界
plt.subplot(1, 3, 2)
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# 标准化网格点
mesh_points = np.c_[xx.ravel(), yy.ravel()]
mesh_points_scaled = scaler.transform(mesh_points)
Z = log_reg.predict(mesh_points_scaled)
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdYlBu)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdYlBu)
plt.title('决策边界')
plt.xlabel('特征 1')
plt.ylabel('特征 2')

# 绘制概率分布
plt.subplot(1, 3, 3)
plt.hist(y_pred_proba[:, 1], bins=20, alpha=0.7, color='skyblue')
plt.title('预测概率分布')
plt.xlabel('正类概率')
plt.ylabel('频数')

plt.tight_layout()
plt.show()

# 输出模型系数
print(f"\n模型系数: {log_reg.coef_}")
print(f"截距: {log_reg.intercept_}")

2. 多分类逻辑回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# 加载鸢尾花数据集（三分类问题）
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print("数据集信息:")
print(f"特征数量: {X.shape[1]}")
print(f"样本数量: {X.shape[0]}")
print(f"类别: {target_names}")

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建并训练多分类逻辑回归模型
# 使用 multinomial 和 lbfgs 求解器进行多分类
multi_log_reg = LogisticRegression(
    multi_class='multinomial',  # 使用多项逻辑回归（softmax）
    solver='lbfgs',             # 适合多分类的求解器
    random_state=42,
    max_iter=1000
)

multi_log_reg.fit(X_train_scaled, y_train)

# 预测
y_pred = multi_log_reg.predict(X_test_scaled)
y_pred_proba = multi_log_reg.predict_proba(X_test_scaled)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print(f"\n准确率: {accuracy:.4f}")
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=target_names))

# 绘制混淆矩阵
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=target_names, yticklabels=target_names)
plt.title('混淆矩阵')
plt.xlabel('预测标签')
plt.ylabel('真实标签')

# 绘制特征重要性
plt.subplot(1, 3, 2)
feature_importance = np.abs(multi_log_reg.coef_).mean(axis=0)
plt.barh(feature_names, feature_importance)
plt.title('特征重要性')
plt.xlabel('平均系数绝对值')

# 绘制预测概率分布
plt.subplot(1, 3, 3)
for i in range(len(target_names)):
    plt.hist(y_pred_proba[y_test == i, i], bins=10, alpha=0.6, 
             label=target_names[i], density=True)
plt.title('各类别预测概率分布')
plt.xlabel('预测概率')
plt.ylabel('密度')
plt.legend()

plt.tight_layout()
plt.show()

# 输出详细结果
print("\n详细预测结果 (前10个测试样本):")
print("真实标签\t预测标签\t概率分布")
for i in range(min(10, len(y_test))):
    true_label = target_names[y_test[i]]
    pred_label = target_names[y_pred[i]]
    probas = [f"{p:.3f}" for p in y_pred_proba[i]]
    print(f"{true_label}\t\t{pred_label}\t\t{probas}")

# 输出模型系数
print(f"\n模型系数 (每行对应一个类别):")
for i, coef in enumerate(multi_log_reg.coef_):
    print(f"{target_names[i]}: {coef}")

print(f"\n截距项: {multi_log_reg.intercept_}")

# 计算并显示每个类别的准确率
print("\n各类别准确率:")
for i, class_name in enumerate(target_names):
    class_mask = y_test == i
    class_accuracy = accuracy_score(y_test[class_mask], y_pred[class_mask])
    print(f"{class_name}: {class_accuracy:.4f}")

关键区别说明：

二分类 vs 多分类的主要差异：

数据集：
- 二分类：2个类别
- 多分类：3个或更多类别
模型参数：
- 二分类：默认使用 ovr (one-vs-rest) 策略
- 多分类：使用 multinomial 策略进行真正的多分类
输出结果：
- 二分类：预测概率为 [负类概率, 正类概率]
- 多分类：预测概率为 [类别1概率, 类别2概率, 类别3概率, ...]
评估指标：
- 二分类：关注精确率、召回率、F1分数
- 多分类：需要按类别分别评估

运行说明：

两个代码都是完整的，可以直接运行
需要安装的库：numpy, matplotlib, scikit-learn, seaborn
代码包含了数据预处理、模型训练、评估和可视化的完整流程
多分类示例使用了经典的鸢尾花数据集