【机器学习与实战】回归分析与预测：逻辑回归原理与实战【机器学习与实战】回归分析与预测：逻辑回归原理与实战【机器学习与实

【机器学习与实战】回归分析与预测：逻辑回归原理与实战

配套视频教程：www.bilibili.com/video/BV1Rx…

一、什么是逻辑回归

线性回归解决预测问题，逻辑回归解决分类问题，其本质还是回归。通过调整权重w和偏置b来找到线性函数，从而计算数据样本属于某一类的概率。

机器学习模型根据输入数据判断一个人患心脏病的可能性为80%，那么就把这个人判定为“患病”类，输出数值1，如果可能性为30%，则判定为“健康”类，输出数值0，从而将概率问题转换为离散型分类问题。

逻辑回归主要利用线性回归+阶跃函数完成分类，如下图所示：

但是阶跃函数无法解决离群样本的数据偏差问题，比如考试及格与否作为阶路径的关键，如果得0分的人或者100分的人很多，则可能导致及格线不是60分，会发生偏移。所以，更优的做法是使用Sigmoid函数（S形函数）进行转换，使之更好地拟合概率为代表的分类结果，又能够抑制两边比较接近0和1的极端例子，使之钝化，并且也能保持对中间部分数据细微变化的敏感度。

该公式可以使用Matplotlib将其进行绘制：

import matplotlib.pyplot as plt
import numpy as np
z = np.arange(-20, 20, 0.5)
g = 1/(1+np.exp(-z))
plt.plot(z, g)
plt.show()

由此，我们也可以定义一个sigmoid函数代码：

def sigmoid(z):    
    y_hat = 1/(1+ np.exp(-z))
    return y_hat

Sigmoid函数的特性如下：

（1）连续的，单调递增

（2）可以进行微分和求导

（3）输出范围[0, 1]，结果可以表示为概率形式

（4）抑制分类的两边，对中间区域变化敏感

二、逻辑回归的损失函数

逻辑回归的损失函数如下：

代码如下：

def loss_function(x,y,w,b):
    y_hat = sigmoid(np.dot(x,w) + b)                    # Sigmoid逻辑函数 + 线性函数(wX+b)得到y'
    loss = -(y*np.log(y_hat) + (1-y)*np.log(1-y_hat))   # 计算损失
    cost = np.sum(loss) / x.shape[0]                       # 整个数据集平均损失
    return cost

三、逻辑回归的梯度下降

与线性回归的梯度下降公式完全一致

同样的，参数的变化公式也是一样的：

以下代码是逻辑回归的梯度下降过程：

def gradient_descent(x, y, w, b, lr, loop):
    loss_history = np.zeros(loop)
    w_history = np.zeros((loop, w.shape[0], w.shape[1]))
    b_history = np.zeros(loop)
    for i in range(loop):
        y_hat = sigmoid(np.dot(x, w) + b)
        d_w = np.dot(x.T, ((y_hat-y)))/x.shape[0]
        d_b = np.sum(y_hat-y)/x.shape[0]
        w = w - lr*d_w
        b = b - lr*d_b
        loss_history[i] = loss_function(x, y, w, b)
        w_history[i] = w
        b_history[i] = b
    return loss_history, w_history, b_history

四、心脏病患分类预测

数据含义：

age  sex    cp     trestbps    chol    fbs      restecg    thalach       exang    oldpeak       slope    ca       thal       target
年龄  性别      胸痛    血压     胆固醇    血糖   心电图     最大心率    是否心绞痛  ST段压低    ST段斜率 荧光造影 缺陷种类  有无心脏病：0无1有

读取数据并进行分类：

import pandas as pd
heart = pd.read_csv('./heart.csv')
print(heart.head)
print(heart.target.value_counts())

有心脏病的数量为：165，无心脏病的数量为138，相对比例比较接近，说明数据集是可用的。

构造特征集和标签集：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
heart = pd.read_csv('./heart.csv')
print(heart.head)
print(heart.target.value_counts())
x = heart.drop(['target'], axis=1)
y = heart.target.values
y = y.reshape(-1, 1)  # -1是相对索引，等价于len(y)
print(x.shape, y.shape)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)     # 先拟合再应用
x_test = scaler.transform(x_test)           # 不拟合直接用
# 不需要对标签集进行是一化，因为标签集结果本身就是0或1，已经在有效范围

定义分类预测函数：

def predict(x, w, b):
    z = np.dot(x, w) + b
    y_hat = sigmoid(z)
    y_pred = np.zeros((y_hat.shape[0], 1))
    for i in range(y_hat.shape[0]):
        if y_hat[i, 0] < 0.5:
            y_pred[i, 0] = 0
        else:
            y_pred[i, 0] = 1
    return y_pred

定义训练函数

def logic_train(x, y, w, b, lr, loop):
    loss_history, w_history, b_history = gradient_descent(x, y, w, b,  lr, loop)
    print("训练最终损失为：", loss_history[-1])
    y_pred = predict(x, w_history[-1], b_history[-1])
    return loss_history, w_history, b_history, y_pred

主调：

dimension = x.shape[1] # 这里的维度 len(X)是矩阵的行的数，维度是列的数目
weight = np.full((dimension,1),0.1) # 权重向量全为0.1，向量一般是1D，但这里实际上创建了2D张量
bias = 0            # 偏置值
alpha = 1           # 学习速率
iterations = 500    # 迭代次数
loss_history, weight_history, bias_history, y_pred = logic_train(x_train,y_train,weight,bias,alpha,iterations)
# 计算训练准确率
train_acc = 100 - np.mean(np.abs(y_pred-y_train)) * 100
print("线性回归训练准确率: ", train_acc)
# 在测试集上进行预测
loss_history, weight_history, bias_history, y_pred = logic_train(x_test,y_test,weight,bias,alpha,iterations)
train_acc = 100 - np.mean(np.abs(y_pred-y_test)) * 100
print("线性回归测试准确率: ", train_acc)

输出结果：

训练最终损失为： 0.35302333688653487
线性回归训练准确率:  83.05785123966942
训练最终损失为： 0.18403998571780056
线性回归测试准确率:  90.1639344262295

五、使用sklearn完成拟合

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
heart = pd.read_csv('./heart.csv')
x = heart.drop(['target'], axis=1)
y = heart.target.values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)     # 先拟合再应用
x_test = scaler.transform(x_test)
lr = LogisticRegression()
lr.fit(x_train, y_train)
print("准确率：", lr.score(x_test, y_test))

准确率： 0.8360655737704918

六、增加哑特征优化准确率

heart = pd.read_csv('./heart.csv')
# 增加哑特征：如cp列，将其0123取值变成新的4列，各列为0和1，以增加准确率
a = pd.get_dummies(heart['cp'], prefix='cp')
b = pd.get_dummies(heart['thal'], prefix='thal')
c = pd.get_dummies(heart['slope'], prefix='slope')
frame = [heart, a, b, c]
heart = pd.concat(frame, axis=1)
heart = heart.drop(columns=['cp', 'thal', 'slope'])
x = heart.drop(['target'], axis=1)
y = heart.target.values

准确率： 0.8688524590163934