从头用 numpy 来写一个识别 MNIST 手写数据的神经网络

1,357 阅读6分钟

在本次分享中,将从头开始用基础 numpy 的库来构建一个可以识别 MNIST(手写数据)的神经网络。不会用到当下流行的深度学习框架例如 pytorch 和 tensorflow,会用到 scikit-learn,也只是用来下载数据集而已。

一切从简单开始,开始搭建只具有一个结点用来识别数字 0 实际上是 logistic 归回的实现,看似简单其实不然其中包括一些流程和关键知识点,这些是为了随后可以写出更复杂网络打下基础。

接下来会扩展到带有一层隐藏的网络,这网络暂时也仅是为了识别数字 0 设计。在最后我们引入 softmax 来实现识别从 0 到 9 所有数字。

准备数据集

下载数据集

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
X, y = mnist["data"],mnist["target"]
import sklearn
sklearn.__version__ #'0.23.1'

刚刚可能是 sklearn 版本过于新,这里用 sklearn 版本为 0.23.1。MNIST 数据集中包括 70,000 张手写数字图像是由 28x28 个像素点,是单通道灰度图,每个像素的取值范围在 0 到 255 之间。labels 是一个具体数值对一个图像中出现的数值

#首先对数据进行归一化
X = X/255

首先我们创建一个 2 分类分类器,也就是数字 0 作为一类,标签修改为 1 也就是我们所说的正例样本其他类别标签用 0 来表示

import numpy as np
y = y.astype(np.float32)
y_new = np.zeros(y.shape,dtype=np.float32)
y_new[np.where(y==0.0)[0]] = 1
y = y_new

注意,这里 y 数据类型原本是object,所以需要用 y = y.astype(np.float32) 转换为 np.float32 类型,不然 y_new[np.where(y==0.0)[0]] = 1 不会生效。随后导致所有 label 都为 0

每一个元素的值都在 0 和 1 之间,X 在调整之后的为 (60000,784)。现在开始将原始的数据集拆分为,训练数据集和测试数据集,样本的数量分别是 60,000 和 10,000 接下来还需要对数据进行进一步处理。也就是用列来存储一个一个样本,也就是每一列是一个样本,而不再是每一行是一个样本。

# m 设置为 60000
m = 60000
# m_test 为
m_test = X.shape[0] - m

X_train, X_test = X[:m].T,X[m:].T
y_train, y_test = y[:m].reshape(1,m), y[m:].reshape(1,m_test)
X_train.shape
y_test.shape

对数据集进行重新排序

这样做的好处是增加数据的多样性,这里np.random.permutation 会随机生成索引序列,然后按新生成索引序列对数据进行排序

np.random.seed(138)
shuffle_index = np.random.permutation(m)
X_train, y_train = X_train[:,shuffle_index],y_train[:,shuffle_index]
# X_train.shape
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

i = 3
plt.imshow(X_train[:,i].reshape(28,28),cmap=matplotlib.cm.binary)
plt.axis("off")
plt.show()

output_19_0.png

单个神经元(Logistic 归回)

将要构建一个简单,前馈神经网络,神经网络输入为 784 (28×28)(28 \times 28)维的向量,输出是一个 1 或者 0 ,1 表示正例样本,也就是输入是 0。

前向传播

所谓的前向传播,输入 xx 样本作为信号输入网络

y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

定义 sigmoid 函数

对于一个二分类问题,这里激活函数 σ\sigma 选择的是 sigmoid 函数

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
def sigmoid(z):
    s = 1 / (1 + np.exp(-z))
    return s

在前向传播的代码中,将要计算两个阶段 z = np.matmul(W.T,X) + b 线性变换,然后 A = sigmoid(z) 非线性变换 2 个阶段。将计算拆分为 2 个阶段好处便于向后传播,分阶段进行求导。

损失函数

分类问题的损失函数通常都会选择交叉熵作为损失函数,这里也不例外

L(y,y^)=ylog(y^)(1y)log(1y^)L(y,\hat{y}) = -y\log(\hat{y}) - (1-y)\log(1-\hat{y})

每次迭代都会对所有样本损失值进行求和后再取平均

L(y,y^)=1mi=1m(y(i)log(y^(i))+(1y(i))log(1y^(i)))L(y,\hat{y}) = - \frac{1}{m} \sum_{i=1}^m \left( y^{(i)}\log(\hat{y}^{(i)}) + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right)
def compute_loss(Y, Y_hat):
    m = Y.shape[1]
    L = -(1./m) * ( np.sum( np.multiply(np.log(Y_hat),Y) ) + np.sum( np.multiply(np.log(1 - Y_hat),(1 -Y))))
    return L

反向传播

对于反向传播我们就是要计算 ww 每一个分量 wjw_jLL 影响程度,也就等价于求 L/wj\partial L/\partial w_j

这里我们先以一个样本为例,计算 wwwjw_j 的梯度,也就是 Lw\frac{\partial L}{\partial w} 可以先假设,wjzy^Lw_j \rightarrow z \rightarrow \hat{y} \rightarrow L

z=wTx+by^=σ(z)L(y,y^)=ylog(y^)(1y)log(1y^)z = w^Tx + b\\ \hat{y} = \sigma(z)\\ L(y,\hat{y}) = -y\log(\hat{y}) - (1-y)\log(1-\hat{y})

上面列出了前向传播为 wjzy^Lw_j \rightarrow z \rightarrow \hat{y} \rightarrow L

根据链式法则

Lwj=Ly^y^zzwj\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial w_j}

接下来分别去求 Ly^\frac{\partial L}{\partial \hat{y}}y^z\frac{\partial \hat{y}}{\partial z}zwj\frac{\partial z}{\partial w_j}

Ly^=y^(ylog(y^)(1y)log(1y^))=yy^log(y^)(1y)y^ log(1y^)=yy^+1y1y^=y^yy^(1y^)\frac{\partial L}{\partial \hat{y}} = \frac{\partial }{\partial \hat{y}}\left(-y\log(\hat{y}) - (1-y)\log(1-\hat{y})\right)\\ = -y \frac{\partial }{\partial \hat{y}} log(\hat{y}) - (1-y) \frac{\partial }{\partial \hat{y}}\ \log(1-\hat{y})\\ = \frac{-y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\\ =\frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}
zσ(z)=z(11+ez)=1(1+ez)2z(1+ez)=ez(1+ez)2=1(1+ez)ez(1+ez)=σ(z)ez(1+ez)=σ(z)(111+ez)=σ(z)(1σ(z))=y^(1y^)\frac{\partial}{\partial z} \sigma(z) = \frac{\partial}{\partial z} \left( \frac{1}{1 + e^{-z}} \right)\\ = - \frac{1}{(1 + e^{-z})^2} \frac{\partial}{\partial z} (1 + e^{-z})\\ = \frac{e^{-z}}{(1 + e^{-z})^2}\\ = \frac{1}{(1 + e^{-z})}\frac{e^{-z}}{(1 + e^{-z})}\\ = \sigma(z)\frac{e^{-z}}{(1 + e^{-z})}\\ = \sigma(z) \left( 1 - \frac{1}{1 + e^{-z}} \right)\\ = \sigma(z)(1 - \sigma(z))\\ = \hat{y}(1 - \hat{y})
wj(wT+b)=wj(w0x0++wnxn+b)=xj\frac{\partial}{\partial w_j}(w^T + b) = \frac{\partial}{\partial w_j}(w_0x_0 + \cdots + w_nx_n + b) = x_j

对于 Lb\frac{\partial L}{\partial b} 进行求偏导

Lb=(y^y)\frac{\partial L}{\partial b} = (\hat{y} - y)
Lb=1mi=1m(y^(i)y(i))\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})

在代码中,我们将在变量前面添加 d 表示梯度 dwdb ,从而

#超参数
learning_rate = 1
#设定 batch size 
batch_size = 256
epochs = 2000

n_feature = X_train.shape[0]
#初始化变量
W = np.random.randn(n_feature,1)*0.01
b = np.zeros((1,1))

from tqdm import trange
for i in (t:=trange(epochs)):
    samp = np.random.randint(0,X_train.shape[1],size=(batch_size))
    # print(samp)
    X_samp = X_train[:,samp]
    Y_samp = y_train[:,samp].astype(np.float32)
#     X_samp = X_train
#     Y_samp = y_train.astype(np.float32)
    
    # 前向传播
    # W(784,1) W.T(1,784) X_sampe (784,bs) => 1xbs
    Z = np.matmul(W.T,X_samp) + b
    #(1,bs)->(1,bs)
    A = sigmoid(Z)
    
    #计算损失值
    cost = compute_loss(Y_samp,A)
    
    
    #计算梯度
    dW= (1/batch_size) * np.matmul(X_samp,(A-Y_samp).T)
    # A-Y_samp A(1,bs) - (1,bs) axis=1,如何不是用 keepdims=True 最终输出结果变为 (1,)
    # 如果使用 keepdims 保持了维度
    db = (1/batch_size) * np.sum(A-Y_samp,axis=1,keepdims=True)
    #更新参数
    W = W - learning_rate * dW
    b = b - learning_rate * db
    if( i % 100 == 0):
        t.set_description(f"iter {i}, cost: {cost}")
print(f"final cost:{cost}")
      0%|                                                                                         | 0/2000 [00:00<?, ?it/s]

    (1, 256) (1, 256)
    final cost:0.7304988760416786
a = np.random.randn(1,10)
b = np.random.randn(1,10)
print(a -b)
np.sum(a-b,axis=1,keepdims=True)
[[-1.36047055 -2.50987615  0.78183155 -1.36933903  0.6333624   1.41838218
   4.41512606  0.12380963 -0.96661973 -0.67886679]]





array([[0.48733957]])

可以通过更多训练迭代数来获取更多的准确性。但是有时候可以均衡一下,接下里通过 sklearn 的 metrics(测量模块)提供的 classification_report, confusion_matrix 来进行查看

from sklearn.metrics import classification_report, confusion_matrix

Z = np.matmul(W.T, X_test) + b
A = sigmoid(Z)

predictions = (A>.5)[0,:]
labels = (y_test == 1)[0,:]

print(confusion_matrix(predictions,labels))
[[8958   26]
 [  62  954]]
print(classification_report(predictions,labels))
                  precision    recall  f1-score   support
    
           False       0.99      1.00      1.00      8984
            True       0.97      0.94      0.96      1016
    
        accuracy                           0.99     10000
       macro avg       0.98      0.97      0.98     10000
    weighted avg       0.99      0.99      0.99     10000
X_test[:,0]
# print(y_test[:,0])

print(y_test[y_test == 0].shape)
# Z = np.matmul(W.T, X_test) + b
# A = sigmoid(Z)
#每次随机
a = np.random.randn(2,3)
b = np.random.randn(3,5)
np.matmul(a,b)

    array([[-1.12541033,  0.67226206,  2.69369124,  0.31000053,  0.16015114],
           [-0.52122664, -0.13428921,  2.76755925, -0.06326955, -1.24097515]])

添加隐藏层

现在可以为网络添加一层具有64 神经元网络,也就是在上面的单个。这一次不会再关于后向传播介绍如何计算参数梯度以及

这里我们需要补充说明一下 L/w\partial L/\partial wL/b\partial L/\partial b

Lw=LzzwLb=Lzzb\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z}\frac{\partial z}{\partial w}\\ \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z}\frac{\partial z}{\partial b}\\

所以可以先求解

Lz=Laaz\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a}\frac{\partial a}{\partial z}
#超参数
learning_rate = 1
#设定 batch size 
batch_size = 1024
epochs = 2000

input_feature = X_train.shape[0]
hidden_feature = 64
#初始化变量
W1 = np.random.randn(hidden_feature,input_feature)*0.01
b1 = np.zeros((hidden_feature,1))
W2 = np.random.randn(1,hidden_feature)*0.01
b2 = np.zeros((1,1))

from tqdm import trange
for i in (t:=trange(epochs)):
    
    samp = np.random.randint(0,X_train.shape[1],size=(batch_size))
    # print(samp)
    X_samp = X_train[:,samp]
    Y_samp = y_train[:,samp].astype(np.float32)
    
    # W1(64,784) X_samp(784,bs) +(64,1) -> (64,bs)
    Z1 = np.matmul(W1,X_samp) + b1
    #(64,bs)
    A1 = sigmoid(Z1)
    # W1(1,64) A1(64,bs) +(1,1) -> (1,bs)
    Z2 = np.matmul(W2,A1) + b2
    # (1,bs)
    A2 = sigmoid(Z2)
    #(1,bs) - (1,bs)
    cost = compute_loss(Y_samp,A2)
    
    #
    dZ2 = A2 - Y_samp
    dW2 = (1./batch_size) * np.matmul(dZ2,A1.T)
    db2 = (1./batch_size) * np.sum(dZ2,axis=1,keepdims=True)
    
    #Z1 = (W1X + b) -> A1 = sigmoid(Z1) -> Z2 = A1W2 + b2
    # dA1 = dZ2 dA2/dA1
    # 这是需要重点解释的内容
    dA1 = np.matmul(W2.T, dZ2)
    dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
    dW1 = (1./batch_size) * np.matmul(dZ1,X_samp.T)
    db1 = (1./batch_size) * np.sum(dZ1,axis=1,keepdims=True)
    
    #更新参数
    W2 = W2 - learning_rate * W2
    b2 = b2 - learning_rate * b2
    W1 = W1 - learning_rate * W1
    b1 = b1 - learning_rate * b1
    
    if( i % 100 == 0):
        t.set_description(f"iter {i}, cost: {cost}")
print(f"final cost:{cost}")
    iter 1900, cost: 0.6931471805599453: 100%|████████████████████████████████████████| 2000/2000 [00:16<00:00, 120.24it/s]

    final cost:0.6931471805599452
Z1 = np.matmul(W1,X_test) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2,A1) + b2
A2 = sigmoid(Z2)

predictions = (A2 >.5)[0,:]
labels = (y_test == 1)[0,:]
print(confusion_matrix(predictions,labels))
print(classification_report(predictions,labels))
    [[9020  980]
     [   0    0]]
                  precision    recall  f1-score   support
    
           False       1.00      0.90      0.95     10000
            True       0.00      0.00      0.00         0
    
        accuracy                           0.90     10000
       macro avg       0.50      0.45      0.47     10000
    weighted avg       1.00      0.90      0.95     10000
    

我正在参与掘金技术社区创作者签约计划招募活动,点击链接报名投稿