反向传播

62 阅读14分钟

1 - 神经网络的反向传播

image.png 由于图片可知这个网络输出层维度为3

n+1n+1层为例第1个神经元参数w1n+1,b1n+1w^{n+1}_1,b^{n+1}_1通过如下连锁变化影响损失函数J:

w1n+1>z1n+1>(a1n+1,a2n+1,a3n+1)>Jw^{n+1}_1-->z^{n+1}_1-->(a^{n+1}_1,a^{n+1}_2,a^{n+1}_3)-->J

b1n+1>z1n+1>(a1n+1,a2n+1,a3n+1)>Jb^{n+1}_1-->z^{n+1}_1-->(a^{n+1}_1,a^{n+1}_2,a^{n+1}_3)-->J

根据复合函数求导的链式法则:

dw1n+1=Jw1n+1=Ja1n+1a1n+1z1n+1z1n+1w1n+1+Ja2n+1a2n+1z1n+1z1n+1w1n+1+Ja3n+1a1n+3z1n+1z1n+1w1n+1=(Ja1n+1a1n+1z1n+1+Ja2n+1a2n+1z1n+1+Ja3n+1a3n+1z1n+1)z1n+1w1n+1\begin{align} dw^{n+1}_1=\frac{\partial J}{\partial w^{n+1}_1} &=\frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+3}_1}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}\\ &=(\frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1}+ \frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+1}_3}{\partial z^{n+1}_1})\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1} \end{align}
db1n+1=Jb1n+1=Ja1n+1a1n+1z1n+1z1n+1b1n+1+Ja2n+1a2n+1z1n+1z1n+1b1n+1+Ja3n+1a1n+3z1n+1z1n+1b1n+1=(Ja1n+1a1n+1z1n+1+Ja2n+1a2n+1z1n+1+Ja3n+1a3n+1z1n+1)z1n+1b1n+1\begin{align} db^{n+1}_1=\frac{\partial J}{\partial b^{n+1}_1} &=\frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial b^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial b^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+3}_1}{\partial z^{n+1}_1}\frac{\partial z^{n+1}_1}{\partial b^{n+1}_1}\\ &=(\frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}+\frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1}+ \frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+1}_3}{\partial z^{n+1}_1})\frac{\partial z^{n+1}_1}{\partial b^{n+1}_1} \end{align}

关键观察:

  1. 最后一项 z1n+1w1n+1\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1} 是相同的:
  • 因为 z1n+1=w1n+1an+b1n+1z^{n+1}_1 = w^{n+1}_1 \cdot a^n + b^{n+1}_1,所以: z1n+1w1n+1=an(对所有k都相同)\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1} = a^n \quad \text{(对所有k都相同)}

  • 这一项与求和索引 kk 无关,可以提到求和符号外面。

  1. 前两项 Jakn+1akn+1z1n+1\frac{\partial J}{\partial a^{n+1}_k} \cdot \frac{\partial a^{n+1}_k}{\partial z^{n+1}_1} 是Softmax的耦合效应:
  • Softmax的每个输出 akn+1a^{n+1}_k 受所有 zjn+1z^{n+1}_j 影响,因此:

    • k=1k=1a1n+1z1n+1=a1n+1(1a1n+1)\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1} = a^{n+1}_1 (1- a^{n+1}_1)

    • k1k \neq 1akn+1z1n+1=akn+1a1n+1\frac{\partial a^{n+1}_k}{\partial z^{n+1}_1} = -a^{n+1}_k a^{n+1}_1

  • 这意味着 w1n+1w^{n+1}_1 的梯度需要汇总所有输出神经元 a1n+1,a2n+1,a3n+1a^{n+1}_1, a^{n+1}_2, a^{n+1}_3 的贡献。

结合网络结构分析上式,最后一项相等,将公共项 z1n+1w1n+1=an\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1} = a^n 提出:前两项可以合并,得到

dw1n+1=dz1n+1z1n+1w1n+1dw^{n+1}_1=dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}

db1n+1=dz1n+1z1n+1b1n+1db^{n+1}_1=dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial b^{n+1}_1}

同理可得:

dw2n+1=dz2n+1z2n+1w2n+1,db2n+1=dz2n+1z2n+1b2n+1dw^{n+1}_2=dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial w^{n+1}_2},db^{n+1}_2=dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial b^{n+1}_2}

dw3n+1=dz3n+1z3n+1w3n+1,db3n+1=dz3n+1z3n+1b3n+1dw^{n+1}_3=dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial w^{n+1}_3},db^{n+1}_3=dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial b^{n+1}_3}

给定数据计算:

权重矩阵 Wn+1W^{n+1}:3个神经元,每个神经元有4个输入权重(对应前一层4个神经元)。

Wn+1=[w1n+1w2n+1w3n+1]=[w1w2w3w4w5w6w7w8w9w10w11w12]Bn+1=[b1n+1b2n+1b3n+1]=[b1b2b3]An=[a1na2na3na4n]=[a11a21a12a22a13a23a14a24]\begin{align} &W^{n+1}= \left[ \begin{array}{} w^{n+1}_1 \\ w^{n+1}_2 \\ w^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} w_1 & w_2 & w_3 & w_4 \\ w_5 & w_6 & w_7 & w_8 \\ w_9 & w_{10} & w_{11} & w_{12} \\ \end{array} \right]\\ &B^{n+1}= \left[ \begin{array}{} b^{n+1}_1 \\ b^{n+1}_2 \\ b^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} b_1 \\ b_2 \\ b_3 \\ \end{array} \right]\\ &A^n= \left[ \begin{array}{} a^{n}_1 \\ a^{n}_2 \\ a^{n}_3 \\ a^{n}_4 \\ \end{array} \right]= \left[ \begin{array}{} a_{11} & a_{21} \\ a_{12} & a_{22} \\ a_{13} & a_{23} \\ a_{14} & a_{24} \\ \end{array} \right] \end{align}

对应前一层输入数据 AnA^n:4个神经元,2个样本(矩阵形式)。

根据前向传播原理,可以得到:

Zn+1=Wn+1An+Bn+1=[z1n+1z2n+1z3n+1]=[z11z21z12z22z13z23]=[w1a11+w2a12+w3a13+w4a14+b1w1a21+w2a22+w3a23+w4a24+b1w5a11+w6a12+w7a13+w8a14+b2w5a21+w6a22+w7a23+w8a24+b2w9a11+w10a12+w11a13+w12a14+b3w9a21+w10a22+w11a23+w12a24+b3]\begin{align} Z^{n+1} &=W^{n+1}A^n+B^{n+1} \\ &= \left[ \begin{array}{} z^{n+1}_1 \\ z^{n+1}_2 \\ z^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} z_{11} & z_{21} \\ z_{12} & z_{22} \\ z_{13} & z_{23} \\ \end{array} \right]\\ &= \left[ \begin{array}{} w_1a_{11}+w_2a_{12}+w_3a_{13}+w_4a_{14}+b_1 & w_1a_{21}+w_2a_{22}+w_3a_{23}+w_4a_{24}+b_1 \\ w_5a_{11}+w_6a_{12}+w_7a_{13}+w_8a_{14}+b_2 & w_5a_{21}+w_6a_{22}+w_7a_{23}+w_8a_{24}+b_2 \\ w_9a_{11}+w_{10}a_{12}+w_{11}a_{13}+w_{12}a_{14}+b_3 & w_9a_{21}+w_{10}a_{22}+w_{11}a_{23}+w_{12}a_{24}+b_3 \\ \end{array} \right] \end{align}

向量的求导法则,可以推出:

由公式只看括号内就是dz1n+1=JZ1n+1dz^{n+1}_1=\frac{\partial J}{\partial Z^{n+1}_1}

Jw1n+1=(Ja1n+1a1n+1z1n+1+Ja2n+1a2n+1z1n+1+Ja3n+1a3n+1z1n+1)z1n+1w1n+1\begin{align*} \frac{\partial J}{\partial w^{n+1}_1} &= \left( \frac{\partial J}{\partial a^{n+1}_1} \cdot \frac{\partial a^{n+1}_1}{\partial z^{n+1}_1} + \frac{\partial J}{\partial a^{n+1}_2} \cdot \frac{\partial a^{n+1}_2}{\partial z^{n+1}_1} + \frac{\partial J}{\partial a^{n+1}_3} \cdot \frac{\partial a^{n+1}_3}{\partial z^{n+1}_1} \right) \cdot \frac{\partial z^{n+1}_1}{\partial w^{n+1}_1} \end{align*}

可得这一层的dZn+1=JZn+1dZ^{n+1}=\frac{\partial J}{\partial Z^{n+1}},由所有神经元拼接起来。

dZn+1=JZn+1=[Jz1n+1Jz2n+1Jz3n+1]=[dz1n+1dz2n+1dz3n+1]=[Jz11Jz21Jz12Jz22Jz13Jz23]=[dz11dz21dz12dz22dz13dz23]dZ^{n+1}=\frac{\partial J}{\partial Z^{n+1}}= \left[ \begin{array}{} \frac{\partial J}{\partial z^{n+1}_1} \\ \frac{\partial J}{\partial z^{n+1}_2} \\ \frac{\partial J}{\partial z^{n+1}_3} \\ \end{array} \right]= \left[ \begin{array}{} dz^{n+1}_1 \\ dz^{n+1}_2 \\ dz^{n+1}_3 \\ \end{array} \right]=\left[ \begin{array}{} \frac{\partial J}{\partial z_{11}} & \frac{\partial J}{\partial z_{21}} \\ \frac{\partial J}{\partial z_{12}} & \frac{\partial J}{\partial z_{22}} \\ \frac{\partial J}{\partial z_{13}} & \frac{\partial J}{\partial z_{23}} \\ \end{array} \right]=\left[ \begin{array}{} dz_{11} & dz_{21} \\ dz_{12} & dz_{22} \\ dz_{13} & dz_{23} \\ \end{array} \right]
z1n+1w1n+1=[z11w1z11w2z11w3z11w4z21w1z21w2z21w3z21w4]=(An)T,z1n+1b1n+1=[z11b1z21b1]=[11]\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial w_1} & \frac{\partial z_{11}}{\partial w_2} & \frac{\partial z_{11}}{\partial w_3} & \frac{\partial z_{11}}{\partial w_4} \\ \frac{\partial z_{21}}{\partial w_1} & \frac{\partial z_{21}}{\partial w_2} & \frac{\partial z_{21}}{\partial w_3} & \frac{\partial z_{21}}{\partial w_4} \\ \end{array} \right]=(A^n)^T, \\ \frac{\partial z^{n+1}_1}{\partial b^{n+1}_1}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial b_1} \\ \frac{\partial z_{21}}{\partial b_1} \\ \end{array} \right]= \left[ \begin{array}{} 1 \\ 1 \\ \end{array} \right]

同理可得第2、3个神经元:

z2n+1w2n+1=(An)T,z2n+1b2n+1=[11]\frac{\partial z^{n+1}_2}{\partial w^{n+1}_2} = (A^n)^T, \quad \frac{\partial z^{n+1}_2}{\partial b^{n+1}_2} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}
z3n+1w3n+1=(An)T,z3n+1b3n+1=[11]\frac{\partial z^{n+1}_3}{\partial w^{n+1}_3} = (A^n)^T, \quad \frac{\partial z^{n+1}_3}{\partial b^{n+1}_3} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

综上,根据dw1n+1,dw2n+1,dw3n+1,db1n+1,db2n+1,db3n+1dw^{n+1}_1,dw^{n+1}_2,dw^{n+1}_3,db^{n+1}_1,db^{n+1}_2,db^{n+1}_3的计算结果,得到损失函数JJWn+1,Bn+1W^{n+1},B^{n+1}的梯度:

由于公式dw1n+1=dz1n+1z1n+1w1n+1dw^{n+1}_1=dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}可得这一层的dWn+1dW^{n+1},由所有神经元拼接起来。

dWn+1=[dw1n+1dw2n+1dw3n+1]=[dz1n+1(An)Tdz2n+1(An)Tdz3n+1(An)T]=[dz1n+1dz2n+1dz3n+1](An)T=[dz11dz21dz12dz22dz13dz23](An)T=dZn+1(An)TdW^{n+1}= \left[ \begin{array}{} dw^{n+1}_1 \\ dw^{n+1}_2 \\ dw^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} dz^{n+1}_1(A^n)^T \\ dz^{n+1}_2(A^n)^T \\ dz^{n+1}_3(A^n)^T \\ \end{array} \right]= \left[ \begin{array}{} dz^{n+1}_1 \\ dz^{n+1}_2 \\ dz^{n+1}_3 \\ \end{array} \right](A^n)^T= \left[ \begin{array}{} dz_{11} & dz_{21} \\ dz_{12} & dz_{22} \\ dz_{13} & dz_{23} \\ \end{array} \right](A^n)^T\\=dZ^{n+1}(A^n)^T
dBn+1=[db1n+1db2n+1db3n+1]=[dz1n+1dz2n+1dz3n+1][11]=[dz11dz21dz12dz22dz13dz23][11]=sum(dZn+1,axis=1)dB^{n+1}= \left[ \begin{array}{} db^{n+1}_1 \\ db^{n+1}_2 \\ db^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} dz^{n+1}_1 \\ dz^{n+1}_2 \\ dz^{n+1}_3 \\ \end{array} \right]\left[ \begin{array}{} 1 \\ 1 \\ \end{array} \right] = \left[ \begin{array}{} dz_{11} & dz_{21} \\ dz_{12} & dz_{22} \\ dz_{13} & dz_{23} \\ \end{array} \right] \left[ \begin{array}{} 1 \\ 1 \\ \end{array} \right]\\=sum(dZ^{n+1},axis=1)

维度验证

  • 这一层:dWn+1=dZn+1(An)T=[3,2][2,4]=[3,4]dW^{n+1} = dZ^{n+1} \cdot (A^n)^T=[3,2][2,4]=[3,4]
    • dZn+1dZ^{n+1} 形状:3×23 \times 2(3个神经元,2个样本)
    • AnA^n 形状:4×24 \times 2(4个输入神经元,2个样本)
    • (An)T(A^n)^T 形状:2×42 \times 4
    • 结果 dWn+1dW^{n+1} 形状:3×43 \times 4(与 Wn+1W^{n+1} 一致)
  • 单个神经元: dw1n+1=dz1n+1z1n+1w1n+1=dz1n+1(An)T=[1,2][2,4]=[1,4]dw^{n+1}_1=dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial w^{n+1}_1}=dz^{n+1}_1(A^n)^T=[1,2][2,4]=[1,4]
    • 这个dz1n+1dz^{n+1}_1是[1,2],当激活函数g=softmaxg=softmax、损失函数JJ为交叉熵时的维度验证有详细介绍,[1,4]表示:是第1个神经元的权重梯度

按照同样的方法计算损失函数JJ任意全连接层参数Wn,BnW^n,B^n的梯度,比较其结果,可以得到如下结论:

结论一:误差JJ对任意全连接层参数Wn,BnW^n,B^n的梯度由以下公式计算:

  • dWn=dZn(An1)TdW^n=dZ^n(A^{n-1})^T
  • dBn=sum(dZn,axis=1)dB^n=sum(dZ^n,axis=1)

sum(dZn,axis=1)sum(dZ^n,axis=1)表示在dZndZ^n的水平方向累加,得到一个n*1的向量,nn表示层号。

2 - 误差对全连接层线性输出的梯度

由结论一可知,计算梯度dWn,dBndW^n,dB^n需要首先计算损失函数JJ对线性输出ZnZ^n的梯度dZndZ^n,而计算dZndZ^n会根据激活函数来确定。由于softmaxsoftmax激活函数不同于一般的激活函数,根据作用于ZnZ^n的激活函数gg是否为softmaxsoftmax函数把计算dZndZ^n分成两种情况。

当激活函数g=softmaxg=softmax、损失函数JJ为交叉熵时

softmaxsoftmax激活函数应用于网络的最后一层(输出层),其作用是输出样本的预测概率分布。以上图为例,神经网络第n+1n+1层(输出层)的激活函数ggsoftmaxsoftmax,根据网络结构以及向量的求导法则,可以推出:

dZn+1=[dz1n+1dz2n+1dz3n+1]=[Ja1n+1a1n+1z1n+1+Ja2n+1a2n+1z1n+1+Ja3n+1a3n+1z1n+1Ja1n+1a1n+1z2n+1+Ja2n+1a2n+1z2n+1+Ja3n+1a3n+1z2n+1Ja1n+1a1n+1z3n+1+Ja2n+1a2n+1z3n+1+Ja3n+1a3n+1z3n+1]=[da1n+1a1n+1z1n+1+da2n+1a2n+1z1n+1+da3n+1a3n+1z1n+1da1n+1a1n+1z2n+1+da2n+1a2n+1z2n+1+da3n+1a3n+1z2n+1da1n+1a1n+1z3n+1+da2n+1a2n+1z3n+1+da3n+1a3n+1z3n+1]\begin{align} dZ^{n+1} &= \left[ \begin{array}{} dz^{n+1}_1 \\ dz^{n+1}_2 \\ dz^{n+1}_3 \\ \end{array} \right]\\ &= \left[ \begin{array}{} \frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1} + \frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1} + \frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+1}_3}{\partial z^{n+1}_1}\\ \frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_2} + \frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_2} + \frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+1}_3}{\partial z^{n+1}_2}\\ \frac{\partial J}{\partial a^{n+1}_1}\frac{\partial a^{n+1}_1}{\partial z^{n+1}_3} + \frac{\partial J}{\partial a^{n+1}_2}\frac{\partial a^{n+1}_2}{\partial z^{n+1}_3} + \frac{\partial J}{\partial a^{n+1}_3}\frac{\partial a^{n+1}_3}{\partial z^{n+1}_3}\\ \end{array} \right]\\ &= \left[ \begin{array}{} da^{n+1}_1\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1} + da^{n+1}_2\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1} + da^{n+1}_3\frac{\partial a^{n+1}_3}{\partial z^{n+1}_1}\\ da^{n+1}_1\frac{\partial a^{n+1}_1}{\partial z^{n+1}_2} + da^{n+1}_2\frac{\partial a^{n+1}_2}{\partial z^{n+1}_2} + da^{n+1}_3\frac{\partial a^{n+1}_3}{\partial z^{n+1}_2}\\ da^{n+1}_1\frac{\partial a^{n+1}_1}{\partial z^{n+1}_3} + da^{n+1}_2\frac{\partial a^{n+1}_2}{\partial z^{n+1}_3} + da^{n+1}_3\frac{\partial a^{n+1}_3}{\partial z^{n+1}_3}\\ \end{array} \right] \tag{1} \end{align}

又因为:

An+1=softmax(Zn+1)=[a1n+1a2n+1a3n+1])=[a11a21a12a22a13a23]=[ez11ez11+ez12+ez13ez21ez21+ez22+ez23ez12ez11+ez12+ez13ez22ez21+ez22+ez23ez13ez11+ez12+ez13ez23ez21+ez22+ez23]A^{n+1}=softmax(Z^{n+1})= \left[ \begin{array}{} a^{n+1}_1 \\ a^{n+1}_2 \\ a^{n+1}_3 \\ \end{array} \right])=\left[ \begin{array}{} a_{11} & a_{21} \\ a_{12} & a_{22} \\ a_{13} & a_{23} \\ \end{array} \right]\\ =\left[ \begin{array}{} \frac{e^{z_{11}}}{e^{z_{11}}+e^{z_{12}}+e^{z_{13}}} & \frac{e^{z_{21}}} {e^{z_{21}}+e^{z_{22}}+e^{z_{23}}} \\ \frac{e^{z_{12}}}{e^{z_{11}}+e^{z_{12}}+e^{z_{13}}} & \frac{e^{z_{22}}} {e^{z_{21}}+e^{z_{22}}+e^{z_{23}}} \\ \frac{e^{z_{13}}}{e^{z_{11}}+e^{z_{12}}+e^{z_{13}}} & \frac{e^{z_{23}}} {e^{z_{21}}+e^{z_{22}}+e^{z_{23}}} \\ \end{array} \right]

解释:这里的分母是一个样本,竖着看的求和。

Y=[y11y21y12y22y13y23]Y= \left[ \begin{array}{} y_{11} & y_{21} \\ y_{12} & y_{22} \\ y_{13} & y_{23} \\ \end{array} \right]

根据交叉熵损失函数的定义可知:

J=E(Y,A)=1nj=1ni=1myjilog(aji)=12(y11log(a11)+y12log(a12)+y13log(a13)+y21log(a21)+y22log(a22)+y23log(a23))\begin{align} J&=E(Y,A)=-\frac{1}{n}\sum_{j=1}^n\sum_{i=1}^m y_{ji}log(a_{ji}) \\ &=-\frac{1}{2}(y_{11}log(a_{11})+y_{12}log(a_{12})+y_{13}log(a_{13})+y_{21}log(a_{21})+y_{22}log(a_{22})+y_{23}log(a_{23})) \end{align}

nn表示样本的数量,yjiy_{ji}表示第jj个样本是第ii个类别的真实概率,ajia_{ji}表示第jj个样本是第ii个类别的预测概率。为保证结果书写简洁,后续求导过程中暂不考虑常数项-1/2

按照向量的求导法则,可以得到以下结果:

da1n+1=Ja1n+1=[Ja11Ja21]=[y11a11y21a21]da^{n+1}_1=\frac{\partial J}{\partial a^{n+1}_1}= \left[ \begin{array}{} \frac{\partial J}{\partial a_{11}} & \frac{\partial J}{\partial a_{21}}\\ \end{array} \right]= \left[ \begin{array}{} \frac{y_{11}}{a_{11}} & \frac{y_{21}}{a_{21}} \\ \end{array} \right]

补充:维度[1,2],对 J=yloge(a)J = y \log_e(a) 求导y是真实标签,对a求导Ja=ya \frac{\partial J}{\partial a} = \frac{y}{a} ,同理

da2n+1=Ja2n+1=[Ja12Ja22]=[y12a12y22a22]da^{n+1}_2=\frac{\partial J}{\partial a^{n+1}_2}= \left[ \begin{array}{} \frac{\partial J}{\partial a_{12}} & \frac{\partial J}{\partial a_{22}}\\ \end{array} \right]= \left[ \begin{array}{} \frac{y_{12}}{a_{12}} & \frac{y_{22}}{a_{22}} \\ \end{array} \right]
da3n+1=Ja3n+1=[Ja13Ja23]=[y13a13y23a23]da^{n+1}_3=\frac{\partial J}{\partial a^{n+1}_3}= \left[ \begin{array}{} \frac{\partial J}{\partial a_{13}} & \frac{\partial J}{\partial a_{23}}\\ \end{array} \right]= \left[ \begin{array}{} \frac{y_{13}}{a_{13}} & \frac{y_{23}}{a_{23}} \\ \end{array} \right]

下面是softmax 函数偏导数,维度[2,2]

a1n+1z1n+1=[a11z11a11z21a21z11a21z21]=[ez11(ez12+ez13)(ez11+ez12+ez13)200ez21(ez22+ez23)(ez21+ez22+ez23)2]=[a11(1a11)00a21(1a21)]\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}=\left[ \begin{array}{} \frac{\partial a_{11}}{\partial z_{11}} & \frac{\partial a_{11}}{\partial z_{21}} \\ \frac{\partial a_{21}}{\partial z_{11}} & \frac{\partial a_{21}}{\partial z_{21}} \\ \end{array} \right] = \left[ \begin{array}{} \frac{e^{z_{11}}(e^{z_{12}}+e^{z_{13}})}{(e^{z_{11}}+e^{z_{12}}+e^{z_{13}})^2} & 0 \\ 0 & \frac{e^{z_{21}}(e^{z_{22}}+e^{z_{23}})}{(e^{z_{21}}+e^{z_{22}}+e^{z_{23}})^2} \\ \end{array} \right]= \left[ \begin{array}{} a_{11}(1-a_{11}) & 0 \\ 0 & a_{21}(1-a_{21}) \\ \end{array} \right]

同理可得到:

a1n+1z2n+1=[a11a1200a21a22],a1n+1z3n+1=[a11a1300a21a23]\frac{\partial a^{n+1}_1}{\partial z^{n+1}_2}=\left[ \begin{array}{} -a_{11}a_{12} & 0 \\ 0 & -a_{21}a_{22} \\ \end{array} \right], \frac{\partial a^{n+1}_1}{\partial z^{n+1}_3}=\left[ \begin{array}{} -a_{11}a_{13} & 0 \\ 0 & -a_{21}a_{23} \\ \end{array} \right]

a2n+1z1n+1=[a11a1200a21a22]a2n+1z2n+1=[a12(1a12)00a22(1a22)]a2n+1z3n+1=[a12a1300a22a23]\frac{\partial a^{n+1}_2}{\partial z^{n+1}_1}=\left[ \begin{array}{} -a_{11}a_{12} & 0 \\ 0 & -a_{21}a_{22} \\ \end{array} \right]\\ \frac{\partial a^{n+1}_2}{\partial z^{n+1}_2}=\left[ \begin{array}{} a_{12}(1-a_{12}) & 0 \\ 0 & a_{22}(1-a_{22}) \\ \end{array} \right]\\ \frac{\partial a^{n+1}_2}{\partial z^{n+1}_3}=\left[ \begin{array}{} -a_{12}a_{13} & 0 \\ 0 & -a_{22}a_{23} \\ \end{array} \right]

a3n+1z1n+1=[a11a1300a21a23]a3n+1z2n+1=[a13a1200a23a22]a3n+1z3n+1=[a13(1a13)00a23(1a23)]\frac{\partial a^{n+1}_3}{\partial z^{n+1}_1}=\left[ \begin{array}{} -a_{11}a_{13} & 0 \\ 0 & -a_{21}a_{23} \\ \end{array} \right]\\ \frac{\partial a^{n+1}_3}{\partial z^{n+1}_2}=\left[ \begin{array}{} -a_{13}a_{12} & 0 \\ 0 & -a_{23}a_{22} \\ \end{array} \right]\\ \frac{\partial a^{n+1}_3}{\partial z^{n+1}_3}=\left[ \begin{array}{} a_{13}(1-a_{13}) & 0 \\ 0 & a_{23}(1-a_{23}) \\ \end{array} \right]

把上述求导结果带入式(1),并乘以12-\frac{1}{2},得到:

dZn+1=12(AY)dZ^{n+1}=\frac{1}{2}(A-Y)

维度验证

  • 单个神经元: dz1n+1=da1n+1a1n+1z1n+1=[1,2][2,2]=[1,2]dz^{n+1}_1=da^{n+1}_1\frac{\partial a^{n+1}_1}{\partial z^{n+1}_1}=[1,2][2,2]=[1,2]
  • 这一层: dZn+1dZ^{n+1}就是[3,2],有三个类别(神经元),两个样本。

综上,可以得到如下结论: 结论二:如果输出层激活函数为softmaxsoftmax,损失函数为交叉熵损失,则误差对输出层线性组合Z的梯度: dZ=1m(AY)dZ=\frac{1}{m}(A-Y) mm表示样本的个数,AAsoftmaxsoftmax层的激活输出,YY是基于one-hot编码的样本真实概率分布。

当激活函数g!=softmax,损失函数JJ为交叉熵时

image.png

由上图可知,第nn层的激活函数g!=softmaxg!=softmax,第1个神经元线性输出z1nz^{n}_1会通过以下连锁变化改变损失函数的值。

z1n>a1n>(z1n+1,z2n+1,z3n+1)>(a1n+1,a2n+1,a3n+1)>Jz^{n}_1-->a^{n}_1-->(z^{n+1}_1,z^{n+1}_2,z^{n+1}_3)-->(a^{n+1}_1,a^{n+1}_2,a^{n+1}_3)-->J

dz1n=Ja3n+1a3n+1z1n+1z1n+1a1na1nz1n+Ja3n+1a3n+1z2n+1z2n+1a1na1nz1n+Ja3n+1a3n+1z3n+1z3n+1a1na1nz1n+Ja2n+1a2n+1z1n+1z1n+1a1na1nz1n+Ja2n+1a2n+1z2n+1z2n+1a1na1nz1n+Ja2n+1a2n+1z3n+1z3n+1a1na1nz1n+Ja1n+1a1n+1z1n+1z1n+1a1na1nz1n+Ja1n+1a1n+1z2n+1z2n+1a1na1nz1n+Ja1n+1a1n+1z3n+1z3n+1a1na1nz1n\begin{align} dz^{n}_1= &\frac{\partial J}{\partial a^{n+1}_3} \frac{\partial a^{n+1}_3}{\partial z^{n+1}_1} \frac{\partial z^{n+1}_1}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_3} \frac{\partial a^{n+1}_3}{\partial z^{n+1}_2} \frac{\partial z^{n+1}_2}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_3} \frac{\partial a^{n+1}_3}{\partial z^{n+1}_3} \frac{\partial z^{n+1}_3}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+\\ &\frac{\partial J}{\partial a^{n+1}_2} \frac{\partial a^{n+1}_2}{\partial z^{n+1}_1} \frac{\partial z^{n+1}_1}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_2} \frac{\partial a^{n+1}_2}{\partial z^{n+1}_2} \frac{\partial z^{n+1}_2}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_2} \frac{\partial a^{n+1}_2}{\partial z^{n+1}_3} \frac{\partial z^{n+1}_3}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \\ &\frac{\partial J}{\partial a^{n+1}_1} \frac{\partial a^{n+1}_1}{\partial z^{n+1}_1} \frac{\partial z^{n+1}_1}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_1} \frac{\partial a^{n+1}_1}{\partial z^{n+1}_2} \frac{\partial z^{n+1}_2}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1}+ \frac{\partial J}{\partial a^{n+1}_1} \frac{\partial a^{n+1}_1}{\partial z^{n+1}_3} \frac{\partial z^{n+1}_3}{\partial a^n_1} \frac{\partial a^n_1}{\partial z^n_1} \end{align}

分析上式,可以发现: image.png

QQ_1745330686820.png

这个式子水平三个式子代表红色的线,竖直代表蓝色的线,也就是蓝色括号,可以按照最上面的公式化简,如下图

image.png

定义第 n+1 层的局部梯度:dzkn+1=Jzkn+1=Jakn+1akn+1zkn+1dz^{n+1}_k = \frac{\partial J}{\partial z^{n+1}_k} = \frac{\partial J}{\partial a^{n+1}_k} \cdot \frac{\partial a^{n+1}_k}{\partial z^{n+1}_k}

得到: dz1n=(dz1n+1z1n+1a1n+dz2n+1z2n+1a1n+dz3n+1z3n+1a1n)a1nz1ndz^n_1=(dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial a^n_1}+dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial a^n_1}+dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial a^n_1})\frac{\partial a^n_1}{\partial z^n_1}

同理,可以推出:

dz2n=(dz1n+1z1n+1a2n+dz2n+1z2n+1a2n+dz3n+1z3n+1a2n)a2nz2ndz^n_2=(dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial a^n_2}+dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial a^n_2}+dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial a^n_2})\frac{\partial a^n_2}{\partial z^n_2}

dz3n=(dz1n+1z1n+1a3n+dz2n+1z2n+1a3n+dz3n+1z3n+1a3n)a3nz3ndz^n_3=(dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial a^n_3}+dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial a^n_3}+dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial a^n_3})\frac{\partial a^n_3}{\partial z^n_3}

dz4n=(dz1n+1z1n+1a4n+dz2n+1z2n+1a4n+dz3n+1z3n+1a4n)a4nz4ndz^n_4=(dz^{n+1}_1\frac{\partial z^{n+1}_1}{\partial a^n_4}+dz^{n+1}_2\frac{\partial z^{n+1}_2}{\partial a^n_4}+dz^{n+1}_3\frac{\partial z^{n+1}_3}{\partial a^n_4})\frac{\partial a^n_4}{\partial z^n_4}

列出前面(给定数据计算)给定的数据:

Wn+1=[w1w2w3w4w5w6w7w8w9w10w11w12],Bn+1=[b1b2b3],An=[a1na2na3na4n]=[a11a21a12a22a13a23a14a24]W^{n+1}= \left[ \begin{array}{} w_1 & w_2 & w_3 & w_4 \\ w_5 & w_6 & w_7 & w_8 \\ w_9 & w_{10} & w_{11} & w_{12} \\ \end{array} \right], B^{n+1}= \left[ \begin{array}{} b_1 \\ b_2 \\ b_3 \\ \end{array} \right], A^n= \left[ \begin{array}{} a^n_1 \\ a^n_2 \\ a^n_3 \\ a^n_4 \\ \end{array} \right]= \left[ \begin{array}{} a_{11} & a_{21} \\ a_{12} & a_{22} \\ a_{13} & a_{23} \\ a_{14} & a_{24} \\ \end{array} \right]

向量的求导法则,可以推出:

Zn+1=Wn+1An+Bn+1=[z1n+1z2n+1z3n+1]=[z11z21z12z22z13z23]=[w1a11+w2a12+w3a13+w4a14+b1w1a21+w2a22+w3a23+w4a24+b1w5a11+w6a12+w7a13+w8a14+b2w5a21+w6a22+w7a23+w8a24+b2w9a11+w10a12+w11a13+w12a14+b3w9a21+w10a22+w11a23+w12a24+b3]\begin{align} Z^{n+1} &=W^{n+1}A^n+B^{n+1} \\ &= \left[ \begin{array}{} z^{n+1}_1 \\ z^{n+1}_2 \\ z^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} z_{11} & z_{21} \\ z_{12} & z_{22} \\ z_{13} & z_{23} \\ \end{array} \right]\\ &= \left[ \begin{array}{} w_1a_{11}+w_2a_{12}+w_3a_{13}+w_4a_{14}+b_1 & w_1a_{21}+w_2a_{22}+w_3a_{23}+w_4a_{24}+b_1 \\ w_5a_{11}+w_6a_{12}+w_7a_{13}+w_8a_{14}+b_2 & w_5a_{21}+w_6a_{22}+w_7a_{23}+w_8a_{24}+b_2 \\ w_9a_{11}+w_{10}a_{12}+w_{11}a_{13}+w_{12}a_{14}+b_3 & w_9a_{21}+w_{10}a_{22}+w_{11}a_{23}+w_{12}a_{24}+b_3 \\ \end{array} \right] \end{align}

应用向量的求导法则,可以推出:

n+1n+1 层神经元线性输出 zkn+1z^{n+1}_k 对第 nn 层激活值 a1na^n_1 的偏导数

z1n+1a1n=[z11a11z11a21z21a11z21a21]=[w100w1],z2n+1a1n=[z12a11z12a21z22a11z22a21]=[w500w5],z3n+1a1n=[z13a11z13a21z23a11z23a21]=[w900w9]\frac{\partial z^{n+1}_1}{\partial a^n_1}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial a_{11}} & \frac{\partial z_{11}}{\partial a_{21}}\\ \frac{\partial z_{21}}{\partial a_{11}} & \frac{\partial z_{21}}{\partial a_{21}}\\ \end{array} \right]= \left[ \begin{array}{} w_1 & 0 \\ 0 & w_1 \\ \end{array} \right], \frac{\partial z^{n+1}_2}{\partial a^n_1}= \left[ \begin{array}{} \frac{\partial z_{12}}{\partial a_{11}} & \frac{\partial z_{12}}{\partial a_{21}}\\ \frac{\partial z_{22}}{\partial a_{11}} & \frac{\partial z_{22}}{\partial a_{21}}\\ \end{array} \right]= \left[ \begin{array}{} w_5 & 0 \\ 0 & w_5 \\ \end{array} \right],\\ \frac{\partial z^{n+1}_3}{\partial a^n_1}= \left[ \begin{array}{} \frac{\partial z_{13}}{\partial a_{11}} & \frac{\partial z_{13}}{\partial a_{21}}\\ \frac{\partial z_{23}}{\partial a_{11}} & \frac{\partial z_{23}}{\partial a_{21}}\\ \end{array} \right]= \left[ \begin{array}{} w_9 & 0 \\ 0 & w_9 \\ \end{array} \right]

第 n 层第2、3、4个神经元的梯度(类似推导)

z1n+1a2n=[z11a12z11a22z21a12z21a22]=[w200w2],z2n+1a2n=[z12a12z12a22z22a12z22a22]=[w600w6],z3n+1a2n=[z13a12z13a22z23a12z23a22]=[w1000w10]\frac{\partial z^{n+1}_1}{\partial a^n_2}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial a_{12}} & \frac{\partial z_{11}}{\partial a_{22}}\\ \frac{\partial z_{21}}{\partial a_{12}} & \frac{\partial z_{21}}{\partial a_{22}}\\ \end{array} \right]= \left[ \begin{array}{} w_2 & 0 \\ 0 & w_2 \\ \end{array} \right], \frac{\partial z^{n+1}_2}{\partial a^n_2}= \left[ \begin{array}{} \frac{\partial z_{12}}{\partial a_{12}} & \frac{\partial z_{12}}{\partial a_{22}}\\ \frac{\partial z_{22}}{\partial a_{12}} & \frac{\partial z_{22}}{\partial a_{22}}\\ \end{array} \right]= \left[ \begin{array}{} w_6 & 0 \\ 0 & w_6 \\ \end{array} \right],\\ \frac{\partial z^{n+1}_3}{\partial a^n_2}= \left[ \begin{array}{} \frac{\partial z_{13}}{\partial a_{12}} & \frac{\partial z_{13}}{\partial a_{22}}\\ \frac{\partial z_{23}}{\partial a_{12}} & \frac{\partial z_{23}}{\partial a_{22}}\\ \end{array} \right]= \left[ \begin{array}{} w_{10} & 0 \\ 0 & w_{10} \\ \end{array} \right]
z1n+1a3n=[z11a13z11a23z21a13z21a23]=[w300w3],z2n+1a3n=[z12a13z12a23z22a13z22a23]=[w700w7],z3n+1a3n=[z13a13z13a23z23a13z23a23]=[w1100w11]\frac{\partial z^{n+1}_1}{\partial a^n_3}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial a_{13}} & \frac{\partial z_{11}}{\partial a_{23}}\\ \frac{\partial z_{21}}{\partial a_{13}} & \frac{\partial z_{21}}{\partial a_{23}}\\ \end{array} \right]= \left[ \begin{array}{} w_3 & 0 \\ 0 & w_3 \\ \end{array} \right], \frac{\partial z^{n+1}_2}{\partial a^n_3}= \left[ \begin{array}{} \frac{\partial z_{12}}{\partial a_{13}} & \frac{\partial z_{12}}{\partial a_{23}}\\ \frac{\partial z_{22}}{\partial a_{13}} & \frac{\partial z_{22}}{\partial a_{23}}\\ \end{array} \right]= \left[ \begin{array}{} w_7 & 0 \\ 0 & w_7 \\ \end{array} \right],\\ \frac{\partial z^{n+1}_3}{\partial a^n_3}= \left[ \begin{array}{} \frac{\partial z_{13}}{\partial a_{13}} & \frac{\partial z_{13}}{\partial a_{23}}\\ \frac{\partial z_{23}}{\partial a_{13}} & \frac{\partial z_{23}}{\partial a_{23}}\\ \end{array} \right]= \left[ \begin{array}{} w_{11} & 0 \\ 0 & w_{11} \\ \end{array} \right]
z1n+1a4n=[z11a14z11a24z21a14z21a24]=[w400w4],z2n+1a3n=[z12a14z12a24z22a14z22a24]=[w800w8],z3n+1a3n=[z13a14z13a24z23a14z23a24]=[w1200w12]\frac{\partial z^{n+1}_1}{\partial a^n_4}= \left[ \begin{array}{} \frac{\partial z_{11}}{\partial a_{14}} & \frac{\partial z_{11}}{\partial a_{24}}\\ \frac{\partial z_{21}}{\partial a_{14}} & \frac{\partial z_{21}}{\partial a_{24}}\\ \end{array} \right]= \left[ \begin{array}{} w_4 & 0 \\ 0 & w_4 \\ \end{array} \right], \frac{\partial z^{n+1}_2}{\partial a^n_3}= \left[ \begin{array}{} \frac{\partial z_{12}}{\partial a_{14}} & \frac{\partial z_{12}}{\partial a_{24}}\\ \frac{\partial z_{22}}{\partial a_{14}} & \frac{\partial z_{22}}{\partial a_{24}}\\ \end{array} \right]= \left[ \begin{array}{} w_8 & 0 \\ 0 & w_8 \\ \end{array} \right],\\ \frac{\partial z^{n+1}_3}{\partial a^n_3}= \left[ \begin{array}{} \frac{\partial z_{13}}{\partial a_{14}} & \frac{\partial z_{13}}{\partial a_{24}}\\ \frac{\partial z_{23}}{\partial a_{14}} & \frac{\partial z_{23}}{\partial a_{24}}\\ \end{array} \right]= \left[ \begin{array}{} w_{12} & 0 \\ 0 & w_{12} \\ \end{array} \right]

结合dz1n,dz2n,dz3n,dz4ndz^n_1,dz^n_2,dz^n_3,dz^n_4的计算公式:

nn 层第1个神经元的梯度

dz1n=(dz1n+1z1n+1a1n+dz2n+1z2n+1a1n+dz3n+1z3n+1a1n)a1nz1n=(dz1n+1[w100w1]+dz2n+1[w500w5]+dz3n+1[w900w9])a1nz1ndz^n_1 = \left( dz^{n+1}_1 \frac{\partial z^{n+1}_1}{\partial a^n_1} + dz^{n+1}_2 \frac{\partial z^{n+1}_2}{\partial a^n_1} + dz^{n+1}_3 \frac{\partial z^{n+1}_3}{\partial a^n_1} \right) \frac{\partial a^n_1}{\partial z^n_1} \\ = \left( dz^{n+1}_1 \begin{bmatrix} w_1 & 0 \\ 0 & w_1 \end{bmatrix} + dz^{n+1}_2 \begin{bmatrix} w_5 & 0 \\ 0 & w_5 \end{bmatrix} + dz^{n+1}_3 \begin{bmatrix} w_9 & 0 \\ 0 & w_9 \end{bmatrix} \right) \frac{\partial a^n_1}{\partial z^n_1}

维度计算:([1,2][2,2])[2,2]=[1,2]([1,2][2,2])*[2,2]=[1,2],注意a1n/z1n∂a^n_1/∂z^n_1(2x2 的对角矩阵)

nn 层第2、3、4个神经元的梯度(类似推导)

dz2n=(dz1n+1[w200w2]+dz2n+1[w600w6]+dz3n+1[w1000w10])a2nz2ndz^n_2 = \left( dz^{n+1}_1 \begin{bmatrix} w_2 & 0 \\ 0 & w_2 \end{bmatrix} + dz^{n+1}_2 \begin{bmatrix} w_6 & 0 \\ 0 & w_6 \end{bmatrix} + dz^{n+1}_3 \begin{bmatrix} w_{10} & 0 \\ 0 & w_{10} \end{bmatrix} \right) \frac{\partial a^n_2}{\partial z^n_2}
dz3n=(dz1n+1[w300w3]+dz2n+1[w700w7]+dz3n+1[w1100w11])a3nz3ndz^n_3 = \left( dz^{n+1}_1 \begin{bmatrix} w_3 & 0 \\ 0 & w_3 \end{bmatrix} + dz^{n+1}_2 \begin{bmatrix} w_7 & 0 \\ 0 & w_7 \end{bmatrix} + dz^{n+1}_3 \begin{bmatrix} w_{11} & 0 \\ 0 & w_{11} \end{bmatrix} \right) \frac{\partial a^n_3}{\partial z^n_3}
dz4n=(dz1n+1[w400w4]+dz2n+1[w800w8]+dz3n+1[w1200w12])a4nz4ndz^n_4 = \left( dz^{n+1}_1 \begin{bmatrix} w_4 & 0 \\ 0 & w_4 \end{bmatrix} + dz^{n+1}_2 \begin{bmatrix} w_8 & 0 \\ 0 & w_8 \end{bmatrix} + dz^{n+1}_3 \begin{bmatrix} w_{12} & 0 \\ 0 & w_{12} \end{bmatrix} \right) \frac{\partial a^n_4}{\partial z^n_4}

又因为损失函数JJZn+1Z^{n+1}的梯度:

dZn+1=[dz1n+1dz2n+1dz3n+1]=[dz11dz21dz12dz22dz13dz23]dZ^{n+1}= \left[ \begin{array}{} dz^{n+1}_1 \\ dz^{n+1}_2 \\ dz^{n+1}_3 \\ \end{array} \right]= \left[ \begin{array}{} dz_{11} & dz_{21} \\ dz_{12} & dz_{22} \\ dz_{13} & dz_{23} \\ \end{array} \right]

带入dz1n,dz2n,dz3n,dz4ndz^n_1,dz^n_2,dz^n_3,dz^n_4可得:

dz1n=[w1dz11+w5dz12+w9dz13w1dz21+w5dz22+w9dz23]a1nz1ndz^n_1= \left[ \begin{array}{} w_1dz_{11}+w_5dz_{12}+w_9dz_{13} & w_1dz_{21}+w_5dz_{22}+w_9dz_{23}\\ \end{array} \right]\frac{\partial a^{n}_1}{\partial z^n_1}

dz2n=[w2dz11+w6dz12+w10dz13w2dz21+w6dz22+w10dz23]a2nz2ndz^n_2= \left[ \begin{array}{} w_2dz_{11}+w_6dz_{12}+w_{10}dz_{13} & w_2dz_{21}+w_6dz_{22}+w_{10}dz_{23}\\ \end{array} \right]\frac{\partial a^{n}_2}{\partial z^n_2}

dz3n=[w3dz11+w7dz12+w11dz13w3dz21+w7dz22+w11dz23]a3nz3ndz^n_3= \left[ \begin{array}{} w_3dz_{11}+w_7dz_{12}+w_{11}dz_{13} & w_3dz_{21}+w_7dz_{22}+w_{11}dz_{23}\\ \end{array} \right]\frac{\partial a^{n}_3}{\partial z^n_3}

dz4n=[w4dz11+w8dz12+w12dz13w4dz21+w8dz22+w12dz23]a4nz4ndz^n_4= \left[ \begin{array}{} w_4dz_{11}+w_8dz_{12}+w_{12}dz_{13} & w_4dz_{21}+w_8dz_{22}+w_{12}dz_{23}\\ \end{array} \right]\frac{\partial a^{n}_4}{\partial z^n_4}

又因为AnA^n 是第 nn 层的激活值矩阵,由该层的线性输出 ZnZ^n 经过逐元素(element-wise)激活函数 g()g(\cdot) 计算得到。具体来说: An=g(Zn)=g([z1nz2nz3nz4n])=[g(z1n)g(z2n)g(z3n)g(z4n)]=[a1na2na3na4n]A^n=g(Z^n)=g( \left[ \begin{array}{} z^n_1 \\ z^n_2 \\ z^n_3 \\ z^n_4 \\ \end{array} \right])= \left[ \begin{array}{} g(z^n_1) \\ g(z^n_2) \\ g(z^n_3) \\ g(z^n_4) \\ \end{array} \right]= \left[ \begin{array}{} a^n_1 \\ a^n_2 \\ a^n_3 \\ a^n_4 \\ \end{array} \right]

其中ain=g(zin)a^n_i = g(z^n_i) 是第 ii 个神经元的激活值(仍为 1×21 \times 2 向量,对应 2 个样本),这一层输出维度依然是[4.2],一个神经元是输出是[1,2],但是他的反向求导是[2,2]的对称矩阵。

所以: g(Zn)Zn=[a1nz1na2nz2na3nz3na4nz4n]\partial\frac{g(Z^n)}{Z^n}= \left[ \begin{array}{} \frac{\partial a^{n}_1}{\partial z^n_1}\\ \frac{\partial a^{n}_2}{\partial z^n_2} \\ \frac{\partial a^{n}_3}{\partial z^n_3} \\ \frac{\partial a^{n}_4}{\partial z^n_4} \\ \end{array} \right]在反向传播中,AnZn\frac{\partial A^n}{\partial Z^n} 是一个对角矩阵(因为 gg 是逐元素的)


示例(ReLU 激活函数) 假设 g(z)=ReLU(z)=max(0,z)g(z) = \text{ReLU}(z) = \max(0, z),且:

Zn=[1.00.52.03.00.00.50.51.0]Z^n = \begin{bmatrix} 1.0 & -0.5 \\ -2.0 & 3.0 \\ 0.0 & 0.5 \\ 0.5 & -1.0 \\ \end{bmatrix}

则:

An=ReLU(Zn)=[max(0,1.0)max(0,0.5)max(0,2.0)max(0,3.0)max(0,0.0)max(0,0.5)max(0,0.5)max(0,1.0)]=[1.00.00.03.00.00.50.50.0]A^n = \text{ReLU}(Z^n) = \begin{bmatrix} \max(0, 1.0) & \max(0, -0.5) \\ \max(0, -2.0) & \max(0, 3.0) \\ \max(0, 0.0) & \max(0, 0.5) \\ \max(0, 0.5) & \max(0, -1.0) \\ \end{bmatrix} = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 3.0 \\ 0.0 & 0.5 \\ 0.5 & 0.0 \\ \end{bmatrix}
  • 每一行 对应一个神经元的输出(如 z1n=[z11,z12]z^n_1 = [z_{11}, z_{12}] 是第 1 个神经元对 2 个样本的线性输出)。
  • 每一列 对应一个样本(如第 1 列 [z11,z21,z31,z41]T[z_{11}, z_{21}, z_{31}, z_{41}]^T 是第一个样本在 4 个神经元上的输出)。

维度验证:

含义:第n层的输出

  • 这一层:dZn=[dz1ndz2ndz3ndz4n]=((Wn+1)TdZn+1)g(Zn)Zn=[4,3][3,2]=[4,2]dZ^n = \begin{bmatrix} dz^n_1 \\ dz^n_2 \\ dz^n_3 \\ dz^n_4 \end{bmatrix} = \left( (W^{n+1})^T \cdot dZ^{n+1} \right) \odot \frac{\partial g(Z^n)}{\partial Z^n}=[4, 3][3, 2]=[4, 2]
    • (Wn+1)T.shape=[4,3](W^{n+1})^T.shape=[4, 3]
    • dZn+1.shape=[3,2]dZ^{n+1}.shape=[3, 2]
    • g(Zn)Zn.shape=[4,2]\partial\frac{g(Z^n)}{Z^n}.shape=[4,2]

单个神经元梯度验证:

梯度计算:每个神经元 dzindz^n_i 的梯度由权重与上层梯度的线性组合构成,

  • 例如第一个神经元:dz1n=(w1dz1n+1+w5dz2n+1+w9dz3n+1)a1nz1n=[1,2][2,2][2,2]=[1,2]dz^n_1 = \left( w_1 \cdot dz^{n+1}_1 + w_5 \cdot dz^{n+1}_2 + w_9 \cdot dz^{n+1}_3 \right) \odot \frac{\partial a^n_1}{\partial z^n_1}\\=[1,2][2,2]*[2,2]=[1,2]
  • 逐元素乘法:a1nz1n\frac{\partial a^n_1}{\partial z^n_1} 维度为 [1×2][1 \times 2]
  • 输出维度:dz1ndz^n_1 维度为 [1×2][1 \times 2],符合逐样本梯度计算,就是一个神经元输出两个样本

前向传播维度验证:

  • 输入:第 nn 层激活值 AnA^n 维度为 [4×2][4 \times 2](4 个神经元,2 个样本)。
  • 权重矩阵:Wn+1W^{n+1} 维度为 [3×4][3 \times 4],偏置 Bn+1B^{n+1} 维度为 [3×1][3 \times 1]
  • 线性输出:Zn+1=Wn+1An+Bn+1=[4×2][3×4]=[3×2]Z^{n+1} = W^{n+1}A^n + B^{n+1}=[4 \times 2][3 \times 4]=[3 \times 2],维度为 [3×2][3 \times 2](3 个神经元,2 个样本)。

所以得到一下结论:

  • 前向传播:Zn+1=Wn+1An+Bn+1Z^{n+1} = W^{n+1}A^n + B^{n+1},维度从 [4×2][4 \times 2] 转换为 [3×2][3 \times 2]
  • 反向传播:dZn=((Wn+1)TdZn+1)g(Zn)ZndZ^n = \left( (W^{n+1})^T \cdot dZ^{n+1} \right) \odot \frac{\partial g(Z^n)}{\partial Z^n},维度从 [3×2][3 \times 2] 恢复为 [4×2][4 \times 2],确保梯度逐层回传。
  • 维度验证总结:单个神经元梯度:dzindz^n_i 的维度为 [1×2][1 \times 2],符合逐样本梯度计算。 整体梯度矩阵:dZndZ^n 维度为 [4×2][4 \times 2],与 ZnZ^n 一致,验证了反向传播公式的维度正确性。

分析dz1n,dz2n,dz3n,dz3n,g(Zn)Zndz^n_1,dz^n_2,dz^n_3,dz^n_3,\partial\frac{g(Z^n)}{Z^n}的最终形式,利用矩阵的乘法规则,可以得到如下结论: 结论三:如果全连接层的激活函数gg不是softmaxsoftmax,损失函数为交叉熵损失,则误差对线性组合Z的梯度:

dZn=[dz1ndz2ndz3ndz4n]=((Wn+1)TdZn+1)g(Zn)ZndZ^n = \begin{bmatrix} dz^n_1 \\ dz^n_2 \\ dz^n_3 \\ dz^n_4 \end{bmatrix} = \left( (W^{n+1})^T \cdot dZ^{n+1} \right) \odot \frac{\partial g(Z^n)}{\partial Z^n}

  • 其中gg表示激活函数,可以是relu或者tan。符号*表示左右2个矩阵的对应元素相乘,结果是同样大小的矩阵。