ResNet为什么能训练1000层

79 阅读1分钟

正常情况

graph LR
Input:X-->L1["Layer1:f1(X,w1)"]
L1-->L2["Layer2:f2(X,w2)"]
L2-->Y

令真实值为y=f2(X,w2)y=f_2(X,w_2),损失函数为L(Y,y)=YyL(Y,y)=Y-y,考虑的简单一点。

正常情况下反传梯度后的梯度更新如下:

w2=w2ηLw2w_2=w_2-\eta \frac{\partial{L}}{\partial w_2}
w1=w1ηLf1f1w1w_1=w_1-\eta \frac{\partial{L}}{\partial f_1}\frac{\partial f_1}{w_1}

显然当层数nn较大时,w1w_1的梯度更新会变为以下形式:

w1=w1ηLfn1...f1w1w_1=w_1-\eta \frac{\partial{L}}{\partial f_{n-1}}...\frac{\partial f_1}{w_1}

从上式可看出,当网络层数较大时,梯度的链式相乘容易引起梯度爆炸或消失,从而导致模型训练失败。

ResNet的情况

graph LR
Input:X-->L1["Layer1:f1(X,w1)"]
L1---O1(( ))
O1-->L2["Layer2:f2(X,w2)"]
L2--->O2(( ))
O2-->Y
O1-->O2

此时Y=f2[f1(X,w1),w2]+f1(X,w1)Y=f_2[f_1(X,w_1),w_2]+f_1(X,w_1),损失函数依然为L(Y,y)L(Y,y),反传梯度后的梯度更新如下:

w2=w2ηLw2=w2ηf2w2w_2=w_2-\eta \frac{\partial L}{\partial w_2}=w_2-\eta \frac{\partial f_2}{\partial w_2}
w1=w1η[f2f1f1w1+f1w1]w_1 = w_1-\eta [\frac{\partial f_2}{\partial f_1}\frac{\partial f_1}{\partial w_1} +\frac{\partial f_1}{\partial w_1}]

此时便将乘法转为了加法,当ResNet层数nn较大时,w1w_1的梯度更新变为了以下形式:

w1=w1η[fnfn1...f1w1+fn1fn2...f1w1+...+f1w1]w_1=w_1-\eta [\frac{\partial f_n}{\partial f_{n-1}}...\frac{\partial f_1}{\partial w_1}+ \frac{\partial f_{n-1}}{\partial f_{n-2}}...\frac{\partial f_1}{\partial w_1}+...+\frac{\partial f_1}{\partial w_1}]

方括号中的最后一项f1w1\frac{\partial f_1}{\partial w_1}保证了梯度不会因为相乘而太大或归0,从而使得ResNet能够训练1000层的深度神经网络。