Adam: method combine Momentum and AdaGrad

42 阅读1分钟

1. Moment Estimate

Based on Momentum Algorithm, we generate the current volumes (the first moment mtm_t and the second moment vtv_t) at time tt are combined with the volumes before (mt1m_{t-1} and vt1v_{t-1}) and the current gradient gtg_t .

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2\begin{array}{lr} m_t=\beta_1m_{t-1}+(1-\beta_1)g_t\\ v_t=\beta_2v_{t-1}+(1-\beta_2){g_t}^2 \end{array}

The first moment:

mt=E(gt)m_t=\mathbb{E}(g_t)

The second moment:

vt=E(gt2)v_t=\mathbb{E}({g_t}^2)

The β1,β2\beta_1,\beta_2 are the exponential decay rates for moment estimates.

2. Why bias-correct?

However, these moments estimations will lead to a slow converging speed: The initial moment m0,v0=0m_0,v_0 = 0 The moments m1,v1=β10+(1β1)g1,β20+(1β2)gt2m_1, v_1 = \beta_1 0 + (1-\beta_1)g_1, \beta_2 0 + (1-\beta_2){g_t}^2, so that the moments during early steps are too smaller than the real moments, so we need a bias-corrected estimate.

m^t=mt(1β1t)v^t=vt(1β2t)\begin{array}{lr} \hat{m}_t=\frac{m_t}{(1-\beta_1^t)}\\ \hat{v}_t=\frac{v_t}{(1-\beta_2^t)} \end{array}

3. Weight Update

θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t-\frac{\eta\cdot \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}

4. Why Adam is effective?

Since the gradient gg has directions,
If the absolute value of first moment mm , m|m|, is small, but the second moment vv is large, It will mean that the values of historical gradients is large but go in opposite directions. m|m| is the result of vectors cancelling each other out. So if the case occurs, it means the gradient is not stable, and the update stride will tends to be smaller as the equation in section 3 If the m|m| and the vv are both large, It means that the direction of historical gradients is very consistent and the value is large, there is no need to shrink the stride.