1. Moment Estimate

Based on Momentum Algorithm, we generate the current volumes (the first moment $m_t$ and the second moment $v_t$ ) at time $t$ are combined with the volumes before ( $m_{t-1}$ and $v_{t-1}$ ) and the current gradient $g_t$ .

\begin{array}{lr} m_t=\beta_1m_{t-1}+(1-\beta_1)g_t\\ v_t=\beta_2v_{t-1}+(1-\beta_2){g_t}^2 \end{array}

The first moment:

m_t=\mathbb{E}(g_t)

The second moment:

v_t=\mathbb{E}({g_t}^2)

The $\beta_1,\beta_2$ are the exponential decay rates for moment estimates.

2. Why bias-correct?

However, these moments estimations will lead to a slow converging speed: The initial moment $m_0,v_0 = 0$ The moments $m_1, v_1 = \beta_1 0 + (1-\beta_1)g_1, \beta_2 0 + (1-\beta_2){g_t}^2$ , so that the moments during early steps are too smaller than the real moments, so we need a bias-corrected estimate.

\begin{array}{lr} \hat{m}_t=\frac{m_t}{(1-\beta_1^t)}\\ \hat{v}_t=\frac{v_t}{(1-\beta_2^t)} \end{array}

3. Weight Update

\theta_{t+1} = \theta_t-\frac{\eta\cdot \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}

4. Why Adam is effective?

Since the gradient $g$ has directions,
If the absolute value of first moment $m$ , $|m|$ , is small, but the second moment $v$ is large, It will mean that the values of historical gradients is large but go in opposite directions. $|m|$ is the result of vectors cancelling each other out. So if the case occurs, it means the gradient is not stable, and the update stride will tends to be smaller as the equation in section 3 If the $|m|$ and the $v$ are both large, It means that the direction of historical gradients is very consistent and the value is large, there is no need to shrink the stride.

Adam: method combine Momentum and AdaGrad

1. Moment Estimate

2. Why bias-correct?

3. Weight Update

4. Why Adam is effective?