1. Moment Estimate
Based on Momentum Algorithm, we generate the current volumes (the first moment and the second moment ) at time are combined with the volumes before ( and ) and the current gradient .
The first moment:
The second moment:
The are the exponential decay rates for moment estimates.
2. Why bias-correct?
However, these moments estimations will lead to a slow converging speed: The initial moment The moments , so that the moments during early steps are too smaller than the real moments, so we need a bias-corrected estimate.
3. Weight Update
4. Why Adam is effective?
Since the gradient has directions,
If the absolute value of first moment , , is small,
but the second moment is large,
It will mean that the values of historical gradients is large but go in opposite directions. is the result of vectors cancelling each other out.
So if the case occurs, it means the gradient is not stable, and the update stride will tends to be smaller as the equation in section 3
If the and the are both large, It means that the direction of historical gradients is very consistent and the value is large, there is no need to shrink the stride.