Gradient Descent

57 阅读1分钟

Tip 1: tuning your learning rate

需要监测Loss vs. # parameters updates的曲线,观察前几次的曲线走向 image.png

Adaptive learning rates

  • Reduce the learning rate by some factor every few epochs
    • 1/t decay: ηt=η/t+1\eta_t = \eta / \sqrt{t+1}
  • Give differnt parameters different learning rates
    • Adagrad

Adagrad

Definition

Divide the learning rate of each parameter by the root mean square of its previous derivatives

image.png

image.png

image.png

Intuition

  • 反映和历史梯度的反差
  • 分母反映了二次微分的大小
    • 梯度越大,离最优值越远:仅对单参数成立
    • the best step:考虑二次微分

image.png

image.png

image.png

image.png

Tip 2: stochastic gradient descent

可以加速训练过程

  • loss for only one example image.png

Tip 3: feature scaling

  • what's feature scaling image.png

  • why feature scaling

    • before: require adaptive learning rates
    • after: not require adaptive learning rates & more efficient image.png

Gradient descent theory

Taylor series

  • approximation to first-order: learning rate should be small enough

Limitation of gradient descent

  • local minima
  • saddle point
  • plateau: not necessarily close to local minima