Tip 1: tuning your learning rate
需要监测Loss vs. # parameters updates的曲线,观察前几次的曲线走向
Adaptive learning rates
- Reduce the learning rate by some factor every few epochs
- 1/t decay:
- Give differnt parameters different learning rates
- Adagrad
Adagrad
Definition
Divide the learning rate of each parameter by the root mean square of its previous derivatives
Intuition
- 反映和历史梯度的反差
- 分母反映了二次微分的大小
- 梯度越大,离最优值越远:仅对单参数成立
- the best step:考虑二次微分
Tip 2: stochastic gradient descent
可以加速训练过程
- loss for only one example
Tip 3: feature scaling
-
what's feature scaling
-
why feature scaling
- before: require adaptive learning rates
- after: not require adaptive learning rates & more efficient
Gradient descent theory
Taylor series
- approximation to first-order: learning rate should be small enough
Limitation of gradient descent
- local minima
- saddle point
- plateau: not necessarily close to local minima