吴恩达深度学习-Optimization+methods

16 阅读4分钟

@[TOC] 三种在更新参数的调优方法 gd,momentum,adam

1 - Gradient Descent

for l=1,...,Ll = 1, ..., L: W[l]=W[l]α dW[l] W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}

b[l]=b[l]α db[l] b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}

# GRADED FUNCTION: update_parameters_with_gd
def update_parameters_with_gd(parameters, grads, learning_rate):
    L = len(parameters) // 2 # number of layers in the neural networks

    # Update rule for each parameter
    for l in range(1, L):
        ### START CODE HERE ### (approx. 2 lines)
        parameters['W' + str(l)] = parameters['W' + str(l)] - learning_rate * grads['dW' + str(l)] 
        parameters['b' + str(l)] = parameters['b' + str(l)] - learning_rate * grads['db' + str(l)] 
        ### END CODE HERE ###
        
    return parameters

在随机梯度下降(Stochastic Gradient Descent, SGD)中,每次更新梯度时仅使用一个训练样本。当训练集较大时,SGD 的速度可能更快。但参数会以“振荡”的方式趋向最小值,而非平滑收敛。 实际应用中,通常采用介于单个样本与整个训练集之间的中间数量样本进行每次更新,即使用小批量梯度下降(Mini-batch Gradient Descent)。在这种方法中,您只需遍历小批量而非单个训练样本。

总结:

  • 梯度下降、小批量梯度下降和随机梯度下降的区别在于每次更新所使用的样本数量
  • 需要调整学习率超参数 α\alpha
  • 当选择合适的小批量大小时,通常其性能优于单纯的梯度下降或随机梯度下降,特别是在训练集较大的情况下。

(Batch) Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)    

Stochastic Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

2 - Mini-Batch Gradient descent

  1. Shuffle 乱序处理:X 和 Y 的每一列分别代表一个训练样本。请注意,X 和 Y 之间的随机乱序操作是同步进行的。如此一来,乱序之后,X 的第 ii 列样本将与 Y 中第 ii 个标签相对应。进行乱序步骤的目的是确保样本能够随机地分配到不同的 mini-batch 中。
  2. Partition 分区处理:将已乱序的(X, Y)按照 mini_batch_size(此处为 64)大小划分为多个 mini-batch,最后一区块可能不足mini_batch_size
# GRADED FUNCTION: random_mini_batches
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))  # 生成指定范围内 [0, m-1] 的整数随机排列。
    shuffled_X = X[:, permutation]	# 按照list重新排列数据
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[ : , k * mini_batch_size : (k+1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[ : , k * mini_batch_size : (k+1) * mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[ : , num_complete_minibatches * mini_batch_size : m]
        mini_batch_Y = shuffled_Y[ : , num_complete_minibatches * mini_batch_size : m]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches
  • Shuffling 和 Partitioning 是构建 mini-batch 所需的两个关键步骤。
  • mini-batch 的大小通常选取2的幂作为基数,例如 16、32、64、128 等。

3 - Momentum

由于 mini-batch 梯度下降在观察到一小部分样本后就进行参数更新,更新的方向存在一定的方差,因此 mini-batch 梯度下降收敛时的路径会呈现出“振荡”现象。引入动量法可以减少这些振荡。 动量法考虑了过去梯度的方向,以平滑更新过程。我们将存储之前梯度的“方向”在一个变量 vv 中。形式上,这将是前几步梯度的指数加权平均。你也可以将 vv 视为一个沿斜坡滚下的球的“速度”,根据梯度(或斜坡)的方向积累速度(动量)。

initialize_velocity

# GRADED FUNCTION: initialize_velocity

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v['dW' + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0] , parameters['W' + str(l+1)].shape[1]))
        v['db' + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0] , parameters['b' + str(l+1)].shape[1]))
        ### END CODE HERE ###
        
    return v

for l=1,...,Ll = 1, ..., L:

v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}$$ $$\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}$$ 此处 $L$ 表示层数,$\beta$ 代表动量系数,$\alpha$ 为学习率。 ## update_parameters_with_momentum ```python # GRADED FUNCTION: update_parameters_with_momentum def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate): """ Update parameters using Momentum Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- python dictionary containing the current velocity: v['dW' + str(l)] = ... v['db' + str(l)] = ... beta -- the momentum hyperparameter, scalar learning_rate -- the learning rate, scalar Returns: parameters -- python dictionary containing your updated parameters v -- python dictionary containing your updated velocities """ L = len(parameters) // 2 # number of layers in the neural networks # Momentum update for each parameter for l in range(L): ### START CODE HERE ### (approx. 4 lines) # compute velocities v['dW' + str(l + 1)] = beta * v['dW' + str(l+1)] + (1 - beta) * grads['dW' + str(l+1)] v['db' + str(l + 1)] = beta * v['db' + str(l+1)] + (1 - beta) * grads['db' + str(l+1)] # update parameters parameters['W' + str(l+1)] -= learning_rate * v['dW' + str(l+1)] parameters['b' + str(l+1)] -= learning_rate * v['db' + str(l+1)] ### END CODE HERE ### return parameters, v ``` **注意:** - 速度(velocity)初始化为零。因此,算法需要经过几个迭代周期来“积累”速度,进而开始采取更大的步长。 - 若 $\beta = 0$,则该算法退化为不含动量的标准梯度下降。 **如何选择 $\beta$?** - $\beta$ 越大,更新越平滑,因为我们更多地考虑了过去的梯度。然而,若 $\beta$ 过大,也可能过度平滑更新过程。 - $\beta$ 的常用取值范围在 0.8 至 0.999 之间。如果你不想对其进行精细调整,通常可选用 $\beta = 0.9$ 作为合理的默认值。 - 为你的模型找到最优 $\beta$ 值可能需要尝试多个值,以观察哪个值在降低成本函数 $J$ 值方面效果最佳。 --- # 4 - Adam Adam 是训练神经网络最有效的优化算法之一,它结合了 RMSProp 和动量法的思想。 **Adam 算法过程:** 1. 它计算过去梯度的指数加权平均,并将其存储在变量 $v$(无偏差修正)和 $v^{corrected}$(带偏差修正)中。 2. 它计算过去梯度平方的指数加权平均,并将其存储在变量 $s$(无偏差修正)和 $s^{corrected}$(带偏差修正)中。 3. 它根据来自“1”和“2”的信息结合更新参数。 更新规则如下,对于 $l = 1, ..., L$: $$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\\\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\\\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\\\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\\\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}\\\\ b^{[l]} = b^{[l]} - \alpha \frac{v^{corrected}_{db^{[l]}}}{\sqrt{s^{corrected}_{db^{[l]}}} + \varepsilon} \end{cases}$$ \end{cases}$$ 其中: - t 计数 Adam 已执行的步数,**迭代次数** - L 是层数 - $\beta_1$ 和 $\beta_2$ 是控制两个指数加权平均的超参数 - $\alpha$ 是学习率 - $\varepsilon$ 是一个非常小的数,用于避免除以零 ## initialize_adam ```python # GRADED FUNCTION: initialize_adam def initialize_adam(parameters) : """ Initializes v and s as two python dictionaries with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters["W" + str(l)] = Wl parameters["b" + str(l)] = bl Returns: v -- python dictionary that will contain the exponentially weighted average of the gradient. v["dW" + str(l)] = ... v["db" + str(l)] = ... s -- python dictionary that will contain the exponentially weighted average of the squared gradient. s["dW" + str(l)] = ... s["db" + str(l)] = ... """ L = len(parameters) // 2 # number of layers in the neural networks v = {} s = {} # Initialize v, s. Input: "parameters". Outputs: "v, s". for l in range(L): ### START CODE HERE ### (approx. 4 lines) v['dW' + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0] , parameters['W' + str(l+1)].shape[1])) v['db' + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0] , parameters['b' + str(l+1)].shape[1])) s['dW' + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0] , parameters['W' + str(l+1)].shape[1])) s['db' + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0] , parameters['b' + str(l+1)].shape[1])) ### END CODE HERE ### return v, s ``` ## update_parameters_with_adam ```python # GRADED FUNCTION: update_parameters_with_adam def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8): """ Update parameters using Adam Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary learning_rate -- the learning rate, scalar. beta1 -- Exponential decay hyperparameter for the first moment estimates beta2 -- Exponential decay hyperparameter for the second moment estimates epsilon -- hyperparameter preventing division by zero in Adam updates Returns: parameters -- python dictionary containing your updated parameters v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary """ L = len(parameters) // 2 # number of layers in the neural networks v_corrected = {} # Initializing first moment estimate, python dictionary s_corrected = {} # Initializing second moment estimate, python dictionary # Perform Adam update on all parameters for l in range(L): # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v". ### START CODE HERE ### (approx. 2 lines) v['dW' + str(l+1)] = beta1 * v['dW' + str(l+1)] + (1 - beta1) * grads['dW' + str(l+1)] v['db' + str(l+1)] = beta1 * v['db' + str(l+1)] + (1 - beta1) * grads['db' + str(l+1)] ### END CODE HERE ### # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected". ### START CODE HERE ### (approx. 2 lines) v_corrected['dW' + str(l+1)] = v['dW' + str(l+1)] / (1 - np.power(beta1 , t)) # t 是当前训练迭代次数 v_corrected['db' + str(l+1)] = v['db' + str(l+1)] / (1 - np.power(beta1 , t)) ### END CODE HERE ### # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s". ### START CODE HERE ### (approx. 2 lines) s['dW' + str(l+1)] = beta2 * s['dW' + str(l+1)] + (1 - beta2) * grads['dW' + str(l+1)] ** 2 s['db' + str(l+1)] = beta2 * s['db' + str(l+1)] + (1 - beta2) * grads['db' + str(l+1)] ** 2 ### END CODE HERE ### # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected". ### START CODE HERE ### (approx. 2 lines) s_corrected['dW' + str(l+1)] = s['dW' + str(l+1)] / (1 - np.power(beta2 , t)) s_corrected['db' + str(l+1)] = s['db' + str(l+1)] / (1 - np.power(beta2 , t)) ### END CODE HERE ### # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters". ### START CODE HERE ### (approx. 2 lines) parameters['W' + str(l+1)] -= learning_rate * v_corrected['dW' + str(l+1)] / (np.sqrt(s_corrected['dW' + str(l+1)]) + epsilon) parameters['b' + str(l+1)] -= learning_rate * v_corrected['db' + str(l+1)] / (np.sqrt(s_corrected['db' + str(l+1)]) + epsilon) ### END CODE HERE ### return parameters, v, s ``` 动量法通常有所帮助,但由于学习率较小且数据集较为简单,其影响几乎可以忽略不计。此外,成本曲线中出现的巨大振荡源于某些 mini-batch 对优化算法而言比其他 mini-batch 更难处理。 相反,Adam 显著优于 mini-batch 梯度下降和动量法。如果在该简单数据集上运行模型更多轮次,三种方法都将产生非常好的结果。然而,**Adam 收敛得更快**。 (:) 吃惊,Adam 收敛速度超出我的想象,不明觉厉。) **Adam 的一些优势包括:** - 相对较低的内存需求(尽管高于梯度下降和带有动量的梯度下降) - 即使对超参数(除 $\alpha$ 外)调整不多,通常也能表现出色