吴恩达深度学习-Initialization---Regularization---Gradient Checking

Initialization

parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":

W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])

b1 -- bias vector of shape (layers_dims[1], 1)

...

WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])

bL -- bias vector of shape (layers_dims[L], 1)

Zero initialization

# GRADED FUNCTION: initialize_parameters_zeros 
def initialize_parameters_zeros(layers_dims):
    parameters = {}
    L = len(layers_dims)            # number of layers in the network
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l] , layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
        
    return parameters

Random initialization

# GRADED FUNCTION: initialize_parameters_random

def initialize_parameters_random(layers_dims):
    np.random.seed(3)               
    parameters = {}
    L = len(layers_dims)            # integer representing the number of layers
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * 10
        parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
        
    return parameters

He initialization

He 初始化对应的是非线性激活函数（Relu 和 Prelu）。任意层的权重 $W^{[l]}$ ，按照均值为 0，且方差为 $\sqrt{\frac{2}{n^[l-1]}}$ 的高斯分布进行初始化，可以保证每一层的输入方差尺度一致。

$W^{[l]} = random * \sqrt{\frac{2}{\text{layers\_dims[l-1]}}}$

# GRADED FUNCTION: initialize_parameters_he
def initialize_parameters_he(layers_dims):
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 
     
    for l in range(1, L + 1):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
        
    return parameters

Regularization

L2 Regularization

from: $J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}$ To:

To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$ , use :

np.sum(np.square(Wl))

compute_cost_with_regularization

# GRADED FUNCTION: compute_cost_with_regularization
def compute_cost_with_regularization(A3, Y, parameters, lambd):

    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y)
    
    L2_regularization_cost =  (1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

backward_propagation_with_regularization

$\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$

# GRADED FUNCTION: backward_propagation_with_regularization
def backward_propagation_with_regularization(X, Y, cache, lambd):

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y  # sigmoid梯度
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1. / m * np.dot(dZ3 , A2.T) + lambd / m * W3
    ### END CODE HERE ###

    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    # 除了最后一层，其他层都是ReLU激活函数，ReLU 的梯度只有在 x > 0 时才为 1，否则为 0
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))  

    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1. / m * np.dot(dZ2 , A1.T) + lambd / m * W2
    ### END CODE HERE ###

    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))

    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1. / m * np.dot(dZ1 , X.T) + lambd / m * W1
    ### END CODE HERE ###

    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

注意：

$\lambda$ 的值是一个超参数，可以通过使用验证集进行调整。
L2 正则化会使决策边界更平滑。如果 $\lambda$ 过大，也可能导致“过平滑”，进而产生高偏差的模型。

L2 正则化基于这样的假设：权重较小的模型比权重较大的模型更为简单。因此，通过在成本函数中惩罚权重的平方值，驱使所有权重趋于更小。这样一来，成本函数中拥有较大权重变得过于昂贵！这导致了一个更平滑的模型，其中输出随着输入的变化更为缓慢。

L2 正则化对以下方面的含义：

成本计算：成本中添加了正则化项
反向传播函数：与权重矩阵相关的梯度中包含了额外项
权重最终变小（“权重衰减”）：权重被推向更小的值

Dropout

注意：

使用 Dropout 时常犯的一个错误是在训练和测试阶段都使用它。您应当仅在训练阶段使用 Dropout（随机删除节点）。
深度学习框架如 tensorflow、PaddlePaddle、keras 或 caffe 提供了 Dropout 层的实现。不必紧张，您很快就能学到其中一些框架。

关于 Dropout 应该记住的事项：

Dropout 是一种正则化技术。
您只能在训练阶段使用 Dropout。在测试阶段不要使用 Dropout（随机删除节点）。
在前向传播和反向传播中都要应用 Dropout。
训练阶段时，为了使激活值的期望值保持不变，应对每个 Dropout 层除以 keep_prob。例如，如果 keep_prob 为 0.5，那么平均会关闭一半的节点，所以输出将缩小为原来的 0.5 倍，因为只有剩下的一半节点对解决方案有所贡献。除以 0.5 等同于乘以 2。因此，现在输出具有相同的期望值。您可以检查即使 keep_prob 为 0.5 以外的其他值，这种方法仍然有效。

forward_propagation_with_dropout

打算在第一层和第二层关闭一些神经元。为此，您需要完成以下4个步骤：

在课程中，我们讨论过通过 np.random.rand() 生成0到1之间的随机数，创建一个与 $a^{[1]}$ 形状相同的变量 $d^{[1]}$ 。在此处，您将采用向量化方法，故需构建一个与 $A^{[1]}$ 维度相同的随机矩阵 $D^{[1]} = [d^{1} d^{1} ... d^{1}]$ 。
通过合理设定阈值，将 $D^{[1]}$ 中各元素以概率 1-keep_prob 设为0，以概率 keep_prob 设为1。提示：若要将矩阵 X 的所有元素小于0.5的设为0，大于等于0.5的设为1，可以运行 X = (X < 0.5)。需要注意，0和1分别代表False和True。
将 $A^{[1]}$ 设置为 $A^{[1]} \times D^{[1]}$ 。（此时您正在关闭部分神经元）。可以将 $D^{[1]}$ 视作一个掩膜，当它与另一个矩阵相乘时，会“屏蔽”掉某些值。
将 $A^{[1]}$ 除以 keep_prob。这样操作的目的是保证即使使用了dropout，成本函数的输出仍具有与未使用时相同的期望值。（这种技术也被称为反转dropout）。

# GRADED FUNCTION: forward_propagation_with_dropout
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
  
     np.random.seed(1)
     
     # retrieve parameters
     W1 = parameters["W1"]
     b1 = parameters["b1"]
     W2 = parameters["W2"]
     b2 = parameters["b2"]
     W3 = parameters["W3"]
     b3 = parameters["b3"]
     
     # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
     Z1 = np.dot(W1, X) + b1
     A1 = relu(Z1)
     ### START CODE HERE ### (approx. 4 lines)         
     # Steps 1-4 below correspond to the Steps 1-4 described above. 
      # Step 1: initialize matrix D1 = np.random.rand(..., ...)   
     # 对生成的随机矩阵 D1 的每个元素，执行 < keep_prob 的比较操作。这将产生一个与 D1 形状相同的布尔矩阵（True/False）
     D1 = np.random.rand(A1.shape[0] , A1.shape[1])
     
      # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
     D1 = (D1 < keep_prob)
          
     # Step 3: shut down some neurons of A1
     A1 *= D1

     # Step 4: scale the value of neurons that haven't been shut down
     A1 = np.divide(A1 , keep_prob)

     ### END CODE HERE ###
     Z2 = np.dot(W2, A1) + b2
     A2 = relu(Z2)
     ### START CODE HERE ### (approx. 4 lines)
     # Step 1: initialize matrix D2 = np.random.rand(..., ...)
     D2 = np.random.rand(A2.shape[0] , A2.shape[1])

     # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
     D2 = (D2 < keep_prob)

     # Step 3: shut down some neurons of A2
     A2 *= D2
     
     # Step 4: scale the value of neurons that haven't been shut down
     A2 = np.divide(A2 , keep_prob)
     ### END CODE HERE ###
     Z3 = np.dot(W3, A2) + b3
     A3 = sigmoid(Z3)
     
     cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
     
     return A3, cache

backward_propagation_with_dropout

反向传播中使用 Dropout 实际上相当简单，您需执行以下两步：

在前向传播阶段，您通过将掩模 $D^{[1]}$ 应用于 A1 关闭了部分神经元。在反向传播时，您需要通过将相同的掩模 $D^{[1]}$ 重新应用于 dA1 来关闭相同的神经元。
在前向传播过程中，您已将 A1 除以 keep_prob。因此，在反向传播时，您需要再次将 dA1 除以 keep_prob（从微积分角度看，若 $A^{[1]}$ 被 keep_prob 缩放，则其导数 $dA^{[1]}$ 也会被相同的 keep_prob 缩放）。

# GRADED FUNCTION: backward_propagation_with_dropout
def backward_propagation_with_dropout(X, Y, cache, keep_prob):

    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
        # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 * D2
        # Step 2: Scale the value of neurons that haven't been shut down
    dA2 = dA2 / keep_prob
    ### END CODE HERE ###

    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
        # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 * D1
        # Step 2: Scale the value of neurons that haven't been shut down
    dA1 = dA1 / keep_prob
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

Gradient Checking

N-dimensional gradient checking

提供前向传播函数 forward_propagation_n，后向传播函数 backward_propagation_n

$\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$

每一个参数（w或b）发生 epsilon 微小变化后，得到新的J ，通过导数定义就可以得到对应参数导数的近似值 所以需要对每个参数w，b都得计算，速度会很慢 对于 num_parameters 中的每个索引 i：

计算 J_plus[i]：
1. 将 $\theta^{+}$ 设置为 np.copy(parameters_values)
2. 将 $\theta^{+}_i$ 设置为 $\theta^{+}_i + \varepsilon$
3. 使用 forward_propagation_n(x, y, vector_to_dictionary( $\theta^{+}$ )) 计算 $J^{+}_i$
计算 J_minus[i]：对 $\theta^{-}$ 执行相同的操作
计算 $gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$

因此，您获得了一个向量 gradapprox，其中 gradapprox[i] 是关于 parameter_values[i] 的梯度近似值。现在您可以将这个 gradapprox 向量与反向传播得到的梯度向量进行比较。计算： $difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 }$

# GRADED FUNCTION: gradient_check_n

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0] 
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    # 每一个参数发生theta变化后，得到cost，导数定义可以得到J对这个参数的近似导数
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        theta_plus = np.copy(parameters_values)
        theta_plus[i] += epsilon
        J_plus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_plus))
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        theta_minus = np.copy(parameters_values)
        theta_minus[i] -= epsilon
        J_minus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_minus))
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
        # Step 1'
    numerator = np.linalg.norm(grad - gradapprox)  # np.linalg.norm 计算范数（Norm）
        # Step 2'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
        # Step 3'
    difference = numerator / denominator
    ### END CODE HERE ###

    if difference > 1e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

作者在 backward_propagation_n 函数中设计了bugs（提示：检查 dW2 和 db1）

注意：

梯度检查很慢！使用 $\frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$ 近似梯度计算成本较高。因此，我们在训练过程中不会在每次迭代中都运行梯度检查。只需运行几次以检查梯度是否正确即可。
至少按照我们目前展示的方式，梯度检查与 Dropout 不兼容。通常，您会先在没有 Dropout 的情况下运行梯度检查以确保反向传播正确无误，然后再添加 Dropout。:)

差异标准的具体阈值：对于上述差异度量，需要设定一个阈值来判断反向传播是否通过了梯度检查。常用的阈值有：

1e-7：这是一个常见的严格标准，意味着允许的最大相对误差在百万分之一左右。如果差异小于这个值，通常认为反向传播计算的梯度与数值近似梯度非常接近，可以接受。

1e-5：稍宽松的标准，允许的相对误差在十万分之一左右。在某些情况下，如果计算资源有限或模型较为复杂，可能需要接受略高的差异容忍度。

1e-2：对于某些大型网络或计算资源极其受限的情况，可能需要设定更宽松的标准。但请注意，如此高的差异可能表明反向传播存在显著误差，应谨慎对待。

选择差异标准时需权衡精度需求、计算资源消耗以及模型复杂度等因素。通常，梯度检查仅在模型开发初期用于调试阶段，一旦确认反向传播正确，即可停止使用，因其计算开销较大，不适合在常规训练过程中持续使用。在实践中，大部分开发者倾向于选择一个相对严格的阈值（如 1e-7 或 1e-5），确保反向传播的高精度。如果差异超过设定的阈值，则应进一步排查反向传播实现中的潜在错误，并在修正后再次进行梯度检查。