15-442/15-642 mlsys lab1 手动实现自动微分框架声明：并非解决方案，仅仅是思路梳理纪录自动微分

声明：并非解决方案，仅仅是思路梳理纪录

自动微分

1、计算图由若干个结点和边构成，每个节点定义了计算操作（加减乘除）、输入节点列表（如v4的输入是v1和v2）

2、当执行前向AD（auto differentiation，自动微分）时，需要先设第一个输入x1的导为1，其它输入为0。一路计算下去，可得输出y对输入x1的微分。

3、显然，使用前向AD的缺点就是，有几个输入就要从头计算几次（正向传播），因此，就要使用反向传播，这样一次就能获得所有输入的梯度。首先正向传播获得每个节点的值，然后反向传播，每层梯度计算依赖前一层传递回来的梯度。

4、因此计算反向传播的方法为：1、节点按拓扑排序反向排好，遍历；2、第i个节点的梯度等于，该节点i指向的所有节点j对该节点i的偏微分(vj对vi的偏微分)乘以自身的梯度（vj的梯度）求和，以上每个节点的该操作称为partial adjoint，已经被存在字典中。；3、处理该结点的所有输入，把该节点对每个输入的偏微分乘以自身的梯度，存到字典中；

task1：

通过node_forward测试，这个很简单，只需要参考参考实现。实现基础乘法、除法、矩阵乘法。

task2：

要求实现Evaluator.run方法，通过计算图的前向传播测试。Evaluator是一个存储了一系列节点的执行器，当执行run方法时，就会计算这一系列节点并输出最终结果。

输入参数是一个字典，里面存放了初始时需要的输入。
只给最终节点，要求实现拓扑排序，使用深度优先算法，这样每个结点只有在处理完所有子节点后才会把自己加进去，实现拓扑排序。
根据拓扑排序的顺序，计算每一个结点的输出值，其所需输入从字典中获取，接着把自身的输出再放入字典中，作为后续的输入。

task3：

为所有的计算类实现梯度方法，每个节点根据自身的梯度和计算属性，计算出每个输入的梯度（偏导）。标量的乘法和除法都很简单，套公式即可，矩阵乘法要注意矩阵的转置。需熟记以下公式

$假设 L 是一个标量损失函数，依赖于 Y。\\$ $我们需要计算 \frac{\partial L}{\partial A} 和 \frac{\partial L}{\partial B}。\\$ $当Y=AB\\ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial Y}B^T\\$ $\frac{\partial L}{\partial B} = A^T\frac{\partial L}{\partial Y}$

task4:

实现反向传播梯度计算图

通过对output_node求拓扑排序，再反转，获得从输出结点开始的反向拓扑排序，为计算反向梯度传播做准备；
用拓扑排序出来的结点初始化字典，键为结点，值为结点列表，并初始化output_node的列表为[1]，其中1是结点类型的1。
遍历计算每个结点的输入结点的梯度，放入字典中。
返回所需结点的梯度（其实结点的本质是表达式）。

task5

要求实现逻辑回归函数， $Z=XW+b$ ，输入是X、W和b，输出Z。
要求实现广播机制，因为XW的形状是（batch,number_class），但是b的形状是（number_class,），需要把b广播为（batch，number_class）,广播轴为0。
广播算子正向传播公式：

C = Broadcast(A,B)

难点在于理解广播算子的梯度计算，广播是沿着某个轴复制一遍值，那么广播算子的反向传播时就要求把形状变回去，把广播轴求和，梯度累加。

\frac{\partial L}{\partial A} = sum(\frac{\partial L}{\partial C},axis=广播的维度)

task6

要求实现softmax_loss函数：

\ell_{\mathrm{softmax}}(z, y) = \log\sum_{i=1}^k \exp z_i - z_y.

其中 $z_i$ 是预测结果的第i个位置的输出值， $z_y$ 是预测结果正确位置的输出值， $z_y$ 是用目标结果的独热编码与输出结果相乘可以得出。

实现exp算子，这个很简单，因为exp求导不变，所以反向传播很方便。
实现log算子，在程序里面一般log是以e为底，所以就是ln，lnx求导就是 $\frac{1}{x}$ 。
实现sum算子，正向传播公式不用多说，反向传播公式为：

    # 在第 axis 维插入大小为1的维度
    expanded_grad = np.expand_dims(grad_Y, axis=axis)
    
    # 广播到原始形状
    grad_X = np.broadcast_to(expanded_grad, original_shape)

因此要实现expand_dims算子，其反向传播算法懒得实现，完成实验用不上。

task7

实现SGD随机梯度下降

将输入数据标准化，以免数值爆炸。
实现独热编码函数，把标签转换为独热编码
由于在损失函数中已经除了batch_size，因此输出的梯度不用除。
修改适合的学习率
任务书说最终准取率能有95%左右，我只有90%，不知道哪些地方有问题。

附上自制算子测试代码

"""
We encourage you to create your own test cases, which helps
you confirm the correctness of your implementation.

If you are interested, you can write your own tests in this file
and share them with us by including this file in your submission.
Please make the tests "pytest compatible" by starting each test
function name with prefix "test_".

We appreciate it if you can share your tests, which can help
improve this course and the assignment. However, please note that
this part is voluntary -- you will not get more scores by sharing
test cases, and conversely, will not get fewer scores if you do
not share.
"""
import auto_diff as ad
from typing import Dict, List
import numpy as np

def check_evaluator_output(
    evaluator: ad.Evaluator,
    input_values: Dict[ad.Node, np.ndarray],
    expected_outputs: List[np.ndarray],
) -> None:
    output_values = evaluator.run(input_values)
    assert len(output_values) == len(expected_outputs)
    for output_val, expected_val in zip(output_values, expected_outputs):
        print(repr(output_val))
        print()
        print(expected_val)
        print()
        np.testing.assert_allclose(actual=output_val, desired=expected_val)

def test_log():
    x = ad.Variable("x")
    y = ad.log(x)
    y_val = np.array([[ 0.          ,0.69314718 ,-0.69314718], [ 1.09861229 ,-1.38629436 , 1.38629436]])
    y_grad = ad.Variable("y_grad")
    x_grad, = y.op.gradient(y, y_grad)
    evaluator = ad.Evaluator(eval_nodes=[y,x_grad])
    y_grad_val = np.array([[0.1, -0.2, 0.3], [-0.4, 0.5, -0.6]])

    check_evaluator_output(
        evaluator,
        input_values={
            x: np.array([[1.0, 2.0, 0.5], [3.0, 0.25, 4.0]]),
            y_grad: np.array([[0.1, -0.2, 0.3], [-0.4, 0.5, -0.6]]),
        },
        expected_outputs=[
            y_val,
            y_grad_val/np.array([[1.0, 2.0, 0.5], [3.0, 0.25, 4.0]]),
        ],
    )

def test_exp():
    x = ad.Variable("x")
    y = ad.exp(x)
    y_grad = ad.Variable("y_grad")
    x_grad, = y.op.gradient(y, y_grad)
    print(x_grad)
    evaluator = ad.Evaluator(eval_nodes=[y,x_grad])
    expect_val = np.exp(np.array([[0.0, 1.0, -1.0], [2.0, -2.0, 0.5]]))
    y_grad_val = np.array([[1.0, 0.5, -0.3], [0.2, -0.4, 0.1]])
    check_evaluator_output(
        evaluator,
        input_values={
            x: np.array([[0.0, 1.0, -1.0], [2.0, -2.0, 0.5]]),
            y_grad: y_grad_val,
        },
        expected_outputs=[
            expect_val,
            np.array([[0.0, 1.0, -1.0], [2.0, -2.0, 0.5]]) * y_grad_val,
        ]
    )

def test_sum():
    x = ad.Variable("x")
    y = ad.sum_op(x,axis=1)
    y_grad = ad.Variable("y_grad")
    x_grad, = y.op.gradient(y, y_grad)
    evaluator = ad.Evaluator(eval_nodes=[y,x_grad])

    check_evaluator_output(
        evaluator,
        input_values={
            x: np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]),
            y_grad: np.array([10.0,15.0]),  # sum的梯度是标量
        },
        expected_outputs=[
            np.array([6.0,15.0]),
            np.array([[10.0, 10.0, 10.0], [15.0, 15.0, 15.0]]),
        ]
    )

def test_broadcast():
    # 定义输入变量
    x = ad.Variable("x")  # 被广播的张量
    target_shape = ad.Variable("target_shape")  # 目标形状（仅用于形状，不参与计算）
    y = ad.broadcast_to(x, target_shape,axis=0)  # 广播操作
    y_grad = ad.Variable("y_grad")  # 上游梯度（广播后的梯度）
    
    # 计算梯度
    x_grad, target_shape_grad = y.op.gradient(y, y_grad)
    
    # 创建计算图执行器
    evaluator = ad.Evaluator(eval_nodes=[y, x_grad, target_shape_grad])

    # 测试用例：将 (3,) 广播到 (3, 2)
    check_evaluator_output(
        evaluator,
        input_values={
            x: np.array([[1.0, 2.0, 3.0]]),  # shape (3,)
            target_shape: np.zeros((3, 3)),  # 仅用于提供形状 (3, 3)
            y_grad: np.array([[10.0, 20.0, 30.0],   # 形状必须为 (3, 3)，与前向输出一致
                          [10.0, 20.0, 30.0],
                          [10.0, 20.0, 30.0],]),
        },
        expected_outputs=[
            np.array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]]),
            np.array([30., 60., 90.]),  # x 的梯度（沿 dim=1 求和）
            np.zeros((3, 3)),  # target_shape 的梯度应为全 0
        ]
    )

    print("Broadcast gradient test passed!")