1. 为什么计算二阶导数比一阶导数的开销要更大？

计算二阶导数比一阶导数的开销要更大，主要有以下几个原因：

1. 计算量更大:

二阶导数是在一阶导数的基础上再进行求导，因此需要进行两次计算。对于一个复杂的函数，一阶导数的计算量就已经很大了，二阶导数的计算量则会更大。

2. 内存占用更大:

在计算二阶导数时，需要保存一阶导数的信息，这会导致内存占用增加。对于大型模型，内存占用可能成为一个瓶颈。

3. 数值稳定性问题:

在计算二阶导数时，可能会遇到数值稳定性问题。这是因为在计算过程中可能会发生精度损失，从而导致结果不准确。

4. 应用场景受限:

二阶导数的应用场景通常比较有限。在大多数情况下，一阶导数就足够了。

**总而言之，**计算二阶导数的开销要更大，主要体现在计算量、内存占用、数值稳定性和应用场景等方面。因此，在实际应用中，需要根据具体情况来决定是否需要计算二阶导数。

以下是一些具体的例子：

在机器学习中，通常使用一阶导数来训练模型。只有在需要精确控制模型收敛速度的情况下，才会用到二阶导数。
在优化算法中，通常使用一阶导数来更新模型参数。只有在需要更高精度的优化结果的情况下，才会用到二阶导数。
在数值计算中，通常使用一阶导数来近似函数。只有在需要更高精度的近似结果的情况下，才会用到二阶导数。

2. 在运行反向传播函数之后，立即再次运行它，看看会发生什么。

RuntimeError: Trying to backward through the graph a second time 
(or directly access saved tensors after they have already been freed). 
Saved intermediate values of the graph are freed when you call 
.backward() or autograd.grad(). Specify retain_graph=True if you need 
to backward through the graph a second time or if you need to access 
saved tensors after calling backward.

3. 在控制流的例子中，我们计算 `d` 关于 `a` 的导数，如果将变量 `a` 更改为随机向量或矩阵，会发生什么？

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c
    
# a = torch.randn(size=(), requires_grad=True)
a = torch.randn(size=(3, 1), requires_grad=True)
# a = torch.randn(size=(3, 4), requires_grad=True)
display(a, a.shape)
d = f(a)
# d.backward() # RuntimeError: grad can be implicitly created only for scalar outputs
d.sum().backward() # True

display(a.grad, d / a, a.grad == d / a)
a.grad.zero_()

4. 重新设计一个求控制流梯度的例子，运行并分析结果。

def g(x):
    if x.sum() < 0:
        return x ** 2
    elif x.sum() <= 1:
        return x ** 3
    else:
        return 1 / x

x = torch.randn(size=(), requires_grad=True)
display(x)
d = g(x)
d.backward()
display(x.grad)
x.grad.zero_()

tensor(1.7029, requires_grad=True)
tensor(-0.3449)

tensor(0.0469, requires_grad=True)
tensor(0.0066)

tensor(1.4272, requires_grad=True)
tensor(-0.4910)

5. 使 $f(x)=\sin(x)$ ，绘制 $f(x)$ 和 $\frac{df(x)}{dx}$ 的图像，其中后者不使用 $f'(x)=\cos(x)$ 。

x = torch.linspace(-5, 5, 100, requires_grad=True)
y = torch.sin(x)
y.sum().backward()

# from matplotlib_inline import backend_inline
from d2l import torch as d2l
d2l.plot(x.detach(), [y.detach(), x.grad], 'x', 'f(x)', legend=['f(x)', 'Tangent line'])

自动微分｜预备知识｜动手学深度学习

1. 为什么计算二阶导数比一阶导数的开销要更大？

2. 在运行反向传播函数之后，立即再次运行它，看看会发生什么。

3. 在控制流的例子中，我们计算 d 关于 a 的导数，如果将变量 a 更改为随机向量或矩阵，会发生什么？

4. 重新设计一个求控制流梯度的例子，运行并分析结果。

5. 使f(x)=sin⁡(x)f(x)=\sin(x)f(x)=sin(x)，绘制f(x)f(x)f(x)和df(x)dx\frac{df(x)}{dx}dxdf(x)​的图像，其中后者不使用f′(x)=cos⁡(x)f'(x)=\cos(x)f′(x)=cos(x)。

3. 在控制流的例子中，我们计算 `d` 关于 `a` 的导数，如果将变量 `a` 更改为随机向量或矩阵，会发生什么？

5. 使 $f(x)=\sin(x)$ ，绘制 $f(x)$ 和 $\frac{df(x)}{dx}$ 的图像，其中后者不使用 $f'(x)=\cos(x)$ 。