本文正在参加人工智能创作者扶持计划

0.论文信息

这篇论文大概是我在两年前多读过的第一篇关于 Domain Adaptation 的论文，当时我还是一个小白 (现在是大白)，对于 Domain Adaptation 这个领域的知识了解的很少，所以当时读这篇论文的时候，我只是看了一下论文的整体框架，做代码复现的时候对于很多细节了解得不是很透彻 (感觉没有那个意识)。所以这次不谈论文的整体 insight，整体框架也只做简要阐述，从整体梯度下降和梯度反转层 (Gradient Reversal Layer) 的角度来进行一些细节的补充。

1.DaNN 整体框架

DaNN 的整体框架如上图所示，整体框架分为两个部分，第一部分是特征提取器 (Feature Extractor) 和分类器 (Classifier)，第二部分是域分类器 (Domain Classifier)。特征提取器和分类器的作用是将输入的数据映射到一个特征空间，然后在特征空间中进行分类。Domain Classifier 的作用是判断输入的数据是属于 Source Domain 还是 Target Domain。整体框架的目标是让 Source Domain 和 Target Domain 的数据经过特征提取器之后更大程度地具有相似的分布，同时 Source Domain 上的数据仍有较好的分类效果，以一定程度上保障 Target Domain 上的分类效果。

2.整体梯度下降目标

\begin{gathered} E\left(\theta_f, \theta_y, \theta_d\right)=\sum_{\substack{i=1 . . N \\ d_i=0}} L_y\left(G_y\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_y\right), y_i\right)- \\ \lambda \sum_{i=1 . . N} L_d\left(G_d\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_d\right), y_i\right)= \\ =\sum_{\substack{i=1 . N \\ d_i=0}} L_y^i\left(\theta_f, \theta_y\right)-\lambda \sum_{i=1 . . N} L_d^i\left(\theta_f, \theta_d\right) \end{gathered}

其中， $L_y(\cdot,\cdot)$ 是标签预测的损失， $L_d(\cdot,\cdot)$ 是域分类，而 $L^i_y$ 和 $L^i_d$ 表示第 $i$ 个训练示例评估的相应损失函数。即可以转化为 :

\begin{gathered} \left(\hat{\theta}_f, \hat{\theta}_y\right)=\arg \min _{\theta_f, \theta_y} E\left(\theta_f, \theta_y, \hat{\theta}_d\right) \\ \hat{\theta}_d=\arg \max _{\theta_d} E\left(\hat{\theta}_f, \hat{\theta}_y, \theta_d\right) . \end{gathered}

即希望通过 $\theta_f,\theta_y$ 的优化使得 $E\left(\theta_f, \theta_y, \theta_d\right)$ 尽可能小，同时通过 $\theta_d$ 的优化使得 $E\left(\theta_f, \theta_y, \theta_d\right)$ 尽可能大。这样就可以使得 Source Domain 和 Target Domain 的数据经过特征提取器之后更大程度地具有相似的分布的同时，Source Domain 上的数据仍有较好的分类效果。因此可以用随机梯度下降法 (SGD) 来进行迭代化表示 :

\begin{aligned} & \theta_f \quad \longleftarrow \quad \theta_f-\mu\left(\frac{\partial L_y^i}{\partial \theta_f}-\lambda \frac{\partial L_d^i}{\partial \theta_f}\right) \\ & \theta_y \quad \longleftarrow \quad \theta_y-\mu \frac{\partial L_y^i}{\partial \theta_y} \\ & \theta_d \quad \longleftarrow \quad \theta_d-\mu \frac{\partial L_d^i}{\partial \theta_d} \end{aligned}

然而直接实现 (4)-(6) 作为 SGD 并不好实现，这时候就要引入梯度反转层 (Gradient Reversal Layer,GRL) 层来进行实现了。

3.梯度反转层 (Gradient Reversal Layer)

可以通过引入定义如下的特殊梯度反转层 (GRL) 来完成。梯度反转层没有与其相关的参数 (除了参数 $\lambda$ 不通过反向传播更新)。在前向传播期间，GRL 充当恒等变换。然而，在反向传播期间，GRL 从后续层获取梯度，将其 $\times(-\lambda)$ 并将其传递给前一层。

可以进行如下形式化定义 :

\begin{array}{r} R_\lambda(\mathbf{x})=\mathbf{x} \\ \frac{d R_\lambda}{d \mathbf{x}}=-\lambda \mathbf{I} \end{array}

其中， $\mathbf{I}$ 是单位矩阵。然后我们可以定义 $(\theta_f,\theta_y,\theta_d)$ 的目标“伪函数”，通过我们的方法中的随机梯度下降进行优化 :

\begin{gathered} \tilde{E}\left(\theta_f, \theta_y, \theta_d\right)=\sum_{\substack{i=1 \ldots N \\ d_i=0}} L_y\left(G_y\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_y\right), y_i\right)+ \\ \sum_{i=1 \ldots N} L_d\left(G_d\left(R_\lambda\left(G_f\left(\mathbf{x}_i ; \theta_f\right)\right) ; \theta_d\right), y_i\right) \end{gathered}

为了方便理解，结合上图，我们可以进行一次反向过程的模拟，首先我们会计算得到 $\tilde{E}\left(\theta_f, \theta_y, \theta_d\right)$ ，需要在域分类器上得到对应的 $-\frac{\partial \tilde{E}\left(\theta_f, \theta_y, \theta_d\right)}{\partial \theta_d}$ ，实际计算过程中下降梯度为

-\mu\frac{\partial \tilde{E}\left(\theta_f, \theta_y, \theta_d\right)}{\partial \theta_d}=-\mu\frac{\partial L_d^i\left(G_d\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_d\right), y_i\right)}{\partial \theta_d}=-\mu \frac{\partial L_d^i}{\partial \theta_d}

而当梯度传播过 $R_\lambda(\cdot)$ 之后，梯度会 $\times(-\lambda)$ ，此时对于 $\theta_y,\theta_f$ 的下降梯度则变为

\begin{aligned} -\mu\frac{\partial \tilde{E}\left(\theta_f, \theta_y, \theta_d\right)}{\partial \theta_y}=&-\mu\frac{\partial\left( L_y\left(G_y\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_y\right), y_i\right)\right)}{\partial \theta_y}\\ &+\mu\frac{\lambda\left( L_d\left(G_d\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_d\right), y_i\right)\right)}{\partial \theta_y}\\ =&-\mu \frac{\partial L_y^i}{\partial \theta_y}\\ -\mu\frac{\partial \tilde{E}\left(\theta_f, \theta_y, \theta_d\right)}{\partial \theta_f}=&-\mu\frac{\partial\left( L_y\left(G_y\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_y\right), y_i\right)\right)}{\partial \theta_f}\\ &+\mu\frac{\lambda\left( L_d\left(G_d\left(G_f\left(\mathbf{x}_i ; \theta_f\right) ; \theta_d\right), y_i\right)\right)}{\partial \theta_f}\\ =&-\mu\left(\frac{\partial L_y^i}{\partial \theta_f}-\lambda \frac{\partial L_d^i}{\partial \theta_f}\right) \end{aligned}

不得不说，还是很精巧的设计，巧妙地将对抗转化为了一致的迭代目标。

重读 DaNN 论文，深刻理解 GRL 的思想内涵

0.论文信息

1.DaNN 整体框架

2.整体梯度下降目标

3.梯度反转层 (Gradient Reversal Layer)