问题来源

事情是这样的：本来我们讲论文，把文章中的图改成上边这样，让读者更了解模型结构就行了。

但是同读论文的师兄发出如下疑问：

下边是ControlNet的论文原文写的：

Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert the entire dataset of $512 \times 512$ images into smaller $64 \times 64$ "latent images" for stabilized training.

This requires ControlNets to convert image-based conditions to $64 \times 64$ feature space to match the convolution size.

We use a tiny network $\mathcal{E}(\cdot)$ of four convolution layers with $4 \times 4$ kernels and $2 \times 2$ strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions $c_{\mathrm{i}}$ into feature maps with
$c_{\mathrm{f}}=\mathcal{E}\left(c_{\mathrm{i}}\right)$

大致意思是说：

为了和Stable Diffusion保持一致，所以我们这里也将控制条件压缩到潜变量空间上，加了个小的压缩网络，从 $512 \times 512$ 压缩四次，把控制条件压缩到 $64 \times 64$ 。

然而实际上如果你卷积层计算四次之后，应该是压缩到 $32 \times 32$ 。这里是有表述错误的。

无奈之下只好去看论文源码了。

ControlNet/cldm.py at main · lllyasviel/ControlNet (github.com)

代码

这部分的代码实现如下：

        self.input_hint_block = TimestepEmbedSequential(
            conv_nd(dims, hint_channels, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 32, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 32, 32, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 32, 96, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 96, 96, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 96, 256, 3, padding=1, stride=2),
            nn.SiLU(),
            zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
        )

conv_nd的实现比较简单，作者是将一维卷积、二维卷积和三维卷积集合到一起了。

代码如下：

def conv_nd(dims, *args, **kwargs):
    """
    Create a 1D, 2D, or 3D convolution module.
    """
    if dims == 1:
        return nn.Conv1d(*args, **kwargs)
    elif dims == 2:
        return nn.Conv2d(*args, **kwargs)
    elif dims == 3:
        return nn.Conv3d(*args, **kwargs)
    raise ValueError(f"unsupported dimensions: {dims}")

根据公式： $W_t = (W_{t-1} - kernel + 2*padding ) / stride + 1$

conv_nd(dims, x,y,3, padding=1) : $W_t = (W_{t-1} - 3 + 2 ) / 1 + 1$

这个卷积改变的只有通道数，对张量本身的大小是没有影响的。
conv_nd(dims, x,y,3, padding=1, stride=2) : $W_t = (W_{t-1} - 3 + 2 ) / 2 + 1$

这个卷积才会改变中间变量的大小，将大小减半。、

所以实际起到压缩作用的应该是 conv_nd(dims, x,y,3, padding=1, stride=2)，进行三次压缩：
- $512 \times 512 → 256 \times 256$
- $256 \times 256 → 128 \times 128$
- $128 \times 128 → 64 \times 64$

所以根据代码的话，我们应该将我们添加的网络改成这样。

按照论文的表述应该写成：

We use a tiny network $\mathcal{E}(\cdot)$ of four convolution layers with $3 \times 3$ kernels and $2 \times 2$ strides (activated by SiLU, channels are 32, 96, 256, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions $c_{\mathrm{i}}$ into feature maps with

c_{\mathrm{f}}=\mathcal{E}\left(c_{\mathrm{i}}\right)

最后图长这样：

本文正在参加「金石计划」

ControlNet 控制条件压缩网络补充 | 扒源码

问题来源

代码