ControlNet 控制条件压缩网络补充 | 扒源码

1,330 阅读2分钟

问题来源

image.png

事情是这样的: 本来我们讲论文,把文章中的图改成上边这样,让读者更了解模型结构就行了。

但是同读论文的师兄发出如下疑问:

image.png

image.png

下边是ControlNet的论文原文写的:

Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert the entire dataset of 512×512512 \times 512images into smaller 64×6464 \times 64 "latent images" for stabilized training.

This requires ControlNets to convert image-based conditions to 64×6464 \times 64 feature space to match the convolution size.

We use a tiny network E()\mathcal{E}(\cdot) of four convolution layers with 4×44 \times 4 kernels and 2×22 \times 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions cic_{\mathrm{i}} into feature maps with

cf=E(ci)c_{\mathrm{f}}=\mathcal{E}\left(c_{\mathrm{i}}\right)

大致意思是说

为了和Stable Diffusion保持一致,所以我们这里也将控制条件压缩到潜变量空间上,加了个小的压缩网络,从512×512512 \times 512压缩四次,把控制条件压缩到64×6464 \times 64

然而实际上如果你卷积层计算四次之后,应该是压缩到32×3232 \times 32。这里是有表述错误的。

无奈之下只好去看论文源码了。

ControlNet/cldm.py at main · lllyasviel/ControlNet (github.com)

代码

这部分的代码实现如下:

        self.input_hint_block = TimestepEmbedSequential(
            conv_nd(dims, hint_channels, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 32, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 32, 32, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 32, 96, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 96, 96, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 96, 256, 3, padding=1, stride=2),
            nn.SiLU(),
            zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
        )

conv_nd的实现比较简单,作者是将一维卷积、二维卷积和三维卷积集合到一起了。

代码如下:

def conv_nd(dims, *args, **kwargs):
    """
    Create a 1D, 2D, or 3D convolution module.
    """
    if dims == 1:
        return nn.Conv1d(*args, **kwargs)
    elif dims == 2:
        return nn.Conv2d(*args, **kwargs)
    elif dims == 3:
        return nn.Conv3d(*args, **kwargs)
    raise ValueError(f"unsupported dimensions: {dims}")

根据公式:Wt=(Wt1kernel+2padding)/stride+1W_t = (W_{t-1} - kernel + 2*padding ) / stride + 1

  • conv_nd(dims, x,y,3, padding=1) : Wt=(Wt13+2)/1+1W_t = (W_{t-1} - 3 + 2 ) / 1 + 1

    这个卷积改变的只有通道数,对张量本身的大小是没有影响的。

  • conv_nd(dims, x,y,3, padding=1, stride=2) : Wt=(Wt13+2)/2+1W_t = (W_{t-1} - 3 + 2 ) / 2 + 1

    这个卷积才会改变中间变量的大小,将大小减半。、

    所以实际起到压缩作用的应该是 conv_nd(dims, x,y,3, padding=1, stride=2),进行三次压缩:

    • 512×512256×256512 \times 512 → 256 \times 256

    • 256×256128×128256 \times 256 → 128 \times 128

    • 128×12864×64128 \times 128 → 64 \times 64

所以根据代码的话,我们应该将我们添加的网络改成这样。

image.png

按照论文的表述应该写成:

We use a tiny network E()\mathcal{E}(\cdot) of four convolution layers with 3×33 \times 3 kernels and 2×22 \times 2 strides (activated by SiLU, channels are 32, 96, 256, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions cic_{\mathrm{i}} into feature maps with

cf=E(ci)c_{\mathrm{f}}=\mathcal{E}\left(c_{\mathrm{i}}\right)

最后图长这样:

image.png


本文正在参加「金石计划」