问题来源
事情是这样的: 本来我们讲论文,把文章中的图改成上边这样,让读者更了解模型结构就行了。
但是同读论文的师兄发出如下疑问:
下边是ControlNet的论文原文写的:
Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert the entire dataset of images into smaller "latent images" for stabilized training.
This requires ControlNets to convert image-based conditions to feature space to match the convolution size.
We use a tiny network of four convolution layers with kernels and strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions into feature maps with
大致意思是说:
为了和Stable Diffusion保持一致,所以我们这里也将控制条件压缩到潜变量空间上,加了个小的压缩网络,从压缩四次,把控制条件压缩到 。
然而实际上如果你卷积层计算四次之后,应该是压缩到。这里是有表述错误的。
无奈之下只好去看论文源码了。
ControlNet/cldm.py at main · lllyasviel/ControlNet (github.com)
代码
这部分的代码实现如下:
self.input_hint_block = TimestepEmbedSequential(
conv_nd(dims, hint_channels, 16, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 16, 16, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 16, 32, 3, padding=1, stride=2),
nn.SiLU(),
conv_nd(dims, 32, 32, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 32, 96, 3, padding=1, stride=2),
nn.SiLU(),
conv_nd(dims, 96, 96, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 96, 256, 3, padding=1, stride=2),
nn.SiLU(),
zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
)
conv_nd的实现比较简单,作者是将一维卷积、二维卷积和三维卷积集合到一起了。
代码如下:
def conv_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D convolution module.
"""
if dims == 1:
return nn.Conv1d(*args, **kwargs)
elif dims == 2:
return nn.Conv2d(*args, **kwargs)
elif dims == 3:
return nn.Conv3d(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
根据公式:
-
conv_nd(dims, x,y,3, padding=1):这个卷积改变的只有通道数,对张量本身的大小是没有影响的。
-
conv_nd(dims, x,y,3, padding=1, stride=2):这个卷积才会改变中间变量的大小,将大小减半。、
所以实际起到压缩作用的应该是
conv_nd(dims, x,y,3, padding=1, stride=2),进行三次压缩: -
所以根据代码的话,我们应该将我们添加的网络改成这样。
按照论文的表述应该写成:
We use a tiny network of four convolution layers with kernels and strides (activated by SiLU, channels are 32, 96, 256, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions into feature maps with
最后图长这样:
本文正在参加「金石计划」