从零来实现一个 ViT(1)— patch embedding一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文

一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第13天，点击查看活动详情。

最近更多工作是关于无人驾驶感知这部分内容，现在计算视觉任务大部分都是基于 transformer 思想设计出模型所刷榜。所以自己也想了解了解。找了了一些资料，准备边学边跟着复现一下比较基础的 ViT 这个模型。

主要就是看这张图，这张图描绘比较清晰了，在左侧就是就是 ViT 网络架构图，从 MLP head 来看这个是一个做分类的网络。ViT 是可以说是第一个将 transformer 引入到 CV 领域，之前也有过一些工作，不过 ViT 将图像用 token 形式来表示还是简单有效，从而让大家看到用 transfomer 这样范式是可以做到 NLP 和 CV 的统一，随后通过也都是基于 ViT 这个思路来做的。

我觉还是有必要简单说一说这张图，也就是将图像划分为固定大小小块，然后将每一个小块看做一个 token 将 token 通过 transformer 来学习 token 之间关联性，也就是学到全局信息。通过自注意力机制 cls 这个添加的 token 拿到图片的全局信息后，对 cls 做一个 MLP 做分类

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

from torch import nn
from torch import Tensor
from PIL import Image
from torchvision.transforms import Compose, Resize, ToTensor
from einops import rearrange, reduce, repeat
from einops.layers.torch import Rearrange, Reduce
from torchsummary import summary

有关 einops 使用方法，随后会用专门介绍一下这个 einops 个人觉得这个库帮助更直观理解 tensor 是如何变换。

img = Image.open('/tesla.jpeg')

fig = plt.figure()
plt.imshow(img)

# 对图像调整图像尺寸
transform = Compose([Resize((224, 224)), ToTensor()])
x = transform(img)
x = x.unsqueeze(0) # add batch dim
x.shape

值得解释的一下是 unsqueeze 是为 tensor 添加一个维度，也就是 batch 的维度

torch.Size([1, 3, 224, 224])

现在我们经过处理得到 tensor 就是具有 4 维度的 tensor 分别是批量(batch)、通道数(channel)和图像的高和宽

patch_size = 16 # 16 pixels
pathes = rearrange(x, 'b c (h s1) (w s2) -> b (h w) (s1 s2 c)', s1=patch_size, s2=patch_size)

上面 rearrange 这个语法大家可能比较陌生，而且即使熟悉编程开发人员看到这段 b c (h s1) (w s2) -> b (h w) (s1 s2 c) 字符串作为参数，我相信也可能会 confusing，这里我们可以通过下面公式解释一下 b 和 c 分别代表 batch size 和通道数，然后 s1 x s2 就是 patch 也就是一个小块的大小，例如 1 x 3 x 224 x 224 可以将宽和高进行分解为 1 x 3 x 14(h) x 16(s1) x 14(w) x 16(s2) 然后 h x w 表示每一张图片有多少个 patch ， 14 x 14 也就是 196 然后 s1 x s2 x c = 3 x 16 x16 就是 768

pathes.shape #torch.Size([1, 196, 768])

PatchEmbedding

我们先看看其输出维度1, 196, 768 和输入维度 1, 3, 224, 224

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768):
        self.patch_size = patch_size
        super().__init__()
        self.projection = nn.Sequential(
            # 将图像拆分为  s1 x s2 的 patche 然后在对其进行展平
            Rearrange('b c (h s1) (w s2) -> b (h w) (s1 s2 c)', s1=patch_size, s2=patch_size),
            nn.Linear(patch_size * patch_size * in_channels, emb_size)
        )
                
    def forward(self, x: Tensor) -> Tensor:
        x = self.projection(x)
        return x
    
PatchEmbedding()(x).shape

随后我们在一个线性层将每一个 patch 展平后的维度转换为 emb_size 的大小。这里因为 emb_size 本来就是 768 ，也可以将 emb_size 设置为其他维度可尝试一下，到现在为止就是

nn.Linear(patch_size * patch_size * in_channels, emb_size)

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768):
        self.patch_size = patch_size
        super().__init__()
        self.projection = nn.Sequential(
            # using a conv layer instead of a linear one -> performance gains
            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )
                
    def forward(self, x: Tensor) -> Tensor:
        x = self.projection(x)
        return x
    
PatchEmbedding()(x).shape

添加 cls

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768):
        self.patch_size = patch_size
        super().__init__()
        self.proj = nn.Sequential(
            # 使用卷积层来代替之前 linear 层，这样做是为获得更好的性能
            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )
        
        self.cls_token = nn.Parameter(torch.randn(1,1, emb_size))
        
    def forward(self, x: Tensor) -> Tensor:
        b, _, _, _ = x.shape
        x = self.proj(x)
        cls_tokens = repeat(self.cls_token, '() n e -> b n e', b=b)
        # 将 cls token 输入 input
        x = torch.cat([cls_tokens, x], dim=1)
        return x
    
PatchEmbedding()(x).shape

使用卷积层来代替之前 linear 层，这样做是为获得更好的性能。