AAAMLP-Chapter-9: Approaching Image Classification & Segmentation

本章介绍图像处理，近几年该领域有许多新成果，许多计算机视觉相关问题现在处理起来更简单了。

在个人电脑上使用预训练模型，很容易得到图像领域的最先进的处理模型。

本章将会介绍最常见的几种图像处理问题。

图像在计算机中可以看作数值矩阵，灰度图就是简单的二维矩阵，矩阵中的元素为 0-255，0 表示黑色，255 表示白色，其他数值表示不同的灰度。RGB 图则包含 3 个这样的矩阵，每个矩阵表示一种颜色。

在以前，深度学习还不流行的时候，人们通常手动处理像素，把每个像素当成一个特征。

在 Python 中，可以使用 OpenCV、PythonPIL 来处理图片，将其像素值转换为 numpy array。

import numpy as np
import matplotlib.pyplot as plt

random_img = np.random.randint(0, 256, (256, 256))
print(random_img.flatten().shape)

plt.figure(figsize=(7, 7))
plt.imshow(random_img, cmap='gray', vmin=0, vmax=255)
plt.show()

上述代码随机构造 256x256 大小的矩阵，并绘制为灰度图。

执行 flatten/ravel 运算，可将矩阵按行展平，得到的向量通常被当作图像问题中的输入特征。

对数据集中所有图像执行 flatten/ravel 运算后，便可构建决策树、随机森林、SVM 模型来解决图像分类问题。

你们肯定听说过 Cat vs Dog 问题，这是个经典的图像分类问题。但是本章不用这个例子，你应该还记得 Chapter3 里使用的气胸图像，我们这里继续用该数据集做例子。

原始数据集的问题是检测图像哪部分发生了气胸，本章对其做了点修改，只需判断图像是否是气胸图像。不用担心，本章后续也会讲到如何判断气胸位置。

数据集由 10675 张图像构成，其中有 2379 个气胸图像。

现在我们可以下判断：这个二分类数据集是不平衡的，因此得用 Straitified k-fold 交叉验证和 AUC 做评估指标。

至于模型选择，可以在展平特征后，使用 SVM、RF 来做分类，通常这样的模型就够了，但这实在是不够先进。

图像大小为 1024x1024，展平后特征向量有 1M，在这上边训练会非常耗时。

原始图像是 DICOM 格式，该格式的文件通常用于保存医学图像，由两部分构成，一部分是 metadata，另一部分是图像数组，我们使用 pydicom 库来读取该文件。

import pydicom

dcm_img = pydicom.dcmread('pneumothorax/sample.dcm')
print(dcm_file)
plt.imshow(dcm_file.pixel_array, cmap=plt.cm.bone)

笔者下载的数据集格式和原文不一样，需要把 dcm 文件整理到同一个文件夹下。

import os
import os.path as pth
import shutil

root_dir = 'pneumothorax/dicom-images-train'
target_dir = 'pneumothorax/sample'

def copy(root_dir, target_dir):
    for f in os.listdir(root_dir):
        file_path = pth.join(root_dir, f)
        if pth.isfile(file_path):
            shutil.copy(file_path, pth.join(target_dir, f))
        elif pth.isdir(file_path):
            copy(file_path, target_dir)
copy(root_dir, target_dir)

我们先实现一个错误的例子，选择随机森林模型来解决分类问题。

由于原始图像就是灰度图，我们不需要做额外转换，只需要将图像缩小到 256x256。

import os 
import numpy as np
import pandas as pd
from PIL import Image
from sklearn import ensemble, metrics, model_selection
from tqdm import tqdm
import pydicom

def create_dataset(training_df, image_dir):
    images = []
    targets = []
    for index, row in tqdm(
        training_df.iterrows(),
        total=len(training_df),
        desc='processing images'
    ):
        image_id = row['id']
        image_path = os.path.join(image_dir, image_id + '.dcm')
        img = pydicom.dcmread(image_path)
        img = np.array(img.pixel_array)
        img = Image.fromarray(img).resize((256, 256))
        img = np.array(img).ravel()
        images.append(img)
        targets.append(int(row['target']))
    images = np.array(images)
    print(images.shape)
    return images, np.array(targets)

csv_path = 'pneumothorax/train-rle.csv'
image_path = 'pneumothorax/sample/'

df = pd.read_csv(csv_path)
df.columns = ['id', 'target']
df.loc[:, 'target'] = (df['target'] != ' -1').astype(int)
df.loc[:, 'kfold'] = -1
df = df.sample(frac=1).reset_index(drop=True)

y = df.target.values.astype(int)

kf = model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
    df.loc[v_, 'kfold'] = f
for f in range(5):
    train_df = df[df.kfold != f].reset_index(drop=True)
    valid_df = df[df.kfold == f].reset_index(drop=True)

    x_train, y_train = create_dataset(train_df, image_path)
    x_valid, y_valid = create_dataset(valid_df, image_path)

    clf = ensemble.RandomForestClassifier()
    clf.fit(x_train, y_train)
    preds = clf.predict_proba(x_valid)[:, 1]
    print(f"FOLD: {f}")
    print(f"AUC = {metrics.roc_auc_score(y_valid, preds)}")
    print()

得到的平均 AUC 为 0.82 ，还不错。

这种处理图片的方式，便是传统机器学习处理图片分类，其中 SVM 模型在这方面非常出名。

深度学习已经实证其先进的图片处理能力，因此我们接下来要尝试一下神经网络模型。

我不讲深度学习的历史，让我们直接来看著名的深度学习模型 AlexNet。

现在，人们会说这是最基本的卷积神经网络，但这确实是新的深度学习模型的基础。

从图中看出该网络有 5 个卷积层，2 个全连接层，一个输出层，同时卷积层之间是最大池化层。

下边介绍一些经常用到的术语。

Filters 卷积核，在图像处理中，是一个 2 维数组，使用该数组对原图像做卷积运算。Filters 的值需要进行恰当的初始化，通常称为 Kaiming Normal Initialization。因为网络模型计算需要使用 ReLU 和恰当初始化来避免梯度消失的问题。

卷积运算是卷积核与图像对应位置元素的数值乘积运算，对应位置是指将卷积核盖在图像上，所能遮罩的位置。你可以通过其他教程详细了解卷积。

从图像左上角开始，做卷积运算，然后将卷积核沿水平或垂直方向移动，如果一次移动一个像素，那么称步长 stride 为 1，以此类推。

步长是一个非常有用的概念，在自然语言处理中也大有作为。

如果不对原图像做填充 Padding，那么卷积运算后的图像会小于原图。

下边举例说明卷积核大小、图像大小、步长之间的关系。

现在有 3x3 的卷积核，对 6x6 的原图像做卷积，步长 stride 为 1，在对原图 padding 之后的卷积结果图像大小为 [(8-3)/1]+1=6，向下取整得 6 ，因此卷积后输出 6x6 的图像。

对图像做 padding 的目的是保持输出图像的大小与原图像大小一致，因此 padding 通常与 stride=1 搭配使用。

下一个术语是 dilation，扩大卷积核，使用扩大后的卷积核可以在卷积时跳过某些像素，在分割任务中非常有用。

然后是术语最大池化 Max Pooling，该术语是指一个只返回最大值的 Filter，与卷积不同，最大池化抽取对应像素范围中的最大值。与之类似的还有 Mean Pooling 平均池化。

池化 Filter 的计算与卷积核类似，扫过全图输出结果图像。最大池化可以用来检测图像边缘，平均池化可以用来平滑图像。

卷积神经网络中有太多概念，我们这里只讨论了很基础的部分，通过这些基础知识帮你快速入门。

现在我们准备用 PyTorch 来手动实现这个卷积神经网络。PyTorch 提供了构建与计算神经网络的简单易用的接口，帮我们屏蔽了底层的计算过程，因此我们的工作是告诉 PyTorch 网络应该如何连接。

算法中使用的注记符号有 BS、C、H、W。

BS 是 BatchSize。C 是通道数 channels，也可将其理解为特征图像数。H 和 W 是图像的高和宽。

import torch
import torch.nn.functional as F
import torch.nn as nn

class AlexNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(
            in_channels=3,
            out_channels=96,
            kernel_size=11,
            stride=4,
            padding=0
        )
        self.pool1 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.conv2 = nn.Conv2d(
            in_channels=96,
            out_channels=256,
            kernel_size=5,
            stride=1,
            padding=2
        )
        self.pool2 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.conv3 = nn.Conv2d(
            in_channels=256,
            out_channels=384,
            kernel_size=3,
            stride=1,
            padding=1
        )
        self.conv4 = nn.Conv2d(
            in_channels=384,
            out_channels=384,
            kernel_size=3, 
            stride=1,
            padding=1
        )
        self.conv5 = nn.Conv2d(
            in_channels=384,
            out_channels=256,
            kernel_size=3,
            stride=1,
            padding=1
        )
        self.pool3 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.fc1 = nn.Linear(
            in_features=9216,
            out_features=4096
        )
        self.dropout1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(
            in_features=4096,
            out_features=4096
        )
        self.dropout2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(
            in_features=4096,
            out_features=1000
        )

    def forward(self, image):
        # input size: BatchSize 3 227 227
        bs, c, h, w = image.size()
        x = F.relu(self.conv1(image)) # bs 96 55 55
        x = self.pool1(x)             # bs 96 27 27
        x = F.relu(self.conv2(x))     # bs 256 27 27
        x = self.pool2(x)             # bs 256 13 13
        x = F.relu(self.conv3(x))     # bs 384 13 13
        x = F.relu(self.conv4(x))     # bs 384 13 13
        x = F.relu(self.conv5(x))     # bs 256 13 13
        x = self.pool3(x)             # bs 256 6 6
        x = x.view(bs, -1)            # bs 256*6*6=9216
        x = F.relu(self.fc1(x))       # bs 4096
        x = self.dropout1(x)          # bs 4096
        x = F.relu(self.fc2(x))       # bs 4096
        x = self.dropout2(x)          # bs 4096
        x = F.relu(self.fc3(x))       # bs 1000
        x = torch.softmax(x, axis=1)  # bs 1000 归一化为概率 0-1
        return x

卷积层的参数 in_channels 表示输入图像的通道数，对于原始图像来说，通道数为 3 表示图像分为 R、G、B 三通道。

参数 out_channels 表示卷积核个数，也表示输出图像通道数，即每个卷积核输出一个图像通道。

这里说的图像不再是二维描述的 [H, W]，而是三维描述的包含通道数的 [C, H, W]，即一张图像的数据是用三维张量表示的。

在卷积过程中，每个卷积核均被随机初始化，然后对所有输入通道做卷积，取各卷积结果的均值，并将其存储在对应输出通道上。

你可以针对自己的任务设计卷积神经网络，通常在自己设计网络的效果会更符合模型需要。

接下来我们在气胸检测问题上，构建神经网络。

首先创建 Stratified K-fold 交叉验证，选择创建 5 个 fold。

import pandas as pd
from sklearn import model_selection

csv_path = 'pneumothorax/train-rle.csv'
folds_path = 'pneumothorax/train-rle-folds.csv'
image_path = 'pneumothorax/sample/'

df = pd.read_csv(csv_path)
df.columns = ['id', 'target']
df.loc[:, 'target'] = (df['target'] != ' -1').astype(int)
df.loc[:, 'kfold'] = -1
df = df.sample(frac=1).reset_index(drop=True)

y = df.target.values.astype(int)

kf = model_selection.StratifiedKFold(n_splits=5)
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
    df.loc[v_, 'kfold'] = f
df.to_csv(folds_path, index=False)

然后创建 PyTorch Dataset 类，用来迭代样本数据。

import torch
import numpy as np
from PIL import Image, ImageFile
import pydicom

ImageFile.LOAD_TRUNCATED_IMAGES = True

class ClassificationDataset:
    def __init__(
        self,
        image_paths,
        targets,
        resize=None,
        augmentations=None
    ):
        self.image_paths = image_paths
        self.targets = targets
        self.resize = resize
        self.augmentations = augmentations

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, item):
        img = pydicom.dcmread(self.image_paths[item])
        img = np.array(img.pixel_array)
        image = Image.fromarray(img)
        image = image.convert('RGB')
        target = self.targets.iloc[item]

        if self.resize is not None:
            image = image.resize(
                (self.resize[1], self.resize[0])
            )
        image = np.array(image)
        if self.augmentations is not None:
            augmented = self.augmentations(image=image)
            image = augmented['image']
        image = np.transpose(image, [2, 0, 1]).astype(np.float32)
        return (
            torch.tensor(image, dtype=torch.float),
            torch.tensor(target, dtype=torch.long)
        )

接下来，编写训练与计算函数。

import torch
import torch.nn as nn
from tqdm import tqdm

def train(data_loader, model, optimizer, device):
    model.train()
    for inputs, targets in data_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = nn.BCEWithLogitsLoss()(
            outputs, 
            targets.view(-1, 1).float()
        )
        loss.backward()
        optimizer.step()

def evaluate(data_loader, model, device):
    model.eval()
    final_targets = []
    final_outputs = []
    with torch.no_grad():
        for inputs, targets in data_loader:
            outputs = model(inputs)
            targets = targets.detach().numpy().tolist()
            outputs = outputs.detach().numpy().squeeze().tolist()
            final_targets.append(targets)
            final_outputs.append(outputs)
    return final_outputs, final_targets

然后创建模型结构。

最好将模型结构与训练逻辑分开写，这样可以方便地替换不同的模型，以横向比较效果。

PyTorch 库里的 pretrainedmodels 有一批模型结构，AlexNet、ResNet、DenseNet 等都在里边，这些经典模型结构都有已经训练好的模型，我们可以直接使用这些模型参数，或者从头开始训练。

import torch.nn as nn
from torchvision.models import alexnet

def get_model(pretrained):
    if pretrained:
        model = alexnet(pretrained=True)
    else:
        model = alexnet(pretrained=False)
    num_f = model.classifier[6].in_features
    model.classifier[6] = nn.Linear(
        in_features=num_f,
        out_features=1
    )
    return model

get_model(False)

现在我们有了准备好的数据、模型结构和训练过程，可以正式开始训练了。

import os
import pandas as pd
import numpy as np
import albumentations
import torch
from sklearn import metrics, model_selection

data_path = 'pneumothorax/train-rle-folds.csv'
image_path = 'pneumothorax/sample/'

device = 'cpu'
epochs = 1
df = pd.read_csv(data_path)
images = df.id.values.tolist()
images = [os.path.join(image_path, image + '.dcm') for image in images]
targets = df.target
model = get_model(pretrained=True).to(device)

mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
aug = albumentations.Compose([
    albumentations.Normalize(
        mean, std, max_pixel_value=255.0, always_apply=True
    )
])
train_images, valid_images, train_targets, valid_targets = model_selection.train_test_split(
    images, targets, stratify=targets, random_state=42
)
train_dataset = ClassificationDataset(
    image_paths=train_images,
    targets=train_targets,
    resize=(224, 224),
    augmentations=aug
)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True
)
valid_dataset = ClassificationDataset(
    image_paths=valid_images,
    targets=valid_targets,
    resize=(224, 224),
    augmentations=aug
)
valid_loader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=16,
    shuffle=False
)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
for epoch in range(epochs):
    train(train_loader, model, optimizer, device)
    predications, valid_targets = evaluate(valid_loader, model, device)
    auc_score = metrics.roc_auc_score(valid_targets, predications)
    print(f'Epoch: {epoch}, AUC: {auc_score}')

简单训练之后，模型在验证集上的 AUC 只有 0.6 左右。该得分比随机森林低太多了，而且训练耗时也高，为什么会这样？

使用现成的神经网络的优点之一是，我们可以很方便地替换预训练模型，下边我们再试试 ResNet18 网络和其预训练参数。

import torch.nn as nn
from torchvision.models import resnet18

def get_model(pretrained):
    if pretrained:
        model = resnet18(pretrained=True)
    else:
        model = resnet18(pretrained=False)
    num_f = model.fc.in_features
    model.fc = nn.Linear(
        in_features=num_f,
        out_features=1
    )
    return model

get_model(False)

然后修改上述训练与测试脚本，将图片 resize 参数设置为 512x512 后执行脚本。

该模型效果看起来更好一点，AUC 得分有 0.7 。

你还是能够在 AlexNet 模型上做一些优化操作，以提升模型性能。优化深度网络很难，但并非不可能。

可以选择使用低学习率低 Adam 优化器、在验证损失的高原上降低学习率、对原图像做一些增强处理、改变 Batch 大小等。

ResNet 的结构比 AlexNet 更复杂，ResNet 表示 Residual Neural Network 残差神经网络。

残差网络由残差块构成，它们可以夸层传递信息。这种网络层的连接方式也被称为跳跃连接，因为我们借此跳过了某些层的计算。跳跃连接可以帮助减少梯度消失问题，该处理可以让我们训练非常大的卷积神经网络，而不用担心性能损失。在大规模神经网络上训练时，损失值通常会在某点之后增大，而通过跳跃连接，可以避免此问题。

残差块很容易理解，你可以挑选某层的输出，然后跳过接下来的几层，将该输出做为输入传递给更远处的某层。

ResNet 有许多变体：18、34、50、101、152 层，它们都有基于 ImageNet 数据集的预训练模型参数。

虽然预训练模型可以用来解决几乎一切问题，但新手还是要从简单的模型开始上手。

基于 ImageNet 的预训练模型还包括：

Inception
DenseNet
NASNet
PNASNet
VGG
Xception
ResNeXt
EfficientNet

接下来看看预训练模型是如何解决图像划分问题的。

图像划分问题在计算机视觉中相当常见，该问题用于从图像的背景中删除或提取前景。图像的前景和背景有不同的定义。

我们可以将图像划分认为是像素级的分类任务，给每个像素预测一个类别。

气胸图像数据集就是一个划分任务，对于给定的射线图，要划分出气胸发生的部分。

最著名的图像划分模型是 U-Net，模型结构如图所示。

U-Net 由编码器和解码器两部分构成。编码器由之前见到的卷积池化操作构成。解码器中引入了 up-convolutional 层，向上卷积层，该层可以将小图扩展为大图。PyTorch 中有 ConvTranspose2d 来表示该处理。

U-Net 实现代码如下。

import torch
import torch.nn as nn
from torch.nn import functional as F

def double_conv(in_channels, out_channels):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=3),
        nn.ReLU(inplace=True),
        nn.Conv2d(out_channels, out_channels, kernel_size=3),
        nn.ReLU(inplace=True)
    )

def crop_tensor(tensor, target_tensor):
    target_size = target_tensor.size()[2]
    tensor_size = tensor.size()[2]
    delta = (tensor_size - target_size) // 2
    return tensor[
        :,
        :,
        delta: tensor_size - delta,
        delta: tensor_size - delta
    ]

class UNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.max_pool_2x2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.down_conv_1 = double_conv(1, 64)
        self.down_conv_2 = double_conv(64, 128)
        self.down_conv_3 = double_conv(128, 256)
        self.down_conv_4 = double_conv(256, 512)
        self.down_conv_5 = double_conv(512, 1024)
        self.up_trans_1 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
        self.up_conv_1 = double_conv(1024, 512)
        self.up_trans_2 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
        self.up_conv_2 = double_conv(512, 256)
        self.up_trans_3 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
        self.up_conv_3 = double_conv(256, 128)
        self.up_trans_4 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
        self.up_conv_4 = double_conv(128, 64)
        self.out = nn.Conv2d(64, 2, kernel_size=1)

    def forward(self, image):
        # bs c h w
        x1 = self.down_conv_1(image)
        x2 = self.max_pool_2x2(x1)
        x3 = self.down_conv_2(x2)
        x4 = self.max_pool_2x2(x3)
        x5 = self.down_conv_3(x4)
        x6 = self.max_pool_2x2(x5)
        x7 = self.down_conv_4(x6)
        x8 = self.max_pool_2x2(x7)
        x9 = self.down_conv_5(x8)

        x = self.up_trans_1(x9)
        y = crop_tensor(x7, x)
        x = self.up_conv_1(torch.cat([x, y], axis=1))
        x = self.up_trans_2(x)
        y = crop_tensor(x5, x)
        x = self.up_conv_2(torch.cat([x, y], axis=1))
        x = self.up_trans_3(x)
        y = crop_tensor(x3, x)
        x = self.up_conv_3(torch.cat([x, y], axis=1))
        x = self.up_trans_4(x)
        y = crop_tensor(x1, x)
        x = self.up_conv_4(torch.cat([x, y], axis=1))
        return self.out(x)

image = torch.rand(1, 1, 572, 572)
model = UNet()
model(image)

上述实现为 U-Net 论文实现，网上还有许多该模型的不同实现。有的模型用 bilinear 采样来代替 transposed 卷积，有可能性能更好，但这种实现不是论文中的原始结构。

该模型需要一个通道的图片作为输入，输出图片有两个通道构成，一个表示前景，一个表示背景。

你可以调整代码来修改输入、输出图片的通道数。输入图片与输出图片的大小不一样，因为卷积时没有选择 padding 。

我们看到编码器部分只是进行了卷积操作，因此，可以使用任何卷积网络结构来替换它，比如 ResNet，同时替换时还可以带上预训练参数。

大多数划分问题具有两个输入，原始图片和遮罩。在多目标划分的情况下，输入具有多个遮罩。

气胸划分数据集给我们提供了 RLE，RLE 表示 Run-Length Encoding，一种二进制遮罩格式，可以节省空间。

假设我们有一个输入图片和对应的遮罩，首先设计 Dataset 类来输出图片和遮罩。

下边是 Dataset 的代码实现，该脚本处理流程可用于任何划分问题。

import os, glob, torch
import numpy as np
import pandas as pd
from PIL import Image, ImageFile
from tqdm import tqdm
from collections import defaultdict
from torchvision import transforms
from albumentations import (
    Compose, OneOf, ShiftScaleRotate,
    RandomBrightnessContrast, RandomGamma
)
import pydicom

ImageFile.LOAD_TRUNCATED_IMAGES = True

data_path = 'pneumothorax/train-rle.csv'
image_path = 'pneumothorax/sample/'

# [h, w, c] float32
def load_dicom(img_path: str):
    img = pydicom.dcmread(img_path)
    image = Image.fromarray(img.pixel_array)
    image = image.convert('RGB')
    image = np.array(image, dtype=np.float32)
    return image

# [h, w] float32
def load_rle_mask(rle: str, shape=(1024, 1024)):
    s = rle.split()
    if len(s) == 1:
        return np.zeros(shape).reshape(
            shape[0], 
            shape[1]
        )
    else:
        starts, lengths = [np.array(x, dtype=np.int32) for x in (s[0:][::2], s[1:][::2])]
        starts -= 1
        ends = starts + lengths
        pixel_array = np.zeros(shape[0]*shape[1], dtype=np.float32)
        for lo, hi in zip(starts, ends):
            pixel_array[lo: hi] = 1
        return pixel_array.reshape(
            shape[0],
            shape[1]
        )

class SIIMDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        df,
        transform=True,
        preprocessing_fn=None
    ):
        self.data = defaultdict(dict)
        self.transform = transform
        self.fn = preprocessing_fn
        self.aug = Compose([
            ShiftScaleRotate(
                shift_limit=0.0625,
                scale_limit=0.1,
                rotate_limit=10,
                p=0.8
            ),
            OneOf([
                RandomGamma(gamma_limit=(90, 110)),
                RandomBrightnessContrast(
                    brightness_limit=0.1,
                    contrast_limit=0.1
                )
            ])
        ])
        for idx, row in df.iterrows():
            img_id = row['id']
            self.data[idx] = {
                'img_path': os.path.join(image_path, img_id + '.dcm'),
                'mask_rle': row['target']
            }

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        img_path = self.data[item]['img_path']
        mask_rle = self.data[item]['mask_rle']

        img = load_dicom(img_path)
        mask = load_rle_mask(mask_rle)
        if self.transform is True:
            augmented = self.aug(image=img, mask=mask)
            img = augmented['image']
            mask = augmented['mask']
        if self.fn is not None:
            img = self.fn(img)
        img = np.transpose(img, [2, 0, 1]).astype(np.float32)
        return torch.tensor(img), torch.tensor(mask)

有了 Dataset 类后，可以开始编写训练函数。

import os, sys, torch
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from sklearn import model_selection
from tqdm import tqdm
from torch.optim import lr_scheduler
import segmentation_models_pytorch as smp
from collections import OrderedDict

EPOCHS = 1
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 4
ENCODER = 'resnet18'
ENCODER_WEIGHTS = 'imagenet'
DEVICE = 'cpu'

def train(
    dataset, data_loader,
    model, criterion, optimizer
):
    model.train()
    num_batchs = int(len(dataset) / data_loader.batch_size)
    tk0 = tqdm(data_loader, total=num_batchs)
    for inputs, targets in tk0:
        optimizer.zero_grad()
        preds = model(inputs)
        loss = criterion(preds.squeeze(), targets)
        loss.backward()
        optimizer.step()
    tk0.close()

def evaluate(
    dataset, data_loader,
    model, criterion
):
    model.eval()
    final_loss = 0
    num_batchs = int(len(dataset) / data_loader.batch_size)
    tk0 = tqdm(data_loader, total=num_batchs)
    for inputs, targets in tk0:
        preds = model(inputs)
        loss = criterion(preds.squeeze(), targets)
        final_loss += loss
    tk0.close()
    return final_loss / num_batchs

df = pd.read_csv(data_path)
df.columns = ['id', 'target']
df_train, df_valid = model_selection.train_test_split(
    df, random_state=42, test_size=0.2
)
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)
model = smp.Unet(
    encoder_name=ENCODER,
    encoder_weights=ENCODER_WEIGHTS,
    classes=1,
    activation=None
)
prep_fn = smp.encoders.get_preprocessing_fn(
    ENCODER, ENCODER_WEIGHTS
)

train_dataset = SIIMDataset(
    df_train, preprocessing_fn=prep_fn
)
train_loader = torch.utils.data.DataLoader(
    train_dataset, TRAIN_BATCH_SIZE, shuffle=False
)
valid_dataset = SIIMDataset(
    df_valid, transform=False, preprocessing_fn=prep_fn
)
valid_loader = torch.utils.data.DataLoader(
    valid_dataset, VALID_BATCH_SIZE, shuffle=False
)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
scheduler = lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', patience=3, verbose=True
)

print(f"Train batch size: {TRAIN_BATCH_SIZE}")
print(f"Valid batch size: {VALID_BATCH_SIZE}")
print(f"Epochs: {EPOCHS}")
print(f"Number of training images: {len(train_dataset)}")
print(f"Number of validation images: {len(valid_dataset)}")
print(f"Encoder: {ENCODER}")

for epoch in range(EPOCHS):
    print(f"Train Epoch: {epoch}")
    train(
        train_dataset, train_loader,
        model, criterion, optimizer
    )
    print(f"Valid Epoch: {epoch}")
    loss = evaluate(
        valid_dataset, valid_loader,
        model, criterion
    )
    scheduler(loss)
    print("")

在模型划分问题中，有多种损失函数可以使用，如像素级二元交叉熵、focal loss、dice loss 等。