U-Net: Convolutional Networks for Biomedical Image Segmentation【翻译】

Doc2X | PDF 到 Markdown 转换专家精准将 PDF 转换为 Markdown 文档，支持公式解析与代码提取，简化开发与科研工作流程。 Doc2X | PDF to Markdown Conversion Expert Convert PDFs to Markdown accurately with support for formula parsing and code extraction, simplifying development and research workflows. 👉 开始使用 Doc2X | Start Using Doc2X

原文链接：U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net: 用于生物医学图像分割的卷积网络

Olaf Ronneberger, Philipp Fischer, and Thomas Brox

Olaf Ronneberger, Philipp Fischer, 和 Thomas Brox

Computer Science Department and BIOSS Centre for Biological Signalling Studies,

计算机科学系和生物信号研究中心，

University of Freiburg, Germany

弗莱堡大学，德国

ronneber@informatik.uni-freiburg.de, WWW home page: lmb.informatik.uni-freiburg.de/

ronneber@informatik.uni-freiburg.de, 主页: lmb.informatik.uni-freiburg.de/

Abstract. There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a ${512} \times {512}$ image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at lmb.informatik.uni-freiburg.de/people/ronn….

摘要。普遍认为，成功训练深度网络需要数千个标注的训练样本。在本文中，我们提出了一种网络和训练策略，依赖于强大的数据增强使用，以更有效地利用可用的标注样本。该架构由一个收缩路径组成，以捕获上下文，以及一个对称的扩展路径，以实现精确定位。我们展示了这样一个网络可以从非常少的图像中进行端到端训练，并在电子显微镜堆栈中神经结构分割的ISBI挑战中超越了之前的最佳方法（滑动窗口卷积网络）。使用相同的网络在透射光显微镜图像（相位对比和DIC）上进行训练，我们在这些类别中以较大优势赢得了2015年ISBI细胞追踪挑战。此外，该网络速度很快。在最近的GPU上，分割一幅 ${512} \times {512}$ 图像的时间不到一秒。完整的实现（基于Caffe）和训练好的网络可在lmb.informatik.uni-freiburg.de/people/ronn…

1 Introduction

1 引言

In the last two years, deep convolutional networks have outperformed the state of the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks have already existed for a long time [8], their success was limited due to the size of the available training sets and the size of the considered networks. The breakthrough by Krizhevsky et al. [7] was due to supervised training of a large network with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].

在过去两年中，深度卷积网络在许多视觉识别任务中超越了最先进的技术，例如 [7,3]。虽然卷积网络已经存在很长时间 [8]，但由于可用训练集的大小和考虑的网络规模，其成功是有限的。Krizhevsky 等人 [7] 的突破是通过在包含100万张训练图像的ImageNet数据集上对一个具有8层和数百万参数的大型网络进行监督训练。自那时以来，甚至更大更深的网络也被训练出来了 [12]。

The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel. Moreover, thousands of training images are usually beyond reach in biomedical tasks. Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (patch) around that pixel as input. First, this network can localize. Secondly, the training data in terms of patches is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a large margin.

卷积网络的典型应用是在分类任务中，其中图像的输出是一个单一的类别标签。然而，在许多视觉任务中，尤其是在生物医学图像处理领域，期望的输出应包括定位，即每个像素应分配一个类别标签。此外，在生物医学任务中，数千张训练图像通常是无法获得的。因此，Ciresan 等人 [1] 在滑动窗口设置中训练了一个网络，通过提供该像素周围的局部区域（补丁）作为输入来预测每个像素的类别标签。首先，这个网络可以进行定位。其次，补丁形式的训练数据远大于训练图像的数量。最终，该网络在2012年ISBI的EM分割挑战中以较大优势获胜。

Fig. 1. U-net architecture (example for ${32} \times {32}$ pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

图1. U-net架构（最低分辨率下 ${32} \times {32}$ 像素的示例）。每个蓝色框对应一个多通道特征图。通道的数量在框的顶部标示。x-y大小在框的左下角提供。白色框表示复制的特征图。箭头表示不同的操作。

Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it is quite slow because the network must be run separately for each patch, and there is a lot of redundancy due to overlapping patches. Secondly, there is a trade-off between localization accuracy and the use of context. Larger patches require more max-pooling layers that reduce the localization accuracy, while small patches allow the network to see only little context. More recent approaches $\left\lbrack {{11},4}\right\rbrack$ proposed a classifier output that takes into account the features from multiple layers. Good localization and the use of context are possible at the same time.

显然，Ciresan 等人 [1] 的策略有两个缺点。首先，它的速度相当慢，因为网络必须为每个补丁单独运行，并且由于重叠补丁存在大量冗余。其次，定位精度与上下文使用之间存在权衡。较大的补丁需要更多的最大池化层，这会降低定位精度，而小补丁则使网络只能看到很少的上下文。最近的方法 $\left\lbrack {{11},4}\right\rbrack$ 提出了一个考虑多个层特征的分类器输出。良好的定位和上下文的使用可以同时实现。

In this paper, we build upon a more elegant architecture, the so-called "fully convolutional network" [9]. We modify and extend this architecture such that it works with very few training images and yields more precise segmentations; see Figure 1. The main idea in [9] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.

在本文中，我们基于一种更优雅的架构，即所谓的“全卷积网络” [9]。我们修改并扩展了这一架构，使其能够在训练图像极少的情况下工作，并产生更精确的分割；见图 1。[9] 中的主要思想是通过连续的层来补充一个通常的收缩网络，其中池化操作被上采样操作所替代。因此，这些层提高了输出的分辨率。为了进行定位，来自收缩路径的高分辨率特征与上采样输出相结合。接下来的卷积层可以学习基于这些信息组装出更精确的输出。

Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here segmentation of neuronal structures in EM stacks). Prediction of the segmentation in the yellow area, requires image data within the blue area as input. Missing input data is extrapolated by mirroring

图 2. 用于无缝分割任意大图像的重叠平铺策略（此处为电子显微镜堆栈中神经结构的分割）。在黄色区域预测分割需要蓝色区域内的图像数据作为输入。缺失的输入数据通过镜像进行外推。

One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture. The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Figure 2). To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

我们架构中的一个重要修改是，在上采样部分我们也有大量的特征通道，这允许网络将上下文信息传播到更高分辨率的层。因此，扩展路径在某种程度上与收缩路径是对称的，形成了一个 U 形架构。网络没有任何全连接层，仅使用每个卷积的有效部分，即分割图仅包含在输入图像中具有完整上下文的像素。这种策略允许通过重叠平铺策略无缝分割任意大的图像（见图 2）。为了预测图像边界区域的像素，缺失的上下文通过镜像输入图像进行外推。这种平铺策略对于将网络应用于大图像非常重要，因为否则分辨率将受到 GPU 内存的限制。

As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training images. This allows the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. This is particularly important in biomedical segmentation, since deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently. The value of data augmentation for learning invariance has been shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.

由于我们的任务可用的训练数据非常少，我们通过对可用的训练图像应用弹性变形来进行大量的数据增强。这使得网络能够学习对这种变形的不变性，而无需在标注的图像语料库中看到这些变换。这在生物医学分割中特别重要，因为变形曾经是组织中最常见的变化，而现实的变形可以高效地进行模拟。数据增强在学习不变性方面的价值已在 Dosovitskiy 等人 [2] 的无监督特征学习范围内得到了证明。

Another challenge in many cell segmentation tasks is the separation of touching objects of the same class; see Figure 3. To this end, we propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.

在许多细胞分割任务中，另一个挑战是分离同类接触物体；见图 3。为此，我们建议使用加权损失，其中接触细胞之间的分离背景标签在损失函数中获得较大的权重。

The resulting network is applicable to various biomedical segmentation problems. In this paper, we show results on the segmentation of neuronal structures in EM stacks (an ongoing competition started at ISBI 2012), where we outperformed the network of Ciresan et al. [1]. Furthermore, we show results for cell segmentation in light microscopy images from the ISBI cell tracking challenge 2015. Here we won with a large margin on the two most challenging 2D transmitted light datasets.

生成的网络适用于各种生物医学分割问题。在本文中，我们展示了在EM堆栈中对神经结构进行分割的结果（这是一个自2012年ISBI开始的持续竞争），我们在此任务中超越了Ciresan等人的网络 [1]。此外，我们还展示了在2015年ISBI细胞追踪挑战中对光学显微镜图像进行细胞分割的结果。在这项挑战中，我们在两个最具挑战性的2D透射光数据集上以较大优势获胜。

2 Network Architecture

2 网络架构

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two $3 \times 3$ convolutions (unpadded convolutions),each followed by a rectified linear unit (ReLU) and a $2 \times 2$ max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a $2 \times 2$ convolution ("up-convolution") that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path,and two $3 \times 3$ convolutions,each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a $1 \times 1$ convolution is used to map each 64- component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

网络架构如图1所示。它由一个收缩路径（左侧）和一个扩展路径（右侧）组成。收缩路径遵循卷积网络的典型架构。它由两个 $3 \times 3$ 卷积（无填充卷积）的重复应用组成，每个卷积后面跟着一个修正线性单元（ReLU）和一个 $2 \times 2$ 最大池化操作，步幅为2以进行下采样。在每个下采样步骤中，我们将特征通道的数量加倍。扩展路径中的每一步由特征图的上采样组成，随后是一个 $2 \times 2$ 卷积（“上卷积”），它将特征通道的数量减半，并与收缩路径中相应裁剪的特征图进行连接，最后是两个 $3 \times 3$ 卷积，每个卷积后面跟着一个ReLU。裁剪是必要的，因为每次卷积都会损失边界像素。在最终层，使用一个 $1 \times 1$ 卷积将每个64维特征向量映射到所需的类别数量。总的来说，网络有23个卷积层。

To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all $2 \times 2$ max-pooling operations are applied to a layer with an even $\mathrm{x}$ - and $\mathrm{y}$ -size.

为了实现输出分割图的无缝拼接（见图2），选择输入块大小时必须确保所有 $2 \times 2$ 最大池化操作应用于具有偶数 $\mathrm{x}$ 和 $\mathrm{y}$ 大小的层。

3 Training

3 训练

The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6]. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. To minimize the overhead and make maximum use of the GPU memory, we favor large input tiles over a large batch size and hence reduce the batch to a single image. Accordingly we use a high momentum (0.99) such that a large number of the previously seen training samples determine the update in the current optimization step.

输入图像及其对应的分割图用于训练网络，采用 Caffe 的随机梯度下降实现 [6]。由于未填充的卷积，输出图像的尺寸比输入图像小一个固定的边界宽度。为了最小化开销并最大限度地利用 GPU 内存，我们更倾向于使用大输入块而不是大批量大小，因此将批量减少为单张图像。相应地，我们使用高动量（0.99），使得大量之前看到的训练样本决定当前优化步骤中的更新。

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function. The soft-max is defined as ${p}_{k}\left( \mathbf{x}\right) = \exp \left( {{a}_{k}\left( \mathbf{x}\right) }\right) /\left( {\mathop{\sum }\limits_{{{k}^{\prime } = 1}}^{K}\exp \left( {{a}_{{k}^{\prime }}\left( \mathbf{x}\right) }\right) }\right)$ where ${a}_{k}\left( \mathbf{x}\right)$ denotes the activation in feature channel $k$ at the pixel position $\mathbf{x} \in \Omega$ with $\Omega \subset {\mathbb{Z}}^{2}.K$ is the number of classes and ${p}_{k}\left( \mathbf{x}\right)$ is the approximated maximum-function. I.e. ${p}_{k}\left( \mathbf{x}\right) \approx 1$ for the $k$ that has the maximum activation ${a}_{k}\left( \mathbf{x}\right)$ and ${p}_{k}\left( \mathbf{x}\right) \approx 0$ for all other $k$ . The cross entropy then penalizes at each position the deviation of ${p}_{\ell \left( \mathbf{x}\right) }\left( \mathbf{x}\right)$ from 1 using

能量函数通过对最终特征图进行逐像素的 soft-max 计算，并结合交叉熵损失函数。soft-max 定义为 ${p}_{k}\left( \mathbf{x}\right) = \exp \left( {{a}_{k}\left( \mathbf{x}\right) }\right) /\left( {\mathop{\sum }\limits_{{{k}^{\prime } = 1}}^{K}\exp \left( {{a}_{{k}^{\prime }}\left( \mathbf{x}\right) }\right) }\right)$ ，其中 ${a}_{k}\left( \mathbf{x}\right)$ 表示在像素位置 $\mathbf{x} \in \Omega$ 处特征通道 $k$ 的激活值，且 $\Omega \subset {\mathbb{Z}}^{2}.K$ 是类别的数量， ${p}_{k}\left( \mathbf{x}\right)$ 是近似的最大函数。即 ${p}_{k}\left( \mathbf{x}\right) \approx 1$ 对于具有最大激活值 ${a}_{k}\left( \mathbf{x}\right)$ 的 $k$ ，以及 ${p}_{k}\left( \mathbf{x}\right) \approx 0$ 对于所有其他 $k$ 。交叉熵在每个位置惩罚 ${p}_{\ell \left( \mathbf{x}\right) }\left( \mathbf{x}\right)$ 偏离 1 的程度，使用

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——