UNet++: A Nested U-Net Architecture for Medical Image Segmentation【翻译】

从 PDF 到 Markdown，Doc2X 为您服务 Doc2X 专注于高效 PDF 转 Markdown，多栏布局完美还原，支持批量处理，省时省力！ From PDF to Markdown, Doc2X Has You Covered Doc2X specializes in efficient PDF to Markdown conversion, accurately restoring multi-column layouts and supporting batch processing for time-saving solutions. 👉 了解 Doc2X 功能 | Learn More About Doc2X

原文链接：1807.10165

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

UNet++: 一种用于医学图像分割的嵌套 U-Net 架构

Zongwei Zhou, Md Mahfuzur Rahman Siddiquee,

Nima Tajbakhsh, and Jianming Liang

Nima Tajbakhsh 和 Jianming Liang

Arizona State University

亚利桑那州立大学

{zongweiz,mrahmans,ntajbakh,jianming.liang}@asu.edu

Abstract. In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

摘要：在本文中，我们提出了 UNet++，一种用于医学图像分割的新型、更强大的架构。我们的架构本质上是一个深度监督的编码器-解码器网络，其中编码器和解码器子网络通过一系列嵌套的密集跳跃通道连接。重新设计的跳跃通道旨在减少编码器和解码器子网络特征图之间的语义差距。我们认为，当解码器和编码器网络的特征图在语义上相似时，优化器将处理一个更容易的学习任务。我们在多个医学图像分割任务中评估了 UNet++，并与 U-Net 和宽 U-Net 架构进行了比较：胸部低剂量 CT 扫描中的结节分割、显微镜图像中的细胞核分割、腹部 CT 扫描中的肝脏分割，以及结肠镜视频中的息肉分割。我们的实验表明，具有深度监督的 UNet++ 在 IoU 上分别比 U-Net 和宽 U-Net 提高了 3.9 和 3.4 个百分点。

1 Introduction

1 引言

The state-of-the-art models for image segmentation are variants of the encoder-decoder architecture like U-Net [9] and fully convolutional network (FCN) [8]. These encoder-decoder networks used for segmentation share a key similarity: skip connections, which combine deep, semantic, coarse-grained feature maps from the decoder sub-network with shallow, low-level, fine-grained feature maps from the encoder sub-network. The skip connections have proved effective in recovering fine-grained details of the target objects; generating segmentation masks with fine details even on complex background. Skip connections is also fundamental to the success of instance-level segmentation models such as Mask-RCNN, which enables the segmentation of occluded objects. Arguably, image segmentation in natural images has reached a satisfactory level of performance, but do these models meet the strict segmentation requirements of medical images?

最先进的图像分割模型是编码器-解码器架构的变体，如 U-Net [9] 和全卷积网络 (FCN) [8]。这些用于分割的编码器-解码器网络有一个关键的相似性：跳跃连接，它将来自解码器子网络的深层、语义、粗粒度特征图与来自编码器子网络的浅层、低级、细粒度特征图相结合。跳跃连接在恢复目标对象的细粒度细节方面已被证明是有效的；即使在复杂背景下，也能生成具有细节的分割掩膜。跳跃连接对于实例级分割模型（如 Mask-RCNN）的成功也是至关重要的，它使得被遮挡对象的分割成为可能。可以说，自然图像中的图像分割已经达到了令人满意的性能水平，但这些模型是否满足医学图像严格的分割要求呢？

Segmenting lesions or abnormalities in medical images demands a higher level of accuracy than what is desired in natural images. While a precise segmentation mask may not be critical in natural images, even marginal segmentation errors in medical images can lead to poor user experience in clinical settings. For instance, the subtle spiculation patterns around a nodule may indicate nodule malignancy; and therefore, their exclusion from the segmentation masks would lower the credibility of the model from the clinical perspective. Furthermore, inaccurate segmentation may also lead to a major change in the subsequent computer-generated diagnosis. For example, an erroneous measurement of nodule growth in longitudinal studies can result in the assignment of an incorrect Lung-RADS category to a screening patient. It is therefore desired to devise more effective image segmentation architectures that can effectively recover the fine details of the target objects in medical images.

在医学图像中分割病变或异常需要比自然图像更高的准确性。虽然在自然图像中，精确的分割掩膜可能并不是关键，但在医学图像中，即使是微小的分割错误也可能导致临床环境中的用户体验不佳。例如，结节周围微妙的刺状模式可能表明结节的恶性；因此，它们从分割掩膜中的排除会降低模型在临床角度的可信度。此外，不准确的分割还可能导致后续计算机生成的诊断发生重大变化。例如，在纵向研究中，结节生长的错误测量可能导致对筛查患者分配错误的 Lung-RADS 类别。因此，需要设计更有效的图像分割架构，以有效恢复医学图像中目标对象的细节。

To address the need for more accurate segmentation in medical images, we present UNet++, a new segmentation architecture based on nested and dense skip connections. The underlying hypothesis behind our architecture is that the model can more effectively capture fine-grained details of the foreground objects when high-resolution feature maps from the encoder network are gradually enriched prior to fusion with the corresponding semantically rich feature maps from the decoder network. We argue that the network would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. This is in contrast to the plain skip connections commonly used in U-Net, which directly fast-forward high-resolution feature maps from the encoder to the decoder network, resulting in the fusion of semantically dissimilar feature maps. According to our experiments, the suggested architecture is effective, yielding significant performance gain over U-Net and wide U-Net.

为了解决医学图像中对更精确分割的需求，我们提出了 UNet++，这是一种基于嵌套和密集跳跃连接的新分割架构。我们架构背后的基本假设是，当编码器网络的高分辨率特征图在与解码器网络相应的语义丰富特征图融合之前逐渐被丰富时，模型能够更有效地捕捉前景对象的细粒度细节。我们认为，当解码器和编码器网络的特征图在语义上相似时，网络将处理一个更容易的学习任务。这与 U-Net 中常用的简单跳跃连接形成对比，后者直接将编码器的高分辨率特征图快速传递到解码器网络，导致语义不相似特征图的融合。根据我们的实验，建议的架构是有效的，相较于 U-Net 和宽 U-Net 具有显著的性能提升。

2 Related Work

2 相关工作

Long et al. [8] first introduced fully convolutional networks (FCN), while U-Net was introduced by Ronneberger et al. [9]. They both share a key idea: skip connections. In FCN, up-sampled feature maps are summed with feature maps skipped from the encoder, while U-Net concatenates them and add convolutions and non-linearities between each up-sampling step. The skip connections have shown to help recover the full spatial resolution at the network output, making fully convolutional methods suitable for semantic segmentation. Inspired by DenseNet architecture [5], Li et al. [7] proposed H-denseunet for liver and liver tumor segmentation. In the same spirit, Drozdzalet al. [2] systematically investigated the importance of skip connections, and introduced short skip connections within the encoder. Despite the minor differences between the above architectures, they all tend to fuse semantically dissimilar feature maps from the encoder and decoder sub-networks, which, according to our experiments, can degrade segmentation performance.

Long 等人 [8] 首先引入了全卷积网络 (FCN)，而 U-Net 是由 Ronneberger 等人 [9] 提出的。它们都共享一个关键思想：跳跃连接。在 FCN 中，上采样特征图与从编码器跳过的特征图相加，而 U-Net 则将它们连接在一起，并在每个上采样步骤之间添加卷积和非线性。跳跃连接已被证明有助于恢复网络输出的完整空间分辨率，使全卷积方法适合语义分割。受到 DenseNet 架构 [5] 的启发，Li 等人 [7] 提出了用于肝脏和肝肿瘤分割的 H-denseunet。同样，Drozdzalet 等人 [2] 系统地研究了跳跃连接的重要性，并在编码器中引入了短跳跃连接。尽管上述架构之间存在小的差异，但它们都倾向于融合来自编码器和解码器子网络的语义不相似特征图，而根据我们的实验，这可能会降低分割性能。

The other two recent related works are GridNet [3] and Mask-RCNN [4]. GridNet is an encoder-decoder architecture wherein the feature maps are wired in a grid fashion, generalizing several classical segmentation architectures. GridNet, however, lacks up-sampling layers between skip connections; and thus, it does not represent UNet++. Mask-RCNN is perhaps the most important meta framework for object detection, classification and segmentation. We would like to note that

另外两项相关的近期工作是 GridNet [3] 和 Mask-RCNN [4]。GridNet 是一种编码器-解码器架构，其中特征图以网格方式连接，概括了几种经典的分割架构。然而，GridNet 在跳跃连接之间缺乏上采样层，因此，它并不代表 UNet++。Mask-RCNN 可能是物体检测、分类和分割最重要的元框架。我们想指出的是

Fig. 1: (a) UNet++ consists of an encoder and decoder that are connected through a series of nested dense convolutional blocks. The main idea behind UNet++ is to bridge the semantic gap between the feature maps of the encoder and decoder prior to fusion. For example,the semantic gap between $\left( {{\mathrm{X}}^{0,0},{\mathrm{X}}^{1,3}}\right)$ is bridged using a dense convolution block with three convolution layers. In the graphical abstract, black indicates the original U-Net, green and blue show dense convolution blocks on the skip pathways, and red indicates deep supervision. Red, green, and blue components distinguish UNet++ from U-Net. (b) Detailed analysis of the first skip pathway of UNet++. (c) UNet++ can be pruned at inference time, if trained with deep supervision.

图 1: (a) UNet++ 由一个编码器和一个解码器组成，它们通过一系列嵌套的密集卷积块连接。UNet++ 的主要思想是在融合之前弥合编码器和解码器特征图之间的语义差距。例如，使用一个包含三个卷积层的密集卷积块来弥合 $\left( {{\mathrm{X}}^{0,0},{\mathrm{X}}^{1,3}}\right)$ 之间的语义差距。在图形摘要中，黑色表示原始 U-Net，绿色和蓝色表示跳跃路径上的密集卷积块，红色表示深度监督。红色、绿色和蓝色组件将 UNet++ 与 U-Net 区分开来。 (b) UNet++ 第一个跳跃路径的详细分析。 (c) 如果经过深度监督训练，UNet++ 可以在推理时进行剪枝。

UNet++ can be readily deployed as the backbone architecture in Mask-RCNN by simply replacing the plain skip connections with the suggested nested dense skip pathways. Due to limited space, we were not able to include results of Mask RCNN with UNet++ as the backbone architecture; however, the interested readers can refer to the supplementary material for further details.

UNet++ 可以通过简单地用建议的嵌套密集跳跃路径替换普通的跳跃连接，作为 Mask-RCNN 的主干架构进行部署。由于空间有限，我们无法包含 UNet++ 作为主干架构的 Mask RCNN 的结果；然而，感兴趣的读者可以参考补充材料以获取更多细节。

3 Proposed Network Architecture: UNet++

3 提出的网络架构：UNet++

Fig. 1a shows a high-level overview of the suggested architecture. As seen, UNet++ starts with an encoder sub-network or backbone followed by a decoder sub-network. What distinguishes UNet++ from U-Net (the black components in Fig. 1a) is the re-designed skip pathways (shown in green and blue) that connect the two sub-networks and the use of deep supervision (shown red).

图 1a 显示了建议架构的高层次概述。如图所示，UNet++ 以编码器子网络或主干开始，随后是解码器子网络。UNet++ 与 U-Net（图 1a 中的黑色组件）之间的区别在于重新设计的跳跃路径（以绿色和蓝色显示），这些路径连接了两个子网络，并使用了深度监督（以红色显示）。

3.1 Re-designed skip pathways

3.1 重新设计的跳跃路径

Re-designed skip pathways transform the connectivity of the encoder and decoder sub-networks. In U-Net, the feature maps of the encoder are directly received in the decoder; however, in UNet++, they undergo a dense convolution block whose number of convolution layers depends on the pyramid level. For example,the skip pathway between nodes ${\mathrm{X}}^{0,0}$ and ${\mathrm{X}}^{1,3}$ consists of a dense convolution block with three convolution layers where each convolution layer is preceded by a concatenation layer that fuses the output from the previous convolution layer of the same dense block with the corresponding up-sampled output of the lower dense block. Essentially, the dense convolution block brings the semantic level of the encoder feature maps closer to that of the feature maps awaiting in the decoder. The hypothesis is that the optimizer would face an easier optimization problem when the received encoder feature maps and the corresponding decoder feature maps are semantically similar.

重新设计的跳跃路径改变了编码器和解码器子网络之间的连接方式。在 U-Net 中，编码器的特征图直接传递给解码器；然而，在 UNet++ 中，它们经过一个稠密卷积块，该块的卷积层数量取决于金字塔层级。例如，节点 ${\mathrm{X}}^{0,0}$ 和 ${\mathrm{X}}^{1,3}$ 之间的跳跃路径由一个包含三个卷积层的稠密卷积块组成，每个卷积层之前都有一个连接层，该层将同一稠密块的前一个卷积层的输出与下一个稠密块的相应上采样输出融合。实质上，稠密卷积块使编码器特征图的语义级别更接近于等待在解码器中的特征图的语义级别。假设是，当接收到的编码器特征图和相应的解码器特征图在语义上相似时，优化器将面临一个更容易的优化问题。

Formally,we formulate the skip pathway as follows: let ${x}^{i,j}$ denote the output of node ${\mathrm{X}}^{i,j}$ where $i$ indexes the down-sampling layer along the encoder and $j$ indexes the convolution layer of the dense block along the skip pathway. The stack of feature maps represented by ${x}^{i,j}$ is computed as

正式地，我们将跳跃路径公式化如下：令 ${x}^{i,j}$ 表示节点 ${\mathrm{X}}^{i,j}$ 的输出，其中 $i$ 索引编码器中的下采样层， $j$ 索引跳跃路径中的稠密块的卷积层。由 ${x}^{i,j}$ 表示的特征图堆栈计算为

where function $\mathcal{H}\left( \cdot \right)$ is a convolution operation followed by an activation function, $\mathcal{U}\left( \cdot \right)$ denotes an up-sampling layer,and [] denotes the concatenation layer. Basically,nodes at level $j = 0$ receive only one input from the previous layer of the encoder; nodes at level $j = 1$ receive two inputs,both from the encoder sub-network but at two consecutive levels; and nodes at level $j > 1$ receive $j + 1$ inputs,of which $j$ inputs are the outputs of the previous $j$ nodes in the same skip pathway and the last input is the up-sampled output from the lower skip pathway. The reason that all prior feature maps accumulate and arrive at the current node is because we make use of a dense convolution block along each skip pathway. Fig. 1b further clarifies Eq. 1 by showing how the feature maps travel through the top skip pathway of UNet++.

其中函数 $\mathcal{H}\left( \cdot \right)$ 是一个卷积操作，后面跟着一个激活函数， $\mathcal{U}\left( \cdot \right)$ 表示一个上采样层，而 [] 表示连接层。基本上，级别 $j = 0$ 的节点仅接收来自编码器前一层的一个输入；级别 $j = 1$ 的节点接收两个输入，均来自编码器子网络，但位于两个连续的层级；级别 $j > 1$ 的节点接收 $j + 1$ 个输入，其中 $j$ 个输入是来自同一跳跃路径中前一个 $j$ 节点的输出，最后一个输入是来自下层跳跃路径的上采样输出。所有先前特征图的累积到达当前节点的原因是我们在每个跳跃路径上使用了密集卷积块。图 1b 进一步通过展示特征图如何在 UNet++ 的顶部跳跃路径中传播来澄清公式 1。

3.2 Deep supervision

3.2 深度监督

We propose to use deep supervision [6] in UNet++, enabling the model to operate in two modes: 1) accurate mode wherein the outputs from all segmentation branches are averaged; 2) fast mode wherein the final segmentation map is selected from only one of the segmentation branches, the choice of which determines the extent of model pruning and speed gain. Fig. 1c shows how the choice of segmentation branch in fast mode results in architectures of varying complexity.

我们建议在 UNet++ 中使用深度监督 [6]，使模型能够以两种模式运行：1) 精确模式，其中所有分割分支的输出被平均；2) 快速模式，其中最终的分割图仅从一个分割分支中选择，选择的分支决定了模型剪枝的程度和速度提升。图 1c 显示了快速模式中分割分支的选择如何导致不同复杂度的架构。

Owing to the nested skip pathways, UNet++ generates full resolution feature maps at multiple semantic levels, $\left\{ {{x}^{0,j},j \in \{ 1,2,3,4\} }\right\}$ ,which are amenable to

由于嵌套的跳跃路径，UNet++ 在多个语义层级生成全分辨率特征图， $\left\{ {{x}^{0,j},j \in \{ 1,2,3,4\} }\right\}$ ，这些特征图适合于

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——