[AI论文翻译] Modality-Agnostic Structural Image Representation Learning for Defor..

Doc2X：智能公式编辑与解析支持从 PDF 中提取并编辑复杂公式，同时转化为 Word 或 Latex，精准高效，为科研工作提速。 Doc2X: Smart Formula Editing and Parsing Extract and edit complex formulas from PDFs with conversion to Word or LaTeX. Accurate and efficient for research workflows. 👉 立即体验 Doc2X | Try Doc2X Now

原文链接: arxiv.org/pdf/2402.18…

Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration

模态无关的结构图像表示学习用于可变形多模态医学图像配准

Tony C. W. ${\operatorname{Mok}}^{1,2 * }$ Zi ${\mathrm{{Li}}}^{1,2 * }$ Yunhao ${\mathrm{{Bai}}}^{1}\;$ Jianpeng ${\mathrm{{Zhang}}}^{1,2,4}\;$ Wei ${\mathrm{{Liu}}}^{1,2}$

Yan-Jie Zhou ${}^{1,2,4}\;$ Ke Yan ${}^{1,2}$

1 DAMO Academy, Alibaba Group

1 达摩学院，阿里巴巴集团

2 Hupan Lab, 310023, Hangzhou, China

2 湖畔实验室，310023，杭州，中国

${}^{3}$ Shengjing Hospital of China Medical University,China

${}^{3}$ 中国医科大学盛京医院，中国

${}^{4}$ College of Computer Science and Technology,Zhejiang University,China

${}^{4}$ 浙江大学计算机科学与技术学院，中国

cwmokab@connect.ust.hk

Abstract

摘要

Establishing dense anatomical correspondence across distinct imaging modalities is a foundational yet challenging procedure for numerous medical image analysis studies and image-guided radiotherapy. Existing multi-modality image registration algorithms rely on statistical-based similarity measures or local structural image representations. However, the former is sensitive to locally varying noise, while the latter is not discriminative enough to cope with complex anatomical structures in multimodal scans, causing ambiguity in determining the anatomical correspondence across scans with different modalities. In this paper, we propose a modality-agnostic structural representation learning method, which leverages Deep Neighbourhood Self-similarity (DNS) and anatomy-aware contrastive learning to learn discriminative and contrast-invariance deep structural image representations (DSIR) without the need for anatomical delineations or pre-aligned training images. We evaluate our method on multiphase CT, abdomen MR-CT, and brain MR T1w-T2w registration. Comprehensive results demonstrate that our method is superior to the conventional local structural representation and statistical-based similarity measures in terms of discriminability and accuracy.

在不同成像模态之间建立密集的解剖对应关系是众多医学图像分析研究和图像引导放疗的基础性但具有挑战性的过程。现有的多模态图像配准算法依赖于基于统计的相似性度量或局部结构图像表示。然而，前者对局部变化的噪声敏感，而后者在应对多模态扫描中的复杂解剖结构时不够具有区分性，导致在不同模态的扫描之间确定解剖对应关系时产生歧义。本文提出了一种模态无关的结构表示学习方法，该方法利用深度邻域自相似性（DNS）和解剖感知对比学习来学习具有区分性和对比不变性的深度结构图像表示（DSIR），无需解剖划分或预对齐的训练图像。我们在多相CT、腹部MR-CT和脑部MR T1w-T2w配准上评估了我们的方法。全面的结果表明，我们的方法在区分性和准确性方面优于传统的局部结构表示和基于统计的相似性度量。

1. Introduction

1. 引言

Determining anatomical correspondence between multimodal data is crucial for medical image analysis and clinical applications, including diagnostic settings [33], surgical planning $\left\lbrack {1,{57}}\right\rbrack$ and post-operative evaluation [39]. As a vital component for modern medical image analysis studies and image-guided interventions, deformable multimodal registration aims to establish the dense anatomical correspondence between multimodal scans and fuse information from multimodal scans, e.g., propagating anatomical or tumour delineation for image-guided radiotherapy [30]. Since different imaging modalities provide valuable complementary visual cues and diagnosis information of the patient, precise anatomical alignment between multimodal scans benefits the radiological observation and the

确定多模态数据之间的解剖对应关系对于医学图像分析和临床应用至关重要，包括诊断设置 [33]、手术规划 $\left\lbrack {1,{57}}\right\rbrack$ 和术后评估 [39]。作为现代医学图像分析研究和图像引导干预的重要组成部分，变形多模态配准旨在建立多模态扫描之间的密集解剖对应关系，并融合来自多模态扫描的信息，例如，为图像引导放疗传播解剖或肿瘤轮廓 [30]。由于不同的成像模态提供了有价值的互补视觉线索和患者的诊断信息，因此多模态扫描之间的精确解剖对齐有利于放射学观察和后续的计算机化分析。

Figure 1. Visualization of feature similarity between the marked feature vector (red dot) of the image and all feature vectors of augmented images using the convolutional neural network without pertaining (CNN), Modality Independent Neighbourhood Descriptor (MIND), and our proposed Deep Neighbourhood Self-similarity (DNS). Our method captures the contrast invariant and high discriminability structural representation of the image, reducing the ambiguity in matching the anatomical correspondence between multimodal images.

图1. 使用卷积神经网络（CNN）、模态独立邻域描述符（MIND）和我们提出的深度邻域自相似性（DNS）可视化图像的标记特征向量（红点）与所有增强图像的特征向量之间的特征相似性。我们的方法捕捉了图像的对比不变性和高可区分性的结构表示，减少了在多模态图像之间匹配解剖对应关系时的模糊性。

*Contributed equally.

*贡献相同。

subsequent downstream computerized analyses. However, finding anatomical correspondences between homologous points in multimodal images is notoriously challenging due to the complex appearance changes across modalities. For instance, in multiphase abdomen computed tomography (CT) scans, the soft tissues can be deformed due to gravity, body motion, and other muscle contractions, resulting in an unavoidable large non-linear misalignment between subsequent imaging scans. Moreover, anatomical structures and tumours in multiphase CT scans show heterogeneous intensity distribution across different multiphase contrast-enhanced CT scans due to the intravenously injected contrast agent during the multiphase CT imaging.

然而，由于模态之间复杂的外观变化，在多模态图像中找到同源点之间的解剖对应关系是非常具有挑战性的。例如，在多相腹部计算机断层扫描（CT）中，软组织可能由于重力、身体运动和其他肌肉收缩而变形，导致后续成像扫描之间不可避免地出现大规模非线性错位。此外，由于在多相CT成像过程中静脉注射的对比剂，多相CT扫描中的解剖结构和肿瘤在不同的多相增强CT扫描中显示出异质的强度分布。

Despite there being vast research studies $\lbrack 7,{11},{23},{28}$ , ${29},{35},{39},{50}\rbrack$ on deformable image registration,most of these are focused on the mono-modal registration settings and rely on intensity-based similarity metrics, i.e., normalized cross-correlation (NCC) and mean squared error (MSE), which are not applicable to the multimodal registration. Recently, several methods have proposed to learn an inter-domain similarity metric using supervised learning with pre-aligned training images [14, 25, 44]. However, the perfectly aligned images and the ideal ground truth deformations are often absent in multimodal medical images, which limits the applicability of these methods.

尽管在可变形图像配准方面有大量研究 $\lbrack 7,{11},{23},{28}$ , ${29},{35},{39},{50}\rbrack$ ，但大多数研究集中在单模态配准设置上，并依赖于基于强度的相似性度量，即归一化互相关 (NCC) 和均方误差 (MSE)，这些方法不适用于多模态配准。最近，一些方法提出使用监督学习与预对齐的训练图像来学习跨域相似性度量 [14, 25, 44]。然而，在多模态医学图像中，完美对齐的图像和理想的真实变形往往缺失，这限制了这些方法的适用性。

Historically, a pioneering work of Maes et al. [32] uses mutual information (MI) [55] to perform rigid multimodal registration. Nevertheless, for deformable multimodal registration, many disadvantages have been identified when using the MI-based similarity measures [45]. Specifically, MI-based similarity measures are sensitive to locally varying noise distribution but not sensitive to the subtle anatomical and vascular structures due to the statistical nature of MI.

历史上，Maes 等人 [32] 的开创性工作使用互信息 (MI) [55] 来执行刚性多模态配准。然而，在可变形多模态配准中，使用基于 MI 的相似性度量时发现了许多缺点 [45]。具体而言，基于 MI 的相似性度量对局部变化的噪声分布敏感，但对微妙的解剖和血管结构不敏感，这是由于 MI 的统计性质。

As an alternative to directly assessing similarity or MI on the original images, structural image representation approaches have gained great interest for deformable multimodal registration. By computing the intermediate structural image representation independent of the underlying image acquisition, well-established monomodal optimization techniques can be employed to address the multimodal registration problem. A prominent example is the Modality-Independent Neighbourhood Descriptor [17], which is motivated by image self-similarity [48] and able to capture the internal geometric layouts of local self-similarities within images. Yet, such local feature descriptors are not expressive and discriminative enough to cope with complex anatomical structures in abdomen CT, exhibiting many local optima, as shown in Fig. 13. Therefore, it is often jointly used with a dedicated optimization strategy or requires robustness initialization.

作为直接评估原始图像上的相似性或 MI 的替代方法，结构图像表示方法在可变形多模态配准中引起了极大关注。通过计算与基础图像采集无关的中间结构图像表示，可以采用成熟的单模态优化技术来解决多模态配准问题。一个显著的例子是模态独立邻域描述符 [17]，它受到图像自相似性 [48] 的启发，能够捕捉图像中局部自相似性的内部几何布局。然而，这种局部特征描述符在应对腹部 CT 中复杂的解剖结构时并不够表达和区分，表现出许多局部最优，如图 13 所示。因此，它通常与专门的优化策略联合使用或需要稳健的初始化。

In this paper, we analyze and expose the limitations of self-similarity-based feature descriptors and mutual information-based methods in multi-modality registration. We depart from the classical self-similarity descriptor and propose a novel structural image representation learning paradigm dedicated to learning expressive deep structural image representations (DSIRs) for deformable multimodal registration. Our proposed method reduces the multimodal registration problem to a monomodal one, in which existing well-established monomodal registration algorithms can be applied. To the best of our knowledge, this is the first modality-agnostic structural representation learning approach that learns to capture DSIR with high discriminability from multimodal images without using perfectly aligned image pairs or anatomical delineation.

在本文中，我们分析并揭示了基于自相似性的特征描述符和基于互信息的方法在多模态配准中的局限性。我们从经典的自相似描述符出发，提出了一种新的结构图像表示学习范式，旨在学习用于可变形多模态配准的表达性深度结构图像表示（DSIR）。我们提出的方法将多模态配准问题简化为单模态问题，在该问题中，可以应用现有的成熟单模态配准算法。根据我们所知，这是第一种模态无关的结构表示学习方法，能够在不使用完美对齐的图像对或解剖轮廓的情况下，从多模态图像中学习捕捉具有高区分性的DSIR。

The main contributions of this work are as follows:

本工作的主要贡献如下：

we propose a novel self-supervised deep structural representation learning approach for multimodal image registration that learns to extract deep structural image representation from standalone medical images, circumventing the need for anatomical delineations or perfectly aligned training image pair for supervision.
我们提出了一种新颖的自监督深度结构表示学习方法，用于多模态图像配准，该方法能够从独立的医学图像中提取深度结构图像表示，避免了对解剖轮廓或完美对齐的训练图像对的监督需求。
we introduce the Deep Neighbour Self-similarity (DNS), which can capture long-range and complex structural information from medical images, addressing the ambiguity in classical feature descriptors and similarity metrics.
我们引入了深邻域自相似性（DNS），该方法能够从医学图像中捕捉长距离和复杂的结构信息，解决了经典特征描述符和相似性度量中的模糊性。
we propose a novel contrastive learning strategy with non-linear intensity transformation, maximizing the discriminability of the feature representation across anatomical positions with homogeneous and heterogeneous intensity distribution.
我们提出了一种新颖的对比学习策略，结合非线性强度变换，最大化在具有同质和异质强度分布的解剖位置上特征表示的区分性。
we demonstrate that the proposed deep structural image representation can be adapted to a variety of well-established learning-based and iterative optimization registration algorithms, reducing the multimodal registration problem to a monomodal registration problem.
我们证明了所提出的深度结构图像表示可以适应多种成熟的基于学习和迭代优化的配准算法，将多模态配准问题简化为单模态配准问题。

We rigorously evaluate the proposed method on three challenging multimodal registration tasks: liver multiphase CT registration, abdomen magnetic resonance imaging (MR) to CT registration, and brain MR T1w-T2w registration. Results demonstrate that our method is capable of computing highly expressive and discriminative structural representations of multimodal images, reaching the state-of-the-art performance of conventional methods solely with a simple gradient decent-based optimization framework.

我们在三个具有挑战性的多模态配准任务上严格评估了所提出的方法：肝脏多相CT配准、腹部磁共振成像（MR）到CT配准以及脑部MR T1w-T2w配准。结果表明，我们的方法能够计算出高度表达性和区分性的多模态图像结构表示，仅通过简单的基于梯度下降的优化框架就达到了传统方法的最先进性能。

2. Related Work

2. 相关工作

Multi-modal image registration. In general, multimodal registration methods can be divided into three categories: statistical-based, structural representation-based and deep learning-based methods.

多模态图像配准。一般而言，多模态配准方法可以分为三类：基于统计的、基于结构表示的和基于深度学习的方法。

Prominent examples of statistical-based methods use information theory and optimize joint voxel statistics, such as minimizing the MI or normalized mutual information (NMI) as similarity measures for multi-modal registration $\left\lbrack {3,4,{52}}\right\rbrack$ . These approaches aim to estimate a solution that minimizes the entropy of the joint histogram between image pairs for rigid registration. However, as evidenced in [45], MI-based similarity measures are restricted to measuring the statistical co-occurrence of image intensities, which are not extendable to the co-occurrence of complex patterns, such as subtle structural information of soft tissues and vascular structures in medical images, and are sensitive to locally varying noise distribution, which is not ideal for deformable multi-modal registration.

基于统计的方法的显著例子使用信息论，并优化联合体素统计，例如将互信息（MI）或归一化互信息（NMI）最小化作为多模态配准的相似性度量 $\left\lbrack {3,4,{52}}\right\rbrack$ 。这些方法旨在估计一个解决方案，以最小化图像对之间联合直方图的熵，以实现刚性配准。然而，如 [45] 所示，基于MI的相似性度量仅限于测量图像强度的统计共现，无法扩展到复杂模式的共现，例如医学图像中软组织和血管结构的微妙结构信息，并且对局部变化的噪声分布敏感，这对于可变形多模态配准并不理想。

To circumvent the limitation of statistical-based methods, another common strategy is to project multimodal images into common intermediate structural representations or measure the misalignment through the modality-invariant similarity metrics $\left\lbrack {{12},{27},{29}}\right\rbrack$ . Prominent examples include MIND [17, 18, 20], attribute vectors [49] and Linear Correlation of Linear Combination $\left( {\mathrm{{LC}}}^{2}\right) \left\lbrack {56}\right\rbrack$ that are designed to capture a dense structural image representation of multimodal images that are independent of the underlying image acquisition. However, these local structural representations are often not sufficiently discriminative and expressive to drive a non-rigid registration with many degrees of freedom, not differentiable or expensive to compute. As such, these local structural representations are often used in conjunction with robust optimization methods, e.g. discrete optimization [19] and Bound Optimization by Quadratic Approximation [46], or requiring robust rigid initialization.

为了规避基于统计的方法的局限性，另一种常见策略是将多模态图像投影到共同的中间结构表示中，或通过模态不变相似性度量来测量不对齐程度 $\left\lbrack {{12},{27},{29}}\right\rbrack$ 。显著的例子包括 MIND [17, 18, 20]、属性向量 [49] 和线性组合的线性相关 $\left( {\mathrm{{LC}}}^{2}\right) \left\lbrack {56}\right\rbrack$ ，这些方法旨在捕捉与基础图像采集无关的多模态图像的密集结构图像表示。然而，这些局部结构表示通常不足以具有足够的判别力和表现力来驱动具有多个自由度的非刚性配准，且不可微分或计算成本高。因此，这些局部结构表示通常与鲁棒优化方法结合使用，例如离散优化 [19] 和通过二次近似的边界优化 [46]，或需要鲁棒的刚性初始化。

Deep learning-based image registration (DLIR) methods have demonstrated remarkable results on diverse mono-modal and multi-modal registration tasks, as evidenced by tremendous registration benchmarks [10, 22]. However, the success of recent DLIR approaches has largely been fueled by the supervision of anatomical segmentation labels $\left\lbrack {{42},{60}}\right\rbrack$ or the supervision of perfectly aligned multimodal images $\left\lbrack {{14},{25},{44}}\right\rbrack$ . The absence of pre-aligned multimodal medical images and dependence on segmentation labels further restricts their generalizability across different anatomies or modalities. In contrast to the mainstream DLIR and learning-based structural representation methods, our proposed method is fully self-supervised, which circumvents the need for anatomical delineations or perfectly aligned multi-modal images.

基于深度学习的图像配准（DLIR）方法在多种单模态和多模态配准任务中表现出了显著的结果，这在大量的配准基准测试中得到了证明 [10, 22]。然而，近期DLIR方法的成功在很大程度上得益于对解剖分割标签 $\left\lbrack {{42},{60}}\right\rbrack$ 的监督或对完美对齐的多模态图像 $\left\lbrack {{14},{25},{44}}\right\rbrack$ 的监督。缺乏预先对齐的多模态医学图像以及对分割标签的依赖进一步限制了它们在不同解剖结构或模态中的泛化能力。与主流的DLIR和基于学习的结构表示方法相比，我们提出的方法是完全自监督的，避免了对解剖划分或完美对齐的多模态图像的需求。

Contrastive learning in image registration. Motivated by the success of contrastive learning in visual representation learning $\left\lbrack {8,{15},{21}}\right\rbrack$ ,several methods $\left\lbrack {5,{26},{58}}\right\rbrack$ adopt contrastive learning to extract anatomical structural embedding for monomodal registration. Contrastive learning focuses on extracting discriminative representations by contrasting positive and negative pairs of instances [16]. These methods use noise contrastive estimation (NCE) [54], learning an anatomical structural representation where the feature vectors from the same anatomical location are brought together, in contrast to feature vectors from different anatomical locations. Nevertheless, the learned anatomical structural representations of these methods are not contrast invariance, hence, incapable of multimodal registration. Apart from learning structural representation, a recent work [9] directly minimizes the PatchNCE [43] loss for brain MR T1w-T2w registration. Yet, minimizing the PatchNCE is identical to maximizing a lower bound on mutual information between corresponding spatial locations in the feature maps, which inherits the limitation of MI in deformable registration.

图像配准中的对比学习。受到对比学习在视觉表示学习 $\left\lbrack {8,{15},{21}}\right\rbrack$ 中成功的启发，几种方法 $\left\lbrack {5,{26},{58}}\right\rbrack$ 采用对比学习来提取单模态配准的解剖结构嵌入。对比学习专注于通过对比正负实例对来提取区分性表示 [16]。这些方法使用噪声对比估计（NCE） [54]，学习一种解剖结构表示，其中来自同一解剖位置的特征向量被聚集在一起，而来自不同解剖位置的特征向量则被区分开来。然而，这些方法学习到的解剖结构表示并不具备对比不变性，因此无法进行多模态配准。除了学习结构表示外，最近的一项工作 [9] 直接最小化脑部MR T1w-T2w配准的PatchNCE [43] 损失。然而，最小化PatchNCE与最大化特征图中对应空间位置之间互信息的下界是相同的，这继承了可变形配准中互信息的局限性。

3. Method

3. 方法

3.1. Problem Setup and Overview

3.1. 问题设置与概述

Let $\mathbf{F},\mathbf{M}$ be fixed and moving volumes defined over a $n$ -D mutual spatial domain $\Omega \subseteq {\mathbb{R}}^{n}$ . For simplicity,we further assume that $\mathbf{F}$ and $\mathbf{M}$ are three-dimensional,single-channel,and grayscale images,i.e., $n = 3$ and $\Omega \subseteq {\mathbb{R}}^{3}$ . In this paper,we aim to extract DSIRs of $\mathbf{F}$ and $\mathbf{M}$ ,i.e., ${\mathbf{D}}_{\mathbf{F}}$ and ${\mathbf{D}}_{\mathbf{M}}$ ,with high discriminability such that only the cosine similarity of two feature vectors in the identical anatomical location $x$ ,i.e., ${\mathbf{D}}_{\mathbf{F}}\left( x\right)$ and ${\mathbf{D}}_{\mathbf{M}}\left( x\right)$ ,or with similar structural information is maximised. To this end, we introduce a Modality-Agnostic deep Structural Representation Network (MASR-Net, Sec. 3.2) and anatomy-aware contrastive learning paradigm (Sec. 3.3), followed by a multimodal similarity metric formulation with DNS (Sec. 3.4).

设定 $\mathbf{F},\mathbf{M}$ 为固定和移动的体积，这些体积定义在一个 $n$ 维的相互空间域 $\Omega \subseteq {\mathbb{R}}^{n}$ 上。为简化起见，我们进一步假设 $\mathbf{F}$ 和 $\mathbf{M}$ 是三维的、单通道的和灰度图像，即 $n = 3$ 和 $\Omega \subseteq {\mathbb{R}}^{3}$ 。在本文中，我们旨在提取 $\mathbf{F}$ 和 $\mathbf{M}$ 的深度结构信息表示（DSIRs），即 ${\mathbf{D}}_{\mathbf{F}}$ 和 ${\mathbf{D}}_{\mathbf{M}}$ ，并具有高区分性，以便仅最大化在相同解剖位置 $x$ 的两个特征向量的余弦相似度，即 ${\mathbf{D}}_{\mathbf{F}}\left( x\right)$ 和 ${\mathbf{D}}_{\mathbf{M}}\left( x\right)$ ，或具有相似结构信息的特征向量。为此，我们引入了一种模态无关的深度结构表示网络（MASR-Net，见第3.2节）和解剖感知对比学习范式（见第3.3节），接着是基于DNS的多模态相似度度量公式（见第3.4节）。

The overview of MASR-Net and anatomy-aware contrastive learning paradigm is illustrated in the lower and upper panels of Fig. 2, respectively. Our network first computes the image feature with an encoder-decoder network, extracts the deep structural information from the feature map with the DNS extractor, and encodes them with the feature squeezing module. The proposed network is trained with non-linear intensity transformation, followed by anatomy-aware contrastive learning. With these components, the complex structural and anatomical location-aware information are well reflected in the resulting deep intermediate structural representation.

MASR-Net和解剖感知对比学习范式的概述分别在图2的下部和上部面板中展示。我们的网络首先通过编码器-解码器网络计算图像特征，从特征图中提取深层结构信息，使用DNS提取器，并通过特征压缩模块对其进行编码。所提出的网络经过非线性强度变换训练，随后进行解剖感知对比学习。通过这些组件，复杂的结构和解剖位置感知信息在生成的深层中间结构表示中得到了很好的反映。

3.2. Modality-Agnostic Deep Structural Represen- tation Network

3.2. 模态无关的深度结构表示网络

Feature extraction. We first leverage a feed-forward 3D convolutional neural network (CNN) to extract the image feature $\mathbf{h}$ for the input image $I$ . The proposed CNN network is built with a 4-level encoder-decoder structure with skip connection [47], which is composed of 3D convolution layers, LeakyReLU activations [31] and uses BlurPool [61] and trilinear interpolation for downsampling and upsam-pling, respectively, resulting in a maximum striding factor of 8 . The network takes the input image $I \in {\mathbb{R}}^{C \times H \times W \times D}$ and outputs an image feature map $\mathbf{h} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D}$ . The details of the network are shown in supplementary material.

特征提取。我们首先利用前馈3D卷积神经网络（CNN）来提取输入图像 $I$ 的图像特征 $\mathbf{h}$ 。所提出的CNN网络采用4级编码器-解码器结构，具有跳跃连接 [47]，由3D卷积层、LeakyReLU激活函数 [31] 组成，并分别使用BlurPool [61] 和三线性插值进行下采样和上采样，最大步幅因子为8。该网络接收输入图像 $I \in {\mathbb{R}}^{C \times H \times W \times D}$ 并输出图像特征图 $\mathbf{h} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D}$ 。网络的详细信息见补充材料。

Figure 2. Overview of the Modality-Agnostic Deep Structural Representation Network (MASR-Net) and anatomy-aware contrastive learning paradigm. For brevity, we visualize the 3D feature maps in a 2D aspect. Only negative pairs of the first feature vector are shown.

图2. 模态无关深度结构表示网络（MASR-Net）和解剖意识对比学习范式的概述。为简洁起见，我们将3D特征图可视化为2D形式。仅显示第一个特征向量的负对。

Deep Neighbourhood Self-similarity (DNS). The vanilla local self-similarity descriptor [48] captures internal geometric layouts of local self-similarities within images by computing the pairwise distance between the patch center and its neighbourhood within a patch at the pixel-level. Yet, this formulation may be sensitive to the noise presented in the image, especially for medical images. In contrast, our proposed DNS is computed at the feature level and avoids using the patch center for the calculation, capable of capturing complex structural information of the image beyond the local neighbour context and less sensitive to the noise presented in medical images.

深邻域自相似性（DNS）。传统的局部自相似性描述符 [48] 通过计算补丁中心与其邻域之间的像素级成对距离，捕捉图像内局部自相似性的内部几何布局。然而，这种公式可能对图像中存在的噪声敏感，尤其是医学图像。相比之下，我们提出的DNS是在特征级别计算的，避免使用补丁中心进行计算，能够捕捉超出局部邻域上下文的复杂结构信息，并且对医学图像中存在的噪声不那么敏感。

Formally,given the feature map $\mathbf{h} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D}$ of input image $I$ ,a patch centred at $x \in \Omega$ and a certain neighbourhood layout $\mathcal{N}$ ,the proposed deep neighbourhood self-similarity $\mathbf{S}$ is given by:

正式地，给定输入图像 $I$ 的特征图 $\mathbf{h} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D}$ ，以 $x \in \Omega$ 为中心的补丁和某种邻域布局 $\mathcal{N}$ ，所提出的深邻域自相似性 $\mathbf{S}$ 定义为：

where $y,{y}^{\prime } \in \mathcal{N}\left( x\right)$ defines the neighbour location of $x$ . The denominator ${\sigma }^{2}$ is a noise estimator,defined as the mean of all patch distances,i.e., $\frac{1}{\left| \mathcal{N}\left( x\right) \right| }\mathop{\sum }\limits_{{{y}^{\prime } \in \mathcal{N}\left( x\right) }}(\mathbf{h}\left( y\right) -$ $\left. {\mathbf{h}\left( {y}^{\prime }\right) }\right) {}^{2}$ ,where $\left| {\mathcal{N}\left( x\right) }\right|$ represents the number of neighbour voxels in $\mathcal{N}\left( x\right)$ . We follow [20] to further restrict the pairwise distance calculations within the six-neighbourhood (6- $\mathrm{{NH}}$ ) with a Euclidean distance of $\sqrt{2}$ between them,reducing the computation complexity of DNS, i.e., reduce the number of unique pair-wise distances from 15 to 12 .

其中 $y,{y}^{\prime } \in \mathcal{N}\left( x\right)$ 定义了 $x$ 的邻域位置。分母 ${\sigma }^{2}$ 是一个噪声估计器，定义为所有补丁距离的平均值，即 $\frac{1}{\left| \mathcal{N}\left( x\right) \right| }\mathop{\sum }\limits_{{{y}^{\prime } \in \mathcal{N}\left( x\right) }}(\mathbf{h}\left( y\right) -$ $\left. {\mathbf{h}\left( {y}^{\prime }\right) }\right) {}^{2}$ ，其中 $\left| {\mathcal{N}\left( x\right) }\right|$ 表示 $\mathcal{N}\left( x\right)$ 中邻域体素的数量。我们遵循 [20] 进一步限制六邻域 (6- $\mathrm{{NH}}$ ) 内的成对距离计算，二者之间的欧几里得距离为 $\sqrt{2}$ ，从而降低 DNS 的计算复杂度，即将唯一成对距离的数量从 15 减少到 12。

To further maximize the discriminability of the computed feature,we compute two sets of DNS from $\mathbf{h}$ using two different neighbourhood layouts,i.e., $\mathcal{N}$ and ${\mathcal{N}}_{d}$ . We define $\mathcal{N}$ and ${\mathcal{N}}_{d}$ to be the direct 6-NH and dilated 6-NH layouts, respectively, as shown in Fig. 2. The DNS of the two neighbourhood layouts is then concatenated to form a 5-D feature map ${\mathbf{h}}^{s} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D \times {24}}$ ,containing the deep structural information of the feature map $\mathbf{h}$ .

为了进一步最大化计算特征的可区分性，我们使用两种不同的邻域布局从 $\mathbf{h}$ 计算两组 DNS，即 $\mathcal{N}$ 和 ${\mathcal{N}}_{d}$ 。我们将 $\mathcal{N}$ 和 ${\mathcal{N}}_{d}$ 定义为直接的 6-NH 和扩张的 6-NH 布局，如图 2 所示。然后将这两种邻域布局的 DNS 连接起来，形成一个 5-D 特征图 ${\mathbf{h}}^{s} \in {\mathbb{R}}^{{C}_{h} \times H \times W \times D \times {24}}$ ，包含特征图 $\mathbf{h}$ 的深层结构信息。

Feature Squeezing. To compute a compact, intermediate DSIR from the 5-D DNS feature map ${\mathbf{h}}^{s}$ ,we encode the high-dimensional 5-D feature map ${\mathbf{h}}^{s}$ into a compact 4D DNS descriptor using a feature squeezing module. The feature squeezing module consists of a single-layer linear layer with ${C}_{h}$ perceptions,followed by a feed-forward convolution head. It first takes the 5-D DNS feature vector ${\mathbf{h}}^{s}$ as input and encodes it into a compact deep structural embedding ${\mathbf{h}}^{c} \in {\mathbb{R}}^{H \times W \times D \times {24}}$ using linear projection. The feed-forward convolution head is composed of two 3D convolution layers (kernel size $= {3}^{3}$ ) with LeakyReLU activation in between the layers. It further encodes the compact deep structural embedding ${\mathbf{h}}^{c}$ to the DSIR $\mathbf{D} \in {\mathbb{R}}^{H \times W \times D \times {C}_{d}}$ .

特征压缩。为了从 5 维 DNS 特征图 ${\mathbf{h}}^{s}$ 计算出紧凑的中间 DSIR，我们使用特征压缩模块将高维 5 维特征图 ${\mathbf{h}}^{s}$ 编码为紧凑的 4D DNS 描述符。特征压缩模块由一个具有 ${C}_{h}$ 感知的单层线性层组成，后面跟着一个前馈卷积头。它首先将 5 维 DNS 特征向量 ${\mathbf{h}}^{s}$ 作为输入，并使用线性投影将其编码为紧凑的深层结构嵌入 ${\mathbf{h}}^{c} \in {\mathbb{R}}^{H \times W \times D \times {24}}$ 。前馈卷积头由两个 3D 卷积层（卷积核大小 $= {3}^{3}$ ）组成，层与层之间使用 LeakyReLU 激活函数。它进一步将紧凑的深层结构嵌入 ${\mathbf{h}}^{c}$ 编码为 DSIR $\mathbf{D} \in {\mathbb{R}}^{H \times W \times D \times {C}_{d}}$ 。

3.3. Anatomy-aware Contrastive Learning

3.3. 解剖学感知的对比学习

The deep intermediate structural representation $\mathbf{D}$ using DNS is a regional descriptor that is able to capture local geometric structures in the feature map while suppressing appearance variation inside it. However, most regions in the image may share similar local geometric structures or suffer from image noise, causing ambiguity in matching the true anatomical correspondence. To further enhance the dis-

使用 DNS 的深中间结构表示 $\mathbf{D}$ 是一种区域描述符，能够捕捉特征图中的局部几何结构，同时抑制内部的外观变化。然而，图像中的大多数区域可能共享相似的局部几何结构或受到图像噪声的影响，从而导致匹配真实解剖对应关系时的模糊性。为了进一步增强...

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——