[AI医学论文翻译]Learning Large-Factor EM Image Super-Resolution with Generative Priors

Doc2X：批量处理 PDF 的理想工具支持批量PDF识别、表格转换、代码解析，结合深度翻译功能，提升工作效率！ Doc2X: Ideal Tool for Batch PDF Processing Supports batch PDF recognition, table conversion, and code parsing, combined with advanced translation for enhanced productivity! 👉 立即访问 Doc2X | Visit Doc2X Now

原文链接: openaccess.thecvf.com/content/CVP…

Learning Large-Factor EM Image Super-Resolution with Generative Priors

基于生成先验的超分辨率大因子EM图像学习

Jiateng Shou ${}^{1}$ Zeyu Xiao ${}^{1}$ Shiyu Deng ${}^{1}$ Wei Huang ${}^{1}$ Peiyao Shi ${}^{3}$

Ruobing Zhang ${}^{3,2}$ Zhiwei Xiong ${}^{1,2}$ Feng Wu ${}^{1,2, \dagger }$

${}^{1}$ MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition,University of Science and Technology of China

${}^{1}$ 中国科学技术大学脑启发智能感知与认知MoE重点实验室

${}^{2}$ Institute of Artificial Intelligence,Hefei Comprehensive National Science Center

${}^{2}$ 合肥综合性国家科学中心人工智能研究所

${}^{3}$ Suzhou Institute of Biomedical Engineering and Technology,Chinese Academy of Sciences shoujt@mail.ustc.edu.cn {zwxiong,fengwu}@ustc.edu.cn

${}^{3}$ 中国科学院苏州生物医学工程与技术研究所 shoujt@mail.ustc.edu.cn {zwxiong,fengwu}@ustc.edu.cn

Abstract

摘要

As the mainstream technique for capturing images of biological specimens at nanometer resolution, electron microscopy (EM) is extremely time-consuming for scanning wide field-of-view (FOV) specimens. In this paper, we investigate a challenging task of large-factor EM image super-resolution (EMSR), which holds great promise for reducing scanning time, relaxing acquisition conditions, and expanding imaging FOV. By exploiting the repetitive structures and volumetric coherence of EM images, we propose the first generative learning-based framework for large-factor EMSR. Specifically, motivated by the predictability of repetitive structures and textures in EM images, we first learn a discrete codebook in the latent space to represent high-resolution (HR) cell-specific priors and a latent vector indexer to map low-resolution (LR) EM images to their corresponding latent vectors in a generative manner. By incorporating the generative cell-specific priors from HR EM images through a multi-scale prior fusion module, we then deploy multi-image feature alignment and fusion to further exploit the inter-section coherence in the volumetric EM data. Extensive experiments demonstrate that our proposed framework outperforms advanced single-image and video super-resolution methods for $8 \times$ and ${16} \times$ EMSR (i.e.,with 64 times and 256 times less data acquired, respectively), achieving superior visual reconstruction quality and downstream segmentation accuracy on benchmark EM datasets. Code is available at github.com/jtshou/

作为以纳米分辨率捕捉生物样本图像的主流技术，电子显微镜（EM）在扫描宽视场（FOV）样本时非常耗时。本文研究了一项具有挑战性的任务，即大因子电子显微镜图像超分辨率（EMSR），这为减少扫描时间、放宽采集条件和扩展成像FOV提供了极大的潜力。通过利用EM图像的重复结构和体积一致性，我们提出了第一个基于生成学习的大因子EMSR框架。具体而言，受到EM图像中重复结构和纹理可预测性的启发，我们首先在潜在空间中学习一个离散代码本，以表示高分辨率（HR）细胞特定先验，并通过生成方式映射低分辨率（LR）EM图像到其对应的潜在向量索引器。通过多尺度先验融合模块，我们将HR EM图像中的生成细胞特定先验结合进来，然后部署多图像特征对齐和融合，以进一步利用体积EM数据中的交叉一致性。大量实验表明，我们提出的框架在 $8 \times$ 和 ${16} \times$ EMSR（即分别以64倍和256倍更少的数据采集）方面优于先进的单图像和视频超分辨率方法，达到了卓越的视觉重建质量和基准EM数据集上的下游分割精度。代码可在 github.com/jtshou/ 获取。

GPEMSR.

1. Introduction

1. 引言

Electron microscopy (EM) is a commonly used imaging technique in life sciences to investigate the ultrastructure of cells, tissues, organelles, and macromolecular complexes, which captures images of biological specimens at nanometer resolution. However, high-quality EM image acquisition typically requires a strict and time-consuming process, involving careful adjustments of beam current, aperture size, and detector settings. This process may take up to years to scan wide field-of-view (FOV) specimens. For example, Zheng et al. [57] spent approximately 16 months to acquire a $\sim {106}\mathrm{\;{TB}}$ whole-brain dataset of an adult drosophila melanogaster. The long acquisition time greatly limits the application of EM imaging in analyzing complete biological structures in large specimens, such as neuron connections in mammalian brains.

电子显微镜（EM）是一种在生命科学中常用的成像技术，用于研究细胞、组织、细胞器和大分子复合物的超微结构，能够以纳米级分辨率捕获生物样本的图像。然而，高质量的EM图像获取通常需要严格且耗时的过程，包括对束流强度、光圈大小和探测器设置的仔细调整。这个过程可能需要长达数年的时间来扫描广视野（FOV）样本。例如，Zheng等人[57]花费了大约16个月的时间来获取一组成年果蝇的 $\sim {106}\mathrm{\;{TB}}$ 全脑数据集。漫长的获取时间极大限制了EM成像在分析大型样本中完整生物结构（如哺乳动物大脑中的神经连接）的应用。

Image super-resolution (SR), which is capable of restoring high-resolution (HR) images from their corresponding low-resolution (LR) observations, has the potential to revolutionize EM imaging by allowing for faster and less restrictive data acquisition, while also providing high-quality images with a wide field of view. By applying SR to EM images (shorted as EMSR hereafter), the capturing time can be significantly reduced, and the strict capturing conditions can be relaxed. By deploying a simple ResNet-based UNet model, Fang et al. [13] have demonstrated the promising performance of EMSR for $4 \times$ magnification (i.e., with 16 times less data acquired). However, achieving even larger-factor EMSR to further reduce capturing time remains challenging. This is in accordance with existing methods $\left\lbrack {5,{18},{30},{35},{50},{58}}\right\rbrack$ for natural images,which can achieve satisfactory results for up to $4 \times$ magnification, but fail to meet the demand for larger factors.

图像超分辨率（SR）能够从相应的低分辨率（LR）观测中恢复高分辨率（HR）图像，具有革命性地改变EM成像的潜力，允许更快且限制更少的数据获取，同时提供高质量的广视野图像。通过将SR应用于EM图像（以下简称为EMSR），捕获时间可以显著减少，严格的捕获条件也可以放宽。Fang等人[13]通过部署一个基于ResNet的UNet模型，展示了EMSR在 $4 \times$ 放大（即获取的数据量减少16倍）方面的良好性能。然而，实现更大倍数的EMSR以进一步减少捕获时间仍然具有挑战性。这与现有的自然图像方法 $\left\lbrack {5,{18},{30},{35},{50},{58}}\right\rbrack$ 相符，这些方法在高达 $4 \times$ 的放大倍数下能够取得令人满意的结果，但未能满足更大倍数的需求。

On the other hand, recent advances in generative models, such as ChatGPT and diffusion-based models [9, 16, 21, 43], reveal powerful capability in automatic content generation, including natural languages and images. This motivates us to consider the EMSR task from a generative perspective. Especially, compared with natural images that possess diverse structures and textures, EM images often exhibit repetitive structures and textures due to the predictability of imaging specimens, making it more suitable to leverage generative learning for accurate reconstruction. In this paper, we propose a novel deep learning-based framework tailored to the challenging task of large-factor EMSR, by 1) exploiting the repetitive structures and textures in EM images with generative cell-specific priors learned from HR EM images, and 2) exploiting the inter-section coherence in the volumetric EM data by aggregating features learned from multiple consecutive images.

另一方面，最近在生成模型方面的进展，如 ChatGPT 和基于扩散的模型 [9, 16, 21, 43]，揭示了在自动内容生成方面的强大能力，包括自然语言和图像。这激励我们从生成的角度考虑 EMSR 任务。特别是，与具有多样结构和纹理的自然图像相比，EM 图像由于成像样本的可预测性，通常表现出重复的结构和纹理，这使得利用生成学习进行准确重建更为合适。本文提出了一种新颖的基于深度学习的框架，专门针对大因子 EMSR 的挑战任务，具体通过 1) 利用从高分辨率 EM 图像中学习到的生成细胞特定先验来利用 EM 图像中的重复结构和纹理，以及 2) 通过聚合从多个连续图像中学习到的特征，利用体积 EM 数据中的交叉一致性。

${}^{ \dagger }$ Corresponding author.

${}^{ \dagger }$ 通讯作者。

Specifically, our framework explores cell-specific priors using a VQGAN-Indexer network, consisting of VQ-GAN [12] and our proposed latent vector indexer. We first learn a discrete codebook to represent the distribution of HR EM images in the latent space. The codebook captures both structure and texture information, while the decoder establishes relationships between latent vectors and image patches. We then train a latent vector indexer to acquire the corresponding latent vectors and integrate the indexer with the codebook and the decoder for generating HR EM images. By treating the generation process as an indexing task, we can match LR EM images with their corresponding HR feature representations from the latent space, thereby obtaining priors solely derived from HR EM images.

具体而言，我们的框架使用 VQGAN-Indexer 网络探索细胞特定先验，该网络由 VQ-GAN [12] 和我们提出的潜在向量索引器组成。我们首先学习一个离散代码本，以表示潜在空间中高分辨率 EM 图像的分布。代码本捕获结构和纹理信息，而解码器则建立潜在向量与图像块之间的关系。然后，我们训练一个潜在向量索引器，以获取相应的潜在向量，并将索引器与代码本和解码器集成，以生成高分辨率 EM 图像。通过将生成过程视为索引任务，我们可以将低分辨率 EM 图像与其对应的来自潜在空间的高分辨率特征表示进行匹配，从而获得仅源自高分辨率 EM 图像的先验。

To maintain reconstruction quality while prioritizing downstream segmentation accuracy, we propose a Multi-Scale Prior Fusion (MPF) module for incorporating the above learned cell-specific priors in EMSR. We use the VQGAN-Indexer output as reference images and learn a mask for fusing reference features based on the patch-level cosine similarity between LR EM images and corresponding reference images. To fully utilize the latent vectors and relationships learned by the decoder, we use multi-scale reference features from different layers of the decoder with varying resolutions. Following the MPF module, our framework includes two key steps for exploiting inter-section coherence in the volumetric EM data: multi-image feature alignment (along the axial direction) and multi-image feature fusion. To this end, we introduce a Pyramid Optical-flow-based Deformable convolution alignment (POD) module and a 3D Spatial-Attention fusion (3DA) module. The former leverages a pre-trained optical-flow network SPyNet [42] and deformable convolutions $\left\lbrack {6,{60}}\right\rbrack$ ,while the latter leverages the spatial attention mechanism and 3D convolutions. Both improve reconstruction quality and downstream segmentation accuracy for large-factor EMSR.

为了在优先考虑下游分割精度的同时保持重建质量，我们提出了一种多尺度先验融合（MPF）模块，以将上述学习的细胞特定先验纳入EMSR。我们使用VQGAN-Indexer的输出作为参考图像，并学习一个掩码，以根据LR EM图像与相应参考图像之间的补丁级余弦相似度融合参考特征。为了充分利用解码器学习的潜在向量和关系，我们使用来自解码器不同层的多尺度参考特征，这些特征具有不同的分辨率。在MPF模块之后，我们的框架包括两个关键步骤，以利用体积EM数据中的交叉一致性：多图像特征对齐（沿轴向）和多图像特征融合。为此，我们引入了基于金字塔光流的可变形卷积对齐（POD）模块和3D空间注意力融合（3DA）模块。前者利用预训练的光流网络SPyNet [42] 和可变形卷积 $\left\lbrack {6,{60}}\right\rbrack$ ，而后者利用空间注意力机制和3D卷积。两者均提高了大因子EMSR的重建质量和下游分割精度。

In summary, this paper offers the following contributions. 1) We present the first generative learning-based framework for the challenging task of large-factor EMSR. 2) We introduce the VQGAN-Indexer network to explore generative cell-specific prior information from HR EM images. 3) We propose the MPF module to effectively utilize the generative priors while preserving image fidelity with LR observations, followed by the POD and 3DA modules for multi-image feature alignment and fusion. 4) Extensive experiments demonstrate the superiority of our framework in terms of both reconstruction quality and downstream segmentation accuracy for $8 \times$ and ${16} \times$ EMSR.

总之，本文提供了以下贡献。1）我们提出了第一个基于生成学习的框架，用于解决大因子EMSR这一具有挑战性的任务。2）我们引入了VQGAN-Indexer网络，以探索来自HR EM图像的生成细胞特定先验信息。3）我们提出了MPF模块，以有效利用生成先验，同时保持与LR观测的图像保真度，随后是POD和3DA模块，用于多图像特征对齐和融合。4）大量实验表明，我们的框架在重建质量和下游分割精度方面优于 $8 \times$ 和 ${16} \times$ EMSR。

2. Related Work

2. 相关工作

Electron microscopy image super-resolution. Existing EMSR methods can be categorized into two types: restoring isotropic volumes from anisotropic ones, i.e., SR along the axial dimension $\left\lbrack {8,{20}}\right\rbrack$ ,and reconstructing HR images from corresponding LR observations in the lateral dimensions $\left\lbrack {7,{13},{40},{46},{53}}\right\rbrack$ . We focus on the latter task in this paper, while our proposed framework may also apply to the former task. As a pioneering work in the field of EMSR, Sreehari et al. [46] introduce a Bayesian framework and utilize a library-based non-local means (LB-NLM) algorithm to achieve up to ${16} \times$ EMSR without requiring a training process. However, this non-learning-based method limits performance and is not specifically designed for large-factor EMSR. Along the deep learning line, Nehme et al. [40] train a fully convolutional encoder-decoder network on simulated data to reconstruct super-resolved images. Hann et al. [7] train a GAN model using pairs of test specimens captured from the same region of interest. Xie et al. [53] leverage the attention mechanism to capture inter-section dependencies and shared features among adjacent images. Compared to previous EMSR methods, our framework not only utilizes adjacent EM images but also explores and integrates generative cell-specific priors to tackle the challenging task of large-factor EMSR.

电子显微镜图像超分辨率。现有的 EMSR 方法可以分为两类：从各向异性体积恢复各向同性体积，即沿轴向维度的 SR $\left\lbrack {8,{20}}\right\rbrack$ ，以及从相应的横向维度的低分辨率观察重建高分辨率图像 $\left\lbrack {7,{13},{40},{46},{53}}\right\rbrack$ 。本文重点关注后者任务，同时我们提出的框架也可能适用于前者任务。作为 EMSR 领域的开创性工作，Sreehari 等人 [46] 引入了一个贝叶斯框架，并利用基于库的非局部均值 (LB-NLM) 算法实现高达 ${16} \times$ 的 EMSR，而无需训练过程。然而，这种非学习型方法限制了性能，并且并非专门为大因子 EMSR 设计。在深度学习方面，Nehme 等人 [40] 在模拟数据上训练了一个全卷积编码器-解码器网络以重建超分辨率图像。Hann 等人 [7] 使用从同一区域捕获的测试样本对训练 GAN 模型。Xie 等人 [53] 利用注意机制捕捉相邻图像之间的交叉依赖性和共享特征。与以前的 EMSR 方法相比，我们的框架不仅利用相邻的 EM 图像，还探索并整合生成的细胞特异性先验，以应对大因子 EMSR 的挑战性任务。

Video super-resolution. Video super-resolution (VSR) aims to restore HR frames by leveraging adjacent temporal information in multiple LR frames. To align temporal features, optical flow [3, 5, 26, 44, 49, 52, 54] and deformable convolution $\left\lbrack {{47},{50}}\right\rbrack$ ,have been widely adopted. Recently, transformer-based approaches $\left\lbrack {4,{36}}\right\rbrack$ yield remarkable advancements in VSR, owing to the utilization of diverse attention mechanisms. Inspired by these VSR methods, to exploit the inter-section coherence in the volumetric EM data, we utilize optical-flow networks and deformable convolutions for multi-image feature alignment, and spatial attention mechanisms for multi-image feature fusion.

视频超分辨率。视频超分辨率（VSR）旨在通过利用多个低分辨率（LR）帧中的相邻时间信息来恢复高分辨率（HR）帧。为了对齐时间特征，光流 [3, 5, 26, 44, 49, 52, 54] 和可变形卷积 $\left\lbrack {{47},{50}}\right\rbrack$ 被广泛采用。最近，基于变换器的方法 $\left\lbrack {4,{36}}\right\rbrack$ 在 VSR 中取得了显著进展，这得益于多样的注意机制的利用。受到这些 VSR 方法的启发，为了利用体积电镜（EM）数据中的交叉一致性，我们利用光流网络和可变形卷积进行多图像特征对齐，并使用空间注意机制进行多图像特征融合。

Generative priors in image restoration. Generative image restoration methods [11, 31-33] employ the priors from the pre-trained generative adversarial network (GAN), such as StyleGAN [24] and BigGAN [2], to approximate the natural image manifold and synthesize high-quality images. Given the superior performance of discrete codebook-based generative methods in semantic image synthesis, structure-to-image,and stochastic super-resolution tasks [12, 48], recent methods explore codebook-based facial priors [17, 59] by leveraging VQGAN [12] for training. Different from these methods, we propose a latent vector indexer to exploit the information contained within the input LR images, and the MPF module to fuse generative priors.

图像恢复中的生成先验。生成图像恢复方法 [11, 31-33] 利用预训练的生成对抗网络（GAN）中的先验，如 StyleGAN [24] 和 BigGAN [2]，来近似自然图像流形并合成高质量图像。鉴于基于离散代码本的生成方法在语义图像合成、结构到图像和随机超分辨率任务 [12, 48] 中的优越表现，最近的方法通过利用 VQGAN [12] 进行训练，探索基于代码本的面部先验 [17, 59]。与这些方法不同，我们提出了一种潜在向量索引器，以利用输入低分辨率图像中包含的信息，以及 MPF 模块来融合生成先验。

3. Method

3. 方法

3.1. Overview

3.1. 概述

As illustrated in Figure 1, the goal of large-factor EMSR is to obtain the super-resolved ${I}_{SR}^{0} \in {\mathbb{R}}^{{rH} \times {rW} \times 1}$ ,given a sequence of ${2N} + 1$ consecutive LR EM images, ${I}_{LR}^{z} \in$ ${\mathbb{R}}^{H \times W \times 1}$ ,which should be close to the ground truth image ${I}_{GT}^{0} \in {\mathbb{R}}^{{rH} \times {rW} \times 1}$ ,where $z \in \{ - N, - N +$ $1,\cdots ,0,\cdots ,N - 1,N\}$ and $r$ is the large scale factor. In this paper,we set $N = 2$ and $r = 8,{16}$ .

如图1所示，大尺度因子EMSR的目标是获得超分辨率的 ${I}_{SR}^{0} \in {\mathbb{R}}^{{rH} \times {rW} \times 1}$ ，给定一系列 ${2N} + 1$ 连续的低分辨率电子显微镜图像 ${I}_{LR}^{z} \in$ ${\mathbb{R}}^{H \times W \times 1}$ ，这些图像应接近真实图像 ${I}_{GT}^{0} \in {\mathbb{R}}^{{rH} \times {rW} \times 1}$ ，其中 $z \in \{ - N, - N +$ $1,\cdots ,0,\cdots ,N - 1,N\}$ 和 $r$ 是大尺度因子。在本文中，我们设置 $N = 2$ 和 $r = 8,{16}$ 。

To achieve this, we propose a generative learning-based framework consisting of three stages. Stage I involves exploring generative cell-specific priors through the VQGAN model, which identifies a discrete latent space of HR EM images and generates the HR EM image ${I}_{\text{HRref }}$ from the latent space. This latent space is represented using vectors from the VQGAN codebook $\mathbf{C}$ . In Stage II,we train a latent vector indexer and connect it with the VQGAN codebook $\mathbf{C}$ and VQGAN decoder $\mathbf{Q}$ obtained from Stage I. This connection allows us to generate the reference HR EM image ${I}_{\text{Ref }}$ and its corresponding multi-scale generative features ${F}_{\text{Ref }}^{l}$ from the LR EM image ${I}_{\mathrm{{LR}}}$ . Finally,in Stage III, we fuse the multi-scale generative features ${F}_{\text{Ref }}^{l}$ using the MPF module, align adjacent image features in the axial direction using the POD module, and fuse the adjacent image features using the 3DA module. We finally use sub-pixel convolution to reconstruct the HR output ${I}_{SR}^{0}$ .

为了实现这一目标，我们提出了一个基于生成学习的框架，分为三个阶段。第一阶段涉及通过VQGAN模型探索生成的细胞特定先验，该模型识别高分辨率电子显微镜图像的离散潜在空间，并从该潜在空间生成高分辨率电子显微镜图像 ${I}_{\text{HRref }}$ 。该潜在空间使用来自VQGAN代码本的向量 $\mathbf{C}$ 表示。在第二阶段，我们训练一个潜在向量索引器，并将其与第一阶段获得的VQGAN代码本 $\mathbf{C}$ 和VQGAN解码器 $\mathbf{Q}$ 连接。这一连接使我们能够从低分辨率电子显微镜图像 ${I}_{\mathrm{{LR}}}$ 生成参考的高分辨率电子显微镜图像 ${I}_{\text{Ref }}$ 及其对应的多尺度生成特征 ${F}_{\text{Ref }}^{l}$ 。最后，在第三阶段，我们使用MPF模块融合多尺度生成特征 ${F}_{\text{Ref }}^{l}$ ，使用POD模块在轴向上对齐相邻图像特征，并使用3DA模块融合相邻图像特征。最终，我们使用亚像素卷积重建高分辨率输出 ${I}_{SR}^{0}$ 。

3.2. Exploring Generative Cell-Specific Priors

3.2. 探索生成的细胞特定先验

EM images are characterized by their repetitive structures and textures, such as cellular membranes and subcellular organelles. These features offer an opportunity to exploit their regularity and predictability, motivating our exploration of the generative cell-specific prior in EM images for large-factor EMSR. As shown in Figure 1 (a), our framework for exploring generative cell-specific prior exploration consists of two main steps. First, we identify a discrete latent space that represents the features of HR EM images. Then, we generate the HR EM images from this latent space, leveraging its compact and representative nature.

EM 图像的特征在于其重复的结构和纹理，例如细胞膜和亚细胞器。这些特征提供了利用其规律性和可预测性的机会，激励我们探索 EM 图像中用于大因子 EMSR 的生成细胞特定先验。如图 1 (a) 所示，我们探索生成细胞特定先验的框架由两个主要步骤组成。首先，我们识别一个离散潜在空间，该空间表示高分辨率 EM 图像的特征。然后，我们从该潜在空间生成高分辨率 EM 图像，利用其紧凑和代表性的特性。

Identifying a discrete latent space. We aim to identify a discrete latent space that represents HR EM images. To achieve this,we utilize an encoder $\mathbf{E}$ to parameterize the posterior categorical distribution of HR EM images ${q}_{\phi }(\xi \mid$ $o)$ ,where $\xi$ represents the variable for latent vectors, $o$ represents the variable for HR EM images and $\phi$ represents the encoder's parameters. Specifically, we quantize the output feature ${Z}_{e}$ from $\mathbf{E}$ by mapping it to its closest latent vector in $\mathbf{C}$ and obtain the quantified feature ${Z}_{d}$ and one-hot index of each mapped HR patch,denoted as ${s}^{m,n}$ where ${Z}_{q}\left\lbrack j\right\rbrack$ denotes the $j$ -th latent vector stored in $\mathbf{C}$ . ${Z}_{e}^{m,n}$ denotes the encoder output feature element at position(m,n)within ${Z}_{e}$ ,while ${Z}_{d}^{m,n}$ denotes the quantified feature element at position(m,n)within ${Z}_{d}.{s}^{m,n}$ is a v-dimensional vector and ${s}^{m,n}\left\lbrack k\right\rbrack$ denotes the $k$ -th element in ${s}^{m,n}$ . The codebook $\mathbf{C}$ consists of $v$ latent vectors,each with a dimensionality of $d$ . Thus,the posterior categorical distribution ${q}_{\phi }\left( {\xi \mid o}\right)$ is defined as

识别离散潜在空间。我们的目标是识别一个表示高分辨率 EM 图像的离散潜在空间。为此，我们利用编码器 $\mathbf{E}$ 对高分辨率 EM 图像的后验类别分布进行参数化 ${q}_{\phi }(\xi \mid$ $o)$ ，其中 $\xi$ 表示潜在向量的变量， $o$ 表示高分辨率 EM 图像的变量， $\phi$ 表示编码器的参数。具体而言，我们通过将输出特征 ${Z}_{e}$ 从 $\mathbf{E}$ 映射到其在 $\mathbf{C}$ 中最近的潜在向量来量化输出特征，并获得量化特征 ${Z}_{d}$ 和每个映射的高分辨率补丁的一热索引，记作 ${s}^{m,n}$ ，其中 ${Z}_{q}\left\lbrack j\right\rbrack$ 表示存储在 $\mathbf{C}$ 中的第 $j$ 个潜在向量。 ${Z}_{e}^{m,n}$ 表示在 ${Z}_{e}$ 中位置 (m,n) 的编码器输出特征元素，而 ${Z}_{d}^{m,n}$ 表示在 ${Z}_{d}.{s}^{m,n}$ 中位置 (m,n) 的量化特征元素，且 ${Z}_{d}.{s}^{m,n}$ 是一个 v 维向量， ${s}^{m,n}\left\lbrack k\right\rbrack$ 表示 $k$ 中的第 ${s}^{m,n}$ 个元素。代码本 $\mathbf{C}$ 由 $v$ 个潜在向量组成，每个向量的维度为 $d$ 。因此，后验类别分布 ${q}_{\phi }\left( {\xi \mid o}\right)$ 定义为

Generating the HR EM image. Given the quantified feature ${Z}_{d}$ ,we can generate HR EM image ${I}_{\text{HRref }}$ through the decoder. We parameterize the prior distribution of HR EM images ${p}_{\theta }\left( {o \mid \xi }\right)$ through the decoder $\mathbf{Q}$ ,where $\theta$ is the parameters of the decoder.

生成高分辨率电子显微镜图像。给定量化特征 ${Z}_{d}$ ，我们可以通过解码器生成高分辨率电子显微镜图像 ${I}_{\text{HRref }}$ 。我们通过解码器 $\mathbf{Q}$ 对高分辨率电子显微镜图像的先验分布进行参数化，其中 $\theta$ 是解码器的参数。

The encoder is composed of multiple Res-blocks [19] and convolution blocks for downsampling, while the decoder is composed of multiple Res-blocks and deconvolution blocks for upsampling. The compression patch size for downsampling is set to $p$ . Both the encoder and decoder leverage self-attention mechanisms to enhance generalization quality. We optimize the latent vector ${Z}_{q}\left\lbrack j\right\rbrack$ along with the encoder $\mathbf{E}$ and decoder $\mathbf{Q}$ .

编码器由多个残差块 [19] 和用于下采样的卷积块组成，而解码器由多个残差块和用于上采样的反卷积块组成。下采样的压缩补丁大小设置为 $p$ 。编码器和解码器都利用自注意力机制来增强泛化质量。我们优化潜在向量 ${Z}_{q}\left\lbrack j\right\rbrack$ 以及编码器 $\mathbf{E}$ 和解码器 $\mathbf{Q}$ 。

3.3. Generating LR Reference

3.3. 生成低分辨率参考

With the parameterization of the latent space using $\mathbf{C}$ and the prior distribution of HR EM images using $\mathbf{Q}$ ,we can generate HR EM images given real HR EM images as input. However, in the large-factor EMSR task, only highly degraded LR images are available as input. Hence, we need to map the LR images to their corresponding quantified feature ${Z}_{d}$ to utilize the generative cell-prior stored in $\mathbf{C}$ .

通过使用 $\mathbf{C}$ 对潜在空间进行参数化，并使用 $\mathbf{Q}$ 对高分辨率电子显微镜图像的先验分布进行参数化，我们可以在给定真实高分辨率电子显微镜图像作为输入的情况下生成高分辨率电子显微镜图像。然而，在大因子电子显微镜超分辨率任务中，仅有高度退化的低分辨率图像可用作输入。因此，我们需要将低分辨率图像映射到其对应的量化特征 ${Z}_{d}$ ，以利用存储在 $\mathbf{C}$ 中的生成细胞先验。

To achieve this, one straightforward approach is to interpolate the LR images and feed them into the encoder [17]. This straightforward approach approximates the posterior categorical distribution of LR EM images $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ by utilizing the parameterized posterior categorical distribution of HR EM images ${q}_{\phi }\left( {\xi \mid {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }}\right)$ with the interpolation operation. Here, ${o}_{ \downarrow }$ represents the LR image variable, and ${\left\lbrack \cdot \right\rbrack }_{ \uparrow }$ denotes the interpolation operation. However,in scenarios where significant degradation occurs in LR input images, the interpolation operation struggles to restore the rich details and textures in HR EM images. Consequently, this mismatch leads to discrepancies between the real HR EM image distribution $p\left( o\right)$ and the interpolated HR EM image distribution $p\left( {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }\right)$ ,thereby resulting in disparities between $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ and ${q}_{\phi }\left( {\xi \mid {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }}\right)$ . Despite the loss of fine-grained details in LR images, the partial preservation of information enables the utilization of such details as priors in mapping LR images to their corresponding quantified feature ${Z}_{d}$ . To fully utilize these priors,we propose a latent vector indexer $\mathbf{W}$ to predict probabilities of corresponding latent vectors in $C$ given LR EM images as input,as shown in Figure 1 (b). We denote the output of the latent vector indexer as ${Z}_{w}$ ,where the element at position(m,n)is denoted as ${Z}_{w}^{m,n}$ and represents a $v$ -dimensional vector. Then, by selecting the latent vector with the highest probability, we can effectively model the posterior categorical distribution $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ as

为了实现这一目标，一种简单的方法是插值 LR 图像并将其输入编码器 [17]。这种简单的方法通过利用 HR EM 图像 ${q}_{\phi }\left( {\xi \mid {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }}\right)$ 的参数化后验类别分布与插值操作，来近似 LR EM 图像 $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ 的后验类别分布。在这里， ${o}_{ \downarrow }$ 表示 LR 图像变量， ${\left\lbrack \cdot \right\rbrack }_{ \uparrow }$ 表示插值操作。然而，在 LR 输入图像发生显著降级的情况下，插值操作难以恢复 HR EM 图像中的丰富细节和纹理。因此，这种不匹配导致真实 HR EM 图像分布 $p\left( o\right)$ 和插值 HR EM 图像分布 $p\left( {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }\right)$ 之间存在差异，从而导致 $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ 和 ${q}_{\phi }\left( {\xi \mid {\left\lbrack {o}_{ \downarrow }\right\rbrack }_{ \uparrow }}\right)$ 之间的差异。尽管 LR 图像中的细节损失严重，但信息的部分保留使得可以将这些细节作为先验用于将 LR 图像映射到其对应的量化特征 ${Z}_{d}$ 。为了充分利用这些先验，我们提出了一种潜在向量索引器 $\mathbf{W}$ ，以预测给定 LR EM 图像作为输入时对应潜在向量在 $C$ 中的概率，如图 1 (b) 所示。我们将潜在向量索引器的输出表示为 ${Z}_{w}$ ，其中位置 (m,n) 的元素表示为 ${Z}_{w}^{m,n}$ ，并表示一个 $v$ 维向量。然后，通过选择具有最高概率的潜在向量，我们可以有效地对后验类别分布 $q\left( {\xi \mid {o}_{ \downarrow }}\right)$ 进行建模。

Figure 1. Overview of our framework. The proposed generative learning-based framework consists of three stages. In Stage I, the encoder, the codebook, and the decoder are trained by self-generating the input HR EM image. In Stage II, a latent vector indexer is trained and connected with the codebook and the decoder with fixed parameters to generate the reference HR EM image and multi-scale generative features from the LR EM image. In Stage III, the HR output is reconstructed by fusing multi-scale generative features and exploiting inter-section coherence in volumetric EM data with the POD module and the 3DA module. Rec denotes reconstruction layers composed of convolution layers and pixel shuffle operation.

图 1. 我们框架的概述。所提出的基于生成学习的框架由三个阶段组成。在第一阶段，编码器、代码本和解码器通过自生成输入的高分辨率电子显微镜图像进行训练。在第二阶段，训练一个潜在向量索引器，并将其与代码本和具有固定参数的解码器连接，以从低分辨率电子显微镜图像生成参考高分辨率电子显微镜图像和多尺度生成特征。在第三阶段，通过融合多尺度生成特征并利用体积电子显微镜数据中的交叉一致性，结合 POD 模块和 3DA 模块重建高分辨率输出。Rec 表示由卷积层和像素重排操作组成的重建层。

where $\varphi$ denotes the parameters of the latent vector indexer $\mathbf{W}.{Z}_{w}^{m,n}\left\lbrack i\right\rbrack$ denotes the $i$ -th element of the vector ${Z}_{w}^{m,n}$ .

其中 $\varphi$ 表示潜在向量索引器的参数， $\mathbf{W}.{Z}_{w}^{m,n}\left\lbrack i\right\rbrack$ 表示向量的 $i$ -th 元素 ${Z}_{w}^{m,n}$ 。

This quantization operation allows us to capture the most representative latent vector that corresponds to the LR image. Then the predicted quantified feature ${Z}_{d}^{\text{pre }}$ is obtained and fed into the decoder to generate the reference HR EM image ${I}_{\text{Ref }}$ and its corresponding multi-scale generative features ${F}_{\text{Ref }}^{l}$ through the decoder. Note that all the latent vectors used to generate ${I}_{\text{Ref }}$ and ${F}_{\text{Ref }}^{l}$ are obtained from the codebook, which contains the generative cell-specific priors of HR EM images.

该量化操作使我们能够捕捉到与低分辨率图像对应的最具代表性的潜在向量。然后，获得预测的量化特征 ${Z}_{d}^{\text{pre }}$ ，并将其输入解码器以生成参考高分辨率电子显微镜图像 ${I}_{\text{Ref }}$ 及其相应的多尺度生成特征 ${F}_{\text{Ref }}^{l}$ 。请注意，所有用于生成 ${I}_{\text{Ref }}$ 和 ${F}_{\text{Ref }}^{l}$ 的潜在向量均来自代码本，该代码本包含高分辨率电子显微镜图像的生成细胞特定先验。

3.4. Reconstructing HR Image

3.4. 重建高分辨率图像

MPF module. The complete process of reconstructing HR image is depicted in Figure 1 (c). The discrepancies between the generated HR image ${I}_{\text{Ref }}$ and real EM image ${I}_{\text{GT }}$ pose a challenge in achieving accurate multi-scale generative feature fusion. To overcome this challenge, we propose the MPF module, which focuses on identifying and fusing the multi-scale generative features. The MPF module comprises two essential processes: a mask-learning process and a multi-scale fusion process.

MPF 模块。重建高分辨率图像的完整过程如图 1 (c) 所示。生成的高分辨率图像 ${I}_{\text{Ref }}$ 和真实电子显微镜图像 ${I}_{\text{GT }}$ 之间的差异在实现准确的多尺度生成特征融合时构成了挑战。为克服这一挑战，我们提出了 MPF 模块，专注于识别和融合多尺度生成特征。MPF 模块包括两个基本过程：掩膜学习过程和多尺度融合过程。

The mask-learning mechanism within the MPF module enables us to identify and mask out multi-scale generative features in regions that show significant boundary differences compared to real HR images. As shown in Figure 1 (d),we first embed the interpolated LR image ${\left\lbrack {I}_{LR}\right\rbrack }_{ \uparrow }$ and the reference image ${I}_{\text{Ref }}$ from the decoder into the feature space by a pre-trained VGG19 encoder [45]. We extract ${16} \times {16}$ patches from LR feature maps and reference (Ref) feature maps without overlap. Then, We calculate cosine similarity vector $\mathbf{{CS}}$ between LR feature patches and Ref feature patches,where ${\mathbf{{CS}}}_{\mathbf{i}}$ denotes cosine similarity be-

MPF 模块中的掩码学习机制使我们能够识别并掩盖与真实高分辨率图像相比显示显著边界差异的区域中的多尺度生成特征。如图 1 (d) 所示，我们首先通过预训练的 VGG19 编码器 [45] 将插值的低分辨率图像 ${\left\lbrack {I}_{LR}\right\rbrack }_{ \uparrow }$ 和参考图像 ${I}_{\text{Ref }}$ 嵌入特征空间。我们从低分辨率特征图和参考 (Ref) 特征图中提取 ${16} \times {16}$ 不重叠的补丁。然后，我们计算低分辨率特征补丁与参考特征补丁之间的余弦相似度向量 $\mathbf{{CS}}$ ，其中 ${\mathbf{{CS}}}_{\mathbf{i}}$ 表示余弦相似度。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——