[论文翻译]Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and

Doc2X：科研翻译与解析工具提供批量PDF处理、公式解析、多栏识别，以及 GPT 翻译与深度语料提取功能。 Doc2X: Research Translation and Parsing Tool Offers batch PDF processing, formula parsing, multi-column recognition, along with GPT translation and corpus extraction. 👉 立即使用 Doc2X | Use Doc2X Now

原文链接：arxiv.org/pdf/2410.13…

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Janus：解耦视觉编码以实现统一的多模态理解和生成

Chengyue ${\mathrm{{Wu}}}^{1,2}\;$ Xiaokang ${\mathrm{{Chen}}}^{1,*, \dagger }\;$ Zhiyu ${\mathrm{{Wu}}}^{1,3}\;$ Yiyang ${\mathrm{{Ma}}}^{1,3}\;$ Xingchao ${\mathrm{{Liu}}}^{1}\;$ Zizheng Pan ${}^{1}$ Wen Liu ${}^{1}$ Zhenda Xie ${}^{1}$ Xingkai Yu ${}^{1}$ Chong Ruan ${}^{1}$ Ping Luo ${}^{2, * }$

程宇 ${\mathrm{{Wu}}}^{1,2}\;$ 夏康 ${\mathrm{{Chen}}}^{1,*, \dagger }\;$ 志宇 ${\mathrm{{Wu}}}^{1,3}\;$ 一阳 ${\mathrm{{Ma}}}^{1,3}\;$ 邢超 ${\mathrm{{Liu}}}^{1}\;$ 潘子正 ${}^{1}$ 刘文 ${}^{1}$ 谢振达 ${}^{1}$ 余兴凯 ${}^{1}$ 阮冲 ${}^{1}$ 罗平 ${}^{2, * }$

${}^{1}$ DeepSeek-AI ${}^{2}$ The University of Hong Kong ${}^{3}$ Peking University

${}^{1}$ 深度求索人工智能基础技术研究有限公司 ${}^{2}$ 香港大学 ${}^{3}$ 北京大学

†: Project lead *: Corresponding authors

†：项目负责人 *: 通讯作者

$\textbf{Project Page:}$ github.com/deepseek-ai…

Abstract

摘要

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

在本文中，我们介绍了Janus，这是一个自回归框架，统一了多模态理解和生成。先前的研究通常依赖于单一的视觉编码器来处理这两项任务，例如变色龙。然而，由于多模态理解和生成所需的信息粒度水平不同，这种方法可能导致性能不佳，特别是在多模态理解方面。为了解决这个问题，我们将视觉编码解耦为独立的通道，同时仍然利用单一的、统一的transformer架构进行处理。这种解耦不仅缓解了视觉编码器在理解和生成中的角色冲突，还增强了框架的灵活性。例如，多模态理解和生成组件可以独立选择最适合其的编码方法。实验表明，Janus超越了先前的统一模型，并达到了或超过了特定任务模型的性能。Janus的简单性、高灵活性和有效性使其成为下一代统一多模态模型的有力候选者。

1. Introduction

1. 引言

In recent years, multimodal large models have made significant advancements in both understanding and generation domains [20, 51]. In the field of multimodal understanding, researchers follow the design of LLaVA [51] by using a vision encoder as a bridge to enable large language models (LLMs) to understand images. In the field of visual generation, diffusion-based approaches $\left\lbrack {9,{20},{20},{67}}\right\rbrack$ have seen notable success. More recently,some works have explored autoregressive methods for vision generation [73, 79], achieving performance comparable to diffusion models. To build more powerful and generalist multimodal models, researchers have sought to combine multimodal understanding and generation tasks [75, 77, 94]. For instance, some studies have attempted to connect multimodal understanding models with pretrained diffusion models [27, 28, 75]. For example, Emu [75] uses the output of the LLM as a condition for a pretrained diffusion model, and then relies on the diffusion model to generate images. However, strictly speaking, this approach cannot be considered a truly unified model, because the visual generation functionality is handled by the external diffusion model, while the multimodal LLM itself lacks the capability to directly generate images.

近年来，多模态大模型在理解和生成领域都取得了显著进展 [20, 51]。在多模态理解领域，研究人员遵循 LLaVA [51] 的设计，使用视觉编码器作为桥梁，使大型语言模型（LLMs）能够理解图像。在视觉生成领域，基于扩散的方法 $\left\lbrack {9,{20},{20},{67}}\right\rbrack$ 取得了显著成功。最近，一些工作探索了用于视觉生成的自回归方法 [73, 79]，达到了与扩散模型相当的性能。为了构建更强大和通用的多模态模型，研究人员寻求将多模态理解和生成任务结合起来 [75, 77, 94]。例如，一些研究尝试将多模态理解模型与预训练的扩散模型连接起来 [27, 28, 75]。例如，Emu [75] 使用 LLM 的输出作为预训练扩散模型的条件，然后依赖扩散模型生成图像。然而，严格来说，这种方法不能被视为一个真正统一的模型，因为视觉生成功能由外部扩散模型处理，而多模态 LLM 本身缺乏直接生成图像的能力。

Other approaches $\left\lbrack {{77},{85},{86},{94}}\right\rbrack$ employ a single transformer to unify both multimodal understanding and generation tasks, which improves instruction-following for visual generation, unlocks potential emergent abilities, and reduces model redundancy. Such methods typically use a single vision encoder to process inputs for both two tasks. However, the representations required by multimodal understanding and generation tasks differ significantly. In multimodal understanding tasks, the purpose of the vision encoder is to extract high-level semantic information (e.g., object categories or visual attributes within an image). The output of understanding task not only involves extracting information from images but also involves complex semantic reasoning. Therefore, the granularity of the vision encoder's representation tends to mainly focus on high-dimensional semantic representation. By contrast, in visual generation tasks, the main focus is on generating local details and maintaining global consistency in the image. The representation in this context necessitates a low-dimensional encoding that is capable of fine-grained spatial structure and textural detail expression. Unifying the representations of these two tasks within the same space will lead to conflicts and trade-offs. Consequently, existing unified models for multimodal understanding and generation often compromise on multimodal understanding performance, falling markedly short of the state-of-the-arts multimodal understanding models. We explore this issue further in the ablation study.

其他方法 $\left\lbrack {{77},{85},{86},{94}}\right\rbrack$ 采用单一的 Transformer 来统一多模态理解和生成任务，这提高了视觉生成的指令跟随能力，解锁了潜在的涌现能力，并减少了模型冗余。这类方法通常使用单一的视觉编码器来处理两个任务的输入。然而，多模态理解和生成任务所需的表示存在显著差异。在多模态理解任务中，视觉编码器的目的是提取高级语义信息（例如，图像中的对象类别或视觉属性）。理解任务的输出不仅涉及从图像中提取信息，还涉及复杂的语义推理。因此，视觉编码器的表示粒度主要集中在高维语义表示上。相比之下，在视觉生成任务中，主要关注的是生成局部细节并保持图像的全局一致性。在这种情况下，表示需要一个能够表达细粒度空间结构和纹理细节的低维编码。将这两种任务的表示统一在同一空间内会导致冲突和权衡。因此，现有的多模态理解和生成统一模型通常会在多模态理解性能上做出妥协，明显落后于最先进的多模态理解模型。我们在消融研究中进一步探讨了这个问题。

Figure 1 | Multimodal understanding and vision generation results from our Janus. Janus outperforms the previous state-of-the-art unified multimodal models as well as some task-specific multimodal understanding models, while also demonstrating strong visual generation capabilities. The image resolution is ${384} \times {384}$ . Best viewed on screen.

图1 | 我们的 Janus 在多模态理解和视觉生成方面的结果。Janus 不仅在多模态理解方面优于之前最先进的统一多模态模型以及一些特定任务的多模态理解模型，同时还展示了强大的视觉生成能力。图像分辨率为 ${384} \times {384}$ 。最佳屏幕查看效果。

To solve this problem,we propose Janus ${}^{1}$ ,a unified multimodal framework that decouples visual encoding for multimodal understanding and generation. Specifically, we introduce two independent visual encoding pathways: one for multimodal understanding and one for multimodal generation, unified by the same transformer architecture. The proposed method offers two main benefits: (1) Janus alleviates the conflict stemming from the different granular needs of multimodal understanding and generation and eliminates the need to make trade-offs between two tasks when selecting visual encoders. (2) Janus is flexible and extensible. After decoupling, both the understanding and generation tasks can adopt state-of-the-art encoding techniques specific to their domain. Moreover, it is possible for Janus to accommodate additional input types in the future, such as point clouds, EEG signals, or audio data, where independent encoders can extract features and then use a unified transformer to process them.

为了解决这个问题，我们提出了Janus ${}^{1}$ ，这是一个统一的多模态框架，它将视觉编码解耦以实现多模态理解和生成。具体来说，我们引入了两个独立的视觉编码路径：一个用于多模态理解，另一个用于多模态生成，它们由相同的transformer架构统一。所提出的方法提供了两个主要优势：(1) Janus缓解了多模态理解和生成对不同粒度需求的冲突，并消除了在选择视觉编码器时需要在两个任务之间进行权衡的需求。(2) Janus具有灵活性和可扩展性。解耦后，理解和生成任务都可以采用各自领域中最先进的编码技术。此外，Janus未来还可以容纳其他输入类型，如点云、脑电信号或音频数据，独立的编码器可以提取特征，然后使用统一的transformer进行处理。

${}^{1}$ In Roman mythology,Janus is the god of duality and transitions,symbolizing the coexistence of contradictory forces by having two faces, each looking in opposite directions. Similarly, our model captures the inherent tension between vision tasks: understanding demands abstract, high-level semantic representations, while generation requires concrete, detailed information. By decoupling these processes into specialized encoders, our system mirrors Janus's dual nature, resolving this tension within a unified architecture.

${}^{1}$ 在罗马神话中，Janus是双重性和过渡之神，象征着矛盾力量的共存，因为他有两张脸，每张脸都朝相反的方向看。同样，我们的模型捕捉了视觉任务之间的内在张力：理解需要抽象的高级语义表示，而生成需要具体的详细信息。通过将这些过程解耦为专门的编码器，我们的系统反映了Janus的双重性质，在一个统一的架构中解决了这种张力。

To the best of our knowledge, we are the first to highlight the importance of decoupling visual encoding within the unified multimodal understanding and generation framework. Our experimental results show that Janus surpasses existing unified models with comparable parameter sizes on both multimodal understanding and generation benchmarks, achieving state-of-the-art results. Notably, Janus even outperforms some task-specific models which have significantly more parameters (Figure 1). Specifically, on multimodal understanding benchmarks MMBench [54], SEED-Bench [42], and POPE [48], Janus (1.3B) achieved scores of 69.4, 63.7, and 87.0, respectively, outperforming LLaVA-v1.5 (7B) [50] and Qwen-VL-Chat (7B) [3] . On visual generation benchmarks MSCOCO-30K [11] and GenEval [30], Janus achieved an FID score of 8.53 and an accuracy of 61%,surpassing text-to-image generative models such as DALL-E 2 [66] and SDXL [62]. We believe that the strong performance, coupled with the high flexibility and extensibility of Janus, presents it as a strong candidate for next-generation unified multimodal models.

据我们所知，我们是第一个强调在统一的多模态理解和生成框架内解耦视觉编码重要性的。我们的实验结果表明，Janus 在多模态理解和生成基准测试中，超越了具有相似参数规模的现有统一模型，达到了最先进的结果。值得注意的是，Janus 甚至优于一些参数数量显著更多的特定任务模型（图1）。具体来说，在多模态理解基准测试 MMBench [54]、SEED-Bench [42] 和 POPE [48] 上，Janus（1.3B）分别取得了 69.4、63.7 和 87.0 的分数，超过了 LLaVA-v1.5（7B）[50] 和 Qwen-VL-Chat（7B）[3]。在视觉生成基准测试 MSCOCO-30K [11] 和 GenEval [30] 上，Janus 取得了 8.53 的 FID 分数和 61% 的准确率，超过了文本到图像生成模型如 DALL-E 2 [66] 和 SDXL [62]。我们相信，Janus 的强大性能，加上其高度的灵活性和可扩展性，使其成为下一代统一多模态模型的有力候选者。

2. Related Work

2. 相关工作

2.1. Visual Generation

2.1. 视觉生成

Visual generation is a rapidly evolving field that combines concepts from natural language processing with advancements in transformer architectures. Autoregressive models, influenced by the success in language processing, leverage transformers to predict sequences of discrete visual tokens (codebook IDs) [24, 65, 75]. These models tokenize visual data and employ a prediction approach similar to GPT-style [64] techniques. Additionally, masked prediction models [7, 8] draw upon BERT-style [19] masking methods, predicting masked sections of visual inputs to improve synthesis efficiency, and have been adapted for video generation [89]. Concurrently, continuous diffusion models have showcased impressive capabilities in visual generation $\left\lbrack {{33},{67},{71}}\right\rbrack$ ,complementing discrete methods by approaching generation through a probabilistic lens.

视觉生成是一个快速发展的领域，它结合了自然语言处理的概念和Transformer架构的进展。自回归模型受到语言处理成功的启发，利用Transformer来预测离散视觉令牌（码本ID）的序列[24, 65, 75]。这些模型对视觉数据进行标记化，并采用类似于GPT风格[64]的技术进行预测。此外，掩码预测模型[7, 8]借鉴了BERT风格[19]的掩码方法，通过预测视觉输入的掩码部分来提高合成效率，并已适应于视频生成[89]。同时，连续扩散模型在视觉生成方面展示了令人印象深刻的能力 $\left\lbrack {{33},{67},{71}}\right\rbrack$ ，通过概率视角补充了离散方法。

2.2. Multimodal Understanding

2.2. 多模态理解

Multimodal large language models (MLLMs) integrate both text and images [6, 80, 81]. By leveraging pretrained LLMs, MLLMs [1, 2, 12, 51, 55, 82, 95] demonstrate a robust ability to understand and process multimodal information. Recent advancements have explored extending MLLMs with pretrained diffusion models to facilitate image generation $\lbrack {27},{29},{36},{75}$ , 76]. These methods fall under the category of tool utilization, where diffusion models are used to generate images based on the conditions output by the MLLM, while the MLLM itself does not have the ability to directly perform visual generation. Moreover, the generative ability of the entire system is often constrained by the external diffusion model, making its performance inferior to directly using the diffusion model on its own [27, 75].

多模态大语言模型（MLLMs）整合了文本和图像[6, 80, 81]。通过利用预训练的LLMs，MLLMs[1, 2, 12, 51, 55, 82, 95]展示了强大的理解和处理多模态信息的能力。最近的进展探索了通过预训练的扩散模型扩展MLLMs以促进图像生成[latex0, 76]。这些方法属于工具利用的范畴，其中扩散模型根据MLLM输出的条件生成图像，而MLLM本身不具备直接进行视觉生成的能力。此外，整个系统的生成能力通常受限于外部扩散模型，使其性能不如直接使用扩散模型[27, 75]。

2.3. Unified Multimodal Understanding and Generation

2.3. 统一的多模态理解和生成

Unified multimodal understanding and generation models are considered powerful for facilitating seamless reasoning and generation across different modalities [77, 94]. Traditional approaches in these models typically use a single visual representation for both understanding and generation tasks, regardless of whether they are based on autoregressive (AR) models [77, 85] or diffusion models [86, 94]. For example, Chameleon [77] adopts a VQ Tokenizer to encode images for both multimodal understanding and generation. However, this practice may lead to suboptimal outcomes, as the vision encoder might face a trade-off between the demands of understanding and generation. In contrast, our Janus can explicitly decouple the visual representations for understanding and generation, recognizing that different tasks may require varying levels of information.

统一的多模态理解和生成模型被认为在促进不同模态之间的无缝推理和生成方面具有强大的能力 [77, 94]。这些模型中的传统方法通常使用单一的视觉表示来处理理解和生成任务，无论是基于自回归 (AR) 模型 [77, 85] 还是扩散模型 [86, 94]。例如，Chameleon [77] 采用 VQ Tokenizer 来编码图像，以进行多模态理解和生成。然而，这种做法可能会导致次优结果，因为视觉编码器可能需要在理解和生成需求之间进行权衡。相比之下，我们的 Janus 可以明确地将理解和生成的视觉表示解耦，认识到不同任务可能需要不同层次的信息。

Figure 2 | Architecture of our Janus. Different from previous approaches [77, 85] that typically assume visual understanding and generation require the same visual encoder, our Janus decouples visual encoding for visual understanding and visual generation. "Und. Encoder" and "Gen. Encoder" are abbreviations for "Understanding Encoder" and "Generation Encoder", respectively. Best viewed in color.

图 2 | 我们的 Janus 架构。与以往通常假设视觉理解和生成需要相同视觉编码器的方法 [77, 85] 不同，我们的 Janus 将视觉编码解耦为视觉理解和视觉生成。"Und. Encoder" 和 "Gen. Encoder" 分别是 "理解编码器" 和 "生成编码器" 的缩写。最佳查看方式为彩色。

3. Janus: A Simple, Unified and Flexible Multimodal Framework

3. Janus：一个简单、统一且灵活的多模态框架

3.1. Architecture

3.1. 架构

The architecture of Janus is shown in Figure 2. For pure text understanding, multimodal understanding, and visual generation, we apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. Specifically, for text understanding, we use the built-in tokenizer of the LLM to convert the text into discrete IDs and obtain the feature representations corresponding to each ID. For multimodal understanding, we use the SigLIP [92] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [73] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. The built-in prediction head of the LLM is utilized for text predictions in both the pure text understanding and multimodal understanding tasks, while a randomly initialized prediction head is used for image predictions in the visual generation task. The entire model adheres to an autoregressive framework without the need for specially designed

Janus 的架构如图 2 所示。对于纯文本理解、多模态理解和视觉生成，我们采用独立的编码方法将原始输入转换为特征，然后由统一的自动回归变换器进行处理。具体来说，对于文本理解，我们使用 LLM 的内置分词器将文本转换为离散的 ID，并获取每个 ID 对应的特征表示。对于多模态理解，我们使用 SigLIP [92] 编码器从图像中提取高维语义特征。这些特征从 2-D 网格展平为 1-D 序列，并使用理解适配器将这些图像特征映射到 LLM 的输入空间。对于视觉生成任务，我们使用 [73] 中的 VQ 分词器将图像转换为离散的 ID。在 ID 序列展平为 1-D 后，我们使用生成适配器将每个 ID 对应的码本嵌入映射到 LLM 的输入空间。然后，我们将这些特征序列连接起来形成一个多模态特征序列，随后将其输入 LLM 进行处理。LLM 的内置预测头用于纯文本理解和多模态理解任务中的文本预测，而在视觉生成任务中使用随机初始化的预测头进行图像预测。整个模型遵循自动回归框架，无需专门设计。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——