[论文翻译]Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Mode

Doc2X | PDF 到 Markdown 转换专家精准将 PDF 转换为 Markdown 文档，支持公式解析与代码提取，简化开发与科研工作流程。 Doc2X | PDF to Markdown Conversion Expert Convert PDFs to Markdown accurately with support for formula parsing and code extraction, simplifying development and research workflows. 👉 开始使用 Doc2X | Start Using Doc2X 原文链接：arxiv.org/pdf/2404.13…

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

概念算术：绕过扩散模型中的概念抑制

Vitali Petsiuk ${}^{1}$ and Kate Saenko ${}^{1}$

Vitali Petsiuk ${}^{1}$ 和 Kate Saenko ${}^{1}$

Boston University

波士顿大学

{vpetsiuk,saenko}@bu.edu

Abstract. Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models.

摘要。出于伦理和法律方面的考虑，科学界正在积极开发方法，以限制文本到图像扩散模型被滥用于在生成的图像中复制受版权保护的、暴力的、露骨的或个人信息。同时，研究人员通过扮演对手的角色，测试这些新开发的安全措施，以发现其中的漏洞和后门。我们利用扩散模型的组合性质，该性质允许在单次图像生成中利用多个提示。这一性质使我们能够结合其他不应受抑制影响的概念，来重建负责目标概念生成的向量，即使直接计算该向量已不再可行。我们提供了理论和实证证据，说明为什么所提出的攻击是可能的，并讨论了这些发现对安全模型部署的影响。我们认为，必须考虑对手可能采用的所有图像生成方法。我们的工作开启了关于概念算术和组合推理对扩散模型安全机制影响的讨论。

Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised.

内容警告：本文包含可能被认为具有冒犯性的讨论和模型生成内容。读者请自行斟酌。

Project page: cs-people.bu.edu/vpetsiuk/ar…

项目页面：cs-people.bu.edu/vpetsiuk/ar…

1 Introduction

1 引言

Recent advances in Text-to-Image (T2I) generation $\left\lbrack {{24},{26},{28}}\right\rbrack$ have led to the rapid growth of applications enabled by the models, including many commercial projects as well as creative applications by the general public. On the other hand, they can also be used for generating deep fakes, hateful or inappropriate images [2,9], copyrighted materials or artistic styles [30]. Trained on vast amounts of data scraped from the web, these models also learn to reproduce the biases and stereotypes present in the data $\left\lbrack {2,8,{17},{19}}\right\rbrack$ . While some legal $\left\lbrack {9,{18}}\right\rbrack$ and ethical [27] questions concerning image generation models remain unsolved, the scientific community is developing methods to limit their malicious utility, while keeping them open and accessible to the community.

文本到图像 (T2I) 生成的最新进展 $\left\lbrack {{24},{26},{28}}\right\rbrack$ 导致了由这些模型驱动的应用的快速增长，包括许多商业项目以及普通公众的创意应用。另一方面，它们也可以用于生成深度伪造、仇恨或不适当的图像 [2,9]、受版权保护的材料或艺术风格 [30]。这些模型在从网络上抓取的大量数据上进行训练，还学会了再现数据中存在的偏见和刻板印象 $\left\lbrack {2,8,{17},{19}}\right\rbrack$ 。尽管关于图像生成模型的某些法律 $\left\lbrack {9,{18}}\right\rbrack$ 和伦理 [27] 问题仍未解决，但科学界正在开发方法来限制其恶意用途，同时保持其对社区的开放和可访问性。

Fig. 1: While recent methods for erasing concepts in Diffusion Models successfully pass their respective evaluations (middle row), they do not entirely remove the target concept (such as zebra) from model weights as claimed. In this work, we propose a method to reproduce the erased concept using the inhibited models (bottom row).

图 1：尽管最近用于擦除扩散模型中概念的方法成功通过了各自的评估（中间行），但它们并未完全如声称的那样从模型权重中移除目标概念（如斑马）。在这项工作中，我们提出了一种使用抑制模型重现擦除概念的方法（底部行）。

Some recently proposed approaches, that we refer to as Concept Inhibition methods $\left\lbrack {7,8,{10},{15},{29},{39}}\right\rbrack$ modify the Diffusion Model (DM) to "forget" some specified information. Given a target concept, the weights of the model are fine-tuned or otherwise edited so that the model is no longer capable of generating images that contain that concept. Unlike the post-hoc filtering methods (safety checkers) that can be easily circumvented by an adversary $\left\lbrack {1,{25},{38}}\right\rbrack$ ,these methods are designed to prevent the generation of undesired content in the first place. One of the motivating factors of this line of works is to limit the inappropriate content generation by the models, while keeping them open-source and accessible to the community. Based on the evaluation results of these works, which demonstrate a significantly reduced reproduction rate of the target concept in the generated images, the authors conclude that the model is no longer capable of generating the target concept and that such "erasure cannot be easily circumvented, even by users who have access to the parameters" [7]. However, we demonstrate, theoretically and experimentally, that the models inhibited with existing methods still contain the infromation for reproducing the erased concept (Figure 1). This information can be easily exploited by an adversary with access to compositional inference of the model, which is a weaker requirement than full access to the model weights.

一些最近提出的方法，我们称之为概念抑制方法 $\left\lbrack {7,8,{10},{15},{29},{39}}\right\rbrack$ ，它们修改了扩散模型（DM）以“遗忘”某些指定的信息。给定一个目标概念，模型的权重会进行微调或以其他方式编辑，使得模型不再能够生成包含该概念的图像。与事后过滤方法（安全检查器）不同，这些方法可以被对手轻易绕过 $\left\lbrack {1,{25},{38}}\right\rbrack$ ，这些方法旨在从一开始就防止生成不希望的内容。这一系列工作的动机之一是限制模型生成不适当内容，同时保持其开源并可被社区访问。基于这些工作的评估结果，这些结果表明在生成的图像中目标概念的再现率显著降低，作者得出结论，模型不再能够生成目标概念，并且这种“擦除不能轻易被绕过，即使是对那些有权访问参数的用户” [7]。然而，我们从理论和实验上证明，使用现有方法抑制的模型仍然包含再现被擦除概念的信息（图1）。这些信息可以被有权访问模型组合推断的对手轻易利用，这比完全访问模型权重要求更低。

Recent works explored, how a semantic concept can be constructed by specifying and composing more than one prompt in one generation $\left\lbrack {3,{16},{29}}\right\rbrack$ . We consider the implications of this compositional property in the context of concept inhibition. By using concept arithmetics, which is not available in single prompt inference, we use multiple input points to reconstruct the erased concept. Unlike prompt optimization attacks $\left\lbrack {4,{34}}\right\rbrack$ that leverage insufficiently generalized inhibition near the target concept (similar to adversarial attacks), our attacks leverage the compositional property and use the input points further away from the target. These points are sufficiently inhibited according to the design of the inhibition methods but still contain the information about the erased concept. Since the defense against these attacks has to take compositionality into consideration, our attacks cannot be mitigated by the methods that exclusively address the adversarial robustness.

最近的研究探讨了如何通过在一个生成过程中指定和组合多个提示来构建语义概念 $\left\lbrack {3,{16},{29}}\right\rbrack$ 。我们考虑了这种组合特性在概念抑制背景下的影响。通过使用概念算术，这在单提示推理中是不可用的，我们使用多个输入点来重建被擦除的概念。与利用目标概念附近抑制不足的提示优化攻击 $\left\lbrack {4,{34}}\right\rbrack$ （类似于对抗攻击）不同，我们的攻击利用了组合特性，并使用远离目标的输入点。这些点根据抑制方法的设计被充分抑制，但仍包含有关被擦除概念的信息。由于防御这些攻击必须考虑组合性，我们的攻击无法通过仅解决对抗鲁棒性的方法来缓解。

Intuitive and straightfoward to implement, our proposed ARC (ARithmetics in Concept space) attacks would be readily available to an adversary, which makes them a serious threat against the presumably safe models. The attacks require black-box access to compositional inference of the model. This is the case for multi-prompting APIs which are becoming increasingly popular ${}^{1}$ ,or for an adversary with full access to the model weights and code, e.g. if the model is open-source.

直观且易于实现，我们提出的ARC（概念空间中的算术）攻击将很容易被对手利用，这使得它们对被认为是安全的模型构成严重威胁。这些攻击需要对模型的组合推理进行黑箱访问。这种情况适用于越来越流行的多提示API ${}^{1}$ ，或者适用于完全访问模型权重和代码的对手，例如如果模型是开源的。

We present both theoretical grounding and empirical evidence of the attack effectiveness, and we quantitatively show that the attacks significantly increase the reproduction rates of the erased concepts. Compositional inference attacks are applicable to all safety mechanisms that modify the model locally (near a given input point). This simple alteration in the inference process may break the assumptions made by the defense mechanisms developers, or exploit the vulnerabilities considered to be minor to a larger extent.

我们提供了攻击有效性的理论基础和实证证据，并定量地展示了攻击显著提高了被擦除概念的重现率。组合推理攻击适用于所有局部修改模型（在给定输入点附近）的安全机制。这种推理过程中的简单改变可能会打破防御机制开发者所做的假设，或者更大程度地利用被认为是次要的漏洞。

To summarize, our main contributions are as follows:

总之，我们的主要贡献如下：

We are the first work to consider compositional property of Diffusion Models in the context of concept inhibition and its circumvention.
我们是首个在概念抑制及其规避背景下考虑扩散模型组合属性的工作。
We design novel attacks that exploit the limitations of concept inhibition methods, based on the theoretical framework we develop.
我们设计了新型攻击，这些攻击利用了我们开发的理论框架下的概念抑制方法的局限性。
We test our attacks against models inhibited with a variety of inhibtion methods and show that the attacks significantly increase the reproduction rates of the erased concepts.
我们测试了针对使用各种抑制方法抑制的模型的攻击，并展示了这些攻击显著增加了已删除概念的再现率。

Our work is not intended to discourage the use of the presented inhibition methods but to determine the strengths and limitations of different approaches, further define the notion of concept inhibtion, and ultimately advance the research on safe and responsible Text-to-Image generation. The proposed attacks can be used to test the robustness of the inhibition methods and to guide the choice of the inhibition method and its parameters. Our intentions do not include enabling the generation of inappropriate content, however, by the nature of red-team work, presented approach can be used for malicious purposes.

我们的工作并非旨在阻止使用所提出的抑制方法，而是为了确定不同方法的优缺点，进一步定义概念抑制的概念，并最终推进安全且负责任的文本到图像生成研究。所提出的攻击可用于测试抑制方法的鲁棒性，并指导抑制方法及其参数的选择。我们的意图不包括启用不当内容的生成，然而，由于红队工作的性质，所提出的方法可能被用于恶意目的。

1 docs.midjourney.com/docs/multi-…,

platform.stability.ai/docs/featur…

2 Related Work

2 相关工作

2.1 Diffusion Models

2.1 扩散模型

Diffusion Model is a type of generative model that employs a gradual denoising process to learn the distribution $p\left( x\right)$ of the data $\left\lbrack {5,{12},{20},{31},{32}}\right\rbrack$ . The diffusion model generates an image ${x}_{0}$ in $T$ steps by iteratively predicting and removing noise starting from the initial Gaussian noise sample ${x}_{T}$ . Noise prediction is learned to optimize the score function ${\nabla }_{x}\log p\left( x\right)$ .

扩散模型是一种生成模型，它采用逐步去噪过程来学习数据的分布 $p\left( x\right)$ $\left\lbrack {5,{12},{20},{31},{32}}\right\rbrack$ 。扩散模型通过从初始高斯噪声样本 ${x}_{T}$ 开始，迭代预测和去除噪声，在 $T$ 步中生成图像 ${x}_{0}$ 。噪声预测被学习以优化评分函数 ${\nabla }_{x}\log p\left( x\right)$ 。

Classifier guidance $\left\lbrack {5,{31},{33}}\right\rbrack$ enables generation conditioned on some input $c$ by adding a conditional score term $\gamma {\nabla }_{x}\log p\left( {c \mid x}\right)$ with guidance scale $\gamma > 1$ controlling the influence of the conditional signal. $p\left( {c \mid x}\right)$ can be an external image classifier model predicting the class label $c$ . Classifier-free guidance [12] proposes to train the model jointly on conditional and unconditional denoising to obtain a single neural network that models both unconditional $p\left( x\right)$ and conditional $p\left( {c \mid x}\right)$ distributions. In this case,the total guidance can be expressed as

分类器引导 $\left\lbrack {5,{31},{33}}\right\rbrack$ 通过添加带有引导尺度 $\gamma > 1$ 的条件评分项 $\gamma {\nabla }_{x}\log p\left( {c \mid x}\right)$ 来控制条件信号的影响，从而实现基于某些输入 $c$ 的生成。 $p\left( {c \mid x}\right)$ 可以是预测类别标签 $c$ 的外部图像分类器模型。无分类器引导 [12] 提出在条件和无条件去噪上联合训练模型，以获得一个同时建模无条件 $p\left( x\right)$ 和条件 $p\left( {c \mid x}\right)$ 分布的单一神经网络。在这种情况下，总引导可以表示为

or in terms of the learned U-Net model ${\epsilon }_{\theta }$ that predicts the noise to be removed from ${x}_{t}$ at timestep $t$ and conditioned on prompt ${c}_{1}{}^{2}$ :

或者根据学习的 U-Net 模型 ${\epsilon }_{\theta }$ ，该模型预测在时间步 $t$ 时从 ${x}_{t}$ 中移除的噪声，并以提示 ${c}_{1}{}^{2}$ 为条件：

Latent Diffusion Models [26] incorporate encoder $E$ and decoder $D$ before and after the diffusion process, respectively. Moving the gradual denoising from image pixel space to lower dimensional encoder-decoder latent space improves convergence and running speeds.

潜在扩散模型 [26] 在扩散过程前后分别结合了编码器 $E$ 和解码器 $D$ 。将逐步去噪从图像像素空间转移到较低维的编码器-解码器潜在空间，提高了收敛速度和运行速度。

2.2 Concept Arithmetics in Diffusion Models

2.2 扩散模型中的概念算术

A series of recent works $\left\lbrack {3,{16},{29}}\right\rbrack$ has demonstrated that adding the guidance terms for multiple prompts during the diffusion process results in an image that corresponds to multiple prompts simultaneously. With the additional prompt guidance incorporated in Equation 1, the updated noise prediction equation becomes

一系列最近的工作 $\left\lbrack {3,{16},{29}}\right\rbrack$ 表明，在扩散过程中为多个提示添加引导项会导致生成同时对应多个提示的图像。通过在公式 1 中加入额外的提示引导，更新后的噪声预测方程变为

where ${\gamma }_{j}$ (typically the same for all concepts) is the guidance scale for each additional prompt/concept ${c}_{j}$ ,and ${d}_{j} \in \{ - 1,1\}$ determines the direction of guidance - negative or positive. For example, the generation conditioned on

其中 ${\gamma }_{j}$ （通常对所有概念相同）是每个额外提示/概念 ${c}_{j}$ 的引导尺度，而 ${d}_{j} \in \{ - 1,1\}$ 决定了引导的方向——负向或正向。例如，生成以

${}^{2}$ Throughout,we imply that the string is embedded using CLIP [23] textual encoder before being passed to $\epsilon$ .

${}^{2}$ 在整个过程中，我们暗示字符串在使用 CLIP [23] 文本编码器嵌入后传递给 $\epsilon$ 。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——