DINOv2: State-of-the-art computer vision models with self-supervised learning翻译

1,177 阅读22分钟

原文地址:ai.facebook.com/blog/dino-v…

DINOv2: State-of-the-art computer vision models with self-supervised learning

DINOv2:最先进的拥有自监督学习能力的计算机视觉模型

2023年4月17日

内容概括:

先上内容摘要方便更好的在读文章时理解。

  • 计算机视觉中的自监督学习(self-supervised learning)已经取得了进展,DINOv2 是其中一种。
  • DINOv2 是一个自监督学习框架,它训练图像特征,而无需标注数据。
  • 为了构建大型神经网络,需要更多的数据来进行训练,DINOv2 使用了公开可用的网络数据,构建了一个包括 1.4亿张图像的大规模预训练数据集。
  • DINOv2 采用了新的正则化方法和高效实现技术,使训练算法更加稳定,并且在相同的硬件配置下,其运行速度是之前的两倍,内存使用量仅为之前的三分之一。
  • 为了避免运行大型模型的昂贵硬件成本,DINOv2 还包括了自蒸馏方法,可以将大型模型的知识压缩到较小的模型中,而仅仅牺牲一点精度。
  • DINOv2 的性能在各种视觉任务上都与 CLIP 和 OpenCLIP 等文本-图像模型相当甚至更好,并且不需要进行微调,而只需要进行线性评估即可。
  • 最终,DINOv2 的目标是作为一个基础模块,整合到更复杂的 AI 系统中,以实现对大型语言模型的更深入图像推理。

前言

DINOv2 is able to take a video and generate a higher-quality segmentation than the original DINO method. DINOv2 allows remarkable properties to emerge, such as a robust understanding of object parts, and robust semantic and low-level understanding of images. DINOv2 能够对视频进行分割处理(segmentation),生成比原始的 DINO 方法更高质量的分割结果。DINOv2 具有出色的性能,例如能够稳健地理解物体的各个部分,以及对图像进行稳健的语义和低层次的理解。

译者备注:

Segmentation

  • 在计算机视觉领域,图像分割(segmentation)指的是将数字图像细分为多个图像子区域(像素的集合)(也被称作超像素)的过程。

Low-level understanding of images

  • "Low-level understanding of images" 通常指对图像的基础特征的理解,包括但不限于颜色、纹理、形状等。这些特征不需要高级的语义知识,而是与图像中的像素级信息有关。在计算机视觉中,这些特征通常是用来进行图像处理、分割、分类等任务的基础。在 DINOv2 中,对图像的低层次理解是指它能够有效地利用图像中的低级特征来进行图像分割和语义理解,而不仅仅是依赖高级的语义知识。

Key Point 文章要点

  • Meta AI has built DINOv2, a new method for training high-performance computer vision models.

  • Meta AI 已经开发出一种名为 DINOv2 的新方法,用于训练高性能计算机视觉模型。

  • DINOv2 delivers strong performance and does not require fine-tuning. This makes it suitable for use as a backbone for many different computer vision tasks.

  • DINOv2 具有出色的性能,并且不需要进行微调。这使得它适用于许多不同的计算机视觉任务的基础(backbone)。

  • Because it uses self-supervision, DINOv2 can learn from any collection of images. It can also learn features, such as depth estimation, that the current standard approach cannot.

  • 由于它使用自监督学习,DINOv2 可以从任何图像集合中学习。它还可以学习当前标准方法无法学习的特征,例如深度估计。

  • We are open-sourcing our model and sharing an interactive demo.

  • 我们正在开源我们的模型,并分享一个交互式演示。

译者备注:

depth estimation:

当我们看一张图像时,我们可以很容易地分辨出图像中物体的距离和深度。但是对于计算机视觉模型来说,理解图像中物体的深度是一个复杂的问题。深度估计是一种计算机视觉任务,旨在从单张或多张图像中估计场景中物体的深度。深度图可以被用来生成三维模型,或在虚拟现实应用中创建逼真的效果。

DINOv2 能够学习深度估计这样的特征,意味着它可以从图像中推断出物体的距离和深度。这是一项非常有用的技能,对于许多计算机视觉任务都有很大的帮助。

open-sourcing

感谢大佬们


Today, we are open-sourcing DINOv2, the first method for training computer vision models that uses self-supervised learning to achieve results that match or surpass the standard approach used in the field.

今天,我们正在开源 DINOv2,这是第一种使用自监督学习来训练计算机视觉模型的方法,其结果能够匹配或超过该领域中使用的标准方法。

译者备注:

the standard approach used in the field

这里所说的 "the standard approach used in the field",指的是计算机视觉领域中当前普遍采用的训练方法。在传统的监督式学习中,需要人工标注大量的数据来训练模型。这样的方法存在一些缺点,例如需要大量的人力和时间,而且标注数据的过程可能存在主观性和不一致性。自监督学习则是一种新兴的训练方法,不需要大量标注的数据,而是利用图像自身的信息来训练模型。与传统的监督式学习相比,自监督学习更具有可扩展性和适应性,并且能够在许多计算机视觉任务中取得比传统方法更好的结果。因此,这里提到的 DINOv2 方法能够匹配或超过计算机视觉领域中当前普遍采用的标准训练方法。

ViT(Vision Transformer)可以被认为是计算机视觉领域中的标准方法之一,尤其是在图像分类任务中。ViT是一种使用自监督学习来训练计算机视觉模型的方法,它将图像转化为序列数据,然后利用Transformer网络来学习特征。ViT通过将图像分割成固定数量的块,然后将这些块变换为序列数据,以便Transformer可以在上面进行训练。ViT的结果已经超过了以往在许多计算机视觉任务中的标准方法,因此可以被认为是该领域中的标准方法之一。


Self-supervised learning — the same method that’s used to create cutting-edge large language models for text applications — is a powerful, flexible way to train AI models because it does not require large amounts of labeled data. Like with other self-supervised systems, models using the DINOv2 method can be trained on any collection of images, without needing any associated metadata. Think of it as being able to learn from all the images it’s given, rather than only those that contain a specific set of hashtags or alt text or caption.

自监督学习是一种强大、灵活的训练人工智能模型的方法,它与用于文本应用的先进大语言模型相同。自监督学习之所以强大,是因为它不需要大量的标记数据。像其他自监督系统一样,使用DINOv2方法的模型可以在任何图像集合上进行训练,而不需要任何相关的元数据。可以将其看作是能够从所有给定的图像中学习,而不仅仅是那些包含特定集合的哈希标签、alt文本或标题的图像。

译者备注:

这段没啥好说的,就是说它能自检读训练,很牛


Unlike many recent reconstruction-based self-supervised learning methods, our model requires no fine-tuning. DINOv2 provides high-performance features that can be directly used as inputs for simple linear classifiers. This flexibility means DINOv2 can be used to create multipurpose backbones for many different computer vision tasks. Our measurements show very strong prediction capabilities on tasks such as classification, segmentation, and image retrieval. Surprisingly, on depth estimation, our features significantly outperform specialized state-of-the-art pipelines evaluated both in-domain and out-of-domain. We believe that this strong out-of-domain performance is due to the combination of self-supervised feature learning and the use of lightweight task-specific modules, such as linear classifiers. Finally, because we don’t resort to fine-tuning, the backbone remains general and the same features can be used simultaneously on many different tasks.

DINOv2模型与许多最近的基于重构的自监督学习方法不同,它不需要进行微调。DINOv2提供了高性能的特征,可以直接用作简单线性分类器的输入。这种灵活性意味着DINOv2可以用于创建多用途的计算机视觉任务的backbone。我们的测试显示,在分类、分割和图像检索等任务中,我们的模型具有非常强的预测能力。令人惊讶的是,在深度估计方面,我们的特征在域内和域外的评估中都明显优于专业的最先进pipelines。我们认为,这种强大的跨域性能是由于自监督特征学习和使用轻量级任务特定模块(如线性分类器)的组合所致。最后,由于我们不使用微调,因此backbone保持通用,同样的特征可以同时用于许多不同的任务。

译者备注:

recent reconstruction-based self-supervised learning methods

是指那些通过像像素填充、图像旋转等方法来生成自监督信号的方法,例如基于自编码器的自监督方法,如SimCLR、MoCo、BYOL等。这些方法需要对模型进行微调以在特定任务上获得更好的性能,而DINOv2模型不需要进行微调,可以直接用作输入特征。

inputs for simple linear classifiers

指的是可以直接作为简单线性分类器的输入的特征。简单线性分类器是一种常见的机器学习算法,用于将输入特征映射到类别标签。在DINOv2模型中,通过学习强大的特征表示,这些特征可以直接用作简单线性分类器的输入,而无需进行更复杂的特征工程或预处理。这使得DINOv2模型可以在许多不同的计算机视觉任务中使用,例如分类、分割和图像检索。

in-domain and out-of-domain

"in-domain"指的是模型在训练时所使用的数据集的领域,例如,在分类任务中,如果模型在猫和狗的数据集上进行训练,则“猫和狗”就是模型的in-domain领域。在这个领域内,模型通常表现最佳,因为它已经学会了该领域的特征和规律。

"out-of-domain"则指的是模型在训练时没有接触到的领域,例如,在分类任务中,如果将模型训练于猫和狗的数据集,而在测试时用鸟类的数据集进行测试,则鸟类就是模型的out-of-domain领域。在这种情况下,模型可能会表现出较差的性能,因为它没有学会处理这种新领域的特征和规律。

因此,在这里,"both in-domain and out-of-domain"指的是DINOv2特征在in-domain领域和out-of-domain领域上的表现都很强,即不仅适用于训练时使用的数据集领域,也可以适用于其他领域,表现出良好的泛化能力。


Self-supervised computer vision models like DINOv2 will be useful in a wide variety of applications. Meta collaborated with the World Resources Institute to use AI to map forests, tree by tree, across areas the size of continents. Our self-supervised model was trained on data from forests in North America, but evaluations confirm that it generalizes well and delivers accurate maps in other locations around the world.

这段文字表明了自监督计算机视觉模型(如DINOv2)在各种应用中都将非常有用。Meta与世界资源研究所合作,利用人工智能对大陆大小的森林进行逐树绘制地图。我们的自监督模型是通过来自北美森林的数据进行训练的,但评估结果表明它具有很好的泛化能力,在世界其他地区也可以提供精确的地图。


DINOv2 complements our other recent computer vision research, including Segment Anything. Segment Anything is a promptable segmentation system focused on zero-shot generalization to diverse set of segmentation tasks. DINOv2 combines with simple linear classifiers to achieve strong results across multiple tasks beyond the segmentation sub-field, creating horizontal impact.

DINOv2是我们最近进行的其他计算机视觉研究的补充,其中包括Segment Anything。Segment Anything是一个可提示的分割系统,专注于零样本泛化到各种分割任务。DINOv2与简单的线性分类器相结合,可以在超出分割子领域的多个任务上取得强大的结果,具有横向影响。


Overcoming the limitations of image-text pretraining

译者注释: “克服图像文本预训练的局限性”。它可能指的是过去图像和文本预训练技术的限制,如数据量不足、训练时间长等问题。也可能指目前的图像和文本预训练技术仍然存在一些问题,例如如何有效地将图像和文本结合起来训练,以及如何提高预训练模型的性能和适用范围等问题。


In recent years, a different technique, known as image-text pretraining, has been the standard approach for many computer vision tasks. But because the method relies on handwritten captions to learn the semantic content of an image, it ignores important information that typically isn’t explicitly mentioned in those text descriptions. For instance, a caption of a picture of a chair in a vast purple room might read “single oak chair.” Yet, the caption misses important information about the background, such as where the chair is spatially located in the purple room. Because of that, we believe caption-based features lack a proper understanding of local information and can lead to poor performance on downstream tasks requiring detailed localized information. Because DINOv2 is based on self-supervised learning, we avoid this problem by not relying on text descriptions. This, in turn, coupled with strong execution, allows DINOv2 to provide state-of-the-art results for monocular depth estimation. For context, monocular depth estimation is a task where the goal is to predict which objects are in the foreground and which are in the background.

近年来,一种称为“图像-文本预训练”的不同技术已成为许多计算机视觉任务的标准方法。但是,由于该方法依赖手写标题来学习图像的语义内容,因此它忽略了通常在这些文本描述中未明确提到的重要信息。例如,一张紫色房间中一张椅子的图片的标题可能是“单个橡木椅子”。然而,该标题忽略了关于背景的重要信息,例如椅子在紫色房间中的空间位置。由于这一点,我们认为基于标题的特征缺乏对局部信息的适当理解,并可能导致在需要详细的局部信息的下游任务中表现不佳。由于DINOv2基于自监督学习,我们通过不依赖文本描述来避免这个问题。这反过来,结合强大的执行力,使得DINOv2能够为单目深度估计提供最先进的结果。作为背景,单目深度估计是一项任务,其目标是预测哪些对象在前景,哪些在背景中。


In general, the need for human annotations of images is a bottleneck because it limits how much data you can use to train a model. In specialized application domains, images are hard or even impossible to label. Training machine learning models on labeled cellular imaging, for instance, is challenging, as there are a limited number of experts who can annotate the cells, and certainly not at the scale required. Self-supervised training on microscopic cellular imagery, however, opens up the way for foundational cell imagery models and, consequently, biological discovery, as it becomes possible to compare known treatments with new ones, for example. The same story holds for the estimation of animal density and abundance, allowing the identification of sources of biodiversity decline and the effectiveness of conservation efforts. Both examples are based on the original open source DINO algorithm, and we hope DINOv2 can improve such lines of work. DINOv2’s training stability and scalability will fuel further advances in applicative domains. One application already underway is our forest-mapping collaboration with the World Resources Institute noted above.

通常来说,图像需要人工标注注释是个瓶颈,因为它限制了训练模型所需的数据量。在一些专业领域,如细胞成像,图像很难甚至无法标注。例如,基于标注的细胞成像训练机器学习模型是具有挑战性的,因为可以进行标注的专家数量有限,而且肯定不足以满足所需的规模。然而,基于自监督训练的微观细胞成像训练打开了建立细胞成像模型的大门,并因此促进了生物学上的发现。例如,可以将已知的治疗方法与新方法进行比较等。对于估计动物种群密度和数量也是同样的情况,它能够识别生物多样性下降的原因和保护措施的有效性。这些应用都是基于原始的开源DINO算法,我们希望DINOv2可以提高这些工作的效率。DINOv2的训练稳定性和可扩展性将推动应用领域的进一步发展。目前已经有一个与世界资源研究所合作的森林制图项目正在进行中。


Our release comes at a time when the performance of joint embedding models that train features by matching data augmentations is plateauing. Specifically, the evaluation performance on ImageNet had moved by 10 percent between 2019 and 2021, and not much since then (+1 percent since 2021). The community focused more on developing alternatives, such as masked-image modeling, limiting progress in that field. In addition, the DINO class of models, among other SSL methods, was difficult to train outside of the classical scope of ImageNet, limiting their adoption for research.

Making progress from DINO to DINOv2 required overcoming several challenges: creating a large and curated training dataset, improving the training algorithm and implementation, and designing a functional distillation pipeline.

我们的发布正值联合嵌入模型的性能达到瓶颈时期,这些模型通过匹配数据增强来训练特征。具体而言,自2019年以来,ImageNet的评估性能已经提高了10%,而自那以后并没有太大提高(自2021年以来只提高了1%)。社区更加关注开发其他替代方案,如遮罩图像建模,限制了该领域的进展。此外,DINO类模型以及其他自监督学习方法难以在ImageNet经典范围之外进行训练,限制了它们在研究中的采用。

从DINO到DINOv2的进步需要克服几个挑战:创建一个大型和精心策划的训练数据集,改进训练算法和实现,以及设计一个功能齐备的蒸馏流程。

译者注释:

designing a functional distillation pipeline

可以翻译为“设计一个功能性蒸馏流程”。在这个上下文中,distillation指的是一种知识蒸馏(knowledge distillation)技术,它将一个大型预训练模型的知识转移给一个小型的目标模型,以提高目标模型的性能。这里所说的“功能性蒸馏流程”指的是为DINOv2设计的一种知识蒸馏流程,旨在从一个大型DINOv2模型中提取知识,然后将该知识传递给小型的DINOv2模型,以提高小型模型的性能。


Building a large, curated, and diverse dataset to train the models

创建一个大型、经过筛选和多样化的数据集以训练模型

One of the key components of our work is training larger architectures, and to increase the performance, larger models require more data for training. But accessing more data is not always possible. With no sufficiently large curated dataset available to suit our needs, we looked into leveraging a publicly available repository of crawled web data and built a pipeline to select useful data inspired by LASER. Two key ingredients are required for building a large-scale pretraining dataset from such a source: discarding irrelevant images and balancing the dataset across concepts. Such delicate curation can’t realistically be done manually, and we wanted a method that allowed capturing distributions not easily associated with metadata. This was achieved by curating a set of seed images from a collection of about 25 third-party datasets and extending it by retrieving images sufficiently close to those seed images. This approach enabled us to produce a pretraining dataset totaling 142 million images out of the 1.2 billion source images.

我们工作的关键组成部分之一是训练更大的架构,并且提高性能。然而更大的模型需要更多的训练数据。但是,你并非总是能够访问更多的数据。由于没有足够大的筛选好的数据集符合我们的需求,我们研究了利用公开的网络数据存储库,并构建了一个受LASER启发的选择有用数据的管道。从这样的源中构建大规模的预训练数据集需要两个关键因素:丢弃无关的图像和在概念上平衡数据集。这种精细的筛选无法通过手动完成,我们希望采用一种允许捕捉与元数据不容易相关联的分布的方法。这是通过从约25个第三方数据集的收藏中筛选出一组种子图像,并通过检索足够接近这些种子图像的图像来扩展它而实现的。这种方法使我们能够生成总共1.42亿张图像的预训练数据集,而其源数据总量为12亿张。


Algorithmic and technical improvements

算法和技术的改进

With more training data, larger models perform better than smaller ones, but their training poses two major challenges. First, increasing the model size makes the training more challenging because of potential instability. In DINOv2, we included additional regularization methods inspired by the similarity search and classification literature, making the training algorithm much more stable. Second, in order to remain tractable, larger models require more efficient implementations. The DINOv2 training code integrates the latest mixed-precision and distributed training implementations proposed in the cutting-edge PyTorch 2 (fully sharded data parallel), an efficient implementation of the stochastic depth technique, as well as the latest compute algorithm implementations of xFormers (in particular, variable-length memory-efficient attention). This allows faster and more efficient iteration cycles. Overall, with equivalent hardware, our code runs around twice as fast with only a third of the memory usage, allowing scaling in data, model size, and hardware.

随着训练数据的增加,较大的模型比较小的模型表现更好,但是它们的训练存在两个主要的挑战。首先,增加模型大小会使训练更具挑战性,可能导致不稳定性。在DINOv2中,我们采用了类似于相似性搜索和分类文献的附加正则化方法,使得训练算法更加稳定。其次,为了保持可处理性,较大的模型需要更有效的实现。DINOv2的训练代码集成了最新的混合精度和分布式训练实现,这些实现是基于最先进的PyTorch 2(完全分片数据并行)提出的,以及最新的xFormers计算算法实现(特别是可变长度的内存高效注意力)。这允许更快,更有效的迭代周期。总的来说,在相同的硬件条件下,我们的代码运行速度大约是原来的两倍,只有三分之一的内存使用量,从而允许在数据、模型大小和硬件方面进行扩展。


Strong, lightweight models with distillation

Running inference for larger models requires more powerful hardware, potentially limiting many practical use cases. To circumvent this problem, researchers typically resort to model distillation, to compress the knowledge of a large model into a smaller one. Our training algorithm is based on self-distillation, making it straightforward to compress our large models into smaller ones. This procedure allows us to compress our highest-performance architecture into significantly smaller ones at only a minimal cost in accuracy, for a dramatically decreased inference cost, leading to remarkably strong ViT-Small, ViT-Base, and ViT-Large models.

在实际应用中,运行更大的模型需要更强大的硬件,这可能限制了许多实际应用场景。为了解决这个问题,研究人员通常会采用模型蒸馏的方法,将大模型的知识压缩到一个较小的模型中。我们的训练算法基于自我蒸馏,这使得将我们的大模型压缩到较小的模型变得简单。这个过程使我们能够将最高性能的架构压缩到显著较小的模型中,只需最小的准确性成本,即可大大降低推理成本,从而产生了极强的ViT-Small、ViT-Base和ViT-Large模型。

The DINOv2 family of models drastically improves over the previous state of the art in self-supervised learning (SSL), and reaches performance comparable with weakly-supervised features (WSL).

DINOv2模型家族在自监督学习(SSL)方面大幅改进了先前的技术水平,并达到了与弱监督特征(WSL)相当的性能。

Releasing a family of high-performance pretrained models

发布了一系列高性能的预训练模型

We release DINOv2 pretrained models to the community with a matching stable, accurate, and scaled implementation: We share pretraining code and recipe for ViT-L/16 (300 M params) and ViT-g/14 (1.1 B params) architectures, as well as checkpoints for a range of pretrained models from the larger ViT-g/14 down to smaller distilled models (ViT-S/14, ViT-B/14 and ViT-L/14). The performance of our approach is competitive or better than the performance of text-image models such as CLIP and OpenCLIP on a wide array of tasks, some of which are illustrated in our demo. Don’t hesitate to play with it! Our features can be used out of the box for nearest neighbor classification or paired with linear classification, yielding strong performance. DINOv2 allows skipping the model adaptation phase (fine-tuning) — our linear evaluation performance is close to their fine-tuned counterpart (within 2 percent on ImageNet-1k) .

Our features can be used out of the box for nearest neighbor classification or paired with linear classification, yielding strong performance. DINOv2 allows skipping the model adaptation phase (fine-tuning) — our linear evaluation performance is close to their fine-tuned counterpart (within 2 percent on ImageNet-1k) .原文这段就重复了,不是我的锅。

Going forward, the team plans to integrate this model, which can function as a building block, in a larger, more complex AI system that could interact with large language models. A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence. Models trained with text supervision are ultimately limited by the image captions. With DINOv2, there is no such built-in limitation.

我们发布了 DINOv2 预训练模型,并提供了匹配的稳定、准确、可扩展的实现:我们共享了 ViT-L/16(300M 参数)和 ViT-g/14(11B 参数)架构的预训练代码和配方,以及从较大的 ViT-g/14 到较小的蒸馏模型(ViT-S/14、ViT-B/14 和 ViT-L/14)的一系列预训练模型检查点。我们的方法的性能在各种任务上都具有竞争力,甚至比 CLIP 和 OpenCLIP 等文本-图像模型的性能更好,其中一些任务在我们的演示中有所说明。欢迎随意使用!

我们的特征可以直接用于最近邻分类,也可以与线性分类配对,表现出很强的性能。DINOv2 允许跳过模型适应阶段(微调)——我们的线性评估性能接近于它们的微调对应物(在 ImageNet-1k 上相差不到 2%)。

未来,该团队计划将这个模型作为构建块集成到更复杂的人工智能系统中,这些系统可以与大型语言模型进行交互。提供图像丰富信息的视觉骨干将允许复杂的 AI 系统以比单个文本句子更深入的方式对图像进行推理。使用文本监督训练的模型最终受到图像标题的限制。使用 DINOv2,就没有这样的内置限制。