本文已参与「新人创作礼」活动，一起开启掘金创作之路。

这是一篇CoSOD综述，阐述了什么是CoSOD，发现了现存数据集的缺陷，并公开了一个新的数据集CoSOD3k，提出了一个统一的、可训练的框架CoEG-Net，并总结了40个前沿算法，在4个数据集对其中的18个进行了基准测试，提供了一个基准工具箱，集成了各种公开可用数据集和多个评价指标，讨论了CoSOD面临的挑战和未来工作。

CoSOD

CoSOD是什么

As a extension of this, co-salient object detection (CoSOD) emerged recently to employ a set of images.

The goal of CoSOD is to extract the salient object(s) that are common within a single image (e.g., red-clothed football players in Fig. 1 (b)) or across multiple images (e.g., the blue-clothed gymnast in Fig. 1 (c)).

Two important characteristics of co-salient objects are local saliency and global similarity.

应用前景

Due to its useful potential, CoSOD has been attracting growing attention in many applications, including collection-aware crops [23], co-segmentation [24], [25], weakly supervised learning [26], image retrieval [27], [28], and video foreground detection [29].

collection-aware crops

- Cosaliency: Where people look when comparing images

co-segmentation

- Higher-order image co-segmentation
- Object-Based Multiple Foreground Video Co-Segmentation via Multi-State Selection Graph

weakly supervised learning

- Capsal: Leveraging captioning to boost semantics for salient object detection

image retrieval

- A model of visual attention for natural image retrieval
- Salientshape: group saliency in image collections

video foreground detection

- Cluster-based co-saliency detection

现有数据集

MSRC 和 Image Pair 是最早的两个数据集。

THUR15K 和 CoSal2015 是两个大规模的公开可用数据集，并且 CoSal2015 被广泛用于评估 CoSOD 算法。

MSRC

MSRC was designed for recognizing object classes from images.

MSRC 设计的初衷是图像中的对象识别任务.

This dataset includes 8 image groups and 240 images in total, with manually annotated pixel-level ground-truth data.

该数据集包括 8 个图像组，总共 240 张图片，并带有手动标注的像素级真值（Ground-truth）。

iCoSeg

The iCoSeg [41] dataset was released in 2010. It is a relatively larger dataset consisting of 38 categories with 643 images in total. Each image group in this dataset contains 4 to 42 images, rather than only 2 images like in the Image Pair dataset.

iCoSeg [4] 数据集于 2010 年公开。这是一个相对较大的数据集，包含了 38 个类别，共 643 张图像。数据集中每个组包含 4 到 42 张图像，而不像 Image Pair 数据集中，每组只有的 2 张图像。

Image Pair

Image Pair was specifically designed for image pairs and contains 210 images (105 groups) in total.

专为图像对设计，包含 105 组总共 210 张图像。

CoSal2015
WICOS

Different from the above-mentioned datasets, the WICOS [31] dataset aims to detect co-salient objects from a single image, where each image can be viewed as one group.

WICOS [90] 数据集旨在检测单张图像中的协同显著物体，其中每个图像可视为一组。

存在的问题

Although the aforementioned datasets have advanced the CoSOD task to various degrees, they are severely limited in variety, with only dozens of groups. On such small-scale datasets, the scalability of methods cannot be fully evaluated.
Moreover, these datasets only provide objectlevel labels. None of them provide rich annotations such as bounding boxes, instances, etc., which are important for progressing many vision tasks and multi-task modeling. Especially in the current deep learning era, where models are often data-hungry.
Most CoSOD datasets tend to focus on the appearance-similarity between objects to identify the co-salient object across multiple images. However, this leads to data selection bias [Salient objects in clutter: Bringing salient object detection to the foreground], [Unbiased look at dataset bias] and is not always appropriate, since, in real-world applications, the salient objects in a group of images often vary in terms of texture, scene, and background, even if they belong to the same category.

CoSOD结果评估

现有局限性

Completeness. (Mean Absolute Error) [36] and F-measure [37] are two widely used metrics in CoSOD/SOD model evaluation. As discussed in [38], these metrics have their inherent limitations. To provide thorough and reliable conclusions, we need introduce more accurate metrics e.g., structural based evaluation metric or perceptual based evaluation metric.

即S-measure，E-measure

Fairness. To evaluate the F-measure, the first step is to binarize a saliency map into a set of foreground maps using different threshold values. There are many binarization strategies [39], such as adaptive threshold, fixed threshold and so on. However, different strategies will result in different F-measure performances. Further, few previous works provide details on their binarization strategy, leading to inconsistent F-measures for different researchers.

To address the aforementioned limitations, we argue that integrating various publicly available CoSOD algorithms, datasets, and metrics, and then providing a complete, unified benchmark, is highly desired.

现有评价指标

To provide a comprehensive evaluation, four widely used metrics are employed for evaluating CoSOD performance, including maximum F-measure [37], mean absolute error (MAE) [36], S-measure [126], and maximum E-measure [127]. The complete evaluation toolbox can be found at github.com/DengPingFan….

F-measure [37] evaluate the weighted harmonic mean of precision and recall. The saliency maps have to be binarized using different threshold, where each threshold corresponds to a binary saliency prediction. The predicted and ground-truth binary maps are compared to get precision and recall values. is typically chosen as the Fmeasure score that corresponds to the best fixed threshold for the whole dataset.
MAE [36] is a much simple evaluation metric that directly measures the absolute difference between the groundtruth value and the predicted value, without any binarization requirements. Both F-measure and MAE evaluate the prediction in a pixel by pixel manner.
S-measure [126] is designed to evaluate the structural similarity between a saliency map and the corresponding ground-truth. It can directly evaluate the continuous saliency prediction without binarization and consider the large scale structure similarity at the same time.
E-measure [127] is a perceptual metric that evaluates both local and global similarity between the predicted map and ground-truth simultaneously.

CoSOD与SOD评估方式的差异

CoSOD涉及到分组，也就是以每一组内（这些图像内普遍出现的目标往往就是Co-salient Obejct）统计各个指标的结果，但是这里有个细节需要注意：

对于直接可获得的数值指标（例如MAE、S-measure、weighted F-measure、adaptive F-measure和adaptive E-measure）而言，就是各组内计算平均值后，所有组的结果再一起计算一次均值。
但是对于需要通过变化阈值来计算的指标（例如max F-measure、mean F-measure、max E-measure和mean F-measure）而言，就是各组内平均得到256长度的序列后，再所有组一起算一次均值。对于最终得到的肠胃256序列的结果取最大或者均值便可以得到对应的指标值。

关于各个指标具体的定义细节可见本人的python代码或者是Fan提供的matlab代码。

注意，这里提供的链接是针对SOD或者COD任务的数据的指标计算代码。
对于CoSOD任务的分组计算的特性，需要进行调整，具体可见Fan提供的另一份计算CoSOD的代码，但是他其中的指标计算并不全面，代码还有部分错误（与这里指出的是相同的错误：github.com/DengPingFan…），但是计算的逻辑是可以参考的：

dpfan.net/wp-content/…

我近期已经整理了一份python的实现，暂时没有公开，指标更加全面（按照本文的内容来看，SOD的指标实际上都可以被用到CoSOD上），速度更快。
关于我对于E-measure计算的加速的思考可见以下两篇文章：

我是如何使计算时间提速25.6倍的：www.yuque.com/lart/blog/a…
我是如何使计算提速>150倍的：www.yuque.com/lart/blog/l…

本文贡献

提出了CoSOD3k数据，包含13个超类，160组，3316张图，提供了多个级别的标注信息，即类别，边界框，对象和实例。
整理了40篇相关工作，评估了18个模型，提供了一套评估代码。
提出了一个简单有效的CoSOD框架（CoEG-Net），通过协同注意力映射和SOD网络，统一同时地嵌入外观和语义特征。
分析了结果，对未来的工作提出了一些建议。

CoSOD3k

The overall dataset mask (the right of Fig. 7) tends to appear as a center-biased map without shape bias . As is well-known, humans are usually inclined to pay more attention to the center of a scene when taking a photo. Thus, it is easy for a SOD model to achieve a high score when employing a Gaussian function in its algorithm. Due to the limitation of space, we present all 160 mixture-specific category masks in the supplementary materials.

CoEG-Net

本文提出了一个两分支的框架以一种多重独立的方式（in a multiply independent fashion）分别捕获并发依赖（concurrent dependencies）和显著性前景。通过上面的分支获得co-attention maps和下面分支获得的saliency prior maps之间相乘（element-wise）来产生最终的co-saliency prediction.

下面的显著性分支较为简单，直接使用了DUTS上训练好的EGNet来收集多尺度显著性先验。这可以在不利用跨图像信息的前提下帮助识别图像中的显著性区域。
上面分支以一种无监督的方式生成co-attention map。这部分需要细讲一下。

Co-attention Projection for Co-saliency Learning

这里的设计受CAM[Learning deep features for discriminative localization]的启发：

给定输入图像，对应图像类别（keywords labeling）为
从VGG最后的卷积层中获得特征激活图
通过类别监督可以获得（例如从分类任务的全连接层对应的参数获得）对应与卷积特征激活输出各个通道的权重(weights could be trained using keyword level weak supervision)
可以得到最终的类别特定的attention map：
针对特征图上的每一个位置，可以得到更加具体的计算方式：

因此CAM实际上实现了一种从特征到类别特定激活图的线性变换。
本文延续这种思路，并且根据自身没有类别标签的情况进行了进一步无监督学习的探索。
作者给出了自己的分析：

Ideally, the unknown common object category among a group of associated images should corresponds to a linear projection that results in high class activation scores in the common object regions, while having low class activation scores in other image regions.
From another point of view, the common object category should correspond to the linear transformation that generates the highest variance (most informative) in the resulting class activation maps.
Follow the idea in coarse localization task [Unsupervised object discovery and co-localization by deep descriptor transformation], we achieve this gold by exploring the classical principle component analysis (PCA) method [LIII. On lines and planes of closest fit to systems of points in space], which is the simplest way of revealing the internal structure of the data in a way that best explains the variance in the data.

接下来就是温习PCA的阶段了：

给定，可以得到
旨在获得一个变换，可以从获得一个有着最大方差的co-attetion maps，注意这里是一组结果，这个变换则通过分析特征描述子的协方差矩阵获得
计算均值：，其中是一个的张量
通过对去均值处理获得零均值版本的描述子
进一步获得协方差矩阵：（虽然原文是这么给的，但是为什么还要再减均值呢？）
这里通过获得Cov的最大的特征值对应的特征向量得到对应的线性变换，这里的表示对应的特征向量

这里需要注意，得到的attention maps本身是灰度的，具有极高的模糊性。为了将其集成到已经由EGNet得到的saliency prior map上，需要先对其进行处理，文中使用了densecrf和manifold ranking来进一步细化。

实验结果

讨论和建议

SOD方法的良好表现并不一定意味着当前的数据集不够复杂，或者直接使用SOD方法可以获得良好的性能：From the evaluation, we observe that, in most cases, the current SOD methods can obtain very competitive or even better performances than the CoSOD methods. However, this does not necessarily mean that the current datasets are not complex enough or using the SOD methods directly can obtain the good performances—the performances of the SOD methods on the CoSOD datasets are actually lower than those on the SOD datasets.
CoSOD的研究还存在一些问题：Consequently, the evaluation results reveal that many problems in CoSOD are still under-studied and this makes the existing CoSOD models less effective.

- Scalability: 现有方法很难应对更大的组的数据同时处理，如何降低由于组内图像数量造成的计算损耗，是实际应用需要考虑的关键问题。
- Stability: 一些方法对于数组组内样本的顺序有依赖，这损害了模型性能的稳定性（如果改变顺序或者划分的子组有变换，可能性能有变化）。这会限制实际的应用。
- Compatibility: 在CoSOD框架中引入SOD方法被本文证明了有效性，但是如何实现更加高效（时间消耗）端到端可训练的检测是一个值得研究的问题。
- Metrics: 现有指标主要基于单图像的目标的预测评估，没有考虑跨图像的目标预测的评估。

Re-thinking Co-Salient Object Detection