CLIP在视频理解领域的应用

CLIP4clip

架构

文本部分采用transformer，视频部分采用vit

文章重点

由于视频比图像多了一个时间维度，因此经过vit得到的特征会有多个cls token，与文本部分的一个cls token无法做点乘计算相似度。
作者采用了目前最常用的三个方式来解决这个问题：
1. Parameter-free type：对多个视频的cls token直接取平均，该方法最简单，但在本文中也被证明最能打；
2. Sequential type：使用lstm/transformer进行时序的建模；
3. Tight type：将文本特征和多帧视频特征加入位置编码后一起扔给一个transformer学习；

总结

因为原文的总结已经非常到位了，所以此处放上原文的总结。
In this paper, we use the pretrained CLIP as our backbone to solve the video clip retrieval task from frame-level input. We employ parameter-free type, sequential type, and tight type similarity calculator to obtain the final results. The experimental results demonstrate the effectiveness of our model and achieve the SOTA results on MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. Besides, we give serval insights from our empirical studies:

image feature can also promote the video-text retrieval,
post-pretrain on even outstanding image- text pretrained CLIP can further improve the performance on video-text retrieval,
3D patch linear projection and sequential type similarity are promising approaches on the retrieval task,
The CLIP used on video-text retrieval is learning-rate sensitivity.

多模态学习论文：CLIP4clip

CLIP在视频理解领域的应用

CLIP4clip

架构

文章重点

总结