CLIP在视频理解领域的应用
CLIP4clip
原论文:Clip4clip
架构
- 文本部分采用transformer,视频部分采用vit
文章重点
- 由于视频比图像多了一个时间维度,因此经过vit得到的特征会有多个cls token,与文本部分的一个cls token无法做点乘计算相似度。
- 作者采用了目前最常用的三个方式来解决这个问题:
- Parameter-free type:对多个视频的cls token直接取平均,该方法最简单,但在本文中也被证明最能打;
- Sequential type:使用lstm/transformer进行时序的建模;
- Tight type:将文本特征和多帧视频特征加入位置编码后一起扔给一个transformer学习;
总结
因为原文的总结已经非常到位了,所以此处放上原文的总结。
In this paper, we use the pretrained CLIP as our backbone to solve the video clip retrieval task from frame-level input. We employ parameter-free type, sequential type, and tight type similarity calculator to obtain the final results. The experimental results demonstrate the effectiveness of our model and achieve the SOTA results on MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. Besides, we give serval insights from our empirical studies:
- image feature can also promote the video-text retrieval,
- post-pretrain on even outstanding image- text pretrained CLIP can further improve the performance on video-text retrieval,
- 3D patch linear projection and sequential type similarity are promising approaches on the retrieval task,
- The CLIP used on video-text retrieval is learning-rate sensitivity.