多模态学习论文:CLIP4clip

426 阅读1分钟

CLIP在视频理解领域的应用

CLIP4clip

原论文:Clip4clip

架构

  • 文本部分采用transformer,视频部分采用vit

文章重点

  • 由于视频比图像多了一个时间维度,因此经过vit得到的特征会有多个cls token,与文本部分的一个cls token无法做点乘计算相似度。
  • 作者采用了目前最常用的三个方式来解决这个问题:
    1. Parameter-free type:对多个视频的cls token直接取平均,该方法最简单,但在本文中也被证明最能打;
    2. Sequential type:使用lstm/transformer进行时序的建模;
    3. Tight type:将文本特征和多帧视频特征加入位置编码后一起扔给一个transformer学习;

总结

因为原文的总结已经非常到位了,所以此处放上原文的总结。
In this paper, we use the pretrained CLIP as our backbone to solve the video clip retrieval task from frame-level input. We employ parameter-free type, sequential type, and tight type similarity calculator to obtain the final results. The experimental results demonstrate the effectiveness of our model and achieve the SOTA results on MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. Besides, we give serval insights from our empirical studies:

  1. image feature can also promote the video-text retrieval,
  2. post-pretrain on even outstanding image- text pretrained CLIP can further improve the performance on video-text retrieval,
  3. 3D patch linear projection and sequential type similarity are promising approaches on the retrieval task,
  4. The CLIP used on video-text retrieval is learning-rate sensitivity.