复现系列-4：手把手搭建视频查重系统距离上一篇《复现》系列的文章已经有很长一段时间了，这其中主要有两个原因。一是阳了，

距离上一篇《复现》系列的文章已经有很长一段时间了，这其中主要有两个原因。

一是阳了，修养了很多天，也不愿意动，感觉代码都看不懂了。键盘上打字也经常敲错位置。

二是这部分代码在Mac上运行会有超级超级多的问题，经过很长时间的查找、调试，最终放弃，找了一台Ubuntu的服务器，用了三四个小时的时间，从重装到环境准备都搞定了，才复现完成。

环境准备

Leveldb的安装在mac电脑上安装levelDB的方法：安装最新版1.23 brew install leveldb 安装起来没有问题，但是在实际调用的时候会报：

>>> import plyvel
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/plyvel/__init__.py", line 6, in <module>
    from ._plyvel import (  # noqa
ImportError: dlopen(/Users/xxx/Library/Python/3.9/lib/python/site-packages/plyvel/_plyvel.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace (__ZTIN7leveldb10ComparatorE)

尝试安装1.21版本

brew extract --version=1.21 leveldb homebrew/cask
brew install /usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask/Formula/leveldb@1.21.rb

但是运行代码的时候依然会报错，因此，强烈建议使用正常的Linux服务器来跑代码调试代码，不要折腾Mac了，M1M2芯片的Mac太麻烦。

Ubuntu装机之后环境准备首先更新好apt的源和pip的源，这两个很基础的操作，网上随便一搜就有了，这里不做赘述。安装一些必要的工具和库：

sudo apt-get install git-lfs python3-pip golang procps file git libleveldb-dev wget curl build-essential gcc cmake ffmpeg libjpeg-dev zlib1g-dev libgit2-dev screen
pip3 install -U pip
pip3 install -r requirements.txt

其中，requirements.txt的内容如下：

av
opencv-python
jupyterlab
requests
pandas
numpy
redis
plyvel==1.2.0
ujson==5.4.0
pymilvus==2.2.1
towhee==0.9.0
towhee.models==0.9.0
pillow

plyvel不要使用最新版，编译有问题，这里采用固定版本号，使用稍微旧一点的版本。

数据准备

这一步没有什么多说的，直接下载解压就行了

curl -L https://github.com/towhee-io/examples/releases/download/data/VCSL-demo.zip -O  
unzip -q -o VCSL-demo.zip

解压之后，会有一个多余的文件夹，上传原压缩包的人应该用的是Mac电脑，多了一个__MACOS__文件夹，直接删掉即可。

代码运行

直接上代码，没有什么问题。

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
        FieldSchema(name='id', dtype=DataType.INT64, descrition='id of the embedding', is_primary=True, auto_id=True),
        FieldSchema(name='path', dtype=DataType.VARCHAR,  descrition='path of the embedding',max_length=500),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='video embedding vectors', dim=dim)
    ]
    schema = CollectionSchema(fields=fields, description='video dedup')
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":1}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

collection = create_milvus_collection('video_deduplication', 256)

from IPython import display
from pathlib import Path
import towhee
from PIL import Image as PILImage
import os

def display_gif(video_path_list, text_list):
    html = ''
    for video_path, text in zip(video_path_list, text_list):
        html_line = '<img src=\"{}\"> {} <br/><br/>'.format(video_path, text)
        html += html_line
    return display.HTML(html)

    
def convert_video2gif(video_path, output_gif_path, start_time=0.0, end_time=1000.0, num_samples=16):
    frames = (
        towhee.glob(video_path)
              .video_decode.ffmpeg(start_time=start_time, end_time=end_time, sample_type='time_step_sample', args={'time_step': 3})
              .to_list()[0]
    )
    imgs = [PILImage.fromarray(frame) for frame in frames]
    imgs = [img.resize((int(img.width/6), int(img.height/6)), PILImage.NEAREST) for img in imgs]
    imgs[0].save(fp=output_gif_path, format='GIF', append_images=imgs[1:], save_all=True, loop=0)


def display_gifs_from_video(video_path_list, text_list, start_time_list = None, end_time_list = None, tmpdirname = './tmp_gifs'):
    Path(tmpdirname).mkdir(exist_ok=True)
    gif_path_list = []
    for i, video_path in enumerate(video_path_list):
        video_name = str(Path(video_path).name).split('.')[0]
        gif_path = Path(tmpdirname) / (video_name + '.gif')
        if start_time_list is not None:
            convert_video2gif(video_path, gif_path, start_time=start_time_list[i], end_time=end_time_list[i])
        else:
            convert_video2gif(video_path, gif_path)
        gif_path_list.append(gif_path)
    return display_gif(gif_path_list, text_list)

这一步会下载一些模型和依赖库，等它下载完毕即可，注意，一定要先安装好git-lfs。

接下来看看生成的gif。

import random
random.seed(9)
vcsl_demo_root = './VCSL-demo/'

event_list = os.listdir(vcsl_demo_root)
if 'crashed_video' in event_list:
    event_list.remove('crashed_video')

random_event = random.choice(event_list)
random_event_folder = os.path.join(vcsl_demo_root, random_event)
random_event_videos = [os.path.join(random_event_folder, video_file) for video_file in os.listdir(random_event_folder)]
tmpdirname = './tmp_gifs'
display_gifs_from_video(random_event_videos, random_event_videos, tmpdirname=tmpdirname)

做视频入库前的一些准备。

import towhee
import numpy as np
from towhee.types import Image


os.environ["CUDA_VISIBLE_DEVICES"] = '0'


@towhee.register
def get_image(x):
    for i in x:
        yield Image(i.__array__(), 'RGB')

@towhee.register
def merge_ndarray(x):
    return np.concatenate(x).reshape(-1, x[0].shape[0])

@towhee.register
def split_res(x):
    return [i.path for i in x], [i.score for i in x]

根据自己的实际情况，如果有N卡安装好了CUDA环境，可以将环境变量进行对应的修改。

%%time

all_file_pattern = os.path.join(vcsl_demo_root, '*', '*.*')

dc = (
    towhee.glob['video_url'](all_file_pattern).stream()
        .video_decode.ffmpeg['video_url', 'frames'](sample_type='time_step_sample', args={'time_step': 1})
        .get_image['frames', 'images']()
        .flatten('images')
        .drop_empty()
        .image_embedding.isc['images', 'embeddings']()
        .select['video_url', 'embeddings']()
        .ann_insert.milvus[('video_url', 'embeddings'), 'insert_result'](uri='tcp://127.0.0.1:19530/video_deduplication')
        .group_by('video_url')
        .merge_ndarray['embeddings', 'video_embedding']()
        .insert_leveldb[('video_url', 'video_embedding'), ]('url_vec.db')
        .select['video_url', 'video_embedding']()
        .show(limit=20)
)

入库这一步会花一点时间，耐心等待即可。

期间可能会遇到video_decoder.py-video_decoder:120 - ERROR: header damaged这种问题，不影响。原视频有损坏，会被跳过。

最后进行视频查重检索。

%%time
collection.load()

query_file_pattern = os.path.join(vcsl_demo_root, 'madongmei', '*.*')

dc = (
    towhee.glob['query_url'](query_file_pattern).stream()
        .video_decode.ffmpeg['query_url', 'frames'](sample_type='time_step_sample', args={'time_step': 1})
        .get_image['frames', 'images']()
        .flatten('images')
        .drop_empty()
        .image_embedding.isc['images', 'embeddings']()
        .select['query_url', 'embeddings']()
        .ann_search.milvus['embeddings', 'results'](collection=collection, limit=64, output_fields=['path'], metric_type='IP')
        .split_res['results', ('retrieved_urls','scores')]()
        .group_by('query_url')
        .video_copy_detection.select_video[('retrieved_urls','scores'), 'ref_url'](top_k=5, reduce_function='sum', reverse=True)
        .from_leveldb['ref_url', 'retrieved_embedding']('url_vec.db', True)
        .merge_ndarray['embeddings', 'video_embedding']()
        .flatten('retrieved_embedding', 'ref_url')
        .video_copy_detection.temporal_network[('video_embedding', 'retrieved_embedding'), ('predict_segments', 'segment_scores')]()
        .select['query_url',  'ref_url', 'predict_segments', 'segment_scores']()
        .show(limit=50)
)