数字人原理详解--从头开始实现数字人（三）一、前言在前面两篇，我们讨论了数字人的基本原理。并实现了Dataset类、基

“本文为稀土掘金技术社区首发签约文章，30天内禁止转载，30天后未获授权禁止转载，侵权必究！”

一、前言

在前面两篇，我们讨论了数字人的基本原理。并实现了Dataset类、基于Wav2Lip的唇形同步模型，以及训练代码。如果还未阅读，可以跳转阅读：

训练好模型后，我们就可以开始生成数字人了。但是在开始前我们还需要对推理速度优化一下。

二、推理优化

优化推理速度的方式有许多，GPU加速，包括降低精度、转换onnx、使用TensorRT等。我们先看看最基本的GPU加速后推理速度。下面编写测试代码：

import time
import torch
from wav2lip.models import Wav2Lip

# 加载模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model = Wav2Lip()
ckpt = torch.load('pretrained/wav2lip.pth')
model.load_state_dict(ckpt['state_dict'])

# 将模型转移到cuda，并设置为评估模式
model.to(device)
model.eval()
start = time.time()
with torch.no_grad():
    for i in range(10):
        a = torch.randn(32, 1, 80, 16).to(device, dtype=torch.float32)
        v = torch.randn(32, 6, 96, 96).to(device, dtype=torch.float32)
        s = time.time()
        # 预测
        out = model(a, v)
        print('inference coast: ', time.time() - s)
print('total inference coast: ', time.time() - start)

上述代码随机了一些数据用于测试推理。在6G显存的3060笔记本上总共消耗0.9秒左右，显存占用1.2G，在实际应用时我们对结果还有一些后续处理，因此我们还可以对上述推理继续优化。

首先想到的是降低精度，我们可以通过.half()使用半精度浮点数推理（只支持GPU），在加载模型后，通过下面代码转换成半精度浮点数：

model = model.half()

...
a = torch.randn(32, 1, 80, 16).to(device).half()
v = torch.randn(32, 6, 96, 96).to(device).half()

其余部分代码不做修改，运行后推理时间降低到了0.7秒，显存占用0.7G。这已经可以满足我们的需求了。

如果需要在其它设备上部署，可以使用转换onnx模型。我们可以转换成fp32、fp16甚至uint8类型的模型。实际测试发现，fp16效果与fp32基本一致，uint8则清晰度和色彩方面都不能满足需求，因此这里转换成fp16模型，代码如下：

# 加载模型
model = Wav2Lip()
ckpt = torch.load('pretrained/wav2lip.pth')
model.load_state_dict(ckpt['state_dict'])
model.eval()

# 准备输入
mel_batch = torch.randn(16, 1, 80, 16)
face_batch = torch.randn(16, 6, 96, 96)

# 迁移到GPU并设置为半精度
model = model.cuda().half()
mel_batch = mel_batch.cuda().half()
face_batch = face_batch.cuda().half()
# 保存模型成onnx文件
torch.onnx.export(
    model,
    (mel_batch, face_batch),
    os.path.join(output_dir, f'wav2lip_fp16.onnx'),
    opset_version=11,
    input_names=['mel_batch', 'face_batch'],
    output_names=['output'],
    dynamic_axes={
        'mel_batch': {0: 'batch_size'},
        'face_batch': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

在转换时需要指定输入数据的形状，然后使用torch.onnx.export导出onnx模型。导出后可以使用onnxruntime运行onnx模型。这里需要注意cuda、cudnn、onnxruntime-gpu版本的对应，本人测试环境为：

torch==2.1.0+cuda118
cudnn==8.x
onnxruntime-gpu==1.8.0

安装好对应版本的onnxruntime-gpu后就可以编写下面代码推理了：

import time
import numpy as np
import onnxruntime as ort

# 创建推理session
ort_session = ort.InferenceSession(
    'pretrained/wav2lip_fp16.onnx', providers=['CUDAExecutionProvider']
)
start = time.time()

for i in range(10):
    # 创建fp16的numpy数组
    a = np.random.randn(32, 1, 80, 16).astype(np.float16)
    v = np.random.randn(32, 6, 96, 96).astype(np.float16)
    s = time.time()
    # 推理
    outputs = ort_session.run(
        ['output'],  # 输出名称，使用 None 表示获取所有输出
        {"mel_batch": a, "face_batch": v}
    )
    print('inference coast: ', time.time() - s)
print('total inference coast: ', time.time() - start)

这里需要注意我们在providers指定运行的设备，另外输入的数据类型是fp16的numpy数组而不是torch.Tensor，而且这里不需要手动将输入迁移到GPU。

运行上述代码后推理花费了3秒，相比前面的方法慢了许多，不过使用onnx模型可以很方便迁移到其它设备运行。

三、编写推理类

为了方便使用，下面编写一个类用于wav2lip的推理。这个类里面我们支持onnx、pth模型的推理。首先看看推理类的整体框架：

class Wav2LipInfer:
    dtype_mapping = {
        'fp32': [np.float32, torch.float32],
        'fp16': [np.float16, torch.float16],
    }

    def __init__(self, model_path: Union[str, WindowsPath], dtype: Literal['fp16', 'fp32'] = 'fp16', device=None):
        pass

    @torch.no_grad()
    def __call__(self, mel_batch: Union[np.ndarray, torch.Tensor], face_batch: Union[np.ndarray, torch.Tensor]):
        pass

这里只包含init和call方法，其中init用于加载模型，call方法用于推理。另外还有一个dtype_mapping用于映射数据类型。下面来看看init应该如何实现。init传入模型路和数据类型，如果模型后缀为.pth则当做pytorch模型，如果模型后缀为.onnx则当做onnx模型。另外这里支持fp32和fp16数据类型，下面是具体代码：

def __init__(self, model_path: Union[str, WindowsPath], dtype: Literal['fp16', 'fp32'] = 'fp16', device=None):
    providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
    if isinstance(model_path, WindowsPath):
        model_path = str(model_path)
    # 针对onnx模型的处理
    if model_path.endswith('.onnx'):
        self.model_type = 'onnx'
        # onnx模型选择numpy中的数据类型
        self.dtype = self.dtype_mapping[dtype][0]
        options = ort.SessionOptions()
        options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        # 创建推理session
        self.ort_session = ort.InferenceSession(
            model_path, providers=providers, sess_options=options
        )
    else:
        # 自动选择GPU
        if device is None:
            self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        else:
            self.device = torch.device(device)
        self.model_type = 'pt'
        # 选择torch中的数据类型
        self.dtype = self.dtype_mapping[dtype][1]
        # 加载模型，移动到GPU，并设置具体数据类型
        self.model = load_checkpoint(model_path, Wav2Lip())[0].to(self.device, dtype=self.dtype)
        self.model.eval()

当我们创建Wav2LipInfer实例后，我们会得到一个model或一个ort_session，这个取决于我们传入哪种模型。然后是call方法的实现，call方法做的就是根据model_type属性判断是使用model还是ort_session推理，代码如下：

@torch.no_grad()
def __call__(
        self,
        mel_batch: Union[np.ndarray, torch.Tensor],
        face_batch: Union[np.ndarray, torch.Tensor]
):
    # 针对onnx模型的处理
    if self.model_type == 'onnx':
        # 使用numpy中的数据转换方式
        mel_batch = mel_batch.astype(self.dtype)
        face_batch = face_batch.astype(self.dtype)
        # 使用ort_session推理
        outputs = self.ort_session.run(
            ['output'],  # 输出名称，使用 None 表示获取所有输出
            {"mel_batch": mel_batch, "face_batch": face_batch}
        )
        return outputs[0]
    else:
        # 转换成torch.Tensor，并使用model推理
        if isinstance(mel_batch, np.ndarray):
            mel_batch = torch.tensor(mel_batch).to(self.device, dtype=self.dtype)
            face_batch = torch.tensor(face_batch).to(self.device, dtype=self.dtype)
        return self.model(mel_batch, face_batch)

现在我们只需要指定模型路径，就可以自动判断如何推理了，测试待代码如下：

if __name__ == '__main__':
    wav2lip_infer = Wav2LipInfer('pretrained/wav2lip_fp16.onnx')
    # wav2lip_infer = Wav2LipInfer('pretrained/wav2lip.pth')
    for i in range(30):
        o = wav2lip_infer(
            np.random.randn(1, 1, 80, 16),
            np.random.randn(1, 6, 96, 96)
        )
        print(o.shape)

四、数字人推理

上面我们实现了wav2lip的推理，但是数字人并非只有面部，而是包含身体和场景部分的内容。那我们应该如何完成这件事情呢？

其实很简单，具体步骤如下：

输入一张正常图像（包含身体部分）
检测人脸，截取出人脸部分，并记录人脸位置
使用截取的人脸推理
将推理结果粘贴到正常图像的人脸位置

上述是针对一张图像的处理，视频的话我们只需要重复上述操作即可。

不过现在还有一个问题，在训练时，我们传入的是音频特征、reference_image和masked_image三个部分。而reference_image是非当前帧人脸的其它人脸图片，那我们推理时要如何做呢？

测试发现直接把当前帧人类当做reference_image就可以得到比较好的结果，因此这里选择使用当前帧人脸当reference_image，并把当前帧人脸mask下半部分作为masked_image。

有了上面的知识，我们就可以开始编写数字人推理的代码了。首先是整体架构：

class Avatar:
    def __init__(self, avatar_name: str, wav2lip: Wav2LipInfer, video_path: str):
        self.init_avatar()

    def init_avatar(self):
        pass
        
    def inference(self, audio_path: str, batch_size=32):
        pass

在init中，我们需要定义一些属性，并调用init_avatar初始化当前数字人。具体代码如下：

def __init__(self, avatar_name: str, wav2lip: Wav2LipInfer, video_path: str):
    self.wav2lip = wav2lip
    self.video_path = video_path
    self.full_images = []
    self.full_faces = []
    self.coords = None
    # 目录
    self.avatar_base = Path('assets/avatars') / avatar_name
    self.frame_base = self.avatar_base / 'frames'
    self.face_base = self.avatar_base / 'faces'
    self.tmp_path = self.avatar_base / 'tmp'
    self.init_avatar()

在init_avatar中，我们要做的是提前提取视频帧，并提取人脸，将人脸截取出来，并记录人脸位置。在实际推理时我们需要用到下面的内容：

原始图片（带场景和身体的）
人脸图片
人脸位置

上面三个分别对应full_images、full_faces、coords，下面看看具体实现：

def init_avatar(self):
    # 如果当前数字人目录不存在
    if not self.avatar_base.exists():
        # 初始化对应的目录
        self.frame_base.mkdir(exist_ok=True, parents=True)
        self.face_base.mkdir(exist_ok=True, parents=True)
        # 提取并读取视频帧
        video2images(self.video_path, str(self.frame_base))
        self.full_images = read_images([
            str(filepath) for filepath in self.frame_base.glob('*.[jpJP][pPN][gG]')
        ], to_rgb=False)
        # 提取音频
        video2audio(self.video_path, self.avatar_base)
        # 检测人脸
        face_det_results = face_detect(self.full_images)
        # 逐个记录需要的信息
        coords = []
        for face_det_result, frame in zip(face_det_results, self.full_images):
            face, coord = face_det_result
            y1, y2, x1, x2 = coord
            coords.append((y1, y2, x1, x2))
            dst_path = self.face_base / f'{len(self.full_faces):08d}.jpg'
            cv2.imwrite(str(dst_path), face)
            self.full_faces.append(face)
        dst_path = self.avatar_base / 'coords.npy'
        np.save(str(dst_path), np.array(coords))
        self.coords = np.array(coords)
    # 如果数字人目录存在，则直接读取
    else:
        frame_path_list = [str(filepath) for filepath in self.frame_base.glob("*.[jpJP][pPN][gG]")]
        face_path_list = [str(filepath) for filepath in self.face_base.glob("*.[jpJP][pPN][gG]")]
        self.full_images = [cv2.imread(frame_path) for frame_path in frame_path_list]
        self.full_faces = [cv2.imread(face_path) for face_path in face_path_list]
        self.coords = np.load(self.avatar_base / 'coords.npy')
    # 构建循环
    self.full_images += self.full_images[::-1]
    self.full_faces += self.full_faces[::-1]
    self.coords = np.concatenate((self.coords, self.coords[::-1]), axis=0)

在初始化时，如果当前数字人的目录不存在，则创建目录，并提取前面提到的三个信息。如果存在则直接从本地读取，这样就可以不重复处理了。

在处理完成后，我们执行了下面的代码：

self.full_images += self.full_images[::-1]
self.full_faces += self.full_faces[::-1]
self.coords = np.concatenate((self.coords, self.coords[::-1]), axis=0)

这样做是为了构建一个连续的循环，在推理时，如果音频比原视频长，则倒放视频继续推理。

接下来就是推理部分了，在实现推理之前，我们编写一个函数用于批量返回数据，datagen函数实现如下：

def datagen(faces, mels, images, coords, batch_size=4, img_size=settings.common.image_size):
    face_batch, mel_batch, image_batch, coord_batch = [], [], [], []

    for idx, mel in enumerate(mels):
        idx %= len(faces)
        face = np.transpose(
            cv2.resize(faces[idx], (img_size, img_size)) / 255.,
            (2, 0, 1)
        )
        image_batch.append(images[idx].copy())
        face_batch.append(face)
        mel_batch.append(mel)
        coord_batch.append(coords[idx])
        if len(image_batch) >= batch_size:
            # 重复当前人脸图像，将前3个通道的下半部分设置为0
            face_batch = np.concatenate([face_batch, face_batch], axis=1)
            face_batch[:, :3, img_size // 2:, :] = 0
            yield (
                face_batch,
                np.array(mel_batch), np.array(image_batch),
                np.array(coord_batch),
            )
            face_batch, mel_batch, image_batch, coord_batch, mask_batch = [], [], [], [], []
    if len(image_batch) > 0:
        face_batch = np.concatenate([face_batch, face_batch], axis=1)
        face_batch[:, :3, img_size // 2:, :] = 0
        yield (
            face_batch,
            np.array(mel_batch), np.array(image_batch),
            np.array(coord_batch)
        )

上述代码非常简单，我们只需要关注下面代码即可：

face_batch = np.concatenate([face_batch, face_batch], axis=1)
face_batch[:, :3, img_size // 2:, :] = 0

这里我们将当前人脸的batch重复了一遍，然后将前三个通道的一般设置为0。这里做的就是构造了masked_image和reference_image，这正是模型需要的输入。

另外mels的长度可能比faces要长，因此选择第idx %= len(faces)个face于当前mel对应。

有了这个函数，我们就可以继续编写推理代码了：

@torch.no_grad()
def inference(self, audio_path: Optional[str], batch_size=32):
    recreate_dirs([self.tmp_path])
    # 提取音频特征
    mel_chunks = get_mel_chunks(extract_mel_frames(audio_path, return_tensor=False))
    # 批量读取数据
    gen = datagen(self.full_faces, mel_chunks, self.full_images, self.coords, batch_size)
    fidx = 0
    for idx, (face_batch, mel_batch, image_batch, coord_batch) in enumerate(
            tqdm(
                gen,
                total=int(np.ceil(float(len(mel_chunks)) / batch_size))
            )):
        # 使用wav2lip_infer推理
        pred = self.wav2lip(mel_batch, face_batch)
        if isinstance(pred, torch.Tensor):
            pred = pred.cpu().numpy()
        pred = pred.transpose(0, 2, 3, 1) * 255.
        # 将预测结果和原图合并起来
        for p, f, c in zip(pred, image_batch, coord_batch):
            y1, y2, x1, x2 = c
            p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))
            f[y1:y2, x1:x2] = p
            dst_path = self.tmp_path / f'{fidx:08d}.jpg'
            fidx += 1
            cv2.imwrite(str(dst_path), f)
    # 保存视频
    images2video(str(self.tmp_path), str(self.tmp_path / 'tmp.mp4'))
    merge_audio_video(
        str(self.tmp_path / 'tmp.mp4'),
        audio_path,
        str(
            Path('outputs/output_videos') / f'{datetime.now().strftime("%Y-%m-%d_%H-%M-%S")}.mp4'
        )
    )

上述代码大致做了以下几件事情：

读取音频数据
预测人脸
将预测结果拼接到原图上
将结果保存成视频

这里需要注意的点在预测结果的处理。预测结果的形状为（batch_size，2,96,96），而且数值范围在0-1之间。因此先转换形状并把范围缩放到0-255：

pred = pred.transpose(0, 2, 3, 1) * 255.

然后是单张图片的处理。这里需要resize，并将数据类型转换成np.uint8，然后粘贴到原图上，最后保存结果即可。

y1, y2, x1, x2 = c
p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))
f[y1:y2, x1:x2] = p
dst_path = self.tmp_path / f'{fidx:08d}.jpg'
fidx += 1
cv2.imwrite(str(dst_path), f)

上述代码还有两个新的函数，images2video和merge_audio_video，这里就是执行了两个ffmpeg命令，代码如下：

def images2video(images_dir, output_path, fps=25, verbose=False):
    command = f"ffmpeg -framerate {fps} -i {images_dir}/%08d.jpg -c:v libx264 -pix_fmt yuv420p -y {output_path}".split()
    run_command(command, verbose)
    return output_path


def merge_audio_video(vid_path, audio_path, output_path, verbose=False):
    command = (
        f"ffmpeg -i {vid_path} -i {audio_path} -c:v copy -c:a aac "
        f"-strict experimental -map 0:v:0 -map 1:a:0 -shortest -y {output_path}"
    ).split()
    run_command(command, verbose)
    return output_path

最后，我们只需要运行下面代码即可进行数字人的推理：

wav2lip = Wav2LipInfer('pretrained/wav2lip_fp16.onnx')
avatar = Avatar('tjl', wav2lip, 'assets/data/tjl.mp4')
avatar.inference('assets/data/111.mp3', batch_size=64)

这样我们就实现了整个数字人的代码。

五、总结

我们用了三篇文章介绍了数字人的实现。在第一篇中，介绍了数字人的原理，这里的关键点在于模型的输入和输出。在第一篇中我们知道模型需要输入reference_image、masked_image和音频特征，输出唇形同步后的图片。

在第一篇中我们还实现了人脸检测、处理数据集、加载数据集部分的代码。

在第二篇中我们介绍了wav2lip的模型结构，一个经典的UNet结构。在wav2lip中包含face_encoder、audio_encoder和face_decoder三个部分。在推理中还结合了FPN的思想。另外我们还实现了wav2lip的训练代码。

在第三篇中介绍了推理的优化，并编写了Wav2LipInfer类，用于方便模型推理。最后我们实现了数字人的推理代码，有了Avatar类后我们无需每次推理都去做预处理的操作。