问题修复记录：Xinference部署 Embedding Model 服务偶发超时1. 异常表现用 Xinferen

重磅推荐专栏：《大模型AIGC》《课程大纲》《知识星球》

本专栏致力于探索和讨论当今最前沿的技术趋势和应用领域，包括但不限于ChatGPT和Stable Diffusion等。我们将深入研究大型模型的开发和应用，以及与之相关的人工智能生成内容（AIGC）技术。通过深入的技术解析和实践经验分享，旨在帮助读者更好地理解和应用这些领域的最新进展

1. 异常表现

用 Xinference 部署Embedding Model，正常来说一次调用在 0.0x s 就能返回了，但是总会调着调着突然有超时的情况（超时设置为 0.2s）。

以下为排查时的耗时记录:

0 0.05194091796875 
1 0.03154397010803223 
2 0.02721405029296875 
3 0.022481203079223633 
4 0.0260312557220459 
5 0.022041797637939453 
6 0.023247241973876953 
7 0.022098064422607422 
8 0.0237271785736084 
9 0.02179098129272461 
10 0.022876739501953125 
11 0.02112102508544922 
12 0.022140026092529297 
13 0.021735668182373047 
14 0.022485971450805664 
15 0.02161383628845215 
16 0.02232217788696289 
17 0.4504587650299072 超时啦!!!!!
18 0.5312082767486572 超时啦!!!!!
19 0.025865793228149414 
20 0.027109146118164062 
21 0.02774524688720703 
22 0.026093721389770508 
23 0.023926973342895508 
24 0.026304006576538086 
25 0.02429652214050293 
26 0.02492690086364746 
27 0.0239565372467041 
28 0.024293184280395508 
29 0.024419307708740234 
30 0.027068614959716797 
31 0.022825241088867188 
32 0.022889137268066406 
33 0.022517919540405273 
34 0.022736310958862305 
35 0.023031949996948242 
36 0.024333715438842773 
37 0.35354113578796387 超时啦!!!!!
38 0.34353184700012207 超时啦!!!!!
39 0.025719404220581055 
40 0.02369856834411621 
41 0.02283000946044922 
42 0.022718429565429688 
43 0.022561311721801758 
44 0.02275228500366211 
45 0.02206563949584961 
46 0.022105693817138672 
47 0.02268385887145996 
48 0.023158788681030273 
49 0.021632909774780273 
50 0.02277398109436035 
51 0.02237224578857422 
52 0.022454023361206055 
53 0.021343469619750977 
54 0.021682262420654297 
55 0.021616697311401367 
56 0.022090911865234375 
57 0.34398388862609863 超时啦!!!!!
58 0.3253335952758789 超时啦!!!!!
59 0.027069807052612305 
60 0.025043725967407227 
61 0.0226593017578125 
62 0.0223391056060791 
63 0.02141594886779785 
64 0.022568941116333008 
65 0.021624088287353516 
66 0.021976947784423828 
67 0.02163410186767578 
68 0.022001981735229492 
69 0.022622346878051758 
70 0.023020267486572266 
71 0.02154064178466797 
72 0.023201704025268555 
73 0.02171468734741211 
74 0.02272319793701172 
75 0.02292656898498535 
76 0.024205446243286133 
77 0.3319737911224365 超时啦!!!!!
78 0.3243284225463867 超时啦!!!!!
79 0.024248361587524414 
80 0.02418971061706543 
81 0.02288365364074707 
82 0.023025989532470703 
83 0.02157282829284668 
84 0.021953105926513672 
85 0.02224111557006836 
86 0.022006988525390625 
87 0.023322582244873047 
88 0.023366689682006836 
89 0.023983478546142578 
90 0.024150609970092773 
91 0.02383112907409668 
92 0.024561643600463867 
93 0.022654056549072266 
94 0.023354530334472656 
95 0.02341461181640625 
96 0.022315025329589844 
97 0.3170638084411621 超时啦!!!!!
98 0.32648158073425293 超时啦!!!!!
99 0.024491071701049805

2. 排查分析

可以看到偶发的超时是有规律性的，每隔20次调用都会超时2次。这里模型我是部署了2个实例，每次的2次超时是在不同实例上的。也就是说，每个实例每隔10次调用就会超时1次。这是怎么回事呢？我们来看看源码： github.com/xorbitsai/i…

在 def encode 这个函数最后有一段逻辑：

        self._counter += 1
        if (
            self._counter % EMBEDDING_EMPTY_CACHE_COUNT == 0
            or all_token_nums >= EMBEDDING_EMPTY_CACHE_TOKENS
        ):
            logger.debug(
                "Empty embedding cache, calling count %s, all_token_nums %s",
                self._counter,
                all_token_nums,
            )
            gc.collect()
            empty_cache()

        return result

gc.collect() 是在清理内存，empty_cache() 是在清理GPU缓存：

def empty_cache():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    if torch.backends.mps.is_available():
        torch.mps.empty_cache()
    if is_xpu_available():
        torch.xpu.empty_cache()
    if is_npu_available():
        torch.npu.empty_cache()

另外我们看下相关环境变量的默认配置：

EMBEDDING_EMPTY_CACHE_COUNT = int(
    os.getenv("XINFERENCE_EMBEDDING_EMPTY_CACHE_COUNT", "10")
)
EMBEDDING_EMPTY_CACHE_TOKENS = int(
    os.getenv("XINFERENCE_EMBEDDING_EMPTY_CACHE_TOKENS", "8192")
)

也就是说，默认调用10次或者累积Token数超过8192，就会清理一次内存和GPU缓存，所以导致下一次调用的时候服务速度变慢。

3. 解决方案

可以在启动 Xinference 环境变量设置更大的 XINFERENCE_EMBEDDING_EMPTY_CACHE_COUNT、XINFERENCE_EMBEDDING_EMPTY_CACHE_TOKENS：

#!/bin/bash

# 激活 conda 环境
echo "Activating conda environment: xinference"
source /data/zsj/miniconda3/etc/profile.d/conda.sh
conda activate xinference

# 设置环境变量并打印设置的信息
export XINFERENCE_HOME=/data/zsj/xinference
export XINFERENCE_EMBEDDING_EMPTY_CACHE_COUNT=100
export XINFERENCE_EMBEDDING_EMPTY_CACHE_TOKENS=81920

# 检查是否已经有 xinference-local 在运行，并打印检查结果
if pgrep -f "xinference-local --host 192.168.0.5 --port 8089" > /dev/null
then
    echo "Service is already running."
else
    echo "Service is not running. Starting service..."
    # 如果没有运行，则启动服务，并将其放到后台运行，同时打印启动信息
    nohup xinference-local --host 192.168.0.5 --port 8089 >> /data/zsj/xinference/nohup.out 2>&1 &
    echo "Service started with PID $!"
    sleep 10
fi