LLMPerf 可以做针对大语言模型的负载测试和正确性测试,提供了对应的命令行工具。
安装和配置 LLMPerf 比较繁琐,我封装了 Docker 镜像,方便直接使用。
下面介绍怎么基于我的 Docker 镜像通过命令实现测试。
负载测试
比如,需要评估目前使用的云端模型或者本地模型的性能,需要有一个基准测试,那么可以使用 LLMPerf 的负载测试命令 token_benchmark_ray.py:
docker run \
-e OPENAI_API_KEY=ollama \
-e OPENAI_API_BASE=http://monkey:11434/v1 \
-v $(pwd)/result_outputs:/app/result_outputs \
marshalw/llmperf \
python3 token_benchmark_ray.py \
--model "qwen2" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
配置模型参数让测试跑起来
执行上面命令,最少需要修改两组参数(其他的可以不用动):
- 兼容 OpenAI 模型 api 的参数
OPENAI_API_KEY=ollamaOPENAI_API_BASE=http://monkey:11434/v1
- 模型的名称
--model "qwen2"
我这里使用的是 Ollama Restful API,是兼容 OpenAI 的。
跑起来的效果:
$ docker run \
-e OPENAI_API_KEY=ollama \
-e OPENAI_API_BASE=http://monkey:11434/v1 \
-v $(pwd)/result_outputs:/app/result_outputs \
marshalw/llmperf \
python3 token_benchmark_ray.py \
--model "qwen2" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2024-07-29 10:27:37,739 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.07gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-07-29 10:27:37,862 INFO worker.py:1781 -- Started a local Ray instance.
100%|██████████| 2/2 [00:07<00:00, 3.94s/it]
\Results for token benchmark for qwen2 queried with the openai api.
inter_token_latency_s
p25 = 0.018408454432277666
p50 = 0.02065592641166753
p75 = 0.022903398391057393
p90 = 0.02425188157869131
p95 = 0.02470137597456928
p99 = 0.02506097149127166
mean = 0.02065592641166753
min = 0.016160982452887804
max = 0.025150870370447255
stddev = 0.0063568107086132974
ttft_s
p25 = 0.607668885262683
p50 = 0.9441499398089945
p75 = 1.280630994355306
p90 = 1.482519627083093
p95 = 1.5498158379923552
p99 = 1.603652806719765
mean = 0.9441499398089945
min = 0.27118783071637154
max = 1.6171120489016175
stddev = 0.9517121416419898
end_to_end_latency_s
p25 = 3.270658512949012
p50 = 3.502787458943203
p75 = 3.734916404937394
p90 = 3.8741937725339084
p95 = 3.920619561732747
p99 = 3.957760193091817
mean = 3.502787458943203
min = 3.0385295669548213
max = 3.967045350931585
stddev = 0.656559807288713
request_output_throughput_token_per_s
p25 = 45.150048420697644
p50 = 50.72404310076146
p75 = 56.29803778082528
p90 = 59.642434588863566
p95 = 60.757233524876334
p99 = 61.64907267368654
mean = 50.72404310076146
min = 39.57605374063383
max = 61.872032460889095
stddev = 15.765637746283458
number_input_tokens
p25 = 548.75
p50 = 559.5
p75 = 570.25
p90 = 576.7
p95 = 578.85
p99 = 580.57
mean = 559.5
min = 538
max = 581
stddev = 30.405591591021544
number_output_tokens
p25 = 164.75
p50 = 172.5
p75 = 180.25
p90 = 184.9
p95 = 186.45
p99 = 187.69
mean = 172.5
min = 157
max = 188
stddev = 21.920310216782973
Number Of Errored Requests: 0
Overall Output Throughput: 43.720811627735536
Number Of Completed Requests: 2
Completed Requests Per Minute: 15.207238827038447
调整参数细节
细节参数主要有:
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
mean-input-tokens, 平均输入 token 数stddev-input-tokens, 输入 token 数的标准差mean-output-tokens, 平均输出 token 数stddev-output-tokens, 输出 token 数的标准差max-num-completed-requests, 测试的最大请求数num-concurrent-requests,并发请求数
负载测试的内容
这是和 LLM 对话的提示词:
Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...
将使用莎士比亚十四行诗,随机抽取几行的文本,形成流式输出,模拟对话。
另外,计算输入输出 token,都使用 LlamaTokenizer ,而不是用模型自己的 Tokenizer,为的是统一输入输出 token 计数标准。
标准是相同的,因此,负载测试可以作为基准测试,对比多个云端模型和本地模型的性能。
结果数据
inter_token_latency_s
p25 = 0.018408454432277666
p50 = 0.02065592641166753
p75 = 0.022903398391057393
p90 = 0.02425188157869131
p95 = 0.02470137597456928
p99 = 0.02506097149127166
mean = 0.02065592641166753
min = 0.016160982452887804
max = 0.025150870370447255
stddev = 0.0063568107086132974
ttft_s
p25 = 0.607668885262683
p50 = 0.9441499398089945
p75 = 1.280630994355306
p90 = 1.482519627083093
p95 = 1.5498158379923552
p99 = 1.603652806719765
mean = 0.9441499398089945
min = 0.27118783071637154
max = 1.6171120489016175
stddev = 0.9517121416419898
end_to_end_latency_s
p25 = 3.270658512949012
p50 = 3.502787458943203
p75 = 3.734916404937394
p90 = 3.8741937725339084
p95 = 3.920619561732747
p99 = 3.957760193091817
mean = 3.502787458943203
min = 3.0385295669548213
max = 3.967045350931585
stddev = 0.656559807288713
request_output_throughput_token_per_s
p25 = 45.150048420697644
p50 = 50.72404310076146
p75 = 56.29803778082528
p90 = 59.642434588863566
p95 = 60.757233524876334
p99 = 61.64907267368654
mean = 50.72404310076146
min = 39.57605374063383
max = 61.872032460889095
stddev = 15.765637746283458
number_input_tokens
p25 = 548.75
p50 = 559.5
p75 = 570.25
p90 = 576.7
p95 = 578.85
p99 = 580.57
mean = 559.5
min = 538
max = 581
stddev = 30.405591591021544
number_output_tokens
p25 = 164.75
p50 = 172.5
p75 = 180.25
p90 = 184.9
p95 = 186.45
p99 = 187.69
mean = 172.5
min = 157
max = 188
stddev = 21.920310216782973
Number Of Errored Requests: 0
Overall Output Throughput: 43.720811627735536
Number Of Completed Requests: 2
Completed Requests Per Minute: 15.207238827038447
inter_token_latency_s, token 间延迟时间(秒)ttft_s, Time to first token, 首令牌延时,获取到第一个token的延时,对实时应用很重要,比如对话助手,需要及时响应end_to_end_latency_s, 端到端延时,发送请求到收到最后一个token的时间长度,也是考察实时性的重要指标request_output_throughput_token_per_s, 每秒输出 token 数number_input_tokens, 输入 token 数number_output_tokens, 输出 token 数Overall Output Throughput, 总的每秒输出 token 数Number Of Completed Requests, 完成的请求数Completed Requests Per Minute, 每分钟完成请求数
结果的存储
存储在 ./result_outputs
$ ls result_outputs -hl
total 5K
-rw-r--r-- 1 root root 777 Jul 29 10:27 qwen2_550_150_individual_responses.json
-rw-r--r-- 1 root root 4.3K Jul 29 10:27 qwen2_550_150_summary.json
可以基于json文件生成分析图表。
其他
- 本地模型,建议执行2次命令,取第二次的结果,因为有可能模型需要加载过程
正确性测试
比如,需要对多个大模型,比较它们正确回答问题的能力。
LLMPerf 提供了正确性测试命令 llm_correctness.py
docker run \
-e OPENAI_API_KEY=ollama \
-e OPENAI_API_BASE=http://monkey:11434/v1 \
-v $(pwd)/result_outputs:/app/result_outputs \
marshalw/llmperf \
python3 llm_correctness.py \
--model "qwen2" \
--max-num-completed-requests 20 \
--timeout 600 \
--num-concurrent-requests 2 \
--results-dir "result_outputs"
配置模型参数让测试跑起来
和负载测试相同,需要修改两组参数(其他的可以不用动):
- 兼容 OpenAI 模型 api 的参数
OPENAI_API_KEY=ollamaOPENAI_API_BASE=http://monkey:11434/v1
- 模型的名称
--model "qwen2"
跑起来的效果:
$ docker run \
-e OPENAI_API_KEY=ollama \
-e OPENAI_API_BASE=http://monkey:11434/v1 \
-v $(pwd)/result_outputs:/app/result_outputs \
marshalw/llmperf \
python3 llm_correctness.py \
--model "qwen2" \
--max-num-completed-requests 20 \
--timeout 600 \
--num-concurrent-requests 2 \
--results-dir "result_outputs"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2024-07-29 11:40:02,433 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.08gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-07-29 11:40:03,564 INFO worker.py:1781 -- Started a local Ray instance.
100%|██████████| 20/20 [00:03<00:00, 5.32it/s]
Mismatched and errored requests.
mismatched request: 132, expected: 213
Results for llm correctness test for qwen2 queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 1, Mismatch rate: 0.05
Completed: 20
Completed without errors: 20
调整参数细节
参数比负载测试少:
--max-num-completed-requests 20 \
--timeout 600 \
--num-concurrent-requests 2 \
max-num-completed-requests, 必须设置大一点,否则都是正确的num-concurrent-requests, 并发请求数
正确性测试的内容
提示词:
Convert the following sequence of words into a number: {random_number_in_word_format}. Output just your final answer.
给出一个英文表示的数字,让LLM输出数字表示。
结果数据
Mismatched and errored requests.
mismatched request: 132, expected: 213
Results for llm correctness test for qwen2 queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 1, Mismatch rate: 0.05
Completed: 20
Completed without errors: 20
mismatched request: 132, expected: 213, 有一个错误了Mismatched: 1, Mismatch rate: 0.05, 错配率是5%
结果的存储
存储在 ./result_outputs
$ ls result_outputs -hl
total 20K
-rw-r--r-- 1 root root 19K Jul 29 11:40 qwen2_correctness_individual_responses.json
-rw-r--r-- 1 root root 360 Jul 29 11:40 qwen2_correctness_summary.json
总结
LLMPerf 提供了2个命令行工具,用于做基准测试,分别针对
- 负载
- 正确性
使用 docker 镜像,可以无需安装配置,直接通过命令行完成测试任务,并通过测试结果的 json 文件,结合图表工具可视化分析报告。