基于 LLMPerf 对模型做负载测试LLMPerf 可以做针对大语言模型的负载测试和正确性测试，提供了对应的命令行工具

LLMPerf 可以做针对大语言模型的负载测试和正确性测试，提供了对应的命令行工具。

Kapture 2024-07-29 at 19.59.44.gif

安装和配置 LLMPerf 比较繁琐，我封装了 Docker 镜像，方便直接使用。

下面介绍怎么基于我的 Docker 镜像通过命令实现测试。

负载测试

比如，需要评估目前使用的云端模型或者本地模型的性能，需要有一个基准测试，那么可以使用 LLMPerf 的负载测试命令 token_benchmark_ray.py：

docker run  \
    -e OPENAI_API_KEY=ollama \
    -e OPENAI_API_BASE=http://monkey:11434/v1 \
    -v $(pwd)/result_outputs:/app/result_outputs \
    marshalw/llmperf \
        python3 token_benchmark_ray.py \
            --model "qwen2" \
            --mean-input-tokens 550 \
            --stddev-input-tokens 150 \
            --mean-output-tokens 150 \
            --stddev-output-tokens 10 \
            --max-num-completed-requests 2 \
            --timeout 600 \
            --num-concurrent-requests 1 \
            --results-dir "result_outputs" \
            --llm-api openai \
            --additional-sampling-params '{}'

配置模型参数让测试跑起来

执行上面命令，最少需要修改两组参数（其他的可以不用动）：

兼容 OpenAI 模型 api 的参数
- OPENAI_API_KEY=ollama
- OPENAI_API_BASE=http://monkey:11434/v1
模型的名称 --model "qwen2"

我这里使用的是 Ollama Restful API，是兼容 OpenAI 的。

跑起来的效果：

$ docker run  \
    -e OPENAI_API_KEY=ollama \
    -e OPENAI_API_BASE=http://monkey:11434/v1 \
    -v $(pwd)/result_outputs:/app/result_outputs \
    marshalw/llmperf \
        python3 token_benchmark_ray.py \
            --model "qwen2" \
            --mean-input-tokens 550 \
            --stddev-input-tokens 150 \
            --mean-output-tokens 150 \
            --stddev-output-tokens 10 \
            --max-num-completed-requests 2 \
            --timeout 600 \
            --num-concurrent-requests 1 \
            --results-dir "result_outputs" \
            --llm-api openai \
            --additional-sampling-params '{}'
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2024-07-29 10:27:37,739 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.07gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-07-29 10:27:37,862 INFO worker.py:1781 -- Started a local Ray instance.
100%|██████████| 2/2 [00:07<00:00,  3.94s/it]
\Results for token benchmark for qwen2 queried with the openai api.

inter_token_latency_s
    p25 = 0.018408454432277666
    p50 = 0.02065592641166753
    p75 = 0.022903398391057393
    p90 = 0.02425188157869131
    p95 = 0.02470137597456928
    p99 = 0.02506097149127166
    mean = 0.02065592641166753
    min = 0.016160982452887804
    max = 0.025150870370447255
    stddev = 0.0063568107086132974
ttft_s
    p25 = 0.607668885262683
    p50 = 0.9441499398089945
    p75 = 1.280630994355306
    p90 = 1.482519627083093
    p95 = 1.5498158379923552
    p99 = 1.603652806719765
    mean = 0.9441499398089945
    min = 0.27118783071637154
    max = 1.6171120489016175
    stddev = 0.9517121416419898
end_to_end_latency_s
    p25 = 3.270658512949012
    p50 = 3.502787458943203
    p75 = 3.734916404937394
    p90 = 3.8741937725339084
    p95 = 3.920619561732747
    p99 = 3.957760193091817
    mean = 3.502787458943203
    min = 3.0385295669548213
    max = 3.967045350931585
    stddev = 0.656559807288713
request_output_throughput_token_per_s
    p25 = 45.150048420697644
    p50 = 50.72404310076146
    p75 = 56.29803778082528
    p90 = 59.642434588863566
    p95 = 60.757233524876334
    p99 = 61.64907267368654
    mean = 50.72404310076146
    min = 39.57605374063383
    max = 61.872032460889095
    stddev = 15.765637746283458
number_input_tokens
    p25 = 548.75
    p50 = 559.5
    p75 = 570.25
    p90 = 576.7
    p95 = 578.85
    p99 = 580.57
    mean = 559.5
    min = 538
    max = 581
    stddev = 30.405591591021544
number_output_tokens
    p25 = 164.75
    p50 = 172.5
    p75 = 180.25
    p90 = 184.9
    p95 = 186.45
    p99 = 187.69
    mean = 172.5
    min = 157
    max = 188
    stddev = 21.920310216782973
Number Of Errored Requests: 0
Overall Output Throughput: 43.720811627735536
Number Of Completed Requests: 2
Completed Requests Per Minute: 15.207238827038447

调整参数细节

细节参数主要有：

--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \

mean-input-tokens, 平均输入 token 数
stddev-input-tokens, 输入 token 数的标准差
mean-output-tokens, 平均输出 token 数
stddev-output-tokens, 输出 token 数的标准差
max-num-completed-requests, 测试的最大请求数
num-concurrent-requests，并发请求数

负载测试的内容

这是和 LLM 对话的提示词：

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

将使用莎士比亚十四行诗，随机抽取几行的文本，形成流式输出，模拟对话。

另外，计算输入输出 token，都使用 LlamaTokenizer ，而不是用模型自己的 Tokenizer，为的是统一输入输出 token 计数标准。

标准是相同的，因此，负载测试可以作为基准测试，对比多个云端模型和本地模型的性能。

结果数据

inter_token_latency_s
    p25 = 0.018408454432277666
    p50 = 0.02065592641166753
    p75 = 0.022903398391057393
    p90 = 0.02425188157869131
    p95 = 0.02470137597456928
    p99 = 0.02506097149127166
    mean = 0.02065592641166753
    min = 0.016160982452887804
    max = 0.025150870370447255
    stddev = 0.0063568107086132974
ttft_s
    p25 = 0.607668885262683
    p50 = 0.9441499398089945
    p75 = 1.280630994355306
    p90 = 1.482519627083093
    p95 = 1.5498158379923552
    p99 = 1.603652806719765
    mean = 0.9441499398089945
    min = 0.27118783071637154
    max = 1.6171120489016175
    stddev = 0.9517121416419898
end_to_end_latency_s
    p25 = 3.270658512949012
    p50 = 3.502787458943203
    p75 = 3.734916404937394
    p90 = 3.8741937725339084
    p95 = 3.920619561732747
    p99 = 3.957760193091817
    mean = 3.502787458943203
    min = 3.0385295669548213
    max = 3.967045350931585
    stddev = 0.656559807288713
request_output_throughput_token_per_s
    p25 = 45.150048420697644
    p50 = 50.72404310076146
    p75 = 56.29803778082528
    p90 = 59.642434588863566
    p95 = 60.757233524876334
    p99 = 61.64907267368654
    mean = 50.72404310076146
    min = 39.57605374063383
    max = 61.872032460889095
    stddev = 15.765637746283458
number_input_tokens
    p25 = 548.75
    p50 = 559.5
    p75 = 570.25
    p90 = 576.7
    p95 = 578.85
    p99 = 580.57
    mean = 559.5
    min = 538
    max = 581
    stddev = 30.405591591021544
number_output_tokens
    p25 = 164.75
    p50 = 172.5
    p75 = 180.25
    p90 = 184.9
    p95 = 186.45
    p99 = 187.69
    mean = 172.5
    min = 157
    max = 188
    stddev = 21.920310216782973
Number Of Errored Requests: 0
Overall Output Throughput: 43.720811627735536
Number Of Completed Requests: 2
Completed Requests Per Minute: 15.207238827038447

inter_token_latency_s, token 间延迟时间（秒）
ttft_s, Time to first token, 首令牌延时，获取到第一个token的延时，对实时应用很重要，比如对话助手，需要及时响应
end_to_end_latency_s, 端到端延时，发送请求到收到最后一个token的时间长度，也是考察实时性的重要指标
request_output_throughput_token_per_s, 每秒输出 token 数
number_input_tokens, 输入 token 数
number_output_tokens, 输出 token 数
Overall Output Throughput, 总的每秒输出 token 数
Number Of Completed Requests, 完成的请求数
Completed Requests Per Minute, 每分钟完成请求数

结果的存储

存储在 ./result_outputs

$ ls result_outputs -hl
total 5K
-rw-r--r-- 1 root root  777 Jul 29 10:27 qwen2_550_150_individual_responses.json
-rw-r--r-- 1 root root 4.3K Jul 29 10:27 qwen2_550_150_summary.json

可以基于json文件生成分析图表。

其他

本地模型，建议执行2次命令，取第二次的结果，因为有可能模型需要加载过程

正确性测试

比如，需要对多个大模型，比较它们正确回答问题的能力。

LLMPerf 提供了正确性测试命令 llm_correctness.py

docker run  \
    -e OPENAI_API_KEY=ollama \
    -e OPENAI_API_BASE=http://monkey:11434/v1 \
    -v $(pwd)/result_outputs:/app/result_outputs \
    marshalw/llmperf \
        python3 llm_correctness.py \
            --model "qwen2" \
            --max-num-completed-requests 20 \
            --timeout 600 \
            --num-concurrent-requests 2 \
            --results-dir "result_outputs"

配置模型参数让测试跑起来

和负载测试相同，需要修改两组参数（其他的可以不用动）：

兼容 OpenAI 模型 api 的参数
- OPENAI_API_KEY=ollama
- OPENAI_API_BASE=http://monkey:11434/v1
模型的名称 --model "qwen2"

跑起来的效果：

$ docker run  \
    -e OPENAI_API_KEY=ollama \
    -e OPENAI_API_BASE=http://monkey:11434/v1 \
    -v $(pwd)/result_outputs:/app/result_outputs \
    marshalw/llmperf \
        python3 llm_correctness.py \
            --model "qwen2" \
            --max-num-completed-requests 20 \
            --timeout 600 \
            --num-concurrent-requests 2 \
            --results-dir "result_outputs"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2024-07-29 11:40:02,433 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=5.08gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-07-29 11:40:03,564 INFO worker.py:1781 -- Started a local Ray instance.
100%|██████████| 20/20 [00:03<00:00,  5.32it/s]
Mismatched and errored requests.
    mismatched request: 132, expected: 213

Results for llm correctness test for qwen2 queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 1, Mismatch rate: 0.05
Completed: 20
Completed without errors: 20

调整参数细节

参数比负载测试少：

--max-num-completed-requests 20 \
--timeout 600 \
--num-concurrent-requests 2 \

max-num-completed-requests, 必须设置大一点，否则都是正确的
num-concurrent-requests, 并发请求数

正确性测试的内容

提示词：

Convert the following sequence of words into a number: {random_number_in_word_format}. Output just your final answer.

给出一个英文表示的数字，让LLM输出数字表示。

结果数据

Mismatched and errored requests.
    mismatched request: 132, expected: 213

Results for llm correctness test for qwen2 queried with the openai api.
Errors: 0, Error rate: 0.0
Mismatched: 1, Mismatch rate: 0.05
Completed: 20
Completed without errors: 20

mismatched request: 132, expected: 213, 有一个错误了
Mismatched: 1, Mismatch rate: 0.05, 错配率是 5%

结果的存储

存储在 ./result_outputs

$ ls result_outputs -hl
total 20K
-rw-r--r-- 1 root root  19K Jul 29 11:40 qwen2_correctness_individual_responses.json
-rw-r--r-- 1 root root  360 Jul 29 11:40 qwen2_correctness_summary.json

总结

LLMPerf 提供了2个命令行工具，用于做基准测试，分别针对

负载
正确性

使用 docker 镜像，可以无需安装配置，直接通过命令行完成测试任务，并通过测试结果的 json 文件，结合图表工具可视化分析报告。