vLLM引擎参数详解从运行日志观察vllm进行模型部署的过程在探索llama factory的时候我们看到了llam

vLLM引擎参数详解

以下是对vLLM引擎所支持的各项参数的详细解释：

基本模型与tokenizer参数

--model <model_name_or_path>：指定要使用的Hugging Face模型的名字或路径。
--tokenizer <tokenizer_name_or_path>：指定要使用的Hugging Face tokenizer的名字或路径。

版本控制参数

--revision <revision>：指定了要使用的模型的具体版本，可以是分支名、标签名或者提交ID。若不指定，则使用默认版本。
--tokenizer-revision <revision>：同样针对tokenizer，指定具体版本。

tokenizer模式参数

--tokenizer-mode {auto,slow}：决定tokenizer的工作模式。
- "auto"：如果有可用的快速tokenizer则优先使用。
- "slow"：始终使用慢速tokenizer。

安全性和远程代码信任参数

--trust-remote-code：信任来自Hugging Face的远程代码。

下载与加载路径参数

--download-dir <directory>：模型权重下载和加载的目录，默认为Hugging Face的缓存目录。

模型权重加载格式参数

--load-format {auto,pt,safetensors,npcache,dummy,tensorizer}：模型权重加载的文件格式。
- "auto"：尝试以safetensors格式加载，如果不支持则退回到PyTorch二进制格式。
- "pt"：直接加载PyTorch二进制格式的权重。
- "safetensors"：以safetensors格式加载权重。
- "npcache"：加载PyTorch格式权重并在加载后存储numpy缓存以加速加载过程。
- "dummy"：用随机值初始化权重，主要用于性能分析。
- "tensorizer"：使用CoreWeave的Tensorizer模型反序列化器加载序列化的权重。查看examples/tensorize_vllm_model.py了解如何序列化vLLM模型及相关更多信息。

数据类型参数

--dtype {auto,half,float16,bfloat16,float,float32}：模型权重和激活的数据类型。
- "auto"：根据模型类型自动选择精度（FP32/FP16模型用FP16，BF16模型用BF16）。
- "half" 或 "float16"：表示FP16，推荐用于AWQ量化。
- "bfloat16"：平衡精度和范围的一种数据类型。
- "float" 或 "float32"：表示FP32精度。

模型上下文长度参数

--max-model-len <length>：模型上下文长度，如果不指定，则会从模型配置中自动推断。

分布式处理与并行参数

--worker-use-ray：使用Ray进行分布式服务，当使用多于1个GPU时会自动设置。
--pipeline-parallel-size (-pp) <size>：流水线并行阶段的数量。
--tensor-parallel-size (-tp) <size>：张量并行副本数量。
--max-parallel-loading-workers <workers>：按批次顺序加载模型，避免大型模型在张量并行时因RAM不足而崩溃。

模块化处理参数

--block-size {8,16,32}：连续token块的大小。

其他优化参数

--enable-prefix-caching：启用自动前缀缓存。
--seed <seed>：随机种子。
--swap-space <size>：每块GPU的CPU交换空间大小（单位GiB）。
--gpu-memory-utilization <fraction>：GPU内存使用率的比例，介于0到1之间。
--max-num-batched-tokens <tokens>：每次迭代的最大批处理令牌数。
--max-num-seqs <sequences>：每次迭代的最大序列数。
--max-paddings <paddings>：批处理中的最大填充数量。
--disable-log-stats：禁用统计日志记录。

量化参数

--quantization (-q) {awq,squeezellm,None}：指定用于量化权重的方法。

异步引擎附加参数

--engine-use-ray：使用Ray在服务器进程外的独立进程中启动LLM引擎。
--disable-log-requests：禁用对请求的记录日志功能。
--max-log-len：在日志中打印的最大提示字符数或提示ID数目，默认不限制。在探索llama factory的时候我们看到了llama进行模型部署的工作启动日志

日志解读 INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-16 10:17:54,532	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
INFO 04-16 10:18:10 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-16 10:18:10 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:18:14 serving_chat.py:331] Using default chat template:
INFO 04-16 10:18:14 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-16 10:18:14 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-16 10:18:14 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-16 10:18:14 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

这个日志来自一个运行在服务器上的应用程序，它使用了Microsoft的LLM（大型语言模型）引擎，并且是通过Ray分布式计算框架来管理模型的。以下是对日志中一些关键点的解读：

API服务器版本信息：

INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1

这表示LLM API服务器的版本是0.4.0.post1。

启动参数：
```
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(...)
```
这里列出了启动API服务器的参数，包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
初始化LLM引擎：
```
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
```
LLM引擎正在使用特定的配置初始化，包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
特殊标记：
```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
这表明模型中添加了特殊标记，这些标记与词汇表中的单词嵌入有关，需要进行微调或训练。

FlashAttention包未找到：

INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.

由于FlashAttention包未安装，所以无法使用它。安装这个包可以提高性能。

XFormers后端使用：

INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.

应用使用了XFormers后端。

加载模型权重：

INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB

加载模型权重花费了大约26.6740 GB的内存。

GPU和CPU块的数量：
```
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
```
在Ray分布式计算框架中，有14171个GPU块和1365个CPU块。

模型捕获完成：

INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.

模型图的捕获在4秒内完成。

服务器启动：
```
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
服务器进程启动，并且应用启动完成，服务器现在正在监听8000端口。以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

第一次启动模型的时候我们发现没有安装FlashAttention包，于是我们通过源码的方式下载安装了FlashAttention包。

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .

第二次启动日志

(qwen2_moe) ca2@ubuntu:~$ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1___5-MoE-A2___7B --model /home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat  --worker-use-ray --tensor-parallel-size 2
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-17 05:46:17,103	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 selector.py:16] Using FlashAttention backend.
INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:28 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:31 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
INFO 04-17 05:46:39 serving_chat.py:331] Using default chat template:
INFO 04-17 05:46:39 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-17 05:46:39 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-17 05:46:39 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-17 05:46:39 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

这个日志似乎是来自一个使用Ray分布式计算框架和某种LLM（大型语言模型）的API服务器的启动过程。以下是日志中一些关键点的解读：

API服务器版本信息：

INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1

这表示LLM API服务器的版本是0.4.0.post1。

启动参数：
```
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(...)
```
这里列出了启动API服务器的参数，包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。
初始化LLM引擎：
```
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=...
```
LLM引擎正在使用特定的配置初始化，包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。
特殊标记：
```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
```
这表明模型中添加了特殊标记，这些标记与词汇表中的单词嵌入有关，需要进行微调或训练。

FlashAttention后端使用：

INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.

应用使用了FlashAttention后端。

GPU和CPU块的数量：
```
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
```
在Ray分布式计算框架中，有37540个GPU块和2730个CPU块。

模型捕获完成：

INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.

模型图的捕获在4秒内完成。

服务器启动：
```
INFO:     Started server process [528169]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
服务器进程启动，并且应用启动完成，服务器现在正在监听8000端口。以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。

希望这篇内容对您有帮助加我好友 15246115202

vLLM引擎参数详解 从运行日志观察vllm进行模型部署的过程