vLLM引擎参数详解
以下是对vLLM引擎所支持的各项参数的详细解释:
基本模型与tokenizer参数
--model <model_name_or_path>:指定要使用的Hugging Face模型的名字或路径。--tokenizer <tokenizer_name_or_path>:指定要使用的Hugging Face tokenizer的名字或路径。
版本控制参数
--revision <revision>:指定了要使用的模型的具体版本,可以是分支名、标签名或者提交ID。若不指定,则使用默认版本。--tokenizer-revision <revision>:同样针对tokenizer,指定具体版本。
tokenizer模式参数
--tokenizer-mode {auto,slow}:决定tokenizer的工作模式。"auto":如果有可用的快速tokenizer则优先使用。"slow":始终使用慢速tokenizer。
安全性和远程代码信任参数
--trust-remote-code:信任来自Hugging Face的远程代码。
下载与加载路径参数
--download-dir <directory>:模型权重下载和加载的目录,默认为Hugging Face的缓存目录。
模型权重加载格式参数
--load-format {auto,pt,safetensors,npcache,dummy,tensorizer}:模型权重加载的文件格式。"auto":尝试以safetensors格式加载,如果不支持则退回到PyTorch二进制格式。"pt":直接加载PyTorch二进制格式的权重。"safetensors":以safetensors格式加载权重。"npcache":加载PyTorch格式权重并在加载后存储numpy缓存以加速加载过程。"dummy":用随机值初始化权重,主要用于性能分析。"tensorizer":使用CoreWeave的Tensorizer模型反序列化器加载序列化的权重。查看examples/tensorize_vllm_model.py了解如何序列化vLLM模型及相关更多信息。
数据类型参数
--dtype {auto,half,float16,bfloat16,float,float32}:模型权重和激活的数据类型。"auto":根据模型类型自动选择精度(FP32/FP16模型用FP16,BF16模型用BF16)。"half"或"float16":表示FP16,推荐用于AWQ量化。"bfloat16":平衡精度和范围的一种数据类型。"float"或"float32":表示FP32精度。
模型上下文长度参数
--max-model-len <length>:模型上下文长度,如果不指定,则会从模型配置中自动推断。
分布式处理与并行参数
--worker-use-ray:使用Ray进行分布式服务,当使用多于1个GPU时会自动设置。--pipeline-parallel-size (-pp) <size>:流水线并行阶段的数量。--tensor-parallel-size (-tp) <size>:张量并行副本数量。--max-parallel-loading-workers <workers>:按批次顺序加载模型,避免大型模型在张量并行时因RAM不足而崩溃。
模块化处理参数
--block-size {8,16,32}:连续token块的大小。
其他优化参数
--enable-prefix-caching:启用自动前缀缓存。--seed <seed>:随机种子。--swap-space <size>:每块GPU的CPU交换空间大小(单位GiB)。--gpu-memory-utilization <fraction>:GPU内存使用率的比例,介于0到1之间。--max-num-batched-tokens <tokens>:每次迭代的最大批处理令牌数。--max-num-seqs <sequences>:每次迭代的最大序列数。--max-paddings <paddings>:批处理中的最大填充数量。--disable-log-stats:禁用统计日志记录。
量化参数
--quantization (-q) {awq,squeezellm,None}:指定用于量化权重的方法。
异步引擎附加参数
--engine-use-ray:使用Ray在服务器进程外的独立进程中启动LLM引擎。--disable-log-requests:禁用对请求的记录日志功能。--max-log-len:在日志中打印的最大提示字符数或提示ID数目,默认不限制。 在探索llama factory的时候我们看到了llama进行模型部署的工作 启动日志
日志解读 INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-16 10:17:52 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-16 10:17:54,532 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-16 10:17:58 selector.py:25] Using XFormers backend.
INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB
INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365
INFO 04-16 10:18:10 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-16 10:18:10 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-16 10:18:14 serving_chat.py:331] Using default chat template:
INFO 04-16 10:18:14 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-16 10:18:14 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-16 10:18:14 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-16 10:18:14 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-16 10:18:14 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Started server process [528169]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
这个日志来自一个运行在服务器上的应用程序,它使用了Microsoft的LLM(大型语言模型)引擎,并且是通过Ray分布式计算框架来管理模型的。以下是对日志中一些关键点的解读:
- API服务器版本信息:
这表示LLM API服务器的版本是0.4.0.post1。INFO 04-16 10:17:52 api_server.py:149] vLLM API server version 0.4.0.post1 - 启动参数:
这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。INFO 04-16 10:17:52 api_server.py:150] args: Namespace(...) - 初始化LLM引擎:
LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。INFO 04-16 10:17:55 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=... - 特殊标记:
这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. - FlashAttention包未找到:
由于FlashAttention包未安装,所以无法使用它。安装这个包可以提高性能。INFO 04-16 10:17:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. - XFormers后端使用:
应用使用了XFormers后端。INFO 04-16 10:17:58 selector.py:25] Using XFormers backend. - 加载模型权重:
加载模型权重花费了大约26.6740 GB的内存。INFO 04-16 10:18:07 model_runner.py:104] Loading model weights took 26.6740 GB - GPU和CPU块的数量:
在Ray分布式计算框架中,有14171个GPU块和1365个CPU块。INFO 04-16 10:18:08 ray_gpu_executor.py:240] # GPU blocks: 14171, # CPU blocks: 1365 - 模型捕获完成:
模型图的捕获在4秒内完成。INFO 04-16 10:18:14 model_runner.py:867] Graph capturing finished in 4 secs. - 服务器启动:
服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。INFO: Started server process [528169] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
第一次启动模型的时候我们发现没有安装FlashAttention包,于是我们通过源码的方式下载安装了FlashAttention包。
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .
第二次启动日志
(qwen2_moe) ca2@ubuntu:~$ python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1___5-MoE-A2___7B --model /home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat --worker-use-ray --tensor-parallel-size 2
INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-17 05:46:14 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='Qwen1___5-MoE-A2___7B', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-17 05:46:17,103 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer='/home/ca2/.cache/modelscope/hub/qwen/Qwen1___5-MoE-A2___7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 selector.py:16] Using FlashAttention backend.
INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=633099) INFO 04-17 05:46:22 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) WARNING 04-17 05:46:23 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:28 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:31 model_runner.py:104] Loading model weights took 13.3677 GB
INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730
INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:35 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=633099) INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs.
INFO 04-17 05:46:39 serving_chat.py:331] Using default chat template:
INFO 04-17 05:46:39 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-17 05:46:39 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-17 05:46:39 serving_chat.py:331] ' + message['content']}}{% if (loop.last and add_generation_prompt) or not loop.last %}{{ '<|im_end|>' + '
INFO 04-17 05:46:39 serving_chat.py:331] '}}{% endif %}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{ '<|im_start|>assistant
INFO 04-17 05:46:39 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
这个日志似乎是来自一个使用Ray分布式计算框架和某种LLM(大型语言模型)的API服务器的启动过程。以下是日志中一些关键点的解读:
- API服务器版本信息:
这表示LLM API服务器的版本是0.4.0.post1。INFO 04-17 05:46:14 api_server.py:149] vLLM API server version 0.4.0.post1 - 启动参数:
这里列出了启动API服务器的参数,包括服务器地址、端口、日志级别、允许的源、允许的方法、允许的头部、API密钥、服务的模型名称、LORA模块、聊天模板、响应角色、SSL配置等。INFO 04-17 05:46:14 api_server.py:150] args: Namespace(...) - 初始化LLM引擎:
LLM引擎正在使用特定的配置初始化,包括模型路径、标记器、版本、信任远程代码、数据类型、序列长度等。INFO 04-17 05:46:17 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model=... - 特殊标记:
这表明模型中添加了特殊标记,这些标记与词汇表中的单词嵌入有关,需要进行微调或训练。Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. - FlashAttention后端使用:
应用使用了FlashAttention后端。INFO 04-17 05:46:21 selector.py:16] Using FlashAttention backend. - GPU和CPU块的数量:
在Ray分布式计算框架中,有37540个GPU块和2730个CPU块。INFO 04-17 05:46:33 ray_gpu_executor.py:240] # GPU blocks: 37540, # CPU blocks: 2730 - 模型捕获完成:
模型图的捕获在4秒内完成。INFO 04-17 05:46:39 model_runner.py:867] Graph capturing finished in 4 secs. - 服务器启动:
服务器进程启动,并且应用启动完成,服务器现在正在监听8000端口。 以上是日志中一些关键点的解读。这些日志可以帮助开发者了解服务器的启动情况、配置参数、模型加载状态、后端选择以及服务器运行状态等信息。INFO: Started server process [528169] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
希望这篇内容对您有帮助 加我好友 15246115202