LLM模型推理过程一、流程总览二、tokenizer_config 核心字段字段作用 chat_template

一、流程总览

Step 0  API 输入            messages + tools
Step 1  chat_template       → ChatML 文本
Step 2  tokenizer           → token ids
Step 3  LLM 推理 #1         → <tool_call>
Step 4  应用执行 tool        → tool result
Step 5  tool_response 回注  → 重新构造 prompt
Step 6  LLM 推理 #2         → 最终回答
Step 7  返回用户             content

二、tokenizer_config 核心字段

字段	作用
`chat_template`	⭐ 决定模型真实输入结构
`eos_token`	生成停止标记（`<\|im_end\|>`）
`additional_special_tokens`	ChatML 控制 token 集合
`model_max_length`	context window 上限
`add_eos_token`	通常关闭（由模板控制）

三、应用输入（API 请求）

{
  "messages": [
    {"role": "user", "content": "北京天气怎么样？"}
  ],
  "tools": [
    {
      "name": "get_weather",
      "description": "查询城市天气",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }
  ]
}

四、第一次推理输入（chat_template 展开后）

<|im_start|>system
# Tools
<tools>
{"name":"get_weather","description":"查询城市天气","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}
</tools>
For each function call, return a json within <tool_call></tool_call>
<|im_end|>
<|im_start|>user
北京天气怎么样？
<|im_end|>
<|im_start|>assistant

五、第一次推理输出

<tool_call>
{"name":"get_weather","arguments":{"city":"北京"}}
</tool_call>
<|im_end|>          ← vLLM 检测到 EOS，停止

六、应用执行 Tool，得到结果

{"temperature": "10°C", "weather": "晴"}

七、第二次推理输入（追加 tool_response）

...(同上)...
<|im_start|>assistant
<tool_call>
{"name":"get_weather","arguments":{"city":"北京"}}
</tool_call>
<|im_end|>
<|im_start|>user
<tool_response>
{"temperature":"10°C","weather":"晴"}
</tool_response>
<|im_end|>
<|im_start|>assistant

八、第二次推理输出

北京当前天气晴朗，气温约10°C，适合外出。
<|im_end|>          ← EOS，返回用户

九、五句话复原全部机制

chat_template  →  决定结构
tokenizer      →  文本变 token
Qwen3          →  只预测 token
vLLM           →  遇 EOS 停止
tool           →  由外部应用循环驱动