双RTX3090多卡推理深度解析# 双 RTX 3090 多卡推理深度解析：从 Tensor Parallelism 到

双 RTX 3090 多卡推理深度解析：从 Tensor Parallelism 到 vLLM 源码

基于 vLLM + Qwen2.5-14B 实际部署经验，系统讲解 Tensor Parallelism、NCCL 通信、显存计算及 vLLM 多卡调度源码。适合有 GPU 推理基础的研发人员阅读。

1. Tensor Parallelism 原理

1.1 核心思想

Tensor Parallelism（张量并行，简称 TP）是一种层内并行策略：将 Transformer 每一层的权重矩阵按列或按行切分到多张 GPU 上，每张卡只计算一部分，再通过集合通信拼合结果。

与 Pipeline Parallelism（层间流水线并行）不同，TP 的切分粒度更细——不是"你算前 20 层，我算后 20 层"，而是"每一层我们各算一半"。

graph TB
    subgraph PP["Pipeline Parallelism -- 层间切分"]
        direction LR
        PP_IN(["Input"]) --> PP_G0
        subgraph PP_G0["GPU 0: 前半模型"]
            PP_L0["Layer 0"] --> PP_L1["Layer 1"] --> PP_D1["..."] --> PP_L23["Layer 23"]
        end
        PP_G0 -->|"传输中间激活<br/>仅 1 次通信"| PP_G1
        subgraph PP_G1["GPU 1: 后半模型"]
            PP_L24["Layer 24"] --> PP_L25["Layer 25"] --> PP_D2["..."] --> PP_L47["Layer 47"]
        end
        PP_G1 --> PP_OUT(["Output"])
    end

    subgraph TP["Tensor Parallelism -- 层内切分"]
        direction LR
        TP_IN(["Input"]) --> TP_LAYER
        subgraph TP_LAYER["每一层 Layer 0 ~ 47"]
            direction TB
            subgraph TP_G0["GPU 0"]
                TP_W0["左半权重"]
            end
            subgraph TP_G1["GPU 1"]
                TP_W1["右半权重"]
            end
            TP_G0 <-->|"AllReduce x2/层<br/>共 96 次/forward"| TP_G1
        end
        TP_LAYER --> TP_OUT(["Output"])
    end

    classDef gpu0 fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef gpu1 fill:#238636,stroke:#3fb950,color:#fff
    classDef io fill:#9e6a03,stroke:#d29922,color:#fff

    class PP_L0,PP_L1,PP_D1,PP_L23,TP_W0 gpu0
    class PP_L24,PP_L25,PP_D2,PP_L47,TP_W1 gpu1
    class PP_IN,PP_OUT,TP_IN,TP_OUT io

对比维度	Pipeline Parallelism	Tensor Parallelism
切分粒度	按层（层间）	按矩阵（层内）
通信次数	1 次 / forward	2L 次 / forward
负载均衡	可能不均衡	天然均衡
主要问题	流水线气泡（GPU 空闲等待）	通信开销（受互联带宽限制）

1.2 Column Parallel 与 Row Parallel

Transformer 的核心计算是线性变换 Y = XA。TP 将权重矩阵 A 按两种方式切分：

graph TD
    subgraph COL["Column Parallel -- 按列切分"]
        direction TB
        CX["X 完整输入 [b,s,hidden]"]
        CX --> CG0 & CG1
        subgraph CG0["GPU 0"]
            CA1["W 左半列 [hidden, out/2]"]
            CY1["Y1 = X * W_left"]
        end
        subgraph CG1["GPU 1"]
            CA2["W 右半列 [hidden, out/2]"]
            CY2["Y2 = X * W_right"]
        end
        CY1 & CY2 -->|"AllGather 拼接"| COUT["Y = concat(Y1, Y2)"]
    end

    subgraph ROW["Row Parallel -- 按行切分"]
        direction TB
        RX["X 分片输入 (来自上层Column输出)"]
        RX -->|"X1 前半"| RG0
        RX -->|"X2 后半"| RG1
        subgraph RG0["GPU 0"]
            RA1["W 上半行 [hidden/2, out]"]
            RY1["Y1 = X1 * W_top (部分和)"]
        end
        subgraph RG1["GPU 1"]
            RA2["W 下半行 [hidden/2, out]"]
            RY2["Y2 = X2 * W_bot (部分和)"]
        end
        RY1 & RY2 -->|"AllReduce 求和"| ROUT["Y = Y1 + Y2 (完整)"]
    end

    classDef g0 fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef g1 fill:#238636,stroke:#3fb950,color:#fff
    classDef data fill:#9e6a03,stroke:#d29922,color:#fff
    classDef out fill:#8957e5,stroke:#bc8cff,color:#fff

    class CA1,CY1,RA1,RY1 g0
    class CA2,CY2,RA2,RY2 g1
    class CX,RX data
    class COUT,ROUT out

关键设计：Column Parallel 和 Row Parallel 成对使用——Column 的分片输出直接作为 Row 的分片输入，中间无需额外通信。只在 Row Parallel 最终输出时做一次 AllReduce。

1.3 Transformer Block 中的实际切分（Megatron-LM 方案）

每个 Transformer Block 包含 Attention + MLP 两个子层，各需要 1 次 AllReduce：

graph TD
    INPUT["输入 x (两卡各持有完整副本)"]

    subgraph ATTN["Self-Attention"]
        direction TB
        ATTN_COL["Column Parallel: Wq Wk Wv<br/>GPU0: head 0~19 | GPU1: head 20~39<br/>(按 head 切分, 无通信)"]
        ATTN_COMP["各自独立计算 Attention<br/>Softmax(QK^T / sqrt d) * V"]
        ATTN_ROW["Row Parallel: Wo<br/>GPU0: Wo_top | GPU1: Wo_bot"]
        ATTN_AR{{"AllReduce #1 (sum)"}}
        ATTN_COL --> ATTN_COMP --> ATTN_ROW --> ATTN_AR
    end

    LN1["Add & LayerNorm (无通信)"]

    subgraph MLP["MLP"]
        direction TB
        MLP_COL["Column Parallel: W_gate, W_up<br/>GPU0: 左半列 | GPU1: 右半列<br/>(无通信)"]
        MLP_ACT["GeLU / SiLU 激活 (逐元素, 无通信)"]
        MLP_ROW["Row Parallel: W_down<br/>GPU0: W_down_top | GPU1: W_down_bot"]
        MLP_AR{{"AllReduce #2 (sum)"}}
        MLP_COL --> MLP_ACT --> MLP_ROW --> MLP_AR
    end

    LN2["Add & LayerNorm (无通信)"]
    OUTPUT["输出 x (两卡各持有完整副本)"]

    INPUT --> ATTN
    ATTN_AR --> LN1 --> MLP
    MLP_AR --> LN2 --> OUTPUT

    classDef nocomm fill:#238636,stroke:#3fb950,color:#fff
    classDef comm fill:#da3633,stroke:#f85149,color:#fff
    classDef norm fill:#30363d,stroke:#8b949e,color:#c9d1d9
    classDef io fill:#9e6a03,stroke:#d29922,color:#fff

    class ATTN_COL,ATTN_COMP,MLP_COL,MLP_ACT nocomm
    class ATTN_AR,MLP_AR comm
    class LN1,LN2 norm
    class INPUT,OUTPUT io

通信量统计（Qwen2.5-14B 为例）：

维度	数值
模型层数 L	48
每层 AllReduce 次数	2 (Attention + MLP)
每次 forward 总 AllReduce	96 次
hidden_dim	5120
Decode 每次 AllReduce 数据量	1 * 1 * 5120 * 2B = 10 KB
Prefill (2048 token) 每次 AllReduce	1 * 2048 * 5120 * 2B = 20 MB
Decode 总通信量/step	96 * 10 KB = 960 KB (延迟主导)
Prefill 总通信量	96 * 20 MB = 1.92 GB (带宽主导)

1.4 为什么 TP=2 是消费级最优选择

TP 大小	通信开销	适用场景
TP=1	无通信	单卡能放下模型时的最优选择
TP=2	适中	最常见，2 卡 NVLink/PCIe 延迟可控
TP=4	较大	需要 NVLink 全互联，否则 PCIe 成瓶颈
TP=8	很大	仅适用于 A100/H100 NVLink 8 卡互联

2. NCCL 通信原语

NCCL（NVIDIA Collective Communications Library）是 Tensor Parallelism 的通信基础。

2.1 五种核心操作

graph LR
    subgraph BC["Broadcast 广播"]
        direction TB
        BCB["GPU0: A B C D<br/>GPU1: _ _ _ _"]
        BCA["GPU0: A B C D<br/>GPU1: A B C D"]
        BCB -->|"1 to all"| BCA
    end

    subgraph RD["Reduce 归约"]
        direction TB
        RDB["GPU0: 1 2 3 4<br/>GPU1: 5 6 7 8"]
        RDA["GPU0: 6 8 10 12<br/>GPU1: 不变"]
        RDB -->|"sum to root"| RDA
    end

    subgraph AR["AllReduce 全归约"]
        direction TB
        ARB["GPU0: 1 2 3 4<br/>GPU1: 5 6 7 8"]
        ARA["GPU0: 6 8 10 12<br/>GPU1: 6 8 10 12"]
        ARB -->|"sum to all"| ARA
    end

    subgraph AG["AllGather"]
        direction TB
        AGB["GPU0: A B<br/>GPU1: C D"]
        AGA["GPU0: A B C D<br/>GPU1: A B C D"]
        AGB -->|"concat all"| AGA
    end

    subgraph RS["ReduceScatter"]
        direction TB
        RSB["GPU0: 1 2 3 4<br/>GPU1: 5 6 7 8"]
        RSA["GPU0: 6 8<br/>GPU1: 10 12"]
        RSB -->|"sum+scatter"| RSA
    end

在 TP 推理中的使用频率：

操作	用途	使用频率
AllReduce	Row Parallel 输出汇总	最高频 (每层 2 次)
AllGather	Column Parallel 输出拼接	中等
Broadcast	模型初始化广播权重	仅启动时
ReduceScatter	AllReduce 的分解实现	内部实现
Reduce	收集梯度 (训练)	推理不用

关键关系：AllReduce = ReduceScatter + AllGather。NCCL 内部正是通过这种分解来实现 AllReduce 的。

2.2 Ring AllReduce 算法

NCCL 默认使用 Ring 算法。以 4 张 GPU 为例：

graph TD
    subgraph INIT["初始: 4 GPU 组成环"]
        direction LR
        I0["GPU0: a0 b0 c0 d0"] -->|ring| I1["GPU1: a1 b1 c1 d1"]
        I1 -->|ring| I2["GPU2: a2 b2 c2 d2"]
        I2 -->|ring| I3["GPU3: a3 b3 c3 d3"]
        I3 -->|ring| I0
    end

    subgraph P1["Phase 1: ReduceScatter (K-1=3 步)"]
        direction TB
        RS1["Step 1-3: 环形传递并累加<br/>每步每 GPU 发送 1 块给下一个, 接收 1 块并累加"]
        RS_R["结果: 每 GPU 持有 1 个完整归约块<br/>GPU0: SUM_a | GPU1: SUM_b | GPU2: SUM_c | GPU3: SUM_d"]
        RS1 --> RS_R
    end

    subgraph P2["Phase 2: AllGather (K-1=3 步)"]
        direction TB
        AG1["Step 4-6: 环形传递完整块<br/>每步每 GPU 发送已有的完整块给下一个"]
        AG_R["结果: 每 GPU 持有全部归约结果<br/>所有 GPU: SUM_a SUM_b SUM_c SUM_d"]
        AG1 --> AG_R
    end

    INIT --> P1 --> P2

    classDef p1 fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef p2 fill:#238636,stroke:#3fb950,color:#fff
    classDef init fill:#9e6a03,stroke:#d29922,color:#fff
    classDef final fill:#8957e5,stroke:#bc8cff,color:#fff

    class I0,I1,I2,I3 init
    class RS1 p1
    class AG1 p2
    class RS_R,AG_R final

通信量公式：总通信量 = 2 * (K-1)/K * N，其中 K 为 GPU 数，N 为数据大小。当 K=2 时，总通信量 = N。

2.3 TP 推理中的通信模式

每个 Transformer Block 的数据流和通信点：

输入 x（两卡各持有完整副本）
 |
 +---> Self-Attention
 |       Wq,Wk,Wv: Column Parallel（无通信，按 head 切分）
 |       独立计算各自 head 的 Attention
 |       Wo: Row Parallel --> AllReduce  [通信点 1]
 |
 +---> Add & LayerNorm（逐元素，无通信）
 |
 +---> MLP
 |       W_gate, W_up: Column Parallel（无通信）
 |       GeLU/SiLU: 逐元素（无通信）
 |       W_down: Row Parallel --> AllReduce  [通信点 2]
 |
 +---> Add & LayerNorm（逐元素，无通信）
 |
输出 x（两卡各持有完整副本）

3. PCIe vs NVLink 带宽差异对推理性能的影响

3.1 带宽对比

互联方式	双向带宽	相对倍数
PCIe 3.0 x16	32 GB/s	1x
PCIe 4.0 x16	64 GB/s	2x
NVLink (3090)	112.5 GB/s	3.5x
PCIe 5.0 x16	128 GB/s	4x
NVLink (A100)	600 GB/s	18.75x
NVLink (H100)	900 GB/s	28x

RTX 3090 的 NVLink 比 PCIe 4.0 快约 1.75x，需要额外购买 NVLink Bridge 硬件。

3.2 互联拓扑差异

graph LR
    subgraph PCIE["PCIe 互联"]
        direction TB
        CPU_P["CPU"]
        PS["PCIe Switch"]
        PG0["GPU 0"]
        PG1["GPU 1"]
        CPU_P <-->|"PCIe 4.0"| PS
        PS <-->|"32 GB/s 单向"| PG0
        PS <-->|"32 GB/s 单向"| PG1
    end

    subgraph NVL["NVLink 互联"]
        direction TB
        CPU_N["CPU"]
        NG0["GPU 0"]
        NG1["GPU 1"]
        CPU_N <-->|"PCIe (H2D)"| NG0
        CPU_N <-->|"PCIe (H2D)"| NG1
        NG0 <-->|"NVLink Bridge<br/>56.25 GB/s 单向<br/>GPU 直连, 不经 CPU"| NG1
    end

    classDef cpu fill:#9e6a03,stroke:#d29922,color:#fff
    classDef pcie fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef nvl fill:#238636,stroke:#3fb950,color:#fff

    class CPU_P,CPU_N cpu
    class PG0,PG1 pcie
    class NG0,NG1 nvl

3.3 对推理两阶段的不同影响

graph TD
    subgraph PREFILL["Prefill 阶段 (计算密集)"]
        direction TB
        PF_IN["处理完整 Prompt (2048 tokens)"]
        PF_C["大量 GEMM 矩阵乘法 ~50ms"]
        PF_P["PCIe AllReduce ~3ms"]
        PF_N["NVLink AllReduce ~1.7ms"]
        PF_R["通信占比: PCIe 5.6% vs NVLink 3.3%<br/>结论: Compute-Bound, PCIe 够用"]
        PF_IN --> PF_C --> PF_P & PF_N --> PF_R
    end

    subgraph DECODE["Decode 阶段 (访存密集)"]
        direction TB
        DC_IN["每步生成 1 token"]
        DC_C["轻量计算, 主要读显存 ~2ms"]
        DC_P["PCIe AllReduce ~0.3ms"]
        DC_N["NVLink AllReduce ~0.15ms"]
        DC_R["通信占比: PCIe 13% vs NVLink 7%<br/>结论: Memory-Bound, NVLink 优势显著"]
        DC_IN --> DC_C --> DC_P & DC_N --> DC_R
    end

    classDef compute fill:#238636,stroke:#3fb950,color:#fff
    classDef pcie fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef nvlink fill:#8957e5,stroke:#bc8cff,color:#fff
    classDef warn fill:#da3633,stroke:#f85149,color:#fff

    class PF_C,DC_C compute
    class PF_P,DC_P pcie
    class PF_N,DC_N nvlink
    class DC_R warn

3.4 实测性能参考（双 RTX 3090）

配置	Prefill 吞吐	Decode 吞吐	TTFT	TPOT
单卡 AWQ	基线	基线	基线	基线
双卡 FP16 PCIe	+80%	+40%	-30%	-25%
双卡 FP16 NVLink	+90%	+70%	-35%	-40%

NVLink 在 decode 阶段的收益最大（从 +40% 提升到 +70%），因为 decode 的每步通信延迟占比更高。

3.5 如何检查 & 优化

# 查看 GPU 互联拓扑 (NV=NVLink, PHB=PCIe, SYS=跨 NUMA)
nvidia-smi topo -m

# PCIe 优化环境变量
export NCCL_SHM_DISABLE=0    # 启用共享内存传输
export NCCL_P2P_DISABLE=0    # 启用 P2P 直传

# 确保两卡在同一 NUMA 节点
numactl --hardware

# 增大并发提高计算/通信比
vllm serve ... --max-num-seqs 16

4. 多卡推理的显存计算

4.1 显存四大组成

总显存 = 模型权重 + KV Cache + 激活值 + 系统开销

4.2 逐项计算（Qwen2.5-14B）

模型权重

参数量: 14.7B

FP16:  14.7B * 2 bytes = 29.4 GB  --> TP=2: 每卡 14.7 GB
AWQ:   14.7B * 0.5 bytes = 7.35 GB --> TP=2: 每卡 3.68 GB

KV Cache（推理中最大的动态显存消耗）

每 token KV Cache = 2(K,V) * 48(层) * 8(KV heads, GQA) * 128(head_dim) * 2(FP16)
                  = 192 KB / token

TP=2 时 KV heads 按卡切分(每卡 4 个): 每卡每 token = 96 KB

max_model_len	KV Cache / 卡	场景
2,048	192 MB	短对话
8,192	768 MB	中等长度
16,384	1.5 GB	长文档
32,768	3.0 GB	超长上下文

激活值

推理无需反向传播, 激活值很小:
  Prefill: ~120 MB/卡 (粗估)
  Decode:  ~60 KB/卡 (可忽略)

系统开销

CUDA Context:      300-500 MB
NCCL Buffer:       200-400 MB
CUDA Graph:        200-500 MB
PyTorch Allocator: 100-200 MB
合计: ~1.0-1.5 GB/卡

4.3 完整显存预算表

以 Qwen2.5-14B FP16, TP=2, max_model_len=16384 为例：

每卡显存预算 (RTX 3090 = 24 GB):

+---------------------------+----------+---------+
| 项目                      | 显存/卡   | 占比    |
+---------------------------+----------+---------+
| 模型权重 (FP16, TP 切分)   | 14.7 GB  | 61.3%  |
| KV Cache (16K tokens)     | 1.5 GB   | 6.3%   |
| 激活值                     | 0.1 GB   | 0.4%   |
| 系统开销 (CUDA/NCCL)       | 1.2 GB   | 5.0%   |
+---------------------------+----------+---------+
| 合计                      | 17.5 GB  | 73.0%  |
| 剩余 (可用于更多并发 KV)    | 6.5 GB   | 27.0%  |
+---------------------------+----------+---------+

剩余 6.5 GB / 96 KB per token = ~69,000 tokens 额外 KV
= 约 4 个 16K 上下文的并发序列

4.4 四种方案对比

                          每卡显存分布 (24 GB)

  FP16 单卡         FP16 TP=2         AWQ 单卡          AWQ TP=2
  (OOM!)            (推荐)            (性价比)          (极限性能)

  +---------+       +---------+       +---------+       +---------+
  |         |       |/////////|       |/////////|       |/////////|
  | 权重    |       |/ 剩余  /|       |/       /|       |/       /|
  | 29.4 GB |       |/ 6.5GB /|       |/ 剩余  /|       |/ 剩余  /|
  |         |       |////////+|       |/ 14.5GB/|       |/ 18.7GB/|
  |         |       | KV 1.5 ||       |/       /|       |/       /|
  |  超出   |       +--------+|       |////////+|       |////////+|
  |  24GB!  |       |        ||       |        ||       |        ||
  |         |       | 权重   ||       | 权重   ||       | 权重   ||
  |         |       | 14.7GB ||       | 7.35GB ||       | 3.68GB ||
  |         |       |        ||       |        ||       |        ||
  +---------+       +--------+|       +--------+|       +--------+|
                    |系统1.2GB|       |系统1.2GB|       |系统1.2GB|
                    +---------+       +---------+       +---------+
  30.6 GB           17.5 GB           8.65 GB           4.98 GB
  [OOM]             [73%]             [36%]             [21%]

方案	权重/卡	KV 可用/卡	支持 max_len	可并发
FP16 单卡	29.4 GB	OOM	-	-
FP16 TP=2	14.7 GB	5.6 GB	16,384	~4
AWQ INT4 单卡	7.35 GB	14.5 GB	32,768	~6
AWQ INT4 TP=2	3.68 GB	18.7 GB	65,536	8+

结论：FP16 单卡放不下 14B 模型，必须 TP=2 或量化。生产建议 AWQ 单卡（性价比最高）或 FP16 TP=2（质量最优）。

5. vLLM 多卡调度源码分析

5.1 整体架构

graph TB
    subgraph ENGINE["LLMEngine (主进程)"]
        REQ["请求队列<br/>add_request()"]
        SCHED["Scheduler 调度器"]
        REQ --> SCHED
    end

    subgraph EXEC["Executor 层"]
        GPU_EXEC["GPUExecutor (单机)"]
        RAY_EXEC["RayDistributedExecutor (多机)"]
    end

    subgraph IPC["进程间通信 (IPC)"]
        TQ["task_queue"]
        RQ["result_queue"]
    end

    subgraph WORKERS["Worker 进程"]
        direction LR
        subgraph W0["Worker 0"]
            MR0["ModelRunner<br/>模型左半分片"]
            FW0["forward() on GPU 0"]
        end
        subgraph W1["Worker 1"]
            MR1["ModelRunner<br/>模型右半分片"]
            FW1["forward() on GPU 1"]
        end
    end

    subgraph DIST["NCCL 通信层"]
        PS["parallel_state<br/>GroupCoordinator"]
        G0["RTX 3090 #0"] <-->|"AllReduce"| G1["RTX 3090 #1"]
    end

    SCHED -->|"execute_model()"| GPU_EXEC
    GPU_EXEC --> IPC
    IPC --> W0 & W1
    W0 & W1 --> PS
    PS --> G0 & G1
    FW0 -.->|"logits (rank 0 采样)"| SCHED

    classDef engine fill:#9e6a03,stroke:#d29922,color:#fff
    classDef exec fill:#8957e5,stroke:#bc8cff,color:#fff
    classDef worker fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef nccl fill:#da3633,stroke:#f85149,color:#fff
    classDef dist fill:#238636,stroke:#3fb950,color:#fff

    class REQ,SCHED engine
    class GPU_EXEC,RAY_EXEC exec
    class MR0,FW0,MR1,FW1 worker
    class G0,G1 nccl
    class PS dist

5.2 关键模块和文件

vllm/
  engine/llm_engine.py              # LLMEngine: 请求队列 + 调度
  executor/
    executor_base.py                # ExecutorBase: 抽象基类
    gpu_executor.py                 # GPUExecutor: 单机多卡执行器
    ray_distributed_executor.py     # RayDistributedExecutor: 多机
    multiproc_worker_utils.py       # ProcessWorkerWrapper: 多进程管理
  worker/
    worker.py                       # Worker: 每 GPU 一个进程
    model_runner.py                 # ModelRunner: 模型加载和推理
  distributed/
    parallel_state.py               # 分布式环境初始化, 进程组管理
    device_communicators/           # NCCL 通信器实现

5.3 Executor：创建和管理 Worker

# 简化逻辑 (vllm/executor/gpu_executor.py)

class GPUExecutor(ExecutorBase):
    def _init_executor(self):
        if self.parallel_config.tensor_parallel_size > 1:
            # TP > 1: 多进程模式
            self._init_workers_multiproc()
        else:
            self._init_workers_single()

    def _init_workers_multiproc(self):
        for rank in range(self.parallel_config.tensor_parallel_size):
            worker = ProcessWorkerWrapper(
                worker_class=Worker,
                rank=rank,
                local_rank=rank, ...
            )
            self.workers.append(worker)

IPC 通信：

# 简化 (vllm/executor/multiproc_worker_utils.py)

class ProcessWorkerWrapper:
    """主进程通过 Queue 与 Worker 子进程通信"""
    def __init__(self, ...):
        self.task_queue = multiprocessing.Queue()    # 发任务
        self.result_queue = multiprocessing.Queue()  # 收结果
        self.process = multiprocessing.Process(
            target=_run_worker_process,
            args=(self.task_queue, self.result_queue, ...)
        )
        self.process.start()

    def execute_method(self, method, *args):
        self.task_queue.put((uuid4(), method, args, {}))
        return self.result_queue.get()

5.4 Worker：GPU 上的模型执行

# 简化 (vllm/worker/worker.py)

class Worker(LocalOrDistributedWorkerBase):
    def init_device(self):
        torch.cuda.set_device(self.local_rank)        # 绑定 GPU
        init_distributed_environment(                   # 初始化 NCCL
            world_size=self.parallel_config.tensor_parallel_size,
            rank=self.rank, backend="nccl"
        )
        initialize_model_parallel(                      # 创建 TP 进程组
            tensor_model_parallel_size=self.parallel_config.tensor_parallel_size
        )

    def load_model(self):
        self.model_runner = ModelRunner(...)
        self.model_runner.load_model()                  # 自动按 TP 切分权重

    def execute_model(self, req):
        return self.model_runner.execute_model(req)     # 一次 forward pass

5.5 分布式初始化：parallel_state

# 简化 (vllm/distributed/parallel_state.py)

def init_distributed_environment(world_size, rank, backend="nccl"):
    torch.distributed.init_process_group(
        backend=backend, world_size=world_size, rank=rank
    )

def initialize_model_parallel(tensor_model_parallel_size):
    ranks = list(range(tensor_model_parallel_size))  # [0, 1]
    group = torch.distributed.new_group(ranks, backend="nccl")
    _TP_GROUP = GroupCoordinator(group, ranks, ...)

class GroupCoordinator:
    """PyTorch ProcessGroup 封装, 提供便捷通信方法"""
    def all_reduce(self, tensor):
        torch.distributed.all_reduce(tensor, group=self.group)

    def all_gather(self, tensor_list, tensor):
        torch.distributed.all_gather(tensor_list, tensor, group=self.group)

5.6 模型层中的 TP 切分

# 概念示意 (Megatron 风格)

class ColumnParallelLinear(nn.Module):
    """列并行: 权重按列切分, 每卡只持有 output_dim/tp_size 列"""
    def __init__(self, input_size, output_size):
        tp_size = get_tensor_model_parallel_world_size()  # = 2
        self.weight = nn.Parameter(
            torch.empty(output_size // tp_size, input_size)  # 只有一半列
        )

    def forward(self, x):
        return F.linear(x, self.weight)  # 输出 shape: [b, s, out/2]

class RowParallelLinear(nn.Module):
    """行并行: 权重按行切分, 输出需 AllReduce"""
    def __init__(self, input_size, output_size):
        tp_size = get_tensor_model_parallel_world_size()
        self.weight = nn.Parameter(
            torch.empty(output_size, input_size // tp_size)  # 只有一半行
        )

    def forward(self, x):
        output = F.linear(x, self.weight)       # 部分和
        torch.distributed.all_reduce(output,     # AllReduce 汇总
            group=get_tensor_model_parallel_group())
        return output                            # 完整结果

5.7 完整推理执行流程

flowchart TD
    REQ(["Client: POST /v1/completions"])
    ENGINE["LLMEngine.add_request()"]
    SCHED["Scheduler: 选取 batch, 分配 KV blocks"]
    DISPATCH["Executor: task_queue 分发到 2 个 Worker"]

    subgraph FWD["并行 Forward (GPU0 + GPU1)"]
        direction TB
        EMB["Embedding: token -> hidden"]
        subgraph L0["Layer 0"]
            direction LR
            L0A["Attention (TP)"] --> L0AR{{"AllReduce"}} --> L0M["MLP (TP)"] --> L0MR{{"AllReduce"}}
        end
        DOTS["Layer 1 ~ 46 (同上)"]
        subgraph L47["Layer 47"]
            direction LR
            L47A["Attention (TP)"] --> L47AR{{"AllReduce #95"}} --> L47M["MLP (TP)"] --> L47MR{{"AllReduce #96"}}
        end
        LN["Final LayerNorm"]
        LM["LM Head -> vocab logits"]
        EMB --> L0 --> DOTS --> L47 --> LN --> LM
    end

    SAMPLE["Sampling (仅 rank 0)<br/>Top-p / Temperature"]
    KV["更新 KV Cache"]
    CHECK{"EOS 或 max_tokens?"}
    LOOP["继续 decode"]
    RESP(["返回响应 (stream/batch)"])

    REQ --> ENGINE --> SCHED --> DISPATCH --> FWD
    LM --> SAMPLE --> KV --> CHECK
    CHECK -->|"否"| LOOP --> SCHED
    CHECK -->|"是"| RESP

    classDef req fill:#238636,stroke:#3fb950,color:#fff
    classDef engine fill:#9e6a03,stroke:#d29922,color:#fff
    classDef fwd fill:#1f6feb,stroke:#58a6ff,color:#fff
    classDef comm fill:#da3633,stroke:#f85149,color:#fff
    classDef sample fill:#8957e5,stroke:#bc8cff,color:#fff

    class REQ,RESP req
    class ENGINE,SCHED,DISPATCH engine
    class EMB,L0A,L0M,L47A,L47M,DOTS,LN,LM fwd
    class L0AR,L0MR,L47AR,L47MR comm
    class SAMPLE,KV sample

5.8 单机多进程 vs Ray 分布式

特性	multiprocessing（单机）	Ray（多机）
启动方式	Python multiprocessing.Process	Ray remote actors
IPC	Queue (task/result)	Ray Object Store
GPU 分配	CUDA_VISIBLE_DEVICES	Ray placement group
适用场景	同一台机器多卡	跨机器多卡
默认行为	TP 在单机时自动使用	需安装 Ray

# 强制使用 multiprocessing（单机推荐）
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# 使用 Ray（多机场景）
vllm serve ... --distributed-executor-backend ray

附录：实际部署配置

dev210 双 3090 配置

python3 -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen2.5-14B-Instruct-1M \
  --port 9090 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --tensor-parallel-size 2

# 实际显存: 每卡 ~22.4 GB / 24 GB
# GPU 利用率: 空闲 0%, 推理时 60-80%

性能调优速查

症状	排查方向	调优手段
Decode 延迟高	PCIe 通信瓶颈	加 NVLink Bridge / 增大 batch
显存不够	权重 + KV 超限	降 gpu-memory-utilization / max-model-len
吞吐不够	batch 利用率低	增大 max-num-seqs
首 Token 慢	Prefill 计算重	--enable-chunked-prefill

双RTX3090多卡推理深度解析