揭秘云原生混布资源调度器Koordinator （十三）GPU 资源管理总览一、核心使命与设计理念 1.1 Koordi

一、核心使命与设计理念

1.1 Koordinator GPU 调度的使命

在 Kubernetes 原生环境中,GPU 资源调度存在以下痛点:

┌─────────────────────────────────────────────────────┐
│           Kubernetes 原生 GPU 调度痛点                 │
├─────────────────────────────────────────────────────┤
│  1. 粗粒度分配: 只能整卡分配,无法GPU共享               │
│  2. 资源浪费: 单个容器无法使用完整的GPU资源            │
│  3. 成本高昂: 每个Pod独占一张GPU,资源利用率低          │
│  4. 缺乏隔离: 多Pod共享GPU时无法保证QoS               │
│  5. 不支持FPGA/RDMA: 仅支持GPU,无统一设备管理框架      │
└─────────────────────────────────────────────────────┘

生产案例场景: 某互联网公司机器学习平台,拥有 200 张 NVIDIA V100 GPU,运行约 500 个推理服务:

原生 K8s 方案	Koordinator 方案
每个 Pod 独占 1 张 GPU	5 个 Pod 共享 1 张 GPU
GPU 平均利用率 18%	GPU 平均利用率 75%
需要 500 张 GPU	需要 133 张 GPU (节省 73%)
成本: 500 × 6万 = 3000万/年	成本: 133 × 6万 = 798万/年

Koordinator 的三大核心使命:

GPU 细粒度共享: 支持按 GPU 核心(Core)和显存(Memory)比例进行分配
统一设备管理: GPU、FPGA、RDMA 统一抽象为 Device CRD
QoS 保障: 通过资源隔离保证共享 GPU 时的服务质量

1.2 核心设计理念（How）

Koordinator GPU 调度采用 三层架构 设计:

┌───────────────────────────────────────────────────────────┐
│                    Koordinator 三层架构                     │
├───────────────────────────────────────────────────────────┤
│                                                            │
│  ┌─────────────────────────────────────────────────┐    │
│  │  1. Scheduler 层: DeviceShare 调度插件          │    │
│  │     - PreFilter: 验证 GPU 请求合法性             │    │
│  │     - Filter: 筛选满足资源要求的节点              │    │
│  │     - Reserve: 预留 GPU 资源                     │    │
│  │     - PreBind: 注入分配结果到 Pod 注解           │    │
│  └─────────────────────────────────────────────────┘    │
│                         ↓                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  2. Controller 层: Device CRD & NodeResource     │    │
│  │     - Device CRD: 统一描述节点设备信息           │    │
│  │     - NodeResource Controller: 计算可用资源       │    │
│  │     - SLO Controller: 分发 QoS 策略              │    │
│  └─────────────────────────────────────────────────┘    │
│                         ↓                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  3. Agent 层: Koordlet GPU 信息上报             │    │
│  │     - StatesInformer: 通过 NVML 采集 GPU 信息   │    │
│  │     - MetricAdvisor: 采集 GPU 使用率指标         │    │
│  │     - RuntimeHooks: 容器隔离参数注入             │    │
│  └─────────────────────────────────────────────────┘    │
│                                                            │
└───────────────────────────────────────────────────────────┘

核心设计原则:

兼容性优先:
- 支持 nvidia.com/gpu 原生资源名
- 支持 koordinator.sh/gpu-core 细粒度资源名
- 兼容 koordinator.sh/gpu-memory 和 koordinator.sh/gpu-memory-ratio
透明性:
- GPU 分配信息通过 Annotation 注入,无需修改容器
- 调度结果包含 GPU Minor 编号,可直接设置 CUDA_VISIBLE_DEVICES
灵活性:
- 支持整卡分配(100% GPU Core + 100% Memory)
- 支持按比例分配(25% GPU Core + 25% Memory)
- 支持跨多张卡分配(200% = 2张完整GPU)

二、GPU 资源模型详解

2.1 资源名称定义

Koordinator 定义了以下 GPU 资源名:

// apis/extension/resource.go
const (
    // 原生 GPU 资源(整卡分配)
    ResourceNvidiaGPU corev1.ResourceName = "nvidia.com/gpu"
    
    // Koordinator 扩展资源(细粒度分配)
    ResourceGPU            corev1.ResourceName = "koordinator.sh/gpu"
    ResourceGPUCore        corev1.ResourceName = "koordinator.sh/gpu-core"
    ResourceGPUMemory      corev1.ResourceName = "koordinator.sh/gpu-memory"
    ResourceGPUMemoryRatio corev1.ResourceName = "koordinator.sh/gpu-memory-ratio"
)

// GPU 节点标签
const (
    LabelGPUModel         string = "node.koordinator.sh/gpu-model"        // GPU 型号
    LabelGPUDriverVersion string = "node.koordinator.sh/gpu-driver-version" // 驱动版本
)

资源请求组合规则:

组合类型	资源请求示例	含义	使用场景
整卡分配(兼容原生)	`nvidia.com/gpu: 1`	1 张完整 GPU	训练任务、需要独占GPU的推理
Koord整卡	`koordinator.sh/gpu: 100`	等同于 `nvidia.com/gpu: 1`	兼容 Koordinator 语义
按核心+内存	`gpu-core: 50` `gpu-memory: 8Gi`	50% GPU 算力 + 8GB 显存	固定显存需求的推理服务
按核心+比例	`gpu-core: 50` `gpu-memory-ratio: 50`	50% GPU 算力 + 50% 显存	显存需求与算力成比例
多卡分配	`gpu-core: 200`	2 张完整 GPU	大模型训练、多卡推理

合法性验证逻辑:

// pkg/scheduler/plugins/deviceshare/utils.go

const (
    NvidiaGPUExist      = 1 << 0  // 0b00001
    KoordGPUExist       = 1 << 1  // 0b00010
    GPUCoreExist        = 1 << 2  // 0b00100
    GPUMemoryExist      = 1 << 3  // 0b01000
    GPUMemoryRatioExist = 1 << 4  // 0b10000
)

func ValidateGPURequest(podRequest corev1.ResourceList) (uint, error) {
    var gpuCombination uint

    // 检测各类资源是否存在
    if _, exist := podRequest["nvidia.com/gpu"]; exist {
        gpuCombination |= NvidiaGPUExist
    }
    if koordGPU, exist := podRequest["koordinator.sh/gpu"]; exist {
        if koordGPU.Value() > 100 && koordGPU.Value()%100 != 0 {
            return gpuCombination, fmt.Errorf("gpu value must be 100 or multiples of 100")
        }
        gpuCombination |= KoordGPUExist
    }
    if gpuCore, exist := podRequest["koordinator.sh/gpu-core"]; exist {
        if gpuCore.Value() > 100 && gpuCore.Value()%100 != 0 {
            return gpuCombination, fmt.Errorf("gpu-core value must be 100 or multiples of 100")
        }
        gpuCombination |= GPUCoreExist
    }
    // ... 类似逻辑检测 gpu-memory 和 gpu-memory-ratio

    // 允许的组合模式
    if gpuCombination == NvidiaGPUExist ||
       gpuCombination == KoordGPUExist ||
       gpuCombination == (GPUCoreExist | GPUMemoryExist) ||
       gpuCombination == (GPUCoreExist | GPUMemoryRatioExist) {
        return gpuCombination, nil
    }

    return gpuCombination, fmt.Errorf("invalid GPU resource combination")
}

生产案例 - 推理服务资源规格:

某AI平台的 3 类推理服务:

# 1. 小型推理服务 (ResNet-50)
resources:
  requests:
    koordinator.sh/gpu-core: "25"          # 25% GPU 算力
    koordinator.sh/gpu-memory-ratio: "25"  # 25% 显存 (4GB)
  limits:
    koordinator.sh/gpu-core: "25"
    koordinator.sh/gpu-memory-ratio: "25"

# 2. 中型推理服务 (BERT-Large)
resources:
  requests:
    koordinator.sh/gpu-core: "50"
    koordinator.sh/gpu-memory: "8Gi"       # 固定 8GB 显存
  limits:
    koordinator.sh/gpu-core: "50"
    koordinator.sh/gpu-memory: "8Gi"

# 3. 大模型训练 (GPT-3)
resources:
  requests:
    koordinator.sh/gpu-core: "200"         # 2 张完整 GPU
    koordinator.sh/gpu-memory-ratio: "200"
  limits:
    koordinator.sh/gpu-core: "200"
    koordinator.sh/gpu-memory-ratio: "200"

实际效果:

服务类型	GPU卡数	单卡并发Pod	集群总Pod数	GPU利用率
小型推理	50张	4个	200个	78%
中型推理	30张	2个	60个	82%
大模型训练	20张	独占(2卡)	10个	95%

2.2 Device CRD 数据结构

Device CRD 是 Koordinator 统一设备管理的核心:

// apis/scheduling/v1alpha1/device_types.go

type DeviceType string

const (
    GPU  DeviceType = "gpu"
    FPGA DeviceType = "fpga"
    RDMA DeviceType = "rdma"
)

// Device 对象(Cluster级别)
type Device struct {
    metav1.TypeMeta
    metav1.ObjectMeta
    
    Spec   DeviceSpec
    Status DeviceStatus
}

// Spec: 节点的设备信息
type DeviceSpec struct {
    Devices []DeviceInfo `json:"devices"`
}

// 单个设备的详细信息
type DeviceInfo struct {
    UUID      string                   `json:"id"`         // GPU UUID
    Minor     *int32                   `json:"minor"`      // 设备编号 (0, 1, 2...)
    Type      DeviceType               `json:"type"`       // gpu/fpga/rdma
    Health    bool                     `json:"health"`     // 健康状态
    Resources corev1.ResourceList      `json:"resources"`  // 资源容量
}

// Status: 设备分配状态
type DeviceStatus struct {
    Allocations []DeviceAllocation `json:"allocations"`
}

type DeviceAllocation struct {
    Type    DeviceType             `json:"type"`
    Entries []DeviceAllocationItem `json:"entries"` // 已分配的 Pod 列表
}

type DeviceAllocationItem struct {
    Name      string  `json:"name"`       // Pod名称
    Namespace string  `json:"namespace"`
    UUID      string  `json:"uuid"`       // 分配的GPU UUID
    Minors    []int32 `json:"minors"`     // 分配的GPU Minor列表
}

实际的 Device CRD 示例:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
  name: node-1
spec:
  devices:
  - id: "GPU-abcd1234"               # NVIDIA GPU UUID
    minor: 0                          # /dev/nvidia0
    type: gpu
    health: true
    resources:
      koordinator.sh/gpu-core: 100
      koordinator.sh/gpu-memory: 16Gi
      koordinator.sh/gpu-memory-ratio: 100
  - id: "GPU-efgh5678"
    minor: 1
    type: gpu
    health: true
    resources:
      koordinator.sh/gpu-core: 100
      koordinator.sh/gpu-memory: 16Gi
      koordinator.sh/gpu-memory-ratio: 100
status:
  allocations:
  - type: gpu
    entries:
    - name: pod-inference-1
      namespace: default
      uuid: "GPU-abcd1234"
      minors: [0]
    - name: pod-inference-2
      namespace: default
      uuid: "GPU-abcd1234"
      minors: [0]

GPU 信息采集流程:

┌────────────────────────────────────────────────────────────┐
│            Koordlet GPU 信息上报流程                         │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 初始化 NVML Library                                     │
│     nvml.Init() → 检测 libnvidia-ml.so                      │
│                                                             │
│  2. 获取 GPU 设备列表                                        │
│     nvml.DeviceGetCount() → 获取 GPU 数量                   │
│     nvml.DeviceGetHandleByIndex(i) → 遍历每张GPU            │
│                                                             │
│  3. 采集每张 GPU 的详细信息                                  │
│     ┌─────────────────────────────────────────┐           │
│     │ nvml.DeviceGetUUID()        → GPU UUID │           │
│     │ nvml.DeviceGetMinorNumber() → Minor 编号│           │
│     │ nvml.DeviceGetMemoryInfo()  → 显存容量  │           │
│     │ nvml.DeviceGetName()        → GPU 型号  │           │
│     │ nvml.SystemGetDriverVersion()→ 驱动版本 │           │
│     └─────────────────────────────────────────┘           │
│                                                             │
│  4. 构建 Device CRD                                         │
│     填充 DeviceInfo 列表 → 包含 UUID/Minor/Resources         │
│                                                             │
│  5. 上报到 APIServer                                        │
│     首次: Create Device                                     │
│     后续: Update Device (带版本控制)                         │
│                                                             │
└────────────────────────────────────────────────────────────┘

核心代码实现:

// pkg/koordlet/statesinformer/states_device_linux.go

func (s *statesInformer) buildGPUDevice() []schedulingv1alpha1.DeviceInfo {
    // 从 MetricCache 获取 GPU 指标
    nodeResource := s.metricsCache.GetNodeResourceMetric(queryParam)
    if len(nodeResource.Metric.GPUs) == 0 {
        return nil
    }
    
    var deviceInfos []schedulingv1alpha1.DeviceInfo
    for i := range nodeResource.Metric.GPUs {
        gpu := nodeResource.Metric.GPUs[i]
        
        // 检查 GPU 健康状态
        health := true
        if _, ok := s.unhealthyGPU[gpu.DeviceUUID]; ok {
            health = false
        }
        
        deviceInfos = append(deviceInfos, schedulingv1alpha1.DeviceInfo{
            UUID:   gpu.DeviceUUID,
            Minor:  &gpu.Minor,
            Type:   schedulingv1alpha1.GPU,
            Health: health,
            Resources: map[corev1.ResourceName]resource.Quantity{
                extension.ResourceGPUCore:        *resource.NewQuantity(100, resource.DecimalSI),
                extension.ResourceGPUMemory:      gpu.MemoryTotal,  // 例如 16Gi
                extension.ResourceGPUMemoryRatio: *resource.NewQuantity(100, resource.DecimalSI),
            },
        })
    }
    return deviceInfos
}

生产数据规模:

某云厂商 GPU 集群统计:

指标	数据
节点数量	500 个
GPU 总数	4000 张 (平均每节点 8 张)
Device CRD 数量	500 个 (每个节点一个)
Device CRD 大小	平均 15KB
更新频率	每 60 秒检测一次健康状态
APIServer 写入QPS	500/60 ≈ 8.3 QPS

三、GPU 调度全流程时序图

3.1 从 Pod 创建到 GPU 分配的完整流程

┌─────────┐  ┌──────────┐  ┌──────────────┐  ┌─────────┐  ┌─────────┐
│  User   │  │APIServer │  │ DeviceShare  │  │ Device  │  │Koordlet │
│         │  │          │  │   Plugin     │  │  Cache  │  │         │
└────┬────┘  └─────┬────┘  └──────┬───────┘  └────┬────┘  └────┬────┘
     │             │               │               │             │
     │ 1. Create   │               │               │             │
     │   Pod with  │               │               │             │
     │   gpu-core  │               │               │             │
     ├─────────────>               │               │             │
     │             │               │               │             │
     │             │ 2. PreFilter  │               │             │
     │             │   验证GPU请求  │               │             │
     │             ├───────────────>               │             │
     │             │               │               │             │
     │             │               │ validateGPURequest()       │
     │             │               │ - 检查组合合法性            │
     │             │               │ - 转换为统一格式            │
     │             │               │               │             │
     │             │ 3. Filter     │               │             │
     │             │   (所有节点)   │               │             │
     │             ├───────────────>               │             │
     │             │               │               │             │
     │             │               │ 4. getNodeDevice()         │
     │             │               ├───────────────>             │
     │             │               │               │             │
     │             │               │ 5. 返回节点GPU状态          │
     │             │               <───────────────┤             │
     │             │               │  deviceTotal: {0: 100%}    │
     │             │               │  deviceFree:  {0: 75%}     │
     │             │               │  deviceUsed:  {0: 25%}     │
     │             │               │               │             │
     │             │               │ 6. tryAllocateDevice()     │
     │             │               │   - 检查可用资源            │
     │             │               │   - 尝试分配到GPU 0         │
     │             │               │               │             │
     │             │ 7. Filter结果 │               │             │
     │             │   node-1: OK  │               │             │
     │             │   node-2: OK  │               │             │
     │             <───────────────┤               │             │
     │             │               │               │             │
     │             │ 8. Score      │               │             │
     │             │   (打分选最优) │               │             │
     │             ├───────────────>               │             │
     │             │               │               │             │
     │             │ 9. 选中node-1 │               │             │
     │             <───────────────┤               │             │
     │             │               │               │             │
     │             │ 10. Reserve   │               │             │
     │             ├───────────────>               │             │
     │             │               │               │             │
     │             │               │ 11. Allocate()             │
     │             │               │   真正分配GPU资源           │
     │             │               ├───────────────>             │
     │             │               │               │             │
     │             │               │ 12. Reserve()              │
     │             │               │   更新deviceUsed            │
     │             │               <───────────────┤             │
     │             │               │  deviceUsed[0] += 25%      │
     │             │               │               │             │
     │             │ 13. PreBind   │               │             │
     │             ├───────────────>               │             │
     │             │               │               │             │
     │             │               │ 14. 注入Annotation         │
     │             │               │    device-allocated:       │
     │             │               │    {"gpu":[{               │
     │             │               │      "minor": 0,           │
     │             │               │      "resources": {        │
     │             │               │        "gpu-core":"25"     │
     │             │               │      }                     │
     │             │               │    }]}                     │
     │             │               │               │             │
     │             │ 15. Patch Pod │               │             │
     │             <───────────────┤               │             │
     │             │               │               │             │
     │             │ 16. Bind Pod  │               │             │
     │             │   to node-1   │               │             │
     │             <───────────────┤               │             │
     │             │               │               │             │
     │             │               │               │ 17. Watch  │
     │             │               │               │   Pod事件   │
     │             │               │               ├─────────────>
     │             │               │               │             │
     │             │               │               │ 18. 读取    │
     │             │               │               │  Annotation │
     │             │               │               │             │
     │             │               │               │ 19. 设置    │
     │             │               │               │  CUDA_      │
     │             │               │               │  VISIBLE_   │
     │             │               │               │  DEVICES=0  │
     │             │               │               │             │
     │             │               │               │ 20. 容器启动│
     │             │               │               │             │
     ◀────────────────────── 调度完成 ────────────────────────────┘

3.2 关键阶段详细说明

阶段 1: PreFilter - 验证GPU请求

// pkg/scheduler/plugins/deviceshare/plugin.go

func (p *Plugin) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod) *framework.Status {
    podRequests, _ := resource.PodRequestsAndLimits(pod)
    
    // 验证 GPU 请求的合法性
    combination, err := ValidateGPURequest(podRequests)
    if err != nil {
        return framework.NewStatus(framework.Error, err.Error())
    }
    
    // 转换为统一格式 (gpu-core + gpu-memory-ratio)
    state.podRequests = ConvertGPUResource(podRequests, combination)
    
    // 示例: nvidia.com/gpu: 1 → gpu-core: 100, gpu-memory-ratio: 100
    // 示例: gpu-core: 50 → gpu-core: 50 (保持不变)
    
    return nil
}

阶段 2: Filter - 筛选满足条件的节点

func (p *Plugin) Filter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(nodeName, false)
    
    // 尝试在节点上分配GPU资源
    allocateResult, err := p.allocator.Allocate(
        nodeName,
        pod,
        state.podRequests,
        nodeDeviceInfo,
        state.preemptibleDevices[nodeName],
    )
    
    if len(allocateResult) > 0 && err == nil {
        return nil  // 可以调度到该节点
    }
    return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
}

阶段 3: Reserve - 预留资源

func (p *Plugin) Reserve(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
    nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(nodeName, false)
    
    // 执行真正的分配
    allocateResult, err := p.allocator.Allocate(...)
    if err != nil {
        return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
    }
    
    // 更新缓存中的已用资源
    p.allocator.Reserve(pod, nodeDeviceInfo, allocateResult)
    state.allocationResult = allocateResult
    
    return nil
}

阶段 4: PreBind - 注入分配结果

func (p *Plugin) PreBind(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
    // 将分配结果写入 Pod Annotation
    err := apiext.SetDeviceAllocations(pod, state.allocationResult)
    
    // Patch Pod
    err = util.RetryOnConflictOrTooManyRequests(func() error {
        _, err := util.NewPatch().
            WithHandle(p.handle).
            AddAnnotations(pod.Annotations).
            Patch(ctx, pod)
        return err
    })
    
    return framework.AsStatus(err)
}

四、生产环境调优指南

4.1 GPU 资源规格制定

推荐的 GPU 资源规格矩阵:

业务类型	GPU Core	GPU Memory	单卡并发	使用场景
轻量推理	10-25%	10-25%	4-8个	图像分类、OCR
标准推理	25-50%	25-50%	2-4个	目标检测、NLP
重型推理	50-100%	50-100%	1-2个	视频分析、语音
在线训练	100%	100%	独占	模型调优、增量训练
离线训练	200-800%	200-800%	独占多卡	大模型预训练

规格制定流程:

1. 性能测试阶段
   ├─ 单独测试: 独占 1 张 GPU,测试 QPS 和延迟
   ├─ 共享测试: 2 个 Pod 各占 50%,测试性能下降比例
   ├─ 并发测试: 4 个 Pod 各占 25%,测试极限并发
   └─ 确定最优共享比例

2. 资源成本分析
   ├─ 计算单位成本: GPU成本 / (QPS × 并发数)
   ├─ 对比不同规格的成本
   └─ 选择性价比最高的方案

3. SLA 保障验证
   ├─ P99 延迟 < SLA 要求
   ├─ 资源抢占测试 (高优先级 Pod 驱逐低优先级)
   └─ 故障恢复测试 (GPU异常时的行为)

生产案例 - 某电商图像识别服务:

初期配置(独占整卡):

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

单 Pod QPS: 150
GPU 利用率: 22%
100 个 Pod 需要 100 张 GPU
成本: 100 × 6万 = 600万/年

优化后配置(4个Pod共享):

resources:
  requests:
    koordinator.sh/gpu-core: "25"
    koordinator.sh/gpu-memory-ratio: "25"
  limits:
    koordinator.sh/gpu-core: "25"
    koordinator.sh/gpu-memory-ratio: "25"

单 Pod QPS: 140 (下降 6.7%)
GPU 利用率: 76%
100 个 Pod 需要 25 张 GPU
成本: 25 × 6万 = 150万/年
节省成本: 450万/年 (75%)

4.2 调度策略配置

DeviceShare 插件配置:

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: koord-scheduler
  plugins:
    preFilter:
      enabled:
      - name: DeviceShare
    filter:
      enabled:
      - name: DeviceShare
    reserve:
      enabled:
      - name: DeviceShare
    preBind:
      enabled:
      - name: DeviceShare
  pluginConfig:
  - name: DeviceShare
    args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: DeviceShareArgs
      allocator: default  # 分配器策略: default / binpack

分配器策略对比:

策略	说明	优势	劣势	适用场景
default	按 GPU Minor 顺序分配	简单快速	可能负载不均	通用场景
binpack	优先填满已使用的 GPU	碎片整理	计算复杂	资源紧张场景

binpack 算法示例:

假设节点有 4 张 GPU,当前状态:

GPU 0: 已用 75%, 剩余 25%
GPU 1: 已用 50%, 剩余 50%
GPU 2: 已用 25%, 剩余 75%
GPU 3: 已用 0%,  剩余 100%

新 Pod 请求 25% GPU:

default 策略: 分配到 GPU 0 (按顺序)
- 结果: GPU 0 用满(100%), GPU 1-3 保持不变
- 后果: GPU 0 无法再分配,其他GPU利用率低
binpack 策略: 分配到 GPU 0 (优先填满)
- 结果: 同上,但算法会考虑全局最优
- 优势: 最大化单卡利用率,为大请求预留空间

4.3 监控与告警

关键监控指标:

# Prometheus 监控规则
groups:
- name: gpu_device_share
  rules:
  # 1. GPU 分配成功率
  - record: gpu:allocation_success_rate
    expr: |
      rate(scheduler_deviceshare_allocations_total{result="success"}[5m])
      / 
      rate(scheduler_deviceshare_allocations_total[5m])

  # 2. GPU 利用率
  - record: gpu:utilization
    expr: |
      (device_gpu_used_core / device_gpu_total_core) * 100

  # 3. GPU 碎片率
  - record: gpu:fragmentation_rate
    expr: |
      count(device_gpu_free_core < 25 and device_gpu_free_core > 0) 
      / 
      count(device_gpu_total_core)

  # 告警规则
  - alert: GPUAllocationFailureHigh
    expr: gpu:allocation_success_rate < 0.9
    for: 5m
    annotations:
      summary: "GPU分配失败率超过10%"

  - alert: GPUFragmentationHigh
    expr: gpu:fragmentation_rate > 0.3
    for: 10m
    annotations:
      summary: "GPU碎片率超过30%,建议调整分配策略"

Grafana 监控面板配置:

┌─────────────────────────────────────────────────────────┐
│              GPU 资源监控 Dashboard                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  ┌──────────────────┐  ┌──────────────────┐            │
│  │  GPU 总体利用率   │  │  分配成功率       │            │
│  │     76.3%        │  │     98.5%        │            │
│  └──────────────────┘  └──────────────────┘            │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │            每张 GPU 的详细使用情况                 │  │
│  ├──────────────────────────────────────────────────┤  │
│  │  GPU 0: [████████████░░░░] 75%                  │  │
│  │  GPU 1: [████████░░░░░░░░] 50%                  │  │
│  │  GPU 2: [████░░░░░░░░░░░░] 25%                  │  │
│  │  GPU 3: [░░░░░░░░░░░░░░░░]  0%                  │  │
│  └──────────────────────────────────────────────────┘  │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │              碎片分布统计                          │  │
│  ├──────────────────────────────────────────────────┤  │
│  │  < 10% 剩余: 8 张GPU  ████████                   │  │
│  │  10-25%剩余: 12张GPU  ████████████              │  │
│  │  25-50%剩余: 20张GPU  ████████████████████      │  │
│  │  > 50% 剩余: 60张GPU  ████████████████████████…│  │
│  └──────────────────────────────────────────────────┘  │
│                                                          │
└─────────────────────────────────────────────────────────┘

4.4 常见问题处理

问题 1: GPU 分配失败 - Insufficient Devices

可能原因:

节点 GPU 资源不足
Device CRD 未正常上报
GPU驱动或NVML库异常

排查步骤:

# 1. 检查 Device CRD
kubectl get device <node-name> -o yaml

# 2. 检查 Koordlet 日志
kubectl logs -n koordinator-system koordlet-<xxx> | grep GPU

# 3. 检查节点 GPU 驱动
nvidia-smi

# 4. 检查 DeviceShare 插件日志
kubectl logs -n kube-system koord-scheduler-<xxx> | grep DeviceShare

问题 2: GPU 利用率不均衡

现象: 部分 GPU 用满,部分 GPU 空闲

解决方案:

# 启用 binpack 分配策略
pluginConfig:
- name: DeviceShare
  args:
    allocator: binpack

# 配合 Pod 亲和性,将同类 Pod 调度到同一节点
affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: gpu-inference
        topologyKey: kubernetes.io/hostname

问题 3: 共享 GPU 时性能波动

现象: P99 延迟偶尔超过 SLA

根因分析:

多个 Pod 同时使用 GPU,存在资源抢占
GPU 显存不足触发 Swap

优化方案:

# 1. 增加 GPU 资源配额
resources:
  requests:
    koordinator.sh/gpu-core: "50"        # 从 25% 增加到 50%
    koordinator.sh/gpu-memory: "10Gi"    # 固定显存,避免 Swap
  limits:
    koordinator.sh/gpu-core: "50"
    koordinator.sh/gpu-memory: "10Gi"

# 2. 配置 QoS 优先级
labels:
  koordinator.sh/priority: "high"        # 高优先级,避免被抢占

五、总结与展望

5.1 Koordinator GPU 调度的核心价值

成本优化: 通过细粒度共享,GPU 利用率从 20% 提升到 75%+,节省 60-80% 成本
灵活性: 支持 5 种资源请求组合,适配不同业务场景
兼容性: 无缝兼容原生 nvidia.com/gpu,迁移成本低
可观测性: 完整的监控指标和告警规则,便于运维管理

5.2 与业界方案对比

方案	GPU共享	多卡分配	设备统一管理
Koordinator	✅ 细粒度	✅	✅ GPU/FPGA/RDMA
原生K8s	❌ 整卡	✅	❌ 仅GPU
Volcano	✅	✅	❌ 仅GPU
GPUStack	✅	✅	❌ 仅GPU

参考资料:

Koordinator 官方文档: koordinator.sh/docs/design…
NVIDIA NVML API: docs.nvidia.com/deploy/nvml…
Kubernetes Device Plugin: kubernetes.io/docs/concep…