使用 Docker 构建可落地运行的 AI 系统——使用 Docker Model Runner 提供模型服务

将 AI models 构建并集成到 applications 中可能很有挑战。你需要管理大型 model files，确保 inference 能在可用硬件上高效运行，并将 model endpoints 集成到 application stack 中，同时还要控制成本和 data privacy。Docker Model Runner（DMR）正是 Docker 针对这些挑战提供的解决方案，它支持 local-first 的 large language model（LLM）inference，并尽量降低使用复杂度。

本章中，我们将聚焦如何使用 DMR 将 AI models 作为 service 提供出来，前提是假设你已经熟悉 Docker 的基础知识。我们会先探索 Docker Model Runner 是什么，包括它的 architecture，以及它如何支持 LLM inference。接下来，我们会走读如何在 Docker Desktop（macOS / Windows）和 Docker Engine（Linux）上安装和配置 DMR，覆盖 GPU enablement 和 diagnostic checks。随后，你会通过从 registry 拉取模型、在本地执行模型，并通过 CLI 和一个简单 API call 验证 inference，来运行你的第一个 AI model。本章之后会继续讨论 model serving 和 API integration，演示如何通过 OpenAI-compatible API 暴露 models，并使用 curl 和 OpenAI SDKs 等工具与它们交互。我们还会考察如何使用 Docker Compose 通过 DMR 构建 AI applications：将 models 绑定到 application services，为 model endpoints 注入 environment variables，并应用 real-world integration patterns。Performance optimization 和 GPU configuration 也会被详细讨论，包括 speed tuning、CPU 与 GPU inference engines 的选择、context length 调整、quantization levels 应用，以及利用 GPUs 加速。最后，我们会覆盖 observability 和 monitoring practices，例如实现 logging、使用 Prometheus 和 Grafana 收集 metrics，以及使用 Jaeger tracing 来监控 model performance 和 usage；本章结尾还会提供 hands-on exercises，引导你在 development 和 production scenarios 中部署 DMR。

本章将覆盖以下主要主题：

Introduction to Docker Model Runner
Installing and configuring Docker Model Runner
Running your first AI model
Model serving and API integration
Building AI applications using Docker Model Runner
Performance optimization and GPU configuration
Observability and monitoring

到本章结束时，你将理解如何在 Docker environment 中将 AI models 当作 first-class services 来处理。我们会一路使用具体 examples 和 configs，帮助你把这些 concepts 应用到自己的 projects 中。现在，让我们先探索 Docker Model Runner 是什么，以及为什么它会成为 local AI model deployment 的 game-changer。

Technical requirements

为了完成本章中的 examples，请确保满足以下 prerequisites：

Software：

Docker Desktop 4.40+（macOS、Windows 10/11）或 Docker Engine（Linux）
Docker Model Runner（可通过 Docker Desktop 使用，或在 Linux 上作为 docker-model-plugin 使用）

Features：

Docker Model Runner 已启用，并且可以通过 Docker CLI 访问（docker model commands）
（Optional）已启用 GPU support，用于 accelerated inference（如果可用）

Environment：

Docker Desktop 或 Docker Engine 已安装并配置完成
可以访问带 Docker CLI 的 terminal

Introduction to Docker Model Runner

完成本节后，你将能够从 system level 解释 DMR architecture，包括 runner、inference engine 和 model store，并将这些 components 映射到真实 developer workflows 中。你也将能够根据 platform、model format 和 throughput requirements，选择合适的 inference engine，例如 llama.cpp 或 vLLM。此外，你还能通过 data privacy、latency predictability、cost control，以及 offline development environments 的需求等因素，评估什么时候 DMR 比 cloud-based APIs 更合适。

Docker Model Runner，简称 DMR，是一个用于在本地运行 AI models 的 native Docker tool。它将 model execution 集成到 Docker ecosystem 中，使你能够使用熟悉的 Docker commands 和 workflows 来 pull、run 和 manage models。

它的主要目标，是让 developers 能够在自己的机器上以完全控制的方式运行 LLMs 和其他 AI models，而不是依赖 cloud APIs。这意味着你可以避免 cloud inference costs，并通过把 inference 保留在自己的硬件上来保护 sensitive data。

在深入细节之前，我们先仔细看看 Docker Model Runner 的关键特性，这些特性让它成为一个适合 local AI inference 的 practical 且 developer-friendly 的工具。

Key features of DMR

DMR 将 models 视为 OCI artifacts，类似于 container images：它们可以存储在 Docker Hub 或 Hugging Face 等 registries 中，并按需 pull。它为 inference 提供 OpenAI-compatible REST API，因此你的 applications 可以在最少代码变更的情况下使用 local models。

它也原生支持 GPU acceleration；如果你有合适的 GPU，DMR 可以利用它实现更快 inference。更关键的是，DMR 与 Docker Compose、Docker Desktop UI 以及其他 Docker tools 集成，使 AI models 能够无缝成为 development workflow 的一部分。

现在你已经对 DMR 的功能有了高层理解，接下来我们探索其 internal architecture，并拆解支撑它运行的 core components。

Architecture and components

在底层，Docker Model Runner 的 architecture 由三个主要 components 构成：

Model Runner Backend：这是 core service，有时也称为 model-runner，用于管理并执行 model inference。它是一个 native host process，而不是 container；它会加载 model files，并为每个 model 运行 inference engine。以 native 方式运行，使它能够直接使用 system resources（CPUs、GPUs），无需 container overhead，因此能获得更高 performance。

Backend 包含一个 scheduler / loader，用于在 request 到来时按需将 models 加载到 memory 中，并在 inactive 一段时间后卸载它们以释放 resources。它也包含一个 installer subsystem，可在需要时 fetch additional binaries，例如 GPU-specific libraries 或 alternate engines。

Model CLI Plugin（docker model） ：这是一个 Docker CLI extension（docker-model），提供与 models 交互的 commands。它的工作方式类似 docker image 或 docker container commands，只不过对象是 models，因此你可以 pull models、list models、configure settings、run models 等。

CLI 是你在 terminal 中使用的 interface。在内部，这个 CLI 会通过 API calls 与 model runner backend 通信以执行操作。因为它是标准 Docker CLI plugin，所以它会自动尊重 Docker contexts，并能在不同 Docker environments（Desktop vs. Engine）之间无缝工作。

Model distribution and storage：DMR 引入了 model distribution specification，有时也称为 model-spec，用于定义 model files 如何作为 OCI artifacts 被打包和存储。当你 pull 一个 model，例如 ai/mymodel:tag 时，model files 会被下载并存储到 local model cache 中。

这个 storage 与 Docker image storage 是分开的，因为 model files 可能非常大，并且具有特定要求；为了效率，它们通常会进行 memory-mapped。Model distribution component 会处理与 registries 的交互，因此 docker model pull 知道如何从 Docker Hub 或 Hugging Face fetch 内容。本质上，它类似于 Docker image distribution，但针对 AI models 做了定制，支持 model metadata、versioning 等。

图 3.1：Docker model runner architecture

带着这个 architecture 视角，下一步就是理解当你运行一个 model 时到底发生了什么，以及 inference 在幕后是如何处理的。

How inference works

一旦某个 model 被 pull 并缓存到本地，当你实际使用该 model 时，model runner backend 会 spawn 一个合适的 inference engine process。值得注意的是，DMR 当前默认支持两个 inference backends：llama.cpp 和 vLLM。

默认情况下，它使用 llama.cpp，这是一个高效的 C++ LLM engine，可以在 CPU 上运行（也支持 GPU offloading），并支持 quantized model files（GGUF format）。llama.cpp 非常适合 local development 和较小 models，因为它 lightweight 且 resource-efficient。

另一个选项是 vLLM，这是一个针对 high-throughput GPU（NVIDIA）serving 优化的 engine，使用 Safetensors format 的 models。后面我们会讨论 performance tuning；这里的关键点是，DMR 可以根据 model format 或 configuration，把 inference requests route 到不同 engines。它的设计本身就是 flexible 的。

事实上，支持 multiple backends 是一个刻意设计目标：DMR 最初从 llama.cpp 起步，但 architecture 上预留了对其他 engines 的支持，例如 vLLM，以及未来可能支持的 ONNX、PyTorch 等。

由于 model serving 只有在 applications 能轻松与之交互时才有意义，接下来我们看看 DMR 如何暴露 OpenAI-compatible API，以及这对 integration 意味着什么。

OpenAI-compatible API

DMR 最突出的特性之一，是暴露一个 RESTful API，用来模拟 OpenAI 的 API endpoints。Model runner backend 会运行一个 API server，像 OpenAI cloud API 一样接受 chat completions、completions 和 embeddings 请求。

你的 application 可以发送 JSON request，并获得与 OpenAI service 相同格式的 response，包括 id、choices 和 usage 等 fields。这种 compatibility 显著降低了 integration effort；如果你的 app 使用 OpenAI SDK 或 curl calls，你只需要把它指向 local DMR endpoint，它就能以最少甚至零代码改动运行。

最后，为了完整理解概念，我们来看看 models 在 Docker ecosystem 中如何贯穿整个 lifecycle 被管理。

Model lifecycle

使用 DMR 时，models 是 Docker environment 中的 first-class citizens。你可以列出 models（docker model ls）、检查它们的信息（docker model inspect）、配置参数（例如 context length 或 performance flags），以及删除它们（docker model rm），方式与管理 container images 非常相似。

图 3.2：Docker Model Runner lifecycle：Pull、Run、Serve 和 Manage

Models 只会在需要时被加载到 memory 中。例如，如果你针对某个 model 运行 prompt，runner 会加载它；在 idle 一段时间之后，它可能会卸载 model 以释放 memory。必要时，DMR 会 queue requests，而不是直接报错。例如，如果你发送 concurrent requests，并且这些 requests 超过了可用 memory，DMR 会将它们 serialize，或在中间执行 load / unload，而不是返回 HTTP 503 errors。

这种设计承认了一个事实：models 非常大，尤其是在 developer machines 上，你可能无法把所有 models 都一直保留在 RAM 中。

总结来说，Docker Model Runner 的 architecture 扩展了 Docker，使它能够管理 AI models，让我们能以类似管理 containers 的方式管理它们，同时弥合 containerized apps 与 AI inference 之间的 gap。接下来，我们将动手在你的 system 上设置 DMR。

Installing and configuring Docker Model Runner

完成本节后，你将能够在 Docker Desktop 中启用 DMR，并通过 docker model status 等 commands 以及首次 model run 来验证它是否正确工作。你还将学习如何在 Linux 上使用 docker-model-plugin 安装 DMR，并理解 docker model install-runner command 在幕后配置了什么。此外，你还将能够安全地配置 host-side access：管理 ports、合理设置 CORS，并判断 applications 应该从哪里连接，是直接从 host，还是通过 container network。

设置 Docker Model Runner 取决于你的 environment。主要有两个场景：

Docker Desktop（Windows / macOS） ：DMR 作为 Docker Desktop（v4.40+）中的一个可选功能打包提供。你只需要启用它。

Docker Engine on Linux：DMR 作为 Docker Engine（v24+）的 CLI plugin package（docker-model-plugin）提供。你可以通过 package manager 安装它。

下面我们分别走过这两种方式，并覆盖如何启用 GPU support 和验证安装。

Enabling DMR on Docker Desktop

如果你使用 Docker Desktop，请确保版本为 4.41 或更高。启用 DMR 可以通过 Docker Desktop Settings 完成：

打开 Docker Desktop 的 settings / preferences。导航到 AI section（在 Docker Desktop 4.46+ 中，有一个专门的 AI settings tab；在 4.45 及更早版本中，Model Runner 位于 “Beta features” 下）。
将 Enable Docker Model Runner 切换为 on。同时启用 host-side TCP support。Docker 可能会提示你 restart 以应用变更；如有必要，请执行 restart。
（Optional）如果你有受支持的 NVIDIA GPU，并希望使用它，也请在 AI settings 中启用 GPU-backed inference。这样 DMR 就可以下载并使用 GPU-enabled inference backend，例如 vLLM 或 GPU-accelerated llama.cpp。后面会讨论 GPU details；现在只需要知道，这个 switch 会为 Docker Desktop 中的 DMR 启用 GPU support。

启用后，Docker Desktop dashboard UI 中会出现一个新的 Models tab。你也可以通过 terminal 访问 docker model CLI commands。例如，尝试运行：

docker model version

如果 DMR 已正确启用，这应该会显示 Model Runner version，从而确认 plugin 已安装。

如果你是在 Linux machine 上使用 Docker Engine，而不是 Docker Desktop，那么设置方式略有不同。接下来我们走一遍 Linux environment 的安装过程。

Installing DMR on Docker Engine（Linux）

在运行 Docker Engine（无 GUI）的 Linux server 或 development machine 上，Docker Model Runner 作为 CLI plugin package 分发。目前，它适用于 Ubuntu、Red Hat 等主流 distros。安装过程非常直接。

For Ubuntu / Debian

通过 apt 安装 docker-model-plugin：

sudo apt-get update
sudo apt-get install docker-model-plugin

For RPM-based（Fedora、RHEL、CentOS 等）

使用 dnf 或 yum：

sudo dnf update
sudo dnf install docker-model-plugin

这会将 Docker model CLI plugin 下载并安装到 Docker 的 plugin directory 中，通常位于：

/usr/lib/docker/cli-plugins/

安装后，通过检查 version 或 help command 验证它。例如：

docker model version
docker model --help

你应该能看到 Model Runner commands 可用。你甚至可以做一个快速 test run：

docker model run ai/smollm2 "Hello"

如果出现：

docker: 'model' is not a docker command

说明 Docker CLI 没有找到 plugin。如果 plugin 没在预期 directory 中，就可能发生这种情况。修复方式是在 Docker 的 CLI plugins directory 中创建 symlink。例如，在 macOS 上，从 Docker 的 Resources folder 将 plugin link 到 ~/.docker/cli-plugins/ 即可解决。使用官方 packages 的 Linux 上，这个问题较少发生。

一旦 DMR 安装并运行，下一步就是确保它处于最新状态并正确配置，尤其是当你计划利用 GPU acceleration 来提升 performance 时。

Updating and enabling GPU support

Docker 正在积极改进 Model Runner，因此你可能需要偶尔更新它。在 Docker Desktop 中，DMR updates 会包含在 Docker Desktop releases 中。在 Docker Engine 上，如果有新版本，可以通过 package manager 更新，例如：

apt-get upgrade docker-model-plugin

文档也提到一个用于 uninstall 并 reinstall runner 的 command，以确保 clean update：

docker model uninstall-runner --images && docker model install-runner

这个 command 会移除当前 model runner backend，并重新下载最新版本，同时不会删除你已经 pulled 的 models（除非你添加 --models 来一并清除它们）。

For GPU support：如果你在 Docker Desktop settings 中启用了 GPU，Docker 会负责 pull 所需的 backend，例如 CUDA-enabled engine 版本。在 Linux 上，安装 plugin 之后，你需要手动安装 GPU backend。Docker 提供了相应 CLI：

docker model install-runner --backend vllm --gpu cuda

这个 command（如 Docker 的 vLLM integration announcement 中所示）会 fetch 并安装带 CUDA support 的 vLLM engine。由于它包含 NVIDIA libraries 和 engine，本次下载量可能会相当大。

安装后，DMR 将能够在 GPU 上运行 models。你可以先在 host 上运行类似 nvidia-smi 的命令，验证 GPU 是否可访问，然后测试一个需要 vLLM 的 model（后面会做）。

如果你使用 AMD GPUs 或其他 accelerators，请注意当前 support 主要面向 NVIDIA（CUDA）；通过 llama.cpp 的 ROCm 和 OpenCL backends，对 AMD 和 Intel 有一些有限支持。

到这里，你已经启用了 Docker Model Runner，确认 CLI 正常工作，并完成了 environment 配置——无论是在 Docker Desktop 还是 Linux 上，也无论是否使用 GPU support。现在可以开始实际操作了。下一步，我们运行第一个 AI model，看看 DMR 的实际效果。

Running your first AI model

完成本节后，你将能够通过 CLI 和 Docker Desktop UI pull 并运行小型 models，同时理解 on-demand loading 在幕后如何工作。你还将学习如何以受控且可预测的方式 inspect 和 adjust model configurations，例如 context size 和 runtime flags。此外，你会获得使用 logs 和 status commands 有效 debug model behavior 的能力，用于 monitor performance 和 troubleshoot issues。

设置好 Docker Model Runner 后，本地运行 AI model 就像运行一个 container 一样简单。本节中，我们将走读如何 pull model 并执行 test inference。为了快速开始，我们会使用一个小 model，并演示 CLI usage 和 API usage 两种方式。

Discovering and pulling a model

Docker Hub 在 ai namespace 下托管了一组 curated open-source AI models（你也可以从 Hugging Face Hub pull）。第一次测试时，我们会使用一个相对较小的 LLM，确保它能在多数机器上快速运行。一个不错的选择是：

ai/smollm2:360M-Q4_K_M

这是一个 instruct-tuned model，包含 3.6 亿参数，并经过 4-bit quantization（Q4_K_M）以提升效率。按现代标准看，这个 model 很小，但足以验证一切是否正常工作。

你可以通过 Docker Hub UI 或 CLI 找到 models。

图 3.3：Docker Hub 上可供 Docker Model Runner 使用的可用 models 列表

如果你偏好命令行，只需要运行：

docker model pull ai/smollm2:360M-Q4_K_M

这会从 Docker Hub fetch model，并将其存储到本地。命名方式与 images 类似：这里 ai/smollm2 是 repository，360M-Q4_K_M 是 tag，用于标识具体 model variant。

在这个例子中，360M 表示 model size，Q4_K_M 表示 4-bit quantization。Quantized models 体积小得多；这个 model 只有几百 MB，因此下载很快，并且可以在 CPU 上运行。

Pull 完成后，确认 model 已被列出：

docker model ls

你应该能在 output 中看到 ai/smollm2:360M-Q4_K_M，可能还包括你之前 pull 的其他 models。此时，model weights 已经缓存到你机器上的 Docker model store 中。除非你删除它们，或 pull updated version，否则不需要再次下载。

现在我们已经有了 model，接下来运行一个简单 inference。

Running the model via CLI

Docker Model Runner 提供 docker model run CLI command，用于直接从 terminal 调用 model。这对于快速测试或 interactive experimentation 非常方便。

试着用一个 prompt 运行 model：

docker model run ai/smollm2:360M-Q4_K_M "Hello, how are you?"

第一次运行时，DMR 会在后台启动 inference engine（对于这个 GGUF model，是 llama.cpp）。由于需要将 model 加载到 RAM 中，可能会有几秒钟的一次性 initialization。然后它会把 model 对 prompt 的 response 打印出来。例如：

> Hello, how are you?
I'm just a machine, but I'm functioning normally! How can I assist you today?

准确 output 会不同，因为 language models 默认是 nondeterministic 的。但你应该会得到一个 coherent answer。

如果你运行 command 时没有提供 prompt argument，它可能会启动一个 interactive REPL，让你连续输入 prompts。对于快速测试，像我们这样 inline 提供 prompt 更简单；它会运行 prompt 并退出。

在底层，docker model run 本质上是一个 convenience command，它包装了一次到 local model 的 API call。它很适合用来验证一切是否正常。如果遇到 errors，请查看 docker model logs，或 Docker Desktop 的 Models | Logs UI，寻找线索。

常见问题可能包括内存不足，虽然对 360M model 来说这种情况不太可能；或者如果你尝试运行 GPU model 却没有启用 GPU，则可能出现 missing backend。

现在我们已经确认 model 可以通过 CLI 运行，接下来看看如何使用 API interface 进行更 programmatic 的 integration。

Serving the model and making API requests

当你运行 model 时，Model Runner backend 也会启动一个 HTTP server，为该 model 提供 requests 服务。API 遵循 OpenAI 的 API patterns，因此很容易接入 existing tools 和 SDKs。

不过，在 Docker Desktop 上，默认情况下 TCP host access 没有启用。所以如果你立刻访问：

http://localhost:12434

你会得到：

Failed to connect to localhost port 12434

要修复它，需要从 Docker Desktop UI（Model Runner settings 下）启用 host-side TCP support，或通过 CLI 运行：

docker desktop enable model-runner --tcp 12434

完成后，server 将可通过以下地址访问：

http://localhost:12434

如果你是在 Linux 上运行 Docker Engine，那么 TCP access 默认在同一个 port 上启用，因此不会遇到这个问题。

下面我们模拟 application 会做的事情：使用 curl 作为 example client，向 model 发送 HTTP request，并获得结果。

对于 smollm2 这样的 chat model，合适的 endpoint 是 Chat Completions API。我们将发送一个带 user role 的 chat-style prompt。

运行以下 curl command（为了可读性添加了 line breaks）：

curl http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2:360M-Q4_K_M",
    "messages": [
      {"role": "user", "content": "Explain Docker in simple terms."}
    ]
  }'

我们拆解一下：

我们正在 POST 到：

http://localhost:12434/engines/v1/chat/completions

Path 包含 /engines/v1/，因为 DMR 支持 multiple engines；这里使用了这种 path format，但没有指定 engine name。

在 DMR 中省略 engine 时，会默认使用 primary engine，在此例中是 llama.cpp。为了更清晰，你也可以写成：

/engines/llama.cpp/v1/chat/completions

但这不是必须的。

JSON body 中包含我们 pull 的 exact model name，并包含遵循 ChatGPT-style format 的 messages list：我们提供了一个 user message，要求解释 Docker。

如果一切顺利，response 将是一个类似 OpenAI API 返回的 JSON payload。例如：

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "ai/smollm2:360M-Q4_K_M",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Docker is a platform that allows developers to package applications into containers. These containers bundle the application's code with all its dependencies, ensuring it runs the same everywhere. In simple terms, Docker helps create a consistent environment so an app can run reliably on different computers."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 56,
    "total_tokens": 74
  }
}

实际措辞会有所不同，但你可以看到结构：assistant message content 中包含对 Docker 的简单解释，usage section 则显示 prompt 和 response 各使用了多少 tokens。

这种格式确认了 DMR 正在 faithfully emulating OpenAI API。这个 local request 不需要 API key（DMR local API 默认不需要 authentication）。我们的 data 没有离开本机，也不会产生 token cost，这与使用 OpenAI cloud model 不同。

现在，我们已经成功将 local model 当作 service 处理。在真实场景中，你可以让 application 指向 localhost:12434，并使用这个 model 代替 external API。

下一节中，我们将探索如何使用 APIs、SDKs 和用于 multi-container setups 的 Docker Compose，更系统地将 DMR model service 集成到 applications 中。

Model serving and API integration

到本节结束时，你将能够熟练地通过 Docker Model Runner 的 OpenAI-compatible endpoints 与之交互，包括知道当 application 在 host 上运行或在 container 内运行时，应该使用哪个 base URL。你会看到，只需要配置 OpenAI SDK 的 custom base_url，并引用 ai/... 这样的 Docker Hub identifiers，就可以非常容易地适配 existing applications。一路上，你还会探索 practical integration patterns，例如启用 streaming responses，并添加简单的 production safeguards，例如 timeouts、retry logic 和 maximum token limits，从而让 applications 更 resilient 和 reliable。

运行一个 model 很有用，但真正的力量来自通过 API 将它集成到 applications 中。Docker Model Runner 的 OpenAI-compatible API 意味着你可以用 minimal changes 把它接入现有 tools 和 frameworks。本节中，我们会覆盖如何用 common methods 与 model service 交互：从简单的 curl calls（如前面所做）到使用 OpenAI SDKs；同时也会讨论如何处理 model endpoints 和 parameters。我们还会介绍可用 API endpoints 以及它们返回什么。

API endpoints overview

Docker Model Runner 暴露两组 REST endpoints：用于 inference 的 OpenAI-style endpoints，以及用于 model management 的 Docker-style endpoints。

图 3.4：Docker Model Runner OpenAI-compatible API and SDK integration

对于 serving models，我们重点关注 OpenAI-compatible endpoints：

POST /v1/chat/completions：Chat completion endpoint，期望一个包含 model 和 messages 的 JSON（可带 optional generation parameters）。返回包含 choices 的 chat completion（就像前面 example 中看到的）。

POST /v1/completions：Completion endpoint，用于 non-chat（instruction）models 或 backward compatibility。你提供 prompt，并获得 completion。实践中，chat models 已经基本取代了这类接口，但它仍然为 older models / interfaces 保留。

POST /v1/embeddings：Embeddings endpoint。你发送一段 text，并获得 vector embedding。这对 semantic search、vector databases 等非常有用。如果 model 具备 embedding generation 能力，DMR 支持该 endpoint。

GET /v1/models：列出 available models，类似 OpenAI 的 list models endpoint。DMR 可以列出你已经 pulled 的 models 或 built-in models。也有 GET /v1/models/{model}，用于获取 specific model 的信息。

在上面的 URLs 中，为了简化，我们省略了 engines/ prefix。如前所述，DMR 的 API 实际上在 path 中有一个 “engine” concept（这是为了允许 multiple inference backends）。在当前 versions 中，默认 engine 是 llama.cpp，你可以互换调用 /v1/... 或 /engines/v1/...。

如果你显式使用不同 engine，例如 vLLM，engine name 可能是 model identifier 的一部分，或在 installation 时内部设置。多数时候你不需要关心它；只需使用 model name，DMR 会知道应该 route 到哪个 engine。例如，如果 model file 是 safetensors，它会自动选择 vLLM。

与 OpenAI 一个重要区别是：DMR endpoints 默认不需要 authentication token。它运行在本地，因此不期待 Authorization: Bearer ... header。

如果你的 code 自动添加 OpenAI API key，DMR 会忽略它。只是要注意：如果你在 shared server 上运行，DMR endpoint 并不安全，它本意是用于 local dev。

现在你已经知道有哪些 endpoints 以及它们返回什么，接下来看看如何使用熟悉的 OpenAI SDKs 在真实 applications 中调用它们。

Using the API with OpenAI SDKs

API compatibility 的巨大好处是：你可以使用官方 OpenAI SDKs，或任何围绕 OpenAI API 构建的 libraries，来 query local model。下面看看实际做法。

Python（OpenAI Python SDK）

如果你安装了 openai Python package，可以通过设置 API base URL，让它指向 DMR endpoint。例如：

import openai

openai.api_base = "http://localhost:12434/engines/v1"  # Base URL for local DMR
openai.api_key = "not-used"  # DMR doesn't require a key, but the SDK might want one set

response = openai.ChatCompletion.create(
    model="ai/smollm2:360M-Q4_K_M",
    messages=[{"role": "user", "content": "What is Docker Compose?"}]
)
print(response["choices"][0]["message"]["content"])

在这个 snippet 中，我们把 api_base override 为 local Model Runner 地址（包含 /engines/v1 部分，以匹配 DMR API）。我们仍然在 request 中传入 model name；SDK 会把它包含在 JSON payload 中。

api_key 被设置为 dummy value，只是为了满足 library；DMR 不会检查它。当调用 ChatCompletion.create 时，OpenAI SDK 实际上会向以下地址发起 HTTP request：

http://localhost:12434/engines/v1/chat/completions

并携带给定 payload。

我们收到的 response 结构会与来自 OpenAI 的 response 相同。因此：

response["choices"][0]["message"]["content"]

可能会打印类似：

Docker Compose is a tool that allows you to define and manage multi-container Docker applications…

这就是 model 的回答。

Node.js（`openai` npm package）

在 Node 或 TypeScript 中，也可以通过为 OpenAIApi client 配置 custom base path 实现类似效果。例如，使用 official library：

const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
  apiKey: "unused",
  basePath: "http://model-runner.docker.internal/engines/v1"  // for Docker Desktop containers
});
const openai = new OpenAIApi(configuration);

const completion = await openai.createChatCompletion({
  model: "ai/smollm2:360M-Q4_K_M",
  messages: [{ role: "user", content: "What is Docker Compose?" }]
});
console.log(completion.data.choices[0].message.content);

注意，这里我们在 basePath 中使用了 model-runner.docker.internal 作为 host。这是 Docker Desktop 提供的一个 special DNS name，用来让 containers 访问 host 上的 Model Runner。在这个例子中，我们假设你的 Node app 本身也是作为 container 运行的（可能通过 Compose）。

使用 model-runner.docker.internal 可以确保 request 命中 host 上的 Model Runner。如果你的 code 直接运行在 host machine 上，则应使用 localhost。其余部分很直接：OpenAI client 会发送 chat completion request，然后你从 completion.data 中获取数据。

Ollama API compatibility

DMR 也提到支持 Ollama-compatible APIs。Ollama 是另一个 local model runner。除非你特别需要它，否则可以忽略；它本质上是另一种 endpoint format。对大多数 developers 来说，OpenAI API 是主要 integration target。

当你熟悉基本 requests 后，下一步就是理解如何通过 API 传入 generation parameters 来 fine-tune model behavior。

Configuring model parameters via API

向 model API 发起 requests 时，你可以传递 parameters 来控制 generation，方式与 OpenAI 类似，例如 temperature、max tokens 等。DMR 的 engines 支持其中许多参数：

Temperature（JSON 中的 temperature） ：控制 randomness。例如，temperature: 0.0 用于 deterministic output，1.0 用于 creative output。如果未指定，则使用 model default 或 engine default，通常约为 0.8。

Max Tokens（max_tokens） ：限制 response length。默认情况下，如果需要，model 会尝试使用 full context。比如，如果你只想要 short answer，可以将它设置为 100。

Top-p、Top-k 等：可以传入 nucleus sampling 以及其他 sampling controls。例如，top_p: 0.9 或 top_k: 40，用于将 sampling 限制在更可能的 tokens 范围内。这些含义与 OpenAI API 中一致。

Stop Sequences（stop） ：提供一个或多个 stop strings；如果 generation 遇到它们就会停止，类似 OpenAI。

并不是所有 OpenAI API parameters 都 100% 支持。例如，frequency_penalty 或 presence_penalty 可能不会生效，除非 underlying engine 支持它。据目前情况，llama.cpp 不支持这些 exact options。如果传入 unsupported parameters，DMR 很可能会忽略它们。

一个特殊参数是 model context length。在 OpenAI API 中，你不能为每个 request 直接设置 context length；它对 model 是固定的。但在 DMR 中，你可以配置 model 的 context window（下一节会详细讲）。如果你将某个 model 配置为拥有 2048 或 4096 tokens 的 context，那么这是 persistent setting，而不是 per-request parameter。

完成 request-level controls 后，我们拉远视角，讨论如何组织 application，使 model endpoint 本身变成 configurable，并易于在 local 和 cloud environments 之间切换。

Handling the model endpoint in applications

将 DMR 集成到 application 中时，你通常希望 base URL 和 model name 都是 configurable 的，这样就能轻松在 local 和 cloud 之间切换。例如，你可能有这样的 environment variable 或 config file entries：

LLM_API_BASE=http://localhost:12434/engines/v1
LLM_MODEL=ai/smollm2:360M-Q4_K_M

在 code 中，你会使用 LLM_API_BASE，而不是 hardcoded OpenAI URL；使用 LLM_MODEL，而不是 fixed model name。这样，在 production 中，你可以将 LLM_API_BASE 设置为 OpenAI endpoint，将 LLM_MODEL 设置为 gpt-4 或类似 model；而在 development 中，则使用 local values。许多 teams 采用这种 pattern：production 使用 cloud models，而 dev / testing 使用 local models 以节省 costs。

Docker Model Runner 实际上通过 Docker Compose 提供 consistent environment 来促进这一点，下一节会看到。即使不使用 Compose，核心思想也是：通过 configuration 将 backend 抽象出来。

如果 OpenAI compatibility 已经足够，你的 application 不会注意到差异，除了 response speed，以及 local models 可能比最新 OpenAI model 质量稍低。

最后，请记住，在 application 中使用 DMR 时，应该 monitor its resource usage。每个 API call 都会消耗你机器上的 CPU / GPU 和 memory。如果你的 app 发出大量 parallel calls，而 model 又很大，机器可能会变慢。

后面会讨论 performance tuning，但作为经验法则，请像对待架构中的任何 limited resource 一样对待 DMR endpoint。如果需要，你可以在 app 中实现 queueing 或限制 concurrency，以避免 overload model。

API integration 的基础已经覆盖完毕。接下来我们构建一个更完整的 application setup，让 app 和 model 通过 Docker Compose 一起被 orchestrated。

Building AI applications using Docker Model Runner

本节中，你将学习如何设计 multi-service architecture，让 Docker Model Runner 充当 inference backend，同时将 model endpoints 和 configuration 等 integration details 保持在 application code 之外。你会探索如何使用 Docker Compose 的 “models” 作为 declarative dependency layer，包括理解支持这一 workflow 所需的 Docker 和 Compose versions。此外，你还会看到如何将 custom GGUF model 打包并发布为 OCI artifact，在需要自己的 model delivery strategy 时获得完整 distribution control。

我们会演示如何在 Compose file 中定义 models，model service 和 app service 如何连接，并提供 configuration 与 environment variable bindings 的 examples。我们也会讨论一些 real-world patterns，例如使用 environment variables，以及在 local 与 remote deployments 之间切换。为了更具体，我们先看 models 如何直接在 compose.yaml 中声明，以及 service 如何引用它们。

Compose models：在 `compose.yaml` 中定义 model

Docker Compose 在 Compose file format 中引入了一个 top-level key，叫 models。你可以在这里列出 application 所需的 AI models，类似于列出 services 或 volumes。每个 model entry 会指定使用哪个 model image，以及任何 configuration、context length、runtime flags 等。在 services section 中，你可以将一个或多个 models attach 到某个 service。

来看一个基础示例。假设一个 Compose application 有一个 service（一个 web app），它需要一个 LLM。我们可以这样定义：

services:
  chat-app:
    image: my-chat-app:latest
    models:
      - llm  # referring to a model named "llm"

models:
  llm:
    model: ai/smollm2:360M-Q4_K_M

在这个 snippet 中，services.chat-app 下的 models: - llm 表示这个 service 使用一个名为 llm 的 model。然后，在 top-level models: key 下，我们定义 llm 是什么：具体来说，它使用 ai/smollm2:360M-Q4_K_M 这个 model artifact。

这个 Compose file 告诉 Docker：当我们在启用 Docker Model Runner 的情况下部署 chat-app 时，应确保 ai/smollm2:360M-Q4_K_M 这个 model 已被 pull 且可用，并将它与 app effectively link 起来。

当你运行 docker compose up 时会发生什么？如果 Docker Model Runner 已在 Docker Desktop 或带 DMR 的 Docker Engine 上启用，Compose 会执行以下操作：

Pull my-chat-app:latest image（你的 application）。
Pull model ai/smollm2:360M-Q4_K_M（如果尚未 pull）。
启动 model runner backend（如果尚未启动），并在 app 尝试使用 model 时加载它。
向 app 提供用于连接 model 的 environment variables。

最后一点非常关键。Docker Compose 会将 connection info 注入到 chat-app service 的 environment 中。默认情况下，上面的 short syntax 会设置两个 variables：LLM_URL 和 LLM_MODEL。

LLM_URL 是 app 可以访问 model API 的 HTTP address；LLM_MODEL 是 model identifier，与我们在 compose file 中写入的内容相同。

例如，Compose 可能会在 container 中设置：

LLM_URL=http://model-runner.docker.internal/engines/v1
LLM_MODEL=ai/smollm2:360M-Q4_K_M

URL 也可能是 IP address，具体取决于 environment。核心思想是：它指向从 container 可访问的 DMR endpoint。你的 app 可以读取这些 environment variables，来判断如何调用 model。

如果使用 OpenAI SDK，你可能会把它们组合起来，例如设置：

openai.api_base = LLM_URL

然后在 calls 中使用：

model=os.environ['LLM_MODEL']

Compose 的 model binding 也支持一个 service 使用多个 models。例如，你可能在同一个 app 中有一个用于 chatbot 的 llm，以及一个用于 vector embeddings 的 embedding-model。你可以在 service 下同时列出它们：

services:
  app:
    image: my-app
    models:
      - llm
      - embedding-model

models:
  llm:
    model: ai/smollm2:360M-Q4_K_M
  embedding-model:
    model: all-minilm-l6-v2-vllm

这种情况下，container 将获得 LLM_URL、LLM_MODEL，以及 EMBEDDING_MODEL_URL、EMBEDDING_MODEL_MODEL environment variables。每个变量都会使用 model reference 的 uppercase name 作为 prefix（这里是 EMBEDDING_MODEL_*）。

URL 指定向哪里 POST requests，MODEL variable 指定请求哪个 model。通常这两个 URL 是相同的，都指向同一个 model runner endpoint，只是 model ID 不同。从概念上看，这允许 models 以不同方式被 served。

如果默认 env var names 不适合你，Compose 提供了 model bindings 的 long syntax：

services:
  app:
    image: my-app
    models:
      llm:
        endpoint_var: AI_MODEL_URL
        model_var: AI_MODEL_NAME
      embedding-model:
        endpoint_var: EMBEDDING_URL
        model_var: EMBEDDING_NAME

models:
  llm:
    model: ai/smollm2
  embedding-model:
    model: all-minilm-l6-v2-vllm

这里我们显式指定每个 model binding 想要的 env vars。Service container 会收到第一个 model 的 AI_MODEL_URL / AI_MODEL_NAME，以及第二个 model 的 EMBEDDING_URL / EMBEDDING_NAME。

这纯粹是为了方便 app code；它与 short syntax 本质相同。

定义 model 只是故事的一部分。接下来我们看看如何直接在 Compose file 中 fine-tune model behavior。

Configuring model parameters in Compose

Compose 的 models section 不只是用于 model identifier。你也可以设置 model-specific configuration，这些配置会在 model 运行时应用。两个常见选项是：

context_size：设置该 model inference 的最大 context length（以 tokens 为单位）。如果你想 override default，例如 model 支持最高 8192 tokens，而你希望允许这一点，可以在这里指定。例如：

models:
  llm:
    model: ai/smollm2
    context_size: 4096

如果省略，DMR 会使用 model default 或 safe default，通常是 2048 或 4096。请记住，更大的 context_size 会显著增加 memory usage。Compose 会将这个 setting 传递给 model runner，效果等同于为该 model 运行：

docker model configure --context-size <N>

runtime_flags：这是传给 inference engine 的 advanced flags list。这些 flags 对应 llama.cpp（或其他 engines）的 command-line options。例如，你可以设置：

models:
  llm:
    model: ai/smollm2
    runtime_flags:
      - "--threads"
      - "8"
      - "--temp"
      - "0.7"
      - "--top-p"
      - "0.9"

这会配置 model 使用 8 threads，默认 sampling temperature 为 0.7，top-p 为 0.9。它等价于运行：

docker model configure --threads 8 --temp 0.7 --top-p 0.9 ai/smollm2

常见 runtime_flags 包括 --threads、--batch-size、--repeat_penalty 等，适用于 llama.cpp；还有一些 GPU-specific flags，例如 --n-gpu-layers，用于将一定数量的 layers offload 到 GPU。

通常你不需要设置这些，除非正在进行 optimization；但 Compose 为此提供了 hooks。后面的 performance section 会进一步讨论这些 flags。

现在看一个组合示例：

models:
  llm:
    model: ai/qwen2.5-coder
    context_size: 8192
    runtime_flags:
      - "--no-prefill-assistant"
      - "--threads"
      - "16"

这可能是一个 code assistant model “Qwen 2.5 - Coder” 的配置：我们希望它拥有 8k token context，禁用 assistant prefill（一个 specific flag），并使用 16 threads。Compose 使这种 configuration 成为 deployment 的一部分，这对 transparency 非常有用（所有人都可以通过 YAML 看到 model 如何配置），也提升 reproducibility。

现在我们已经覆盖了 model definitions 和 configuration，接下来走读一下当你通过 Docker Compose bring everything up 时，一个 end-to-end workflow 会是什么样。

End-to-end application with Compose

我们概括一个使用 Docker Compose 的 end-to-end workflow：

Write Docker Compose file：如上所示，在其中包含你的 application service 和所需 models。例如，compose.yaml 中可能包含一个 web service 和一个 llm model。

Use Env Vars in App Code：修改 application code，使其从 environment 中读取 model endpoint info。例如，一个 Python Flask app 可以在启动时从 os.environ 中读取 LLM_URL 和 LLM_MODEL，并在任何 LLM calls 中使用这些值（也许封装到一个 helper function call_model(prompt) 中）。

Bring up the stack：运行：

docker compose up

Compose 会启动 app container，并确保 model 可用。如果使用 Docker Desktop，Model Runner backend 运行在 Docker VM 中，Compose 会与它通信。如果在 Linux 上使用 Engine，DMR 运行在 host 上，但 containers 可以通过 special host gateway address 访问它。

Application calls the model：当 app 需要获得 completion 时，它会向 LLM_URL 中的 URL 发出 HTTP request，例如：

http://model-runner.docker.internal/engines/v1

Payload 中包含：

model: LLM_MODEL

Request 会到达 Model Runner；如果 model 尚未加载，Model Runner 会加载它，生成 response，并返回结果。然后 app 使用该 response，例如通过 web endpoint 返回给 user。

Iterate and scale：你可以添加更多 services 或更多 models。你可以运行多个 app container replicas；它们可以共享 host 上的 single model instance，或者每个 container 都独立调用同一个 Model Runner API。DMR 本身不是 microservice per se；每个 host 上有一个 model server，但 Compose 会把它们整齐地连接起来。

除了 local development，还值得考虑这种 approach 如何跨 environments 工作，以及它对 portability 意味着什么。

Portability

使用 Compose models 的一个强大方面是：同一个 Compose file 可能可以用于不同 context，例如某个支持 Compose spec 和 model serving 的 cloud platform。

例如，想象一下将上述 Compose app 部署到 cloud service，其中 ai/smollm2 也许不是本地运行，而是由 managed service 提供。Compose spec 允许 cloud providers 使用 extension fields，例如 x-cloud-options，用于指定 model 的 instance types。

你可以在不修改 code 的情况下部署 app。在 cloud 中，LLM_URL 可能指向 platform 提供的 remote endpoint，而 LLM_MODEL 可能保持不变，或映射到某个 ID。这在一定程度上让 application 面向未来。Docker 的愿景是：你可以通过 DMR 使用 local models 进行 develop 和 test，然后用同一份 app definition 在 cloud 上 scale out。

在结束本节前，我们看一些 practical tips 和 patterns，帮助你在真实 applications 中集成 DMR 时避免常见陷阱。

Real-world integration tips

下面这些 tips 体现了一些 common patterns、trade-offs 和 safeguards，可以让你的 setup 更 robust，也更 production-aware：

Graceful fallback：如果你的 app 有时会在没有 Docker Model Runner 的情况下运行，例如 developer 忘记启用它，你可以检测 LLM_URL 是否未提供，然后抛出 error，或 fallback 到 cloud API。例如，如果 ENV=prod，使用 OpenAI；如果 ENV=local，使用 DMR。这更多是 app logic concern，但值得注意。

Streaming responses：DMR API 支持 streaming token responses，类似 OpenAI request 中的 stream: true。如果你的 app 预期 stream data（chunked responses），请确保以类似方式处理。Compose 的 model binding 不会改变这一点；streaming 仍然通过同一个 endpoint 工作。只要记住，与大型 OpenAI model 相比，local model 逐 token streaming 可能更慢，所以要测试 UX。

Multiple models or versions：如果你想在 model versions 之间做 A/B testing，可以绑定多个 models，并让 code 调用其中一个。或者你可以同时运行两种不同 model types，例如一个较小、较快的 model 用于 simple queries，一个较大 model 用于 complex queries。Compose 会处理 pull 两者并为你提供独立 endpoints。

Error handling：像处理 external API errors 一样处理 local model 的 errors。例如，如果 model runner out of memory，或出现 bug，它可能返回 HTTP 500 或 timeout。你的 app 不应该 indefinite freeze。DMR logs（以及 Docker logs）会帮助你 debug model crashes 或 evictions。

当你使用 Model Runner 时，Docker Compose 会自然嵌入 development workflow。你在一个 compose file 中定义 app 和它依赖的 model，运行 docker compose up，所有内容就一起启动——application、model，都已经 wired 且 ready to go。测试结束后，docker compose down 可以将所有内容干净拆除。你不需要每次手动 pull 或 start models。如果你想尝试不同 model，只需在 compose file 中替换 model tag，然后再次 bring the stack up。这让不同 models 的 experimentation 几乎毫不费力。

现在我们已经知道如何把 models 集成到 applications 中，接下来聚焦于 performance optimization 和 hardware utilization。当你扩展 model 使用时，这会变得非常重要。

Performance optimization and GPU configuration

到本节结束时，你将能够基于可用硬件和 desired throughput targets，选择合适的 inference engine 和 quantization strategy。你将学习如何 carefully tune context size 和 runtime flags，以避免常见 failure modes，例如 out-of-memory（OOM）errors 或 excessively slow inference。此外，你还会知道如何通过 inspect logs 和 engine status 来验证 GPU acceleration 是否真的被使用，并在 performance 不达预期时理解应该做哪些 configuration changes。

本地运行 LLMs 可能非常 resource-intensive。Performance 会因为 model size、quantization、CPU vs GPU，以及 runtime configuration 的不同而变化很大。Docker Model Runner 提供了机制来调优这些方面，以在 speed、memory usage 和 model accuracy 之间取得良好平衡。本节中，我们会讨论优化 performance 和正确配置 GPU support 的 strategies。既然我们已经理解不同 setups 下 performance 差异为何如此巨大，第一项 optimization decision 就是为 workload 选择正确的 inference engine。

Choosing the right inference engine

如前所述，DMR 支持多个 inference engines，主要是 llama.cpp 和 vLLM。

Engine 的选择会显著影响 performance：

llama.cpp（GGUF、quantized） ：CPU-first engine，专为 quantized models 构建。它非常 memory-efficient，可以在没有 high-end GPUs 的机器上运行。最适合 dev、low-cost setups，以及 throughput 不是关键瓶颈的场景。

Trade-offs：tokens/sec 比 GPU serving 慢，并且 quantization 会带来少量 quality loss。当你希望最大兼容性和最低 memory usage 时使用它。

vLLM（safetensors、FP16 / BF16） ：GPU-first、high-throughput serving engine，通常用于 NVIDIA CUDA environments。它针对 fast generation 进行了优化，具备 efficient batching / attention，非常适合 streaming 和 multi-request serving。

Trade-offs：VRAM 需求更高，例如 7B model 通常约需要 14GB VRAM；由于 GPU model loading，startup 较慢。当你拥有强 GPU，并且需要 speed、throughput，或更好的 full-precision quality 时使用它。

Docker Model Runner（DMR） ：通常会基于 format 自动选择：GGUF → llama.cpp，safetensors → vLLM。如果选择 vLLM，请确保 GPU backend 已安装并配置完成；没有 GPU 的情况下，safetensors 通常会 error，除非存在 fallback。

图 3.5：Docker Model Runner 中何时选择哪种 inference engine——llama.cpp 和 vLLM inference engine

为了简单起见，先从 llama.cpp 开始；如果 performance 成为瓶颈，再切换到 vLLM。

如果你决定 GPU acceleration 是正确路径，下一步就是确保你的 system 已正确配置，能够利用 GPU。

GPU setup and configuration

如果你有 NVIDIA GPU，启用它可以大幅提升 inference speed。以下是 GPU usage 的关键点：

Docker Desktop（Windows with WSL2 / Linux containers 或 Mac） ：在 Windows 上，你需要相对较新的 NVIDIA GPU 和 drivers。在 Docker Desktop settings 中启用 “GPU-backed inference”，可以让 DMR 通过 CUDA（在 WSL2 中）使用 GPU。

在 Mac 上，Apple Silicon Macs 不支持 CUDA，因为 CUDA 是 NVIDIA hardware 专属 framework。好消息是，Apple Silicon GPUs 可以通过 Metal Performance Shaders（MPS）、MLX 和 Metal 原生加速，llama.cpp、PyTorch 和 TensorFlow 等 engines 可以直接在 host 上使用这些能力。对于 containerized workloads，Docker Desktop on Mac 有自己的 GPU acceleration path：Docker Model Runner 等 tools 会利用 Apple 的 Virtualization framework 和 Metal 直接访问 host GPU，而不是依赖 Windows 和 Linux 上使用的 NVIDIA + WSL2 / Linux passthrough model。启用后，当你运行一个可使用 GPU 的 model 时，Docker 会下载 GPU-enabled engine。如果 experimental WSL GPU support 尚未启用，你可能也会收到相应提示。

Linux（Docker Engine） ：确保安装了 NVIDIA drivers，例如 CUDA 12 所需的 525.60.13+，或 docs 中指定版本。

安装前面讨论过的 Model Runner GPU backend。

如果你希望允许 containerized apps 使用 GPU，也请安装 NVIDIA Container Toolkit。虽然在 DMR 的场景中，model 作为 host process 运行，而不是在 container 中运行，因此主要 requirement 是运行它的 user 可以访问 /dev/nvidia* devices。

通常，以 docker group 中的 user 身份运行 Docker，并正确设置 NVIDIA toolkit，就可以覆盖这个需求。你可以测试一个简单 CUDA container，确保 Docker 能访问 GPU。配置完成后，运行 vLLM model 应该会启用 GPU；当 model 加载时，你会在 nvidia-smi 中看到 GPU memory usage 上升。

RoCM / AMD GPUs：通过 llama.cpp 的 ROCm backend，对 AMD 有 experimental support；对其他硬件也可通过 Vulkan 或 OpenCL 支持。如果你有 AMD GPU，DMR（llama.cpp）可能能够通过这些 backends 使用它，但 performance 可能没有那么优化。这通常需要使用相关 flags build engine。写作时，NVIDIA 仍然是主要且最成熟的支持路径。

Multi-GPU：vLLM 当前预期使用一个 GPU。llama.cpp 可以通过拆分 layers 或 tensors，并使用 --split-directory、--main-gpu 等 flags，将内容 offload 到多个 GPUs，但这类 setups 属于 advanced，可能需要 manual configuration。

在多数情况下，一个 model 会运行在一个 GPU 上。如果你有多个 GPUs，可能可以将多个 models 并发运行，并分别 pin 到不同 GPU；但 DMR 目前还不能原生管理多个 distinct GPUs 给不同 models，除非通过 raw flags。

GPU Memory Considerations：如果你运行的 model 大于 GPU memory，它可能无法在 vLLM 中加载，或因 paging 而运行极慢。对于 llama.cpp，你可以使用 --n-gpu-layers 只把部分 layers 放到 GPU 上，其余保留在 CPU 上。例如：

--n-gpu-layers 20

可能只 offload 20 layers 到 GPU，从而降低 VRAM usage。

Trade-off 是会有一定 performance loss，但你可以运行那些原本无法完全放进 GPU 的 models。使用 nvidia-smi 监控 usage 并进行调整。类似地，--mlock flag 可以将 model 锁定在 RAM 中，避免 swapping（用于 CPU runs）。

Throughput vs. Latency：vLLM 这样的 GPU engines 在 throughput 方面表现出色，可以每秒服务大量 tokens，尤其是在 batching multiple requests 时。如果你关心的是小 prompt 的 single-request latency，那么 quantized CPU model 可能启动更快，并在几秒内响应；而 GPU 可能要花几秒初始化 model，但之后 generation 很快。对于 interactive use，这个差异可能没有特别大。

但对于许多 parallel users，GPU 明显更强。优化时请记住这一点：在 logs 或 responses 中测量 tokens per second。DMR UI 会显示每个 request 的 generation speed（tokens/sec）和第一个 token 的 time to first byte。

Engine 和 GPU considerations 已经覆盖完毕，接下来深入 model-level settings 如何直接影响 memory usage 和 responsiveness。

Context length and model size

现在我们探索 context length、model size、quantization 和 runtime tuning 如何直接影响 Docker Model Runner 中的 performance 和 resource usage。你将学习如何通过 thoughtful 参数调整，在 memory、speed 和 output quality 之间取得平衡，从而让硬件发挥最佳效果。

Context length（memory vs capability）

我们先从 context length 开始，因为它是最常见、也经常被误解的 performance levers 之一。增加 context size，也就是 model 一次可以考虑的 conversation 或 text 数量，可能很有用，但代价是显著增加 memory 和 compute cost。DMR 允许将 context 配置到 model 支持的最大值。例如，一些较新的 models 支持 8k 或 16k tokens。

作为规则，只使用你真正需要的 context size。如果你的 use case 永远只需要大约 1,000 tokens，那么运行 4k context 与 8k context 相比，可能只是浪费 memory 并拖慢 inference，因为 self-attention 会在更大的 window 上运行。

llama.cpp 的默认值是 4096 tokens，对许多场景来说比较合理。如果设置为 8192，预期 memory usage 会跳升，可能增加几 GB，同时 throughput 会下降一些。每次改变 context size 后，都要进行测试。

你可以使用以下命令查看 model 支持的最大 context：

docker model inspect <model>

如果强行超过 model 训练时支持的 context，例如在 4k model 上强行设置 16k，除非该 model 明确支持 RoPE scaling，否则不会正确工作。一些 models 确实支持，而 DMR 会暴露 --rope-scaling flags 用于 advanced usage。

除了 context length，另一个 major performance factor 是 model 本身的 size 以及 quantization 的激进程度。

Model size and quantization

更大的 models（更多 parameters）通常会产生更好的 results，但速度更慢。Quantization（例如 4-bit）可以弥合 speed 和 memory 的 gap，使你能够在 commodity machine 上运行 13B 或 30B model，但它确实会稍微降低 output quality。如果你需要 top-notch quality 并且有硬件，请使用 vLLM 的 FP16 models。如果你需要 good-enough 且 fast，那么 llama.cpp 的 4-bit quantized model 可能就足够了。

通过 iteration 找到满足需求的最小 model。例如，如果 7B parameter model 给出的答案不够好，可以尝试 13B。如果 13B 在 CPU 上太慢，也许使用 quantized 13B，或尝试带不同 fine-tune 的 7B。

Quantization strategies

除了 context length，另一个 major performance factor 仍然是 model 本身的 size 以及它被 quantized 的激进程度。

Q4_K_M（本章 example 中使用的）是一种 balanced 4-bit method，能提供不错的 accuracy。也有 Q5、Q8 等。Q8（8-bit）几乎接近 full-precision accuracy，但大小约为一半；如果你有足够 memory，它是一个不错的中间选择。

DMR 的 Docker Hub models 通常提供多个 quantization levels，tag 会标识它们，例如 :Q5_K_M 等。你可以 pull 几个 variants 并测试它们的 performance。由于 DMR 会缓存 models，切换 quantization 就像 pull 一个新 tag，并更新 compose file 或 config 使用它一样简单。

Threading and parallelism

最后，当你选择了 model 和 quantization level 后，可以通过调整 threading 和 parallelism settings，fine-tune 你的 hardware utilization efficiency。

默认情况下，llama.cpp 会使用与你 CPU physical cores 数量相等的 threads（它会尝试 autodetect）。这通常没问题。如果你观察到 CPU 没有被充分利用，或希望限制它，可以调整 --threads。例如，在拥有 8 个 physical cores（16 logical cores）的机器上，它可能会选择 8。

如果这些是 high-performance cores，很好。如果你还有 efficiency cores，例如较新的 Intel 或 Apple chips，则要注意：太多 threads 有时会因为 memory bandwidth contention 而降低 performance。需要实验；有时将 threads 设置为 physical cores 数量会得到最佳结果。

--batch-size flag（在 prompt ingestion 期间一次处理多少 tokens）也能提升速度；更高 batch size 可以加速 initial prompt processing，但代价是更多 memory。默认 512 通常可以；如果你有大量 RAM，可以将其提升到 1024 或 2048，以略微加快 long prompts 的处理速度。

对于 vLLM，parallelism 由内部处理；它会利用 GPU parallelism，并可自动 batch multiple requests。请确保你没有人为限制它，例如一次只发送一个 request 并等待完成。如果你有 concurrent workload，让 requests concurrent hit model，才能看到 vLLM 的优势。

当然，没有 measurement，optimization 就不完整。所以我们最后看看如何基于 real metrics monitor performance，并 iteratively tune configuration。

Monitoring and tuning

要真正 optimize，就必须 measure performance。可以在单独 terminal 中运行：

docker model logs -f

来观察 requests。它会显示 models 是否经常 loading / unloading（如果是，可能说明你频繁切换 models；也许应该坚持使用一个 model，或确保 memory 足够同时保留多个）。它也会记录每个 request 的耗时，以及可能的 token counts。

从 Docker Desktop 的 Models → Requests UI，你可以看到每个 request 的 token usage 和 latency。查看 “generation speed”（tokens/sec）。例如，如果你看到 CPU 上 7B model 大约是 5 tokens/sec，而你有可用 GPU，那么使用 GPU 可能轻松提升到 50+ tokens/sec；这就是应该切换的信号。

如果看到内存不足——logs 或 errors 中可能显示 OOM——可以考虑降低 context 或 model size，或增加 swap（这对 performance 不理想，但能避免 crashes）。

另一个 optimization strategy 是 prompt batching。如果你的 use case 可以把多个 queries 一起发送，vLLM 会自动 batch 它们。llama.cpp 不原生 batch 不同 prompts；但如果你在 loop 中 sequentially 生成多个 completions，可以考虑使用 threads 或 asynchronous calls 来 overlap 它们，尽管在 CPU 上这只是分摊 resources。使用 vLLM 时，请通过 concurrent requests 利用它的能力。

总结来说，DMR 的 performance tuning 关键在于让 hardware 与 model 和 settings 匹配：

使用 quantization 和较小 context，以适应有限 RAM / VRAM。
如果你有能力，使用 GPUs 和 full-precision models 获得速度。
为你的 specific machine 调整 threads 和 flags，尤其是 CPU runs。
监控 generation speed 和 memory usage，并据此调整。

掌握这些 strategies 后，你就可以尽可能榨干 local models 的性能。现在，我们把焦点转向 observability：如何更系统地 monitor 和 debug model service。

Observability and monitoring

到本节结束时，你将能够识别并定义 local inference services 的关键 performance indicators，包括 tokens per second、latency percentiles、error rates 和 request queueing behavior。你将学习如何使用 Prometheus 和 Grafana 设置一个 minimal 但有效的 monitoring pipeline，并理解在什么地方添加 Jaeger 等 tracing tools 会提供 meaningful insight。此外，你还会获得 practical techniques：通过结合 model logs 和 runtime configuration analysis，系统化诊断 slowdowns 和 memory pressure，并 troubleshoot performance issues。

当把 AI models 作为 services 部署时，尤其是作为 application 的一部分时，你会希望洞察它们的 behavior 和 performance。Docker Model Runner 的 observability 可以像任何 web service 一样处理，结合 logging、metrics 和 tracing。DMR 和 Docker 提供一些 built-in hooks，例如 logs 和 Docker UI，你也可以进一步扩展，使用 Prometheus、Grafana 和 Jaeger 构建完整 monitoring stack。

我们从最简单、最直接的 insight source 开始：Docker Model Runner 和 Docker Desktop 中已经内置的 logs 和 request-level inspection tools。

Logging and request inspection

Docker Model Runner logs 是你的第一层 observability：

Model Runner logs：可以通过 CLI 使用 docker model logs 查看。这会显示来自 model backend 的 logs。内容包括 model download progress、model loading / unloading events，以及 inference request logs。例如，当你发送 request 时，log entry 可能会显示 model name、prompt length、generation time 等。如果发生 error（out-of-memory 等），也会出现在这里。你可以用 -f 实时 follow logs，以观察发生了什么。

Docker Desktop UI：在 Docker Desktop 的 Models tab 中，有一个 Logs sub-tab，可以用 nice interface 看到相同信息。也有 Requests tab，用于列出 model service 处理过的每个 request。

选择某个 specific request 后，可以看到 details：完整 request payload、response、token counts 和 latency。这对 debugging 非常有用；你可以验证 model 收到了什么 prompt，以及它到底返回了什么 response，这有助于 troubleshooting 和 prompt engineering。

使用这些 tools，你可以回答以下问题：model 是否按预期被调用？responses 花了多长时间？我们是否触及了 context limit？例如，如果你在 request details 中持续看到 context usage 接近 max，可能需要使用更大 context 的 model，或在 application logic 中 trim conversation。

Logs 很适合 debug individual issues，但不能提供 long-term performance view。接下来看看如何使用 Prometheus 和 Grafana 随时间收集并可视化 metrics。

Metrics with Prometheus and Grafana

为了获得更 quantitative 的长期视图，你需要 metrics。Prometheus 可以 scrape 来自 application 和相关 services 的 metrics。虽然 Docker Model Runner 本身目前还不提供开箱即用的 Prometheus endpoint，但你可以 instrument application，让它 emit 关于 model usage 的 metrics。

在 Docker GenAI chatbot guide 的 example 中，作者创建了一个 backend 作为 metrics bridge。它收集 tokens per second 和 request latency 等 metrics，并将它们暴露给 Prometheus。Prometheus server（在 container 中运行）scrape 这些 metrics 并存储，Grafana 则用于可视化它们。

你可能想 track 的一些 key metrics 包括：

Total Requests and Request Rate：例如 completion requests 数量的 counter，可能带 model name label。

Latency：例如 request durations 的 histogram。

Token usage：已处理 tokens 的 counters，可能将 prompts 和 completions 分开统计。

Tokens per second：当前 throughput 的 gauge。一些 apps 会测量 model 每秒生成多少 tokens，用来反映 performance。

Active Requests：正在进行 requests 数量的 gauge，可推断 concurrency。

Context Length Used：每个 request 的 context length，或当前 conversation length，用于 monitor 是否经常接近 limit。

Model-specific metrics：例如，如果使用 llama.cpp，可以 track threads 数量，或 reasoning mode 是否被使用。在示例中，他们甚至 track 了是否使用了 “reasoning mode”（一种 specialized metric）以及 context size gauge。

通常，你会在调用 model 的 backend code 中添加 instrumentation。例如，如果使用 Python，可以使用 prometheus_client library 定义 metrics，并在 API calls 周围更新它们。如果 backend 用 Go，则可以使用 Prometheus Go client 注册 metrics，并暴露 /metrics。

一旦 metrics 被暴露，运行一个 Prometheus server（可能作为 Compose setup 的一部分），并配置它 scrape app 的 metrics endpoint。例如，如果你的 app 在 Compose network 内通过：

http://backend:9090/metrics

暴露 metrics，那么 Prometheus config target 就是：

backend:9090

在 chatbot example 中，backend 被用来 serve metrics，这样 Prometheus container 就可以 scrape 它们，而不需要直接访问 host（因为 DMR 在 host 上）。在我们更简单的场景中，如果你的 app 直接调用 DMR，那么只需要让 app 暴露 metrics，这些 metrics 会包含等待 DMR 的耗时等信息。

然后你可以设置 Grafana 来可视化这些 metrics。Grafana 可以从 Prometheus 拉取数据，并创建 dashboards。例如，一个 dashboard 可能显示每分钟 requests 数、一张 average latency graph、当前 tokens per second value 等。

你也可以设置 alerts，例如当 latency > X，或最近 Y 分钟没有 successful requests 时触发 alert。

下面截图展示了 Prometheus UI 中列出的 metrics，例如 genai_app_token_latency，以及 Grafana dashboards：

图 3.6：Prometheus dashboard，展示可用 app metrics，用于检查 active requests

虽然设置 Prometheus / Grafana 需要额外工作，但如果你正在迭代 model performance 或运行 long-lived service，它会带来回报。你可以通过这些 metrics 捕捉 performance regressions，例如某次 update 让 responses 变慢，或 context usage 突然升高。

如果你的 application 涉及多个 services，或需要更深入地了解 request flow，下一步就是添加 distributed tracing，以理解时间到底花在哪里。

Distributed tracing with Jaeger

如果你的 application 是 microservice architecture 的一部分，或你希望 end-to-end trace requests，那么 tracing 非常有价值。Jaeger 是一个 popular open-source tracing system。你可以使用 OpenTelemetry 或 specific client libraries 集成它。

在 DMR 的 context 中，当 application 收到一个将涉及 model 的 request 时，可以创建一个 trace span。例如，在 web server 中，当某个 API call 到来，并且需要 model response 时，先启动一个 span，叫 “Handle request”；在其中再启动另一个 span，叫 “LLM inference”。调用 model 时，由于是 HTTP call，你可以通过 headers 传播 context，例如使用 W3C trace context 中的 traceparent。

目前，DMR 不会自动接续 incoming requests 的 traces。它还没有 instrumented 到这种 tracing level。不过，你仍然可以从外部 measure 它：在 app 中创建一个 span，覆盖 HTTP call 的 duration。

结果是，在 Jaeger 中，你会看到 request 的 trace，其中一个 segment 是 model call。如果 model call 慢，timeline 中会非常清楚。如果它 error，你可以在该 span 上加 error tag。这有助于区分 slowdown 来自你自己的 code，还是来自 model inference。如果你有多个 services，例如 frontend service 调用 backend，而 backend 调用 model，distributed tracing 会跟踪整个 chain。

设置 Jaeger 与 Prometheus 类似；在 development 中可以运行一个 Jaeger all-in-one container。在 app 中使用 OpenTelemetry SDK 或 Jaeger client。

在 Jaeger UI 中，你可以搜索 app requests 的 traces，并看到 details，包括有多少时间花在 model runner 上。

除了 application-level observability，也要关注 model service 所依赖的底层 system resources。

System monitoring

不要忘记 basic system metrics：

Docker stats / System metrics：Model runner process 本身（在 host 上）会消耗 CPU 和 memory。你可以用 docker stats 检查是否有 container 正在大量使用资源；不过 model 本身不是 container，而是 plugin process。因此在 host 上，应使用 OS tools，例如 htop、top 等，查看 docker-model-runner 或类似 process 的 CPU usage；用 nvidia-smi 查看 GPU usage。如果想把 host metrics graph 出来，Prometheus Node Exporter 也可以捕获它们。

如果运行 multiple models，请注意它们如何共享 resources。例如，如果两个大型 models 被交替使用，它们可能不断 load 和 unload。Logs 会显示这种 thrashing；像 “active model switches” 这样的 metrics 也可以被推断出来（虽然不是现成 available），你可能会看到切换时 latency spikes。

Logs aggregation：如果广泛部署，请考虑将 logs 发送到 centralized system，例如 ELK stack 或 cloud logging。对于 dev，built-in logs 就足够了。

总结来说，请像对待任何 critical service 一样对待 model service：instrument it、log it、watch it。使用 Prometheus 和 Grafana 获取关于 performance 和 usage patterns 的 real-time metrics；使用 Jaeger 在 distributed scenarios 中 trace 并定位 latency sources。

这不仅有助于 troubleshooting，也有助于优化 prompts 和 usage。例如，如果你看到大多数 prompts 只使用 100 tokens context，也许就不需要 8k context model；或者如果你看到 request rate 上升，也许就要规划 scaling needs。

Summary

本章中，我们探索了如何使用 Docker Model Runner 将 AI models 转换为 services。你学习了 DMR 如何通过 Docker tooling 实现 local LLM inference：从像 pull images 一样 pull models，到在你的机器上运行 OpenAI-like API。

我们覆盖了 DMR architecture：它如何将 model store、runner engine 和 CLI 结合起来；如何在不同 platforms 上安装并启用它；以及如何通过 Docker Compose 和 direct API calls 将它与你的 applications 集成。我们还深入讨论了 performance tuning：如何在 llama.cpp 和 vLLM 之间选择、如何调整 context、threads 和 quantization，以及如何为 model service 实现 observability。

借助 Docker Model Runner，你可以在开发和测试 AI-powered features 时，不必每次 iteration 都依赖 cloud services，从而节省成本并获得更多控制权。它也为在需要时将这些 models 部署到 production-like environments 铺平道路，而且仍然使用你已经熟悉的 Docker workflows。

作为结尾思考：将 models 当作 Docker-managed resources，会抽象掉大量复杂性。它让你专注于 model 在 app 中应该做什么，而不是如何 wire it up。不过，能力越大责任越大；你现在是在本地运行 potentially large workloads，因此 monitoring 和 optimizing 非常关键，正如我们已经讨论过的那样。

到目前为止，我们一直在本地运行所有东西——models、containers、Compose stacks——都在自己的机器上。那么，当你的 AI / ML workloads 超出 laptop 能力时会怎样？下一章中，我们将研究 Docker Offload：一种将 CUDA-enabled builds、model conversion 和 batch preprocessing 等 heavyweight tasks 推送到 managed cloud resources 的方式，同时保持你已经熟悉的同一套 Docker workflow。