- RuntimeError("UVA is not available")
如下图所示,这个报错是由于在 is_pin_memory_available 函数中判断如果是 in_wsl() 则直接返回 False。这里将 False 临时改为 True 进行修复。
- RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)
报错如下图所示:
这是因为在执行 cudagraph capture 时,使用了 default stream。需要临时进行如下修改:
diff --git a/vllm/compilation/cuda_graph.py b/vllm/compilation/cuda_graph.py
index 13e88448c..0293d2e5c 100644
--- a/vllm/compilation/cuda_graph.py
+++ b/vllm/compilation/cuda_graph.py
@@ -292,10 +292,18 @@ class CUDAGraphWrapper:
get_offloader().sync_prev_onload()
# mind-exploding: carefully manage the reference and memory.
+ # Ensure current stream is not the default stream.
+ # In WSL2, spawn subprocesses may not have TLS stream
+ # initialized before CUDA graph capture is triggered.
+ _capture_stream = current_stream()
+ if _capture_stream.stream_id == torch.cuda.default_stream().stream_id:
+ import torch as _torch
+ _torch.cuda.set_stream(_torch.cuda.Stream())
+ _capture_stream = current_stream()
with torch.cuda.graph(
cudagraph,
pool=self.graph_pool,
- stream=current_stream(),
+ stream=_capture_stream,
):
# `output` is managed by pytorch's cudagraph pool
output = self.runnable(*args, **kwargs)