How to Run vLLM with GPT-OSS-20B Model
This guide provides step-by-step instructions for setting up and running the vLLM server with the GPT-OSS-20B model.
Prerequisites
- Conda package manager installed
- NVIDIA GPU with sufficient memory (recommended 24GB+ for GPT-OSS-20B)
- CUDA 12.x installed
Setup Instructions
1. Create Conda Environment
conda create --name oss_env_py312 python=3.12
conda activate oss_env_py312
2. Install vLLM with GPT-OSS Support
pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
3. Run vLLM Server
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve /path/to/gpt-oss-20b \
--host 192.168.31.103 \
--port 8000 \
--served-model-name openai/gpt-oss-20b \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 26384 \
--async-scheduling \
--uvicorn-log-level debug
Parameters Explanation
--host: IP address to bind the server to--port: Port number to listen on--served-model-name: Name to identify the model in API responses--tensor-parallel-size: Number of GPUs to use (1 for single GPU)--gpu-memory-utilization: Fraction of GPU memory to allocate (0.9 = 90%)--max-model-len: Maximum sequence length (26384 tokens)--async-scheduling: Enables asynchronous scheduling for better throughput--uvicorn-log-level: Sets the log level for the server (debug for detailed logs)
Notes
- Replace
/path/to/gpt-oss-20bwith the actual path to your model files - The model will be automatically downloaded from ModelScope if not found locally
- For production use, consider adjusting the logging level to something less verbose than
debug - The server will be accessible at
http://192.168.31.103:8000once running
API Usage
Once running, you can interact with the model using the OpenAI-compatible API:
curl http://192.168.31.103:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"prompt": "Hello, world!",
"max_tokens": 100
}'
openai_harmony.HarmonyError: Unexpected token 12606 while expecting start token 200006