How to Run vLLM with GPT-OSS-20B ModelHow to Run vLLM with G

How to Run vLLM with GPT-OSS-20B Model

This guide provides step-by-step instructions for setting up and running the vLLM server with the GPT-OSS-20B model.

Prerequisites

Conda package manager installed
NVIDIA GPU with sufficient memory (recommended 24GB+ for GPT-OSS-20B)
CUDA 12.x installed

Setup Instructions

1. Create Conda Environment

conda create --name oss_env_py312 python=3.12
conda activate oss_env_py312

2. Install vLLM with GPT-OSS Support

pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

3. Run vLLM Server

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve /path/to/gpt-oss-20b \
    --host 192.168.31.103 \
    --port 8000 \
    --served-model-name openai/gpt-oss-20b \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 26384 \
    --async-scheduling \
    --uvicorn-log-level debug

Parameters Explanation

--host: IP address to bind the server to
--port: Port number to listen on
--served-model-name: Name to identify the model in API responses
--tensor-parallel-size: Number of GPUs to use (1 for single GPU)
--gpu-memory-utilization: Fraction of GPU memory to allocate (0.9 = 90%)
--max-model-len: Maximum sequence length (26384 tokens)
--async-scheduling: Enables asynchronous scheduling for better throughput
--uvicorn-log-level: Sets the log level for the server (debug for detailed logs)

Notes

Replace /path/to/gpt-oss-20b with the actual path to your model files
The model will be automatically downloaded from ModelScope if not found locally
For production use, consider adjusting the logging level to something less verbose than debug
The server will be accessible at http://192.168.31.103:8000 once running

API Usage

Once running, you can interact with the model using the OpenAI-compatible API:

curl http://192.168.31.103:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "openai/gpt-oss-20b",
        "prompt": "Hello, world!",
        "max_tokens": 100
    }'

openai_harmony.HarmonyError: Unexpected token 12606 while expecting start token 200006