sglang(1)：开发环境搭建0. 简介 sglang是一个高性能的大语言模型和视觉-语言模型服务框架。它旨在从单

0. 简介

sglang是一个高性能的大语言模型和视觉-语言模型服务框架。它旨在从单 GPU 到大型分布式集群等各类部署环境中提供低延迟、高吞吐量的推理服务。

在通过nano-vllm系列针对推理的基本流程和推理框架做了一定的学习入门，现在准备围绕sglang，进行大模型推理的深入学习。

1. 机器环境

机器：RTX 4090：

$ nvidia-smi    
Sat Mar 14 15:57:08 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   44C    P8             31W /  450W |   21348MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3889      G   /usr/lib/xorg/Xorg                      871MiB |
|    0   N/A  N/A            4015      G   /usr/bin/gnome-shell                    110MiB |
|    0   N/A  N/A            6040      G   ...0313-010038.057000-production        291MiB |
|    0   N/A  N/A            6978      G   ...ess --variations-seed-version         64MiB |
|    0   N/A  N/A          101026      C   sglang::scheduler                     19962MiB |
+-----------------------------------------------------------------------------------------+

nvcc：

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:24:57_PDT_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0

2. 环境搭建

2.1 基本流程

Conda环境创建

conda create -n sglang-test python=3.12.3 -y

安装sglang

参考链接中的方法2，从源码安装：

git clone https://github.com/sgl-project/sglang.git
cd sglang

然后执行：

conda activate sglang-test

# Install the python packages
pip install --upgrade pip
pip install -e "python"

运行和验证

在python/sglang/launch_server.py的main函数中，添加运行参数：--model-path /home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8 --trust-remote-code --host 0.0.0.0 --port 30000，然后启动服务，会如下：

import sys; print('Python %s on %s' % (sys.version, sys.platform))
/home/iguochan/miniconda3/envs/sglang-test/bin/python -X pycache_prefix=/home/iguochan/.cache/JetBrains/PyCharm2025.2/cpython-cache /opt/pycharm-2025.2.6/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 33969 --file /home/iguochan/workspace/github/sglang/python/sglang/launch_server.py --model-path /home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8 --trust-remote-code --host 0.0.0.0 --port 30000 
Connected to: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 48420), raddr=('127.0.0.1', 33969)>.
已连接到 pydev 调试器(内部版本号 252.28539.27)/home/iguochan/workspace/github/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-14 16:15:38] INFO server_args.py:2140: Attention backend not specified. Use flashinfer backend by default.
[2026-03-14 16:15:39] server_args=ServerArgs(model_path='/home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8', tokenizer_path='/home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.833, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=2048, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, enable_streaming_session=False, random_seed=94410903, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='/home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=24, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=False, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=2048, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-03-14 16:15:39] Using default HuggingFace chat template with detected content format: string
Connected to: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 42718), raddr=('127.0.0.1', 33969)>.
Connected to: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 42728), raddr=('127.0.0.1', 33969)>.
Connected to: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 42744), raddr=('127.0.0.1', 33969)>.
[2026-03-14 16:15:50] Mamba selective_state_update backend initialized: triton
[2026-03-14 16:15:50] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-14 16:15:50] Init torch distributed ends. elapsed=0.11 s, mem usage=0.06 GB
[2026-03-14 16:15:50] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-03-14 16:15:50] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-03-14 16:15:50] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/home/iguochan/miniconda3/envs/sglang-test/lib/python3.12/site-packages/transformers/__init__.py)
[2026-03-14 16:15:51] Load weight begin. avail mem=21.64 GB
Loading pt checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading pt checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.33s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.60s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.41s/it]
[2026-03-14 16:15:55] Load weight end. elapsed=4.89 s, type=LlamaForCausalLM, avail mem=8.75 GB, mem usage=12.89 GB.
[2026-03-14 16:15:55] Using KV cache dtype: torch.bfloat16
[2026-03-14 16:15:55] KV Cache is allocated. #tokens: 11210, K size: 2.57 GB, V size: 2.57 GB
[2026-03-14 16:15:55] Memory pool end. avail mem=3.56 GB
[2026-03-14 16:15:56] Capture cuda graph begin. This can take up to several minutes. avail mem=3.13 GB
[2026-03-14 16:15:56] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24]
Capturing batches (bs=1 avail_mem=2.98 GB): 100%|██████████| 7/7 [00:06<00:00,  1.00it/s]
[2026-03-14 16:16:03] Capture cuda graph end. Time elapsed: 7.27 s. mem usage=0.15 GB. avail mem=2.98 GB.
[2026-03-14 16:16:03] Capture piecewise CUDA graph begin. avail mem=2.98 GB
[2026-03-14 16:16:03] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048]
[2026-03-14 16:16:03] install_torch_compiled
Compiling num tokens (num_tokens=2048):   0%|          | 0/42 [00:00<?, ?it/s]/home/iguochan/miniconda3/envs/sglang-test/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
[2026-03-14 16:16:08] Initializing SGLangBackend
[2026-03-14 16:16:08] SGLangBackend __call__
[2026-03-14 16:16:09] Compiling a graph for dynamic shape takes 1.38 s
[2026-03-14 16:16:09] Computation graph saved to /home/iguochan/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1773476169.8871477.py
Compiling num tokens (num_tokens=4): 100%|██████████| 42/42 [00:09<00:00,  4.38it/s]
Capturing num tokens (num_tokens=4 avail_mem=2.51 GB): 100%|██████████| 42/42 [00:02<00:00, 14.21it/s]
[2026-03-14 16:16:16] Capture piecewise CUDA graph end. Time elapsed: 12.75 s. mem usage=0.47 GB. avail mem=2.51 GB.
[2026-03-14 16:16:16] max_total_num_tokens=11210, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2048, context_len=4096, available_gpu_mem=2.51 GB
[2026-03-14 16:16:16] INFO:     Started server process [109642]
[2026-03-14 16:16:16] INFO:     Waiting for application startup.
[2026-03-14 16:16:16] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 50, 'top_p': 0.95}
[2026-03-14 16:16:16] INFO:     Application startup complete.
[2026-03-14 16:16:16] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2026-03-14 16:16:17] INFO:     127.0.0.1:48318 - "GET /model_info HTTP/1.1" 200 OK
[2026-03-14 16:16:18] Prefill batch, #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: True
[2026-03-14 16:16:18] INFO:     127.0.0.1:48328 - "POST /generate HTTP/1.1" 200 OK
[2026-03-14 16:16:18] The server is fired up and ready to roll!

可以发现，服务已经启动成功，我们通过一个最简单的例子验证一下：

import requests
response = requests.post("http://localhost:30000/v1/chat/completions",
                       json={"model": "deepseek-llm-7b-chat",
                             "messages": [{"role": "user", "content": "你好，我想要一份去新疆旅游的攻略"}]})
print(response.json()["choices"][0]["message"]["content"])

执行后，客户端打印如下：

你好！新疆是一个美丽而富饶的地方，有着丰富的自然景观和独特的民族文化。以下是一份简单的去新疆旅游攻略：

1. 行程规划：
   - 首先确定你的旅行时间，新疆的旅游旺季通常是夏季，此时天气适宜，旅游资源丰富。
   - 规划好行程路线，可以考虑乌鲁木齐为中心，向四周辐射，比如去吐鲁番看葡萄沟、火焰山，去喀纳斯湖体验北疆风光，去伊犁看薰衣草花海等。

2. 交通方式：
   - 新疆地域广阔，可以选择飞机、火车或者长途汽车作为交通方式。
   - 乌鲁木齐有国际机场，可以直飞到乌鲁木齐，再开始你的旅程。

3. 住宿建议：
   - 新疆的住宿选择很多，从豪华酒店到经济型旅馆都有。
   - 建议在主要旅游城市和景区附近预订住宿，如乌鲁木齐、喀纳斯等地。

4. 注意事项：
   - 新疆地域广阔，气候多变，出行前请关注当地天气预报。
   - 新疆是多民族聚居区，尊重当地的风俗习惯是非常重要的。
   - 注意个人安全，尤其是在人烟稀少的地区，最好结伴出行。

5. 美食推荐：
   - 新疆的美食丰富多样，有烤羊肉串、大盘鸡、手抓饭等。
   - 不要错过当地的水果，如葡萄、哈密瓜、杏子等。

6. 文化体验：
   - 参观新疆的博物馆和历史遗迹，了解新疆的历史文化。
   - 体验当地的民族风情，如参加维吾尔族的婚礼、观看民族歌舞表演等。

希望这份简单的攻略能帮助你规划一次难忘的新疆之旅！

服务端的打印如下：

[2026-03-14 16:20:41] Prefill batch, #new-seq: 1, #new-token: 17, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.03, cuda graph: True
[2026-03-14 16:20:41] Decode batch, #running-req: 1, #token: 51, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.14, #queue-req: 0
[2026-03-14 16:20:42] Decode batch, #running-req: 1, #token: 91, token usage: 0.01, cuda graph: True, gen throughput (token/s): 69.18, #queue-req: 0
[2026-03-14 16:20:42] Decode batch, #running-req: 1, #token: 131, token usage: 0.01, cuda graph: True, gen throughput (token/s): 69.11, #queue-req: 0
[2026-03-14 16:20:43] Decode batch, #running-req: 1, #token: 171, token usage: 0.02, cuda graph: True, gen throughput (token/s): 68.96, #queue-req: 0
[2026-03-14 16:20:43] Decode batch, #running-req: 1, #token: 211, token usage: 0.02, cuda graph: True, gen throughput (token/s): 67.23, #queue-req: 0
[2026-03-14 16:20:44] Decode batch, #running-req: 1, #token: 251, token usage: 0.02, cuda graph: True, gen throughput (token/s): 68.62, #queue-req: 0
[2026-03-14 16:20:45] Decode batch, #running-req: 1, #token: 291, token usage: 0.03, cuda graph: True, gen throughput (token/s): 68.58, #queue-req: 0
[2026-03-14 16:20:45] Decode batch, #running-req: 1, #token: 331, token usage: 0.03, cuda graph: True, gen throughput (token/s): 68.51, #queue-req: 0
[2026-03-14 16:20:46] Decode batch, #running-req: 1, #token: 371, token usage: 0.03, cuda graph: True, gen throughput (token/s): 68.42, #queue-req: 0
[2026-03-14 16:20:46] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, cuda graph: True, gen throughput (token/s): 68.12, #queue-req: 0
[2026-03-14 16:20:46] INFO:     127.0.0.1:35042 - "POST /v1/chat/completions HTTP/1.1" 200 OK

说明服务启动成功。

远程开发

当然，我们也可以进行远程开发，我这里使用的是pycharm，大家可以网上找一下配置，直接ssh连接到服务器，然后使用服务器上的conda环境即可，就和本地一样，只是要注意本地和远端修改并不会如git一样优雅和版本管理，需要注意冲突。

2.2 踩的坑

整个过程当然没有上面那么顺利！ 所以记录一下我个人所踩的坑，帮助大家避避坑！

gcc和g++版本

在最开始的时候，我运行服务时会报错：

/home/iguochan/miniconda3/envs/sglang/bin/python -X pycache_prefix=/home/iguochan/.cache/JetBrains/PyCharm2025.2/cpython-cache /opt/pycharm-2025.2.6/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 42727 --file /home/iguochan/workspace/github/sglang/python/sglang/launch_server.py --model-path /home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8 --trust-remote-code --host 0.0.0.0 --port 30000 --log-level debug 
Connected to: <socket.socket fd=3, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 48896), raddr=('127.0.0.1', 42727)>.
已连接到 pydev 调试器(内部版本号 252.28539.27)/home/iguochan/workspace/github/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-14 10:55:06] INFO server_args.py:2140: Attention backend not specified. Use flashinfer backend by default.

...
gcc: fatal error: cannot execute ‘cc1plus’: execvp: 没有那个文件或目录
compilation terminated.

...

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2026-03-14 10:55:28] Received sigquit from a child process. It usually means the child failed.

进程已结束，退出代码为 137 (interrupted by signal 9:SIGKILL)

报错gcc: fatal error: cannot execute ‘cc1plus’: execvp: 没有那个文件或目录，这里我走了很多的弯路，然后我尝试进入使用 Docker 进行开发指南上弄的docker容器里面，发现其gcc版本和g++版本都更高，因此我尝试升级了本机的gcc和g++的版本: 1. 添加PPA源并更新

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt update

2. 安装g++-13

sudo apt install gcc-13 g++-13

3. 设置默认版本

使用update-alternatives管理多版本：

# 添加gcc-13/g++-13作为备选
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 100

# 如果需要保留旧版本，可以设置优先级
# sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 50
# sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 50

# 交互式选择默认版本
sudo update-alternatives --config gcc

4. 验证安装

gcc --version
g++ --version

还得重新安装flashinfer

这时候，再运行报错就不一样了：

/home/iguochan/miniconda3/envs/sglang/bin/python -X pycache_prefix=/home/iguochan/.cache/JetBrains/PyCharm2025.2/cpython-cache /opt/pycharm-2025.2.6/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 35571 --file /home/iguochan/workspace/github/sglang/python/sglang/launch_server.py --model-path /home/iguochan/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-chat/snapshots/afbda8b347ec881666061fa67447046fc5164ec8 --trust-remote-code --host 0.0.0.0 --port 30000 
Connected to: <socket.socket fd=3, family=2, type=1, proto=0, laddr=('127.0.0.1', 42852), raddr=('127.0.0.1', 35571)>.
已连接到 pydev 调试器(内部版本号 252.28539.27)/home/iguochan/workspace/github/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
[2026-03-14 12:31:59] INFO server_args.py:2140: Attention backend not specified. Use flashinfer backend by default.
...



/home/iguochan/.cache/flashinfer/0.6.4/89/generated/batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False/batch_decode_config.inc:2:10: fatal error: flashinfer/page.cuh: 没有那个文件或目录
    2 | #include <flashinfer/page.cuh>
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.

...

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2026-03-14 12:32:18] Received sigquit from a child process. It usually means the child failed.

进程已结束，退出代码为 137 (interrupted by signal 9:SIGKILL)

错误信息显示找不到flashinfer的头文件（如flashinfer/page.cuh），这时候我都没有意识到是什么问题，只是鬼使神差地重新安装了一下flashinfer: 1. 先完全卸载现有的FlashInfer

pip uninstall -y flashinfer-python flashinfer-cubin flashinfer-jit-cache
# 清理缓存
rm -rf /home/iguochan/.cache/flashinfer/

2. 重新完整安装FlashInfer套件

# 首先安装主包
pip install flashinfer-python==0.6.4

# 安装cubin包
pip install flashinfer-cubin==0.6.4

然后，就奇迹般地好了！我猜测，应该是之前gcc的版本不对，导致报错；但是升级后还是不对，是因为之前的flashinfer版本是根据前面的gcc/g++版本安装的，导致得重新安装。验证的方式就是当我重新建一个conda环境的时候，完全不会走到底下的坑；因为此时的gcc/g++版本如下：

$ gcc --version
gcc (Ubuntu 13.4.0-6ubuntu1~22~ppa2) 13.4.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ --version
g++ (Ubuntu 13.4.0-6ubuntu1~22~ppa2) 13.4.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

3. docker开发

请参考使用 Docker 进行开发指南。这里我暂时没有强需求，照着其指示搭了一个，也能跑起来！