计费/限流变了别慌：Python 情景评估“旧价vs新价”成本冲击 + 并发估算这篇直接给你一份最小 Python 脚本

大模型应用上线后，供应商最常见的“外部变量”就是：价格调整 / 计费口径变化 / 限流收紧。
吐槽没用，工程上最有效的应对方式是把它做成一个可运行的 what-if 计算器：

旧价 vs 新价：单次/日/月成本差多少？
限流收紧：我的峰值并发会不会排队？
需要哪些门禁/降级，才能不翻车？

这篇直接给你一份最小 Python 脚本：读 jsonl 日志，统计 P50/P95 token 和延迟，然后输出成本冲击与并发估算。

0）公式先记住（够用）

0.1 单次成本

[ Cost_{call} \approx \frac{T_{in}}{1000}P_{in} + \frac{T_{out}}{1000}P_{out} ]

0.2 月成本冲击（旧价 vs 新价）

[ \Delta Cost_{month} \approx 30 \times N_{day} \times (Cost_{call}^{new}-Cost_{call}^{old}) ]

0.3 并发估算（是否会排队）

[ Concurrency \approx QPS_{peak} \times Latency_{p95}(\text{seconds}) ]

经验：再乘 1.2~1.5 安全系数覆盖抖动与重试。

1）你需要的最少日志字段（建议先齐这 4 个）

每行一条 json（jsonl）：

{"id":"r1","ok":true,"input_tokens":1200,"output_tokens":300,"latency_ms":820}
{"id":"r2","ok":true,"input_tokens":5400,"output_tokens":900,"latency_ms":1400}

可选但强烈建议：

model（用于分模型算账）
retry_count（重试会放大成本）
prompt_version（提示词漂移会导致 token 漂移）

2）最小脚手架：pricing_impact.py

import json
from dataclasses import dataclass
from typing import Optional


def percentile(sorted_vals: list[int], p: float) -> Optional[int]:
    if not sorted_vals:
        return None
    idx = int(p * (len(sorted_vals) - 1))
    return sorted_vals[idx]


@dataclass
class Pricing:
    # 单价：元/1k token
    input_per_1k: float
    output_per_1k: float


@dataclass
class BudgetInputs:
    daily_calls: int
    qps_peak: float
    safety_factor: float = 1.3


@dataclass
class Stats:
    p50_in: int
    p95_in: int
    p50_out: int
    p95_out: int
    p50_latency_ms: int
    p95_latency_ms: int


def load_jsonl(path: str) -> list[dict]:
    rows = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                rows.append(json.loads(line))
    return rows


def compute_stats(rows: list[dict]) -> Stats:
    ins = sorted(int(r["input_tokens"]) for r in rows)
    outs = sorted(int(r["output_tokens"]) for r in rows)
    lats = sorted(int(r["latency_ms"]) for r in rows)
    return Stats(
        p50_in=percentile(ins, 0.50),
        p95_in=percentile(ins, 0.95),
        p50_out=percentile(outs, 0.50),
        p95_out=percentile(outs, 0.95),
        p50_latency_ms=percentile(lats, 0.50),
        p95_latency_ms=percentile(lats, 0.95),
    )


def cost_per_call(tin: int, tout: int, pricing: Pricing) -> float:
    return (tin / 1000.0) * pricing.input_per_1k + (tout / 1000.0) * pricing.output_per_1k


def report(pricing_old: Pricing, pricing_new: Pricing, stats: Stats, budget: BudgetInputs):
    # 用 P95 做 SLA 预算（更贴近线上上限）
    old_p95 = cost_per_call(stats.p95_in, stats.p95_out, pricing_old)
    new_p95 = cost_per_call(stats.p95_in, stats.p95_out, pricing_new)

    delta_per_call = new_p95 - old_p95
    delta_month = 30 * budget.daily_calls * delta_per_call

    concurrency = budget.qps_peak * (stats.p95_latency_ms / 1000.0) * budget.safety_factor

    return {
        "p50_in": stats.p50_in,
        "p95_in": stats.p95_in,
        "p50_out": stats.p50_out,
        "p95_out": stats.p95_out,
        "p50_latency_ms": stats.p50_latency_ms,
        "p95_latency_ms": stats.p95_latency_ms,
        "old_cost_per_call_p95": round(old_p95, 6),
        "new_cost_per_call_p95": round(new_p95, 6),
        "delta_per_call_p95": round(delta_per_call, 6),
        "delta_month_p95": round(delta_month, 2),
        "concurrency_estimate": round(concurrency, 2),
    }


def main():
    rows = [r for r in load_jsonl("requests.jsonl") if r.get("ok", True)]
    stats = compute_stats(rows)

    # TODO：按你的真实价格改（单位：元/1k token）
    pricing_old = Pricing(input_per_1k=1.0, output_per_1k=3.0)
    pricing_new = Pricing(input_per_1k=1.2, output_per_1k=3.6)

    # TODO：按你的业务量改
    budget = BudgetInputs(daily_calls=1000, qps_peak=10, safety_factor=1.3)

    print(report(pricing_old, pricing_new, stats, budget))


if __name__ == "__main__":
    main()

你得到的输出大概长这样（示意）：

{
  'p95_in': 5400,
  'p95_out': 900,
  'old_cost_per_call_p95': 7.2,
  'new_cost_per_call_p95': 8.64,
  'delta_per_call_p95': 1.44,
  'delta_month_p95': 43200.0,
  'concurrency_estimate': 18.2
}

你可以把 delta_month_p95 直接贴到评审会里：
“价格变化导致按 P95 估算的月成本上升 4.32 万（示例）。”

3）怎么把“算出来的冲击”变成可执行动作

我建议按这个优先级做（别一上来就换模型）：

加预算门禁：max_input_tokens/max_output_tokens/context_budget/history_window
控 token：history 摘要、context 预算、工具返回投影/截断
修重试：把可重试/不可重试错误区分，避免放大回路
上缓存：相同请求/检索结果/工具结果缓存
再做路由：任务分级 + 默认模型 + 回退模型（有评测才动）

4）路由骨架（伪代码，够你落地）

def choose_model(task):
    # 例：高风险任务用更稳的模型；低风险走便宜模型
    if task.kind in {"payment", "policy", "legal"}:
        return "strong-model"
    if task.requires_citation:
        return "strong-model"
    return "cheap-model"


def call_llm(task, client):
    model = choose_model(task)
    try:
        return client.chat.completions.create(
            model=model,
            messages=task.messages,
            max_tokens=task.max_tokens,
        )
    except RateLimitError:
        # 降级策略：换模型/缩短上下文/返回缓存/排队
        return fallback(task, client)

关键不是“写个 if-else”，而是要有：评测基线 + 门禁 + 回退策略。

5）变更当天 Checklist（复制到工单）

确认变化类型：价格/计费口径/限流/模型上下线
用脚本跑一次 P50/P95 基线（token/延迟）
算旧价 vs 新价：单次/月冲击（P50/P95 两套）
估并发：(QPS_{peak} \times Latency_{p95}) 是否会排队
预算门禁上线（token/context/history）
重试策略审查（避免放大回路）
线上观察 24 小时：P95 token、P95 延迟、错误率、成本突增告警

资源区：做成本对比/路由试验时，先把接入层统一（省掉很多工程摩擦）

很多团队在“价格变化应对”里会做多模型对比与路由试验。
如果每次对比都要换 SDK/鉴权，评估成本会很高。

更省事的方式是统一成 OpenAI 兼容入口（多数时候只改 base_url 与 api_key）。
举个例子：我会用 147ai 这类多模型聚合入口做对比（具体模型/价格/限流以其控制台与文档为准）：

API Base URL：https://147ai.com
端点：POST /v1/chat/completions
鉴权：Authorization: Bearer <KEY>
文档：https://147api.apifox.cn/