在 Python 中使用 onnxruntime 量化 ONNX 模型ONNX 官方文档写得挺零碎的，好不容易找到量化的

如何量化现有 onnx 模型

ONNX 官方文档写得挺零碎的，好不容易找到量化的示例代码。

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'maybe_one_bs_roformer.onnx'
model_quant = 'maybe_one_bs_roformer.quant.onnx'
quantized_model = quantize_dynamic(
    model_fp32,
    model_quant,
    weight_type=QuantType.QUInt8,
)

关于 weight_type 的选择，看这个网站：iot-robotics.github.io/ONNXRuntime…。对 CPU 来说只能选 QuantType.QUInt8。

32bit 的权重用 QUInt8 量化理论上能缩小到原来的四分之一大小。