概述

最近在研究大模型推理加速的相关内容，查阅相关资料了解到一些常用的大模型的推理加速方法，主要有以下一些方法可以对大模型的推理进行加速：

模型剪枝
模型蒸馏
模型量化
模型权重矩阵分解
模型参数共享
模型批量推理
......

可以看到对大模型进行推理加速的方法很多，简单介绍上面各个推理加速方法的定义、原理和效果：

模型剪枝（Model Pruning） ：

- 定义：通过减少模型中的冗余参数、连接或层来减小模型的大小和计算量。
- 原理：基于权重的重要性进行选择，通过设置阈值或使用进化算法等方法，剪掉权重值较小或不重要的神经元、连接或层。
- 效果：减小模型尺寸和计算量，但可能会对模型的性能或准确率产生一定影响。

模型蒸馏（Model Distillation） ：

- 定义：通过使用一个较大的模型（教师模型）的知识来训练一个较小的模型（学生模型）。
- 原理：学生模型通过学习教师模型的预测结果或其他辅助目标，将教师模型的知识蒸馏到自己身上。
- 效果：将教师模型的性能转化为较小模型的简洁性和高效性，学生模型通常具有较小的模型尺寸和更快的推理速度。

模型量化（Model Quantization） ：

- 定义：将模型的浮点数参数转换为更低精度的整数形式，以减少内存占用和计算量。
- 原理：使用离散化方法（如K-means聚类、均匀量化等）将浮点数参数量化为整数形式。
- 效果：减小模型的计算量和内存占用，提高推理速度，但可能会对模型的性能产生一定影响。

模型权重矩阵分解（Weight Matrix Decomposition） ：

- 定义：通过将模型的权重矩阵分解为多个小矩阵，以减少计算量和内存占用。
- 原理：使用矩阵分解方法（如SVD、QR分解等）将权重矩阵分解为多个较小的矩阵。
- 效果：减小模型的计算量和内存占用，提高推理速度，但可能会对模型的性能产生一定影响。

模型参数共享（Model Parameter Sharing） ：

- 定义：在模型的不同部分之间共享参数，以减少模型的参数数量。
- 原理：通过共享权重或其他参数，将模型中的冗余减少到最小。
- 效果：减小模型的参数数量和计算量，提高推理速度，但可能会对模型的表示能力或性能产生一定影响。

模型批量推理（Model Batch Inference） ：

- 定义：将多个输入样本组成一个批次进行推理，以充分利用硬件加速器的并行计算能力。
- 原理：同时处理多个输入样本，利用并行计算加速推理过程。
- 效果：提高推理效率和吞吐量，加快推理速度。

这些方法可以单独或组合使用以加速大模型的推理过程。根据具体场景和需求，可以选择适合的技术或方法来实现推理加速。需要注意的是，不同的方法对模型的性能、尺寸和准确率可能会有不同的影响，因此在使用之前应进行实验和评估。

本文主要介绍一下模型量化，通过使用 AutoGPTQ 量化 Qwen-14B 为例进行介绍。关于模型量化的原理请参考文章末尾的相关资源了解，本文主要是实践过程，从如下几个方面进行介绍：

量化模型的结果
量化模型的过程
量化模型的评测
量化模型的最佳实践
量化过程遇到的问题
相关资源

量化模型结果

模型	token 速率	GPU显存占用	token速率提升	GPU显存节约
Qwen-14B	23.66tokens/s	27.73GB
Qwen-14B-int4-gptq	31.54tokens/s	12.11GB	34%	55%
Qwen-14B-int8-gptq	26.30tokens/s	17.86GB	13%	37%

结论：评测运行于单张 A100-80G GPU，使用 PyTorch 2.0.1 和 CUDA 11.7，在生成 5120 个 token 的情况下，将模型量化为 4bit ，模型的token生成速率大概提升 33% ，模型的显存占用节约 55%，将模型量化为 8bit , 模型的模型的token生成速率大概和没有量化相同，但是显存占用节约37%。此结论和Qwen的评测基本一致。

量化模型过程

1、使用AutoGPTQ进行量化

首先使用下面的命令安装 auto_gptq：

pip install auto_gptq

量化的代码使用的是 AutoGPTQ 给出的示例代码，示例代码如下：

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

pretrained_model_name = "Qwen-14B"
quantized_model_dir = 'Qwen-14B-4bit-gptq'

quantize_config = BaseQuantizeConfig(
    bits=4,  # 将模型量化为 4-bit 数值类型
    group_size=128,  # 一般推荐将此参数的值设置为 128
    desc_act=False,  # 设为 False 可以显著提升推理速度，但是 ppl 可能会轻微地变差
)

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, 
                                            quantize_config, 
                                            trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, 
                                          use_fast=True, 
                                          trust_remote_code=True)

examples = [
    tokenizer(
        "Auto-GPTQ 是一个简单易用的模型量化库，基于 GPTQ 算法，具有用户友好的 API。"
    )
]

model.quantize(examples)

model.save_quantized(quantized_model_dir, use_safetensors=True)
tokenizer.save_pretrained(quantized_model_dir)

提示：这里在量化的时候只使用一个文本来引导量化是为了简化代码，使用的示例越多，量化模型就越好(很可能)。

我使用了1个文本和100个文本量化后，评测结果差不多，所以不确定使用多少示例量化的模型效果最好。

在量化的时候使用多少示例，量化的模型最好，待验证。

如果没有使用 tokenizer.save_pretrained(quantized_model_dir) 保存 tokenizer 文件，则量化成功后生成如下几个文件：

如果要进行 8bit 量化，则设置 BaseQuantizeConfig(bits=8) 即可。

2、加载量化模型

上面量化结束后，如果直接使用下面的代码加载量化模型会报错，

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from transformers import AutoTokenizer, GenerationConfig

quantized_model_dir = 'Qwen-14B-4bit-gptq/'

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir,
                                          use_fast=True,
                                          trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,  
                                           device_map="auto", 
                                           use_safetensors=True, 
                                           trust_remote_code=True)

print(tokenizer)
print(model)

报错信息如下：

Could not locate the modeling_qwen.py inside Qwen-14B-4bit-gptq/.
OSError: Qwen-14B-4bit-gptq/ does not appear to have a file named modeling_qwen.py.

因为量化后缺少如下几个文件：

modeling_qwen.py
qwen_generation_utils.py
cpp_kernels.py

这几个文件可以从 Qwen-14B 的最新模型文件 中复制过来，放到量化文件目录中即可，接下来再次运行前面加载量化模型的代码即可正常加载模型，可以看到输出的模型内容中有如下内容即为量化模型：

3、使用量化模型进行推理

量化模型加载成功后就可以使用如下推理代码进行测试：

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("你是谁", return_tensors="pt").to(model.device))[0]))

可以看到量化模型正常回答了，但是回答的内容不完整，这是因为上面使用的是测试推理代码，实际生产使用的推理代码需要修改，我在 LLaMA-Factory 的推理代码中做了修改，加载量化模型可以正常推理，改造的代码如下：

就是在加载模型的时候，指定加载量化模型，其他的推理代码保持不变。

量化模型评测

token 速率评测脚本

import time
import logging
import random
from argparse import ArgumentParser
from itertools import chain
from typing import Dict, List, Optional
import json
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from tqdm import tqdm
from transformers import AutoTokenizer, GenerationConfig, AutoModelForCausalLM
from transformers.generation.logits_process import LogitsProcessor
from datasets import Dataset

logger = logging.getLogger(__name__)

random.seed(0)


class CustomizedMinNewTokensLogitsProcessor(LogitsProcessor):
    def __init__(
            self,
            min_new_tokens: int = None,
            eos_token_id: int = None,
    ):
        self.eos_token_id = eos_token_id
        self.min_new_tokens = min_new_tokens or 0
        self.current_step = 0

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        self.current_step += 1

        if self._skip_process():
            return scores

        if any(each is not None for each in [self.eos_token_id]):
            banned_mask = torch.zeros_like(scores).to(scores.device)
            if self.eos_token_id and self.current_step <= self.min_new_tokens:
                banned_mask = self._fill_banned_mask(input_ids, banned_mask, {1: [[self.eos_token_id]]})
            scores = scores.masked_fill(banned_mask.bool(), -float("inf"))

        return scores

    def _skip_process(self):
        if self.current_step > self.min_new_tokens:
            return True
        return False

    @staticmethod
    def _fill_banned_mask(
            input_ids: torch.LongTensor,
            banned_mask: torch.Tensor,
            len2words_ids: Dict[int, List[List[int]]]
    ):
        for token_len, token_ids in len2words_ids.items():
            if token_len == 1:
                banned_mask[..., list(chain(*token_ids))] = 1
            elif input_ids.shape[-1] < token_len - 1:
                continue
            else:
                token_ids = torch.LongTensor(token_ids).to(input_ids.device)
                hit_masks = torch.all(
                    token_ids[..., :-1].unsqueeze(0).repeat(input_ids.shape[0], 1, 1)
                    == input_ids[..., -(token_ids.shape[-1] - 1):].unsqueeze(1),
                    dim=-1
                )
                for idx in range(hit_masks.shape[0]):
                    selected_token_ids = torch.masked_select(token_ids[..., -1], hit_masks[idx])
                    if len(selected_token_ids):
                        banned_mask[idx, selected_token_ids] = 1
        return banned_mask


def load_data(data_path, tokenizer, n_samples, max_new_tokens):
    with open(data_path, "r", encoding="utf-8") as f:
        raw_data = json.load(f)

    # raw_data = random.sample(raw_data, k=min(n_samples, len(raw_data)))
    raw_data = raw_data[:n_samples]

    def dummy_gen():
        return raw_data

    def tokenize(examples):
        instructions = examples["instruction"]
        inputs = examples["input"]
        prompts = []
        input_ids = []
        attention_mask = []
        for istr, inp in zip(instructions, inputs):
            if inp:
                prompt = f"Instruction:\n{istr}\nInput:\n{inp}\nOutput:\n"

            else:
                prompt = f"Instruction:\n{istr}\nOutput:\n"

            if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length - max_new_tokens:
                continue

            tokenized_data = tokenizer(prompt)

            input_ids.append(tokenized_data["input_ids"][: tokenizer.model_max_length])
            attention_mask.append(tokenized_data["attention_mask"][: tokenizer.model_max_length])
            prompts.append(prompt)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "prompt": prompts
        }

    dataset = Dataset.from_generator(dummy_gen)

    dataset = dataset.map(
        tokenize,
        batched=True,
        batch_size=len(dataset),
        num_proc=1,
        keep_in_memory=True,
        load_from_cache_file=False,
        remove_columns=["instruction", "input"]
    )

    dataset = dataset.to_list()

    for sample in dataset:
        sample["input_ids"] = torch.LongTensor(sample["input_ids"])
        sample["attention_mask"] = torch.LongTensor(sample["attention_mask"])

    return dataset


def load_model_tokenizer(
        model_name_or_path: str,
        max_memory: Optional[dict] = None,
        model_basename: Optional[str] = None,
        quantize_config: Optional[str] = None,
        trust_remote_code: bool = True,
        use_triton: bool = False,
        use_safetensors: bool = False,
        use_fast_tokenizer: bool = False,
        inject_fused_attention: bool = True,
        inject_fused_mlp: bool = True,
        disable_exllama: bool = False,
        not_quantize_model: bool = False,
        use_flash_attn: bool = False
):
    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=model_name_or_path,
        use_fast=use_fast_tokenizer,
        trust_remote_code=trust_remote_code
    )
    if not tokenizer.pad_token_id:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    if not_quantize_model:
        logger.info('start loading not quantize model')
        model = AutoModelForCausalLM.from_pretrained(
            model_name_or_path,
            device_map="auto",
            max_memory=max_memory,
            torch_dtype=torch.float16,
            trust_remote_code=trust_remote_code
        )
    else:
        logger.info('start loading gptq quantized model')
        model = AutoGPTQForCausalLM.from_quantized(
            model_name_or_path,
            max_memory=max_memory,
            low_cpu_mem_usage=True,
            use_triton=use_triton,
            inject_fused_attention=inject_fused_attention,
            inject_fused_mlp=inject_fused_mlp,
            use_cuda_fp16=True,
            quantize_config=quantize_config,
            model_basename=model_basename,
            use_safetensors=use_safetensors,
            trust_remote_code=trust_remote_code,
            warmup_triton=False,
            disable_exllama=disable_exllama,
            use_flash_attn=use_flash_attn
        )

    return model, tokenizer


def benchmark_generation_speed(model, tokenizer, examples, generation_config):
    generation_time_list = []
    num_generated_tokens_list = []
    progress_bar = tqdm(examples)
    max_gpu_memory_cost = 0
    max_input_len = 0
    max_output_len = 0
    for example in progress_bar:
        input_ids = example["input_ids"].to(model.device)

        prompt = tokenizer.decode(input_ids)
        logger.info(f'prompt：{prompt}')
        max_input_len = max(max_input_len, len(input_ids))

        start = time.time()
        outputs_ids = model.generate(
            input_ids=input_ids.unsqueeze(0),
            generation_config=generation_config,
            logits_processor=[
                CustomizedMinNewTokensLogitsProcessor(generation_config.max_new_tokens, tokenizer.eos_token_id)
            ]
        )
        # tokenizer.decode 获取结果
        output_ids = outputs_ids[0][len(input_ids):]
        output = tokenizer.decode(
            output_ids,
            spaces_between_special_tokens=False,
        )
        logger.info(f'answer：{output}')
        max_output_len = max(max_output_len, len(output))
        end = time.time()

        generation_time_list.append(end - start)
        num_generated_tokens = 0
        for output_ids in outputs_ids:
            num_generated_tokens += len(
                [
                    token_id for token_id in output_ids[len(input_ids):] if token_id != tokenizer.pad_token_id
                ]
            )
        num_generated_tokens_list.append(num_generated_tokens)

        # 获取最大的 GPU 内存消耗，单位为 byte
        max_gpu_memory_cost = max(max_gpu_memory_cost, torch.cuda.max_memory_allocated())

        progress_bar.set_postfix(
            num_tokens=num_generated_tokens_list[-1],
            time=generation_time_list[-1],
            speed=f"{num_generated_tokens_list[-1] / generation_time_list[-1]:.4f}tokens/s",
            gpu_memory_cost=f"{max_gpu_memory_cost / 1024 / 1024 / 1024}GB",
        )
        # 释放当前未被占用的 GPU 缓存
        torch.cuda.empty_cache()

    total_tokens = sum(num_generated_tokens_list)
    total_seconds = sum(generation_time_list)

    benchmark_result = {
        "total_tokens": total_tokens,
        "total_seconds": total_seconds,
        "token_per_second": f"{total_tokens / total_seconds}tokens/s",
        "max_gpu_memory_cost": f"{max_gpu_memory_cost / 1024 / 1024 / 1024}GB",
        "max_input_len": max_input_len,
        "max_output_len": max_output_len,
    }
    result = json.dumps(benchmark_result, indent=4, ensure_ascii=False)
    logger.info(f"模型评测结果：{result}")


def main():
    parser = ArgumentParser()
    parser.add_argument("--model_name_or_path", type=str)
    parser.add_argument("--model_basename", type=str, default=None)
    parser.add_argument("--quantize_config_save_dir", type=str, default=None)
    parser.add_argument("--trust_remote_code", action="store_true")
    parser.add_argument("--use_triton", action="store_true")
    parser.add_argument("--not_quantize_model", action="store_true", help='not quantize model')
    parser.add_argument("--use_safetensors", action="store_true")
    parser.add_argument("--use_fast_tokenizer", action="store_true")
    parser.add_argument("--disable_exllama", action="store_true")
    parser.add_argument("--no_inject_fused_attention", action="store_true")
    parser.add_argument("--no_inject_fused_mlp", action="store_true")
    parser.add_argument("--num_samples", type=int, default=10)
    parser.add_argument("--per_gpu_max_memory", type=int, default=None)
    parser.add_argument("--cpu_max_memory", type=int, default=None)
    parser.add_argument("--max_new_tokens", type=int, default=512)
    parser.add_argument("--do_sample", action="store_true")
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--data_path", type=str, default="../quantization/dataset/alpaca_data_cleaned.json")
    parser.add_argument("--use_flash_attn", action="store_true")
    args = parser.parse_args()

    max_memory = dict()
    if args.per_gpu_max_memory is not None and args.per_gpu_max_memory > 0:
        if torch.cuda.is_available():
            max_memory.update(
                {i: f"{args.per_gpu_max_memory}GIB" for i in range(torch.cuda.device_count())}
            )
    if args.cpu_max_memory is not None and args.cpu_max_memory > 0 and max_memory:
        max_memory["cpu"] = f"{args.cpu_max_memory}GIB"
    if not max_memory:
        max_memory = None

    logger.info(f"max_memory: {max_memory}")

    quantize_config = None
    if args.quantize_config_save_dir:
        quantize_config = BaseQuantizeConfig.from_pretrained(args.quantize_config_save_dir)

    logger.info("loading model and tokenizer")
    start = time.time()
    model, tokenizer = load_model_tokenizer(
        model_name_or_path=args.model_name_or_path,
        max_memory=max_memory,
        model_basename=args.model_basename,
        quantize_config=quantize_config,
        trust_remote_code=args.trust_remote_code,
        use_triton=args.use_triton,
        use_safetensors=args.use_safetensors,
        use_fast_tokenizer=args.use_fast_tokenizer,
        inject_fused_attention=not args.no_inject_fused_attention,
        inject_fused_mlp=not args.no_inject_fused_mlp,
        disable_exllama=args.disable_exllama,
        not_quantize_model=args.not_quantize_model,
        use_flash_attn=args.use_flash_attn
    )
    end = time.time()
    logger.info(f"model and tokenizer loading time: {end - start:.4f}s")
    if not args.not_quantize_model:
        logger.info(f"model quantized: {model.quantized}")
        logger.info(f"quantize config: {model.quantize_config.to_dict()}")
        logger.info(f"model device map: {model.hf_device_map}")

    if args.use_triton:
        logger.info("warmup triton, this may take a while.")
        model.warmup_triton()

    logger.info("loading data")
    examples = load_data(
        args.data_path, tokenizer, args.num_samples, args.max_new_tokens
    )

    generation_config = GenerationConfig(
        num_beams=args.num_beams,
        num_return_sequences=args.num_beams,
        do_sample=args.do_sample,
        min_new_tokens=args.max_new_tokens,
        max_new_tokens=args.max_new_tokens,
        pad_token_id=tokenizer.pad_token_id
    )
    logger.info(f"generation config: {generation_config.to_dict()}")

    logger.info(f"benchmark generation speed")
    benchmark_generation_speed(model, tokenizer, examples, generation_config)


if __name__ == "__main__":
    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
    )

    main()

# 测试非量化模型
# python generation_speed.py \
# --model_name_or_path Qwen-14B/ \
# --data_path dataset/instruct_法律问答3000.json \
# --trust_remote_code \
# --use_fast_tokenizer \
# --not_quantize_model \
# --max_new_tokens 512 \
# --do_sample \
# --num_samples 50 \
# --num_beams 1


# 测试量化模型
# python generation_speed.py \
# --model_name_or_path Qwen-14B-int4-gptq/ \
# --data_path dataset/instruct_法律问答3000.json \
# --trust_remote_code \
# --use_safetensors \
# --use_fast_tokenizer \
# --max_new_tokens 512 \
# --do_sample \
# --num_samples 50 \
# --num_beams 1

这里评测量化模型生成token速率的代码是参考AutoGPTQ 给出的代码 generation_speed.py ，我做了修改，添加了评测非量化模型的逻辑和计算最大显存占用的逻辑。

上面的推理推理代码有点问题，生成的答案可能有重复内容，但是这个脚本是用来评测生成 token 速率的，所以不影响 token 速率评测，其中推理代码后续优化。

以下是对上述参数的详细解释:

--model_name_or_path: 指定模型的名称或路径。可以是预训练模型的名称、模型文件夹路径或模型文件路径。

--model_basename: 指定模型的基本名称。用于保存模型的文件夹和文件名，默认为None。

--quantize_config_save_dir: 指定量化配置文件的保存目录。量化配置文件用于指定模型量化的参数和设置。

--trust_remote_code: 如果设置了该参数，表示信任远程代码。通常在使用远程模型时需要设置。

--use_triton: 如果设置了该参数，表示使用Triton Inference Server进行推断。Triton是一个用于高性能推断的开源推理服务器。

--not_quantize_model: 如果设置了该参数，表示不对模型进行量化。量化是将模型参数转换为低精度表示的技术，用于提高模型的推理效率。

--use_safetensors: 如果设置了该参数，表示使用Safetensors库进行安全的模型推断。Safetensors是一个用于在安全环境中运行深度学习模型的库。

--use_fast_tokenizer: 如果设置了该参数，表示使用快速分词器。快速分词器能够提供更高的分词速度。

--disable_exllama: 如果设置了该参数，表示禁用ExLLAMA优化。ExLLAMA是一个用于优化模型推理性能的库。

--no_inject_fused_attention: 如果设置了该参数，表示不注入融合的注意力操作。融合的注意力操作可以提高模型的推理速度。

--no_inject_fused_mlp: 如果设置了该参数，表示不注入融合的多层感知器操作。融合的多层感知器操作可以提高模型的推理速度。

--num_samples: 指定生成样本的数量。用于生成文本的时候，指定要生成的样本数量，默认为10。

--per_gpu_max_memory: 指定每个GPU的最大内存使用量。单位为GB。

--cpu_max_memory: 指定CPU的最大内存使用量。单位为GB。

--max_new_tokens: 指定要生成的新标记的最大数量。用于生成文本时，指定要生成的最大标记数量。

--do_sample: 如果设置了该参数，表示要进行采样生成文本。采样是从模型输出的概率分布中随机选择下一个标记。

--num_beams: 指定束搜索的数量。束搜索是一种生成文本的方法，它在每个生成步骤中保留多个候选，以选择最有可能的标记序列。

--data_path: 指定数据集的路径。用于训练或生成文本时，指定数据集的文件路径。

--use_flash_attn: 如果设置了该参数，表示使用Flash Attention机制。Flash Attention是一种用于注意力计算的加速方法。

测试非量化模型命令

python generation_speed.py \
--model_name_or_path Qwen-14B/ \
--data_path dataset/instruct_法律问答3000.json \
--trust_remote_code \
--use_fast_tokenizer \
--not_quantize_model \
--max_new_tokens 512 \
--do_sample \
--num_samples 50 \
--num_beams 1

测试量化模型命令

python generation_speed.py \
--model_name_or_path Qwen-14B-int4-gptq/ \
--data_path dataset/instruct_法律问答3000.json \
--trust_remote_code \
--use_safetensors \
--use_fast_tokenizer \
--max_new_tokens 512 \
--do_sample \
--num_samples 50 \
--num_beams 1

其中需要评测的数据文件内容格式如下：

[
    {
        "instruction": "基于法律法规和司法案例回答法律问题",
        "input": "网络虚拟财产被骗不还直播平台货币被玩家借走18000元，一个月不还，请问可以立案吗",
        "output": "您好，根据您的描述，双方属于民事纠纷，不属于刑事案件，可以向法院提起诉讼维权，根据《民事诉讼法》第一百一十九条规定：起诉必须符合下列条件：（一）原告是与本案有直接利害关系的公民、法人和其他组织；（二）有明确的被告；（三）有具体的诉讼请求和事实、理由；（四）属于人民法院受理民事诉讼的范围和受诉人民法院管辖。谢谢。"
    }
]

模型量化最佳实践

1、总是建议首先考虑将整个模型加载到 GPU 中，因为这样可以节省在 CPU 和 GPU 之间传输模块权重的时间。

默认情况下，使用 AutoGPTQForCausalLM.from_pretrained()是将整个预训练模型加载到 CPU 中。

当 GPU 显存充足，并且 CPU 内存有限(默认情况下模型将在 CPU 中初始化)或希望更快地加载模型时，可以在

加载预训练模型和量化模型时添加参数：low_cpu_mem_usage=True。

2、为了推理，遵循这个原则: 如果可以的话，总是使用单个 GPU，否则多个 GPU，CPU offload 是最后一个要考虑的。

量化过程中遇到的问题

1、模型加载报错

1）找不到模型

在模型加载的过程中报错，报错信息如下：FileNotFoundError: Could not find model

  File "/opt/conda/lib/python3.8/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
  File "/opt/conda/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
    raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model

报错原因

在保存量化模型的时候执行 model.save_quantized(quantized_model_dir, use_safetensors=True) 可以将模型权重文件保存为 gptq_model-4bit-128g.safetensors 格式，如果没有指定 use_safetensors=True ，则会保存模型权重名称为 gptq_model-4bit-128g.bin。

解决方法

当我们在加载量化模型的时候，如果是 safetensors格式的权重文件，则需要在 AutoGPTQForCausalLM.from_quantized(） 参数中指定 use_safetensors=True。

2) QWenConfig 没有 use_cache_quantization 属性

在加载模型时报错：AttributeError: 'QWenConfig' object has no attribute 'use_cache_quantization'，详细报错信息如下：

Traceback (most recent call last):
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,  
  File "/opt/conda/lib/python3.8/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
  File "/opt/conda/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 817, in from_quantized
    model = AutoModelForCausalLM.from_config(
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 443, in from_config
    return model_class._from_config(config, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1170, in _from_config
    model = cls(config, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/autogptq-4-v11-all/modeling_qwen.py", line 1044, in __init__
    self.transformer = QWenModel(config)
  File "/root/.cache/huggingface/modules/transformers_modules/autogptq-4-v11-all/modeling_qwen.py", line 773, in __init__
    [
  File "/root/.cache/huggingface/modules/transformers_modules/autogptq-4-v11-all/modeling_qwen.py", line 774, in <listcomp>
    QWenBlock(
  File "/root/.cache/huggingface/modules/transformers_modules/autogptq-4-v11-all/modeling_qwen.py", line 630, in __init__
    self.attn = QWenAttention(config)
  File "/root/.cache/huggingface/modules/transformers_modules/autogptq-4-v11-all/modeling_qwen.py", line 296, in __init__
    if config.use_cache_quantization and config.use_cache_kernel:
  File "/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py", line 261, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'QWenConfig' object has no attribute 'use_cache_quantization'

报错原因

量化生成的 configuration_qwen.py 文件中没有 use_cache_quantization 这个属性，通过在线比对发现量化生成的 configuration_qwen.py缺少如下几个参数：

解决方法

将 Qwen-14B 的模型文件中的 configuration_qwen.py文件内容替换量化生成的文件即可。

2、模型推理报错

在使用 model.generate()获取模型推理结果时报错：FWD: Unsupported hidden_size or types: 5120BFloat16FloatFloatFloatFloat，详细报错信息如下：

Traceback (most recent call last):
  File "generation_speed.py", line 343, in <module>
    main()
  File "generation_speed.py", line 335, in main
    benchmark_generation_speed(model, tokenizer, examples, generation_config)
  File "generation_speed.py", line 209, in benchmark_generation_speed
    outputs_ids = model.generate(
  File "/opt/conda/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 443, in generate
    return self.model.generate(**kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1330, in generate
    return super().generate(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1120, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 950, in forward
    outputs = block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 670, in forward
    layernorm_output = self.ln_2(layernorm_input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1425, in forward
    return rms_norm(x, self.weight, self.eps)
  File "/opt/conda/lib/python3.8/site-packages/flash_attn/ops/rms_norm.py", line 15, in rms_norm
    return DropoutAddLayerNormFn.apply(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.8/site-packages/flash_attn/ops/layer_norm.py", line 334, in forward
    zmat, xmat, dmask, mu, rsigma = _dropout_add_layer_norm_forward(
  File "/opt/conda/lib/python3.8/site-packages/flash_attn/ops/layer_norm.py", line 33, in _dropout_add_layer_norm_forward
    zmat, xmat, dmask, mu, rsigma = dropout_layer_norm.dropout_add_ln_fwd(
RuntimeError: FWD: Unsupported hidden_size or types: 5120BFloat16FloatFloatFloatFloat

报错原因

具体报错原因还不清楚，在 Qwen 的github仓库有一个 issue 已经提了这个问题

解决方法

因为这个报错的信息和 flash attention 有关，所以使用下面的命令将 flash-attention 卸载即可：

pip uninstall flash-attn