浅析transformers中的文本生成策略前言当huggingface上新上架一款语言模型时，主页会提供一段示例代码

前言

当huggingface上新上架一款语言模型时，主页会提供一段示例代码，教读者如何使用transformers加载模型进行推理。

例如qwen2-7b-instruct提供的调用示例demo如下

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

llama3.1-8b-instruct提供的调用示例demo如下

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

不管是哪种方式，对于文本生成类模型，推理时底层依赖的都是transformers.generation.utils.GenerationMixin.generate方法，本文旨在介绍generate方法中包含的文本生成策略。

预备知识

在详细描述generate中包含的9个文本生成策略前，首先介绍2点预备基础知识

transformer 是一种自回归式的生成模型。
transformer 的输出是token取token词表中每个token的相对概率分布【在softmax前，分量之和并不为1】。

自回归式的生成模型

以下图为例，输入的token序列为“once upon”，经过transformer推理后，按照某种文本生成策略，从输出的token概率分布中采样出的结果为a，然后再将“once upon a”输入到transformer中进行推理，这次从输出的token概率分布中采样出的结果为time，接着模型又会将“once upon a time”输入到transformers中进行推理，transformers.generation.utils.GenerationMixin.generate方法在生成文本时会重复这个自回归式的生成过程，直到满足以下任何一种结束条件：

模型输出的token数达到最大输出token数量限制。
模型推理时间超过指定的推理时间。
模型输出了tokenizer中的结束token-eos_token_id。
模型输出了推理时传入的结束token-stop_strings。

模型最后一层的输出是token的相对概率分布

以qwen2-7b-instruct为例，其config.json中指明了通过transformers.modeling_utils.PreTrainedModel.from_pretrained方法加载模型时，使用的类为Qwen2ForCausalLM。

Qwen2ForCausalLM的最后一层是一个线性层

假如输入的维度为(1,seq_len,hidden_size)，即一条包含seq_len个token的序列，Qwen2ForCausalLM的最后一个线性层会将输入向量从(1,seq_len,hidden_size)维度映射到了(1,seq_len,vocab_size)维度，再应用softmax后，这个向量就代表了seq_len个token经Qwen2ForCausalLM处理后，输出的第seq_len+1个token取token表中各个token的概率。

torch.multinomal随机采样

multinomal用于从一个分布中进行抽样，下面的示例代码从随机初始化的、大小为6的张量中随机抽取了一个样本，最终返回的是样本的索引。

每个样本都被抽到的可能性和其概率相关。

文本生成策略总览

参考transformers.generation.configuration_utils.GenerationMode中的枚举，transformers总共支持9种文本生成策略，接下来将详细介绍一下这9种生成策略中的每一种【顺序同枚举顺序略有出入】。

greedy search

贪心检索的思想非常直观，在自回归生成的每一轮，选择概率最大的token采样为输出token。

缺点：模型的输出失去多样性，因为每轮采样时选择的都是概率最大的token，所以输入确定后，模型输出始终相同。

采样实践

参考generate中的注释，使用generate推理时，使用的推理参数优先取模型目录下generation_config.json文件中的值，对于未在generation_config.json中声明的推理参数，会参考模型配置中的推理参数值【以qwen为例，即transformers.models.qwen2.configuration_qwen2.Qwen2Config中的推理参数值】，在以上2处都未指明的推理参数则会继承GenerationConfig中的默认值。如果不想使用这3处指定的推理参数值，需要在调用generate时手动指定推理参数值。

qwen2-7b-instruct的仓库中包含了generation_config.json，该文件中指明的推理参数优先级最高，其中do_sample=true，因此推理时会开启采样推理，如果要使用greedy_search的话，需要在generate时，手动传入do_sample=False。

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
	    max_new_tokens=512,
	    do_sample=False,
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

以prompt="你是谁"为例，运行实践验证代码5次结果如下，可以看到设置do_sample=False后，模型进行greedy search，输出固定不变。

sample

模型输出固定不变未必是件好事，如果期望模型输出具有多样性【同样的prompt，返回结果不同】。可以使用sample文本生成策略，其核心思想很简单：在每轮采样时，不像greedy search选择概率最大token，而是根据概率分布随机选择一个token。

推理时常使用的top_k、top_p、temperature就是会影响sample结果的推理参数，接下来详细介绍一下top_k、top_p、temperature是如何影响sample采样的。

top_k采样

下图中，给模型的输入为 Yesterday I went to the cinema to see a

在第一轮自回归中，模型输出的token的可能取值有5个，分别为omelette、like、film、documental、love

每个token下面列出该token被选中的概率。

如果top_k=3，采样时会按照token采样概率由高到低对token进行排序，然后选择前3个token做为待采样的token，即 film、documental、omelette这3个token进入待采样token列表中，其余token会被丢掉。

top_p采样

top_p会按照累积概率进行采样，如果top_p=0.95，采样时会按照token采样概率由高到低对token进行排序，由于0.54+0.36<0.95，而0.54+0.36+0.08>=0.95，因此film、documental这2个token满足累积概率小于等于top_p，这2个token会进入待采样token列表中，其余token被丢掉。

temperature是如何影响采样结果的

参考transformers.generation.logits_process.TemperatureLogitsWarper.call，其处理逻辑非常简单——将模型输出的相对概率分布除以temperature。

数学公式描述如下，在使用softamax归一化概率分布时，首先把分量除了温度。

看一个具体的例子，假设最后一层输出的向量为(3,1,4.9,4.5,1.1)，这5个分量分别代表模型此轮输出取omelette、like、film、documental、love这5个词的相对概率，如果temperature为1，经softmax后，film的采样概率最高，为0.54。

如果temperature为0.3，在除以temperature再softmax后，film的概率来到了0.79。

如果temperature为10，在除以temperature再softmax后，film的概率变成了0.24。

可以看到，小于1的temperature能将采样概率分布变得锋利，将本就概率比较大的token的采样概率变得更大；大于1的temperature则将采样概率分布变得更加平滑，使各个token被采样的概率差异变小。

top_k、top_p、temperature的影响顺序

参考transformers.generation.utils.GenerationMixin._get_logits_warper，首先是temperature起作用，然后是top_k和top_p起作用。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
	    max_new_tokens=512,
	    do_sample=True,
	    top_p=0.9,
	    top_k=1,
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

设定top_k=1时，使用prompt="你是谁"，推理5次结果如下

这是非常有意思的一点，笔者本来以为在top_k=1时，推理会退化成greedy search，但实际上并没有，输出还是具有多样性的，翻源码后发现如果2个token的采样概率都是top1大的，那么在top_k=1时，这2个token都会进入最终的待采样候选token列表中。

contrastive search

penalty alpha

前面介绍了greedy search的输出是唯一的，sample通过采样的方式给输出结果带来了多样性，但是greedy search、sample都有可能存在重复输出的问题，例如模型重复输出 a b c a b c a b c。

而contrastive search中的penalty alpha参数就是来解决模型重复输出相似token问题的。

其核心思想为：在采样时，计算待采样token与历史输出的每个token的相似度，如果一个待采样的token与历史输出的每个token的相似度都很低，说明这是一个之前未输出过的新token，那么这个待采样token被选中的概率就高，如果待采样token与历史输出的某个token或某些token的相似度非常高，那么这个待采样token被选中的概率就低。

用数学公式描述如下，V代表当前轮采样时所有可选的token组成的极核，v代表可选token集合中的任意一个token，v被采样的概率为ptheta，s(hv,hxj)就衡量了v和历史输出token之间相似度的最大值。argmax代表在V中选一个v，根据这个v代入到后面的计算表达式中后，算出的值最大。

alpha代表惩罚系数，即待采样token和历史输出token相似度的最大值所占的比重。

alpha越大，惩罚越高，出现相似token的概率就越小。
alpha为0时表示没有惩罚，contrastive search退化为greedy search
alpha为负数时，则鼓励出现相似token

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        top_k=50,
        top_p=1,
        do_sample=False,
        temperature=1,
        penalty_alpha=1,
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

设置penalty_alpha为1，当指令期望模型相似输出时，模型开始输出奇怪的文字。

设置penalty_alpha为0时，模型可以按照指令输出相似的token。

惩罚系数设置为负数，起奖励作用，结果不用说，肯定同上。

repetition_penalty

介绍了contrastive search的核心思想后，笔者还想说一下contrastive search时可以传入的repetition penalty这个参数，penalty alpha解决的是模型重复输出相似token的问题，而repetition penalty解决的是模型重复输出相同token的问题。

在使用repetation penalty做生成时，在当前自回归轮采样时如果发现一个待采样token已在历史输出过，那么就会降低这个已重复输出过的token的概率。

用数学公式描述如下，如果一个token已经在历史轮输出过，计算softmax前，会将其概率乘以 repetition_penalty 的倒数。例如，当 repetition_penalty 为 1.2时，历史轮生成过的token的概率将被乘以 1/1.2，即 0.833后再计算softamax，这样会降低其采样概率。

同penalty alpha一样，设置不同的值会分别起到奖励、惩罚、无奖无罚的作用

如果repetition_penalty>1，对相同token起惩罚作用
如果repetition_penalty = 1，无惩罚。
如果repetition_penalty<1，对相同token起奖励作用，鼓励输出相同token。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        top_k=50,
        top_p=1,
        do_sample=False,
        temperature=1,
        repetition_penalty=10.0,
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

当设置重复token惩罚系数为10.0时，模型无法按照指令重复输出

当设置重复token惩罚系数为1时，即不惩罚时，模型可以按照指令重复输出

当设置重复token乘法系数<1时，起奖励作用，模型倾向输出重复token。

assisted generation

assisted generation 又叫speculative decoding，于2023年5月18日发布于论文，这个生成策略是用来提高推理速率的。

前置知识里提了，模型的生成过程是一个自回归式的生成过程，如果模型含有100B的参数，那么每轮自回归的推理都需要将这100B的参数作用在输入token序列上。

speculative decoding的核心思想为，采用2个模型【eg 1个大参数的模型 100B，1个小参数的模型1B】进行并行推理，首先由小参数的模型进行多轮自回归的推理，然后将小参数的模型推理出的一小批token交由大参数的模型进行验证，如果大模型验证后觉得ok，则保留这一小批token，如果大模型验证后觉得不ok，则由大参数的模型进行纠正。

备注：大参数的模型和小参数的模型使用的tokenizer和词表需要完全一致。

dola generation

dola generation于2024年5月11日发表于论文，全称为decoding by Contrasting layers，这个文本生成策略是用来解决模型输出不符合事实、存在幻觉的问题。

其核心思想为：通过对比式的解码方式来放大模型中的事实知识，具体来讲，输出的下一个token的概率是通过对比模型中较高层输出的值与较低层输出的值的差异后得到的。

以下图为例，LLaMA-7B有32个transformer block，当prompt为 Where is the capital of Washington State？时，第8层、16层、24层、32层倾向于选择Seattle这个token的概率保持不变，但倾向于选择Olympia这个token的概率在8、16、24、32层中呈现升高的趋势，因此模型使用对比策略后，将选择Olympia这个token的概率增高，即最终倾向于选择Olympia这个token做为输入prompt的第一个输出token。

采样实践

对于这种采样策略能否解决幻觉问题笔者持非常好奇的态度，测试过程如下：

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
	    max_new_tokens=512,
	    do_sample=False,
	    dola_layers='high',
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

笔者首先设置了dola_layers='high'，给模型的prompt为"都江堰是谁修建的"，模型回答说都江堰是春秋时期修的。

然后笔者取消设置dola_layers后，模型回答说是战国时期修的。

经过百度百科核实，实际上是战国时期修的，单次实验表明设置dola layers参数，模型推理未必能输出符合事实的结果。因此不建议使用这个文本生成策略。

beam search

beam search在生成时追踪多个句子的累计概率，自回归迭代时保留累计概率高的，丢掉累计概率低的。

huggingface上提供了beam search的试玩图形界面，笔者试玩后生成的示意图如下。

这里给gpt2的prompt为nice to meet you，让gpt2做文本生成时指定的生成参数为：

number of steps=4，让模型自回归生成4轮
number of beams=4，让模型在每一轮自回归生成时，追踪所有句子的累计得分，但只沿着累计得分最高的前4个句子继续做自回归式的生成。
length penalty=1：长度惩罚，1表示不惩罚。
number of return sequences=3，期望返回的累计得分最高的句子数量。

因此，按照上述参数，输入 nice to meet you后，模型返回了3个token序列。

nice to meet you."\n\n" 对应的累计概率为-1.13
nice to meet you.\n\n" 对应的累计概率为-1.60
nice to meet you.\n\nI 对应的累计概率为-1.6

其余的句子的累计概率都小于这3个句子的累计概率，因此未返回。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        do_sample=False,
        num_beams=5,
        num_return_sequences=2,
        top_p=1,
        temperature=1,
    )

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids.repeat(len(generated_ids[0]),1), generated_ids)
    ]


    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return response

可以看到，指定num_return_sequences=2后，模型返回了2个答复。

重复运行3次，模型返回的前3高的累计概率的句子结果是确定的。

beam sample

beam sample的核心思想同beam search相同，在生成时追踪多个句子的累计概率，保留累计概率高的，丢掉累计概率低的。

区别点：beam search在每一轮计算累计概率前，取的是概率最高的那些token，而beam sample在每一轮计算累计概率前，通过多项式随机采样token。

因此，beam sample的返回结果具有随机性。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        do_sample=True,
        num_beams=5,
        num_return_sequences=2,
        top_p=0.9,
        top_k=50,
        temperature=10.0,
    )

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids.repeat(len(generated_ids[0]),1), generated_ids)
    ]


    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return response

重复运行3次，模型返回的前2高的累计概率的句子结果是不确定的。

constrained beam search

当在生成策略中指明了constraints或force_words_ids时，就会采用constrained beam search策略进行文本生成。

constrained beam search含有beam seach的思想，在生成时追踪多个句子的累计概率，保留累计概率高的，丢掉累计概率低的

同beam search的区别点：constrained beam search允许模型调用者要求模型输出必须包含某个token或某些token。

force_word_ids

constraints同force_word_ids一样，都在要求模型输出必须包含某个token或某些token，这里针对force_word_ids展开进行介绍，constraints作用逻辑类似。

以下图为例，模型调用者期望模型输出的token序列中必须包含 “is fast” 这2个词对应的token时，给模型的prompt为 “The”，在第一轮自回归生成时，除了按照beam search的策略，追踪那些概率排top k大的生成方向【对应于 The dog、The nice、The car、The boat的那些生成】，还会将调用者要求模型输出的token追加到输入的token中，然后沿着追加了要求token的方向【The is】继续自回归式的生成。

在第二轮自回归生成时，模型只沿着累计概率前2高的The dog、The nice和包含要求token的The is继续自回归式生成，继续生成时分了2part，1part按照要求追加指定token，1part按照beam search追加概率排top k【topk=3】的token。

因为这里要求的token只有2个，经过2轮自回归后，已经有一个句子【The is fast】包含了所有的要求Token，对应于下图中的BANK 2。

这里解释一下Bank n的含义：Bank n指为了包含要求的token列表，经过n步迭代的那些beam。下图中的Bank1朝着要求的token列表方向迭代了1步，Bank0则未按照要求条件迭代，由于要求条件中给出的token未必是通顺的、有意义的句子，因此最终返回结果时，constrained beam search会根据number of return sequences的大小，从bank0、bank1、bank2。。。bankn中挑选出累计概率最高的句子返回。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    force_words=["达模型"]
    force_words_ids = tokenizer(force_words, add_special_tokens=False).input_ids

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        do_sample=False,
        num_beams=5,
        num_return_sequences=5,
        force_words_ids = force_words_ids,
    )

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids.repeat(len(generated_ids[0]),1), generated_ids)
    ]


    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return response

要求模型输出包含“达模型”，

模型输出的第5个beam中的最后一个词为要求的词，但模型整体输出存在重复，可用性不高。

group beam search

参考论文，group beam search想解决的问题：beam search沿不同beam搜索出来的句子前缀相似度很高，仅仅句子尾部的词有区别。group beam search期望在同一个beam group内包含相似度很高的前缀，而不同的beam group采用相似度较低的前缀。

采样实践

from transformers import AutoModelForCausalLM, AutoTokenizer

def load_model_and_tokenizer(model_path: str):
    # Now you do not need to add "trust_remote_code=True"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, )
    return model, tokenizer
def get_answer(prompt: str, tokenizer, model):
    device = "cuda"  # the device to load the model onto

    # Instead of using model.chat(), we directly use model.generate()
    # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    # Directly use generate() and tokenizer.decode() to get the output.
    # Use `max_new_tokens` to control the maximum output length.
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        do_sample=False,
        num_beams=5,
        num_return_sequences=5,
        num_beam_groups=5,
        diversity_penalty=10.0,
    )

    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids.repeat(len(generated_ids[0]),1), generated_ids)
    ]


    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return response

可以看到，返回的5个group里的5个生成结果各有特色。

小结

本文介绍了transformers.generation.utils.GenerationMixin.generate中支持的9种文本生成策略，使用qwen2-7b-instruct针对8种采样策略进行了实践，实践结论如下：

使用dola generation时还是存在幻觉情况，可能返回不符合事实的结论，不建议使用。
如果期望模型的输出始终一致，可以使用greedy search、beam search文本生成策略。
如果期望模型的输出具有多样性，建议直接采用模型发布官方提供的generation_config.json中的推理配置。
如果期望模型输出多个风格不同的答案，可以使用group beam search文本生成策略。

参考资料

top_k、top_p、temperature是如何影响采样结果的：medium.com/@daniel.pue…
huggingface上对contrastive_search的介绍：huggingface.co/blog/introd…
介绍repetition alpha，笔者直接截图公式的论文：arxiv.org/pdf/1909.05…
对speculative decoding的介绍：medium.com/ai-science/…
huggingface上对文本生成策略的介绍：huggingface.co/docs/transf…
dola的论文解释：arxiv.org/pdf/2309.03…
huggingface试玩beam search：huggingface.co/spaces/m-ri…
huggingface对constrained beam search的解释：huggingface.co/blog/constr…
group beam search论文原文：arxiv.org/pdf/1610.02…