模型结构-qwen2背景本文将以Qwen2系列大模型为基础，讲解Qwen2模型技术架构及模型原理。编码词表的设计可

背景

本文将以Qwen2系列大模型为基础，讲解Qwen2模型技术架构及模型原理。

编码

词表的设计可以影响训练的效率和下游任务的表现。Qwen系列模型采用的是tiktoken分词器，这是一种快速分词方法，该方法被使用在OpenAI系列模型中，tiktoen的核心逻辑同样是基于BPE算法，下面介绍下这两类算法。

把输入字符串分割为单词或子词（单词的部分）是自然语言处理过程中一项最基本的工作，这个过程是分词，存在较多算法，其中最为经典的是BPE（Byte Pair Encoding）算法。

核心代码

bpe_train

  函数定义：
  def bpe_train(data: str, vocab_size: int, pat_str: str) -> dict[bytes, int]
      句子：data = "你好，qwen大模型"
      词表大小：vocab_size=275
      词切分正则：pat_str = r"'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
  执行步骤：
  step1:  判断vocab_size < 2**8=256？是->报错，否继续。原因见notes1
  step2:  0-256,先填充到rank词表，此部分词表固定，下面截取部分内容
  {b'\x00': 0, b'\x01': 1,...b'A': 65, b'B': 66,..., b'y': 121, b'z': 122,...,b'\xfe': 254, b'\xff': 255}
  step3:  把data分割为字节列表,即数据变成如下格式为[["你好"],[","],["qwen大模型"]]：
  data = [[b'\xe4', b'\xbd', b'\xa0', b'\xe5', b'\xa5', b'\xbd'], [b'\xef', b'\xbc', b'\x8c'], [b'q', b'w', b'e', b'n', b'\xe5', b'\xa4', b'\xa7', b'\xe6', b'\xa8', b'\xa1', b'\xe5', b'\x9e', b'\x8b']]
  step4:  计算共同出现的字节对，然后把它从255开始往后计数增加到词表中，直到得到的词表rank等于vocab_size
  共现字节对：
  {
  (b'\xe4', b'\xbd'): 1, 
  (b'\xbd', b'\xa0'): 1,
  (b'q', b'w') :1,
  (b'w', b'e') :1,
  (b'e', b'n') :1,
  ...
  }
  更新后的rank词表：
  {
  b'\x00': 0, 
  b'\x01': 1，
  ...,
  b'\xe4\xbd':256,
  b'\xbd\xa0':257,
  ...
  }
  反复迭代，更新最终的rank词表：
  {
  b'\x00': 0, 
  b'\x01': 1，
  ...
  b'\xe4\xbd': 256
  b'\xe4\xbd\xa0': 257
  b'\xe4\xbd\xa0\xe5': 258
  b'\xe4\xbd\xa0\xe5\xa5': 259
  b'\xe4\xbd\xa0\xe5\xa5\xbd': 260
  b'\xef\xbc': 261
  b'\xef\xbc\x8c': 262
  b'qw': 263
  b'qwe': 264
  b'qwen': 265
  b'qwen\xe5': 266
  b'qwen\xe5\xa4': 267
  b'qwen\xe5\xa4\xa7': 268
  b'qwen\xe5\xa4\xa7\xe6': 269
  b'qwen\xe5\xa4\xa7\xe6\xa8': 270
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1': 271
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5': 272
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5\x9e': 273
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5\x9e\x8b': 274
  }

  step4:  END

bpe_encode

  函数定义：
  def bpe_encode(mergeable_ranks: dict[bytes, int], input: bytes) -> list[int]:
  步骤：
  较为简单，不赘述。逻辑和训练极为相似，就是吧input的bytes挨个组成对后直接去查词表，如果有就记下来，最终成该字节对应的编码数字
  实验：
  ```python
  >data = "你好，qwen大模型"
  >pat_str = r"'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
  >mergeable_ranks = bpe_train(data=data, vocab_size=275, pat_str=pat_str)
  >print(bpe_encode(mergeable_ranks, data.encode("utf-8")))
  [260, 262, 274]
  ```
  结果：
  上述代码执行后，"你好，qwen大模型"的编码为[260, 262, 274]，找到bpe_train中最终的词表
  得到对应字节为：
  b'\xe4\xbd\xa0\xe5\xa5\xbd': 260
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5': 272
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5\x9e\x8b': 274
  再次解码：
  b'\xe4\xbd\xa0\xe5\xa5\xbd'.decode("utf8")=你好
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5'.decode("utf8") =error（编码器未经充分训练，错误忽略）
  b'qwen\xe5\xa4\xa7\xe6\xa8\xa1\xe5\x9e\x8b'.decode("utf8")=qwen大模型
  解释：
  前面的训练只做了非常小的词表，并没有进行过滤和特殊处理，实际的编码比这个要复杂，会对特殊字符进行处理，此部分后面简单介绍。

要点解释
notes1：
- byte=8bit= [0-1][0-1][0-1][0-1][0-1][0-1][0-1][0-1]，共计可以表示2**8=256个状态
- BPE编码是以字节为单位，所以，至少要有1个字节，也就是256个状态，每个状态可以认为是词表中某个具体的词或词的编码0-255
- 结论：至少1个字节->词表至少256
- 知识扩充
  - 众所周知，ASCII表和纯数字存在对应关系，英文属于拼音语言，a-zA-Z共计72个字符即可表示几乎所有文字，1byte就可以描述其基本字符，而中文，是象形文字，其编码较为复杂，往往一个中文汉字，就需要多字节。
    - UTF-8编码：一个中文汉字需要3个字节
      - ```python
      - >zh = "你好"
      - >zh.encode("utf8")
      - b'\xe4\xbd\xa0\xe5\xa5\xbd'
      - >zh[0].encode("utf8")
      - b'\xe4\xbd\xa0'
      - >zh[1].encode("utf8")
      - b'\xe5\xa5\xbd'
      - >list(zh.encode("utf8"))
      - [228, 189, 160, 229, 165, 189]
      - >for v in list(s.encode("utf8")):
      - ... bin(v)
      - '0b11100100'
        
        '0b10111101'
        
        '0b10100000'
        
        '0b11100101'
        
        '0b10100101'
        
        '0b10111101'
      - ```
    - gbk编码：一个中文汉字需要2个字节
      - ```python
      - >zh = "你好"
      - >zh.encode("gbk")
      - b'\xc4\xe3\xba\xc3'
      - >zh[0].encode("gbk")
      - b'\xc4\xe3'
      - >zh[1].encode("gbk")
      - b'\xba\xc3'
      - >list(zh.encode("gbk"))
      - [196, 227, 186, 195]
      - >for v in list(s.encode("gbk")):
      - ... bin(v)
      - '0b11000100'
        
        '0b11100011'
        
        '0b10111010'
        
        '0b11000011'
      - ```

tiktoken

前面介绍了BPE算法的核心逻辑，那么tiktoken作为qwen2大模型的编码算法，为什么采用它呢？其实最主要的原因就一个：快。

速度快

如上图，可以看到tiktoken相比较huggingface的同类开源的编码器快3-6倍。

Rust编程
- tiktoken核心编码的部分，即train和encode都是通过Rust来实现的，Rust具备一些优势。
- 性能强，没有编译器，直接就是机器码，其速度和C++差不多，甚至超越C++，比python快2-3倍
- 并行计算，更好的支持并行计算，而不受python等全局锁限制

算法优化

BPE算法有多种实现逻辑，前面讲的是最简单的naive算法，其实现有多种，tiktoken就对此部分进行了优化。
BPE核心算法优化
_byte_pair_merge函数实现了BPE的核心逻辑，通过维护一个parts向量来追踪可能合并的字节对。

  fn _byte_pair_merge(ranks: &HashMap<Vec<u8>, Rank>, piece: &[u8]) -> Vec<(usize, Rank)> {
      // This is a vector of (start, rank).
      // The rank is of the pair starting at position start.
      let mut parts = Vec::with_capacity(piece.len() + 1);

      // Note that we hash bytes when indexing into `ranks`, not token pairs. As long as we train BPE
      // the way we currently do, this is equivalent. An easy way to break this would be to decouple
      // merge priority from token index or to prevent specific token merges.
      let mut min_rank: (Rank, usize) = (Rank::MAX, usize::MAX);
      for i in 0..piece.len() - 1 {
          let rank = *ranks.get(&piece[i..i + 2]).unwrap_or(&Rank::MAX);
          if rank < min_rank.0 {
              min_rank = (rank, i);
          }
          parts.push((i, rank));
      }
      parts.push((piece.len() - 1, Rank::MAX));
      parts.push((piece.len(), Rank::MAX));

      let get_rank = {
          #[inline(always)]
          |parts: &Vec<(usize, Rank)>, i: usize| {
              if (i + 3) < parts.len() {
                  // Similar to `piece[i..i + 2]` above. The +3 is because we haven't yet deleted
                  // parts[i + 1], see comment in the main loop.
                  *ranks
                      .get(&piece[parts[i].0..parts[i + 3].0])
                      .unwrap_or(&Rank::MAX)
              } else {
                  Rank::MAX
              }
          }
      };

      // If you have n parts and m merges, this does O(mn) work.
      // We could do something with a heap and do O(m log n) work.
      // n is often very small so considerations like cache-locality outweigh the algorithmic
      // complexity downsides of the `parts` vector.
      while min_rank.0 != Rank::MAX {
          let i = min_rank.1;
          // Update parts[i] and parts[i - 1] before removing parts[i + 1], since
          // `parts.remove(i + 1)` will thrash the cache.
          if i > 0 {
              parts[i - 1].1 = get_rank(&parts, i - 1);
          }
          parts[i].1 = get_rank(&parts, i);
          parts.remove(i + 1);

          min_rank = (Rank::MAX, usize::MAX);
          for (i, &(_, rank)) in parts[..parts.len() - 1].iter().enumerate() {
              if rank < min_rank.0 {
                  min_rank = (rank, i);
              }
          }
      }
      parts
  }

特殊字符处理

在_encode_native和_encode_unstable_native函数中，代码区分了普通字符和特殊字符的处理，允许在编码过程中包含用户定义的特殊字符。

函数定义：
_encode_native(&self, text: &str, allowed_special: &HashSet<&str>) -> (Vec<Rank>, usize) 
核心代码：
step1：通过正则从开始索引匹配特殊字符
next_special = special_regex.find_from_pos(text, start_find).unwrap();
step2：挨个遍历text，找到符合定义的特殊字符跳出，不符合继续找；一旦找到next_special在allowed_special先是跳出，然后直接进行编码
```rust
let mut next_special;
let mut start_find = start;
loop {
    // Find the next allowed special token, if any
    next_special = special_regex.find_from_pos(text, start_find).unwrap();
    match next_special {
        Some(m) => {
            if allowed_special.contains(&text[m.start()..m.end()]) {
                break;
            }
            start_find = m.start() + 1;
        }
        None => break,
    }
}
...
match next_special {
    Some(m) => {
        let piece = m.as_str();
        let token = self.special_tokens_encoder[piece];
        ret.push(token);
        start = m.end();
        last_piece_token_len = 0;
    }
    None => break,
}

```

缓存策略

核心代码： 使用缓存返回一个正则表达式对象，该对象用于匹配传入的特殊token集合中的任意一个token
@functools.lru_cache(maxsize=128)
def _special_token_regex(tokens: frozenset[str]) -> "regex.Pattern[str]":
    inner = "|".join(regex.escape(token) for token in tokens)
    return regex.compile(f"({inner})")

哈希表

见前面_byte_pair_merge算法，使用hashmap提高了检索速度

支持扩展

tiktoken具备扩展能力，前面讲过，在进行词表训练的时候，需要进行一些特殊处理，如某些特殊字符，想让它单独成为一个编码，而不是把他分开处理，这部分在tiktoken中有体现。

  cl100k_base = tiktoken.get_encoding("cl100k_base")

  # In production, load the arguments directly instead of accessing private attributes
  # See openai_public.py for examples of arguments for specific encodings
  enc = tiktoken.Encoding(
      # If you're changing the set of special tokens, make sure to use a different name
      # It should be clear from the name what behaviour to expect.
      name="cl100k_im",
      pat_str=cl100k_base._pat_str,
      mergeable_ranks=cl100k_base._mergeable_ranks,
      special_tokens={
          **cl100k_base._special_tokens,
          "<|im_start|>": 100264,
          "<|im_end|>": 100265,
      }
  )

模型架构

Qwen2系列从根本上说是基于Transformer架构的大型语言模型，Qwen2系列大模型的模型架构可以总结为Dense模型和MoE模型。

整体架构

Dense Model

Grouped Query Attention

Qwen2采用了GQA（分组查询Attention），取代了Qwen以前的MHA（多头注意力），GQA在推理期间优化了 KV缓存的使用，显著提高了吞吐量。
- 背景
自回归解码器推理是Transformer模型的一个瓶颈，在每个解码步骤中，加载解码器权重以及所有注意力键和值会导致内存开销。
- MQA
MQA（多查询注意力）可以减少内存开销，因为它采用多query，单一的key和value，但是该方法容易出现模型质量下降和训练不稳定。MQA可以通过MHA（多头注意力）转换而来，转换需要把key和value进行均值池化，多头的key和value就变成了单一的映射矩阵。
1. 转换checkpoint
2. 进行额外的预训练，使模型适应其新结构
图3-1-1
- GQA
  GQA介于MHA和MQA之间，对于H个Q而言，MHA指的是有H个K、V，而MQA是有一个K和V，GQA指的是对于一个分组而言，只有一个K和V。GQA将查询头分为G组，每组共享单一的K和V，GQA-G指的是G个分组，GQA-1表示一个分组，此时等同于MQA，GQA-H表示有H个K、V等同于MHA。

  GQA是一种折中策略，它会比MHA更快，模型质量比MQA更高。GQA并未应用到encoder的self-attention层，是因为此部分是并行执行的，内存往往不是瓶颈。

  图3-1-2

Dual Chunk Attention with YARN

Qwen2实现了DCA（双块注意力机制），它将长序列分割成可管理的长度块。如果输入可以在一个块中处理，DCA会产生与原始注意力相同的结果。DCA有助于在块内和跨块之间有效地捕获相对位置信息，从而提高长上下文的性能。另外，采用了YARN来重新调整注意力权重，以实现更好的长度外推。

SwiGLU 作为激活函数
- GLU
$GLU(x,W,V,b,c)=σ(xW+b)⊗(xV+c)$
1. x：神经网络的输入
2. 权重矩阵 W 和 V：进行线性变换的两个权重矩阵
3. 偏置项 b 和 c：调整线性变换的偏置项
4. Sigmoid 激活函数：对（xW + b）线性变换
- SwiGLU
$Swish(x)=x\cdot\sigma(\beta x)$
1. σ是Sigmoid函数
2. β是一个可调节的超参数
$SwiGLU(x, W, V, b, c,β) = Swish_\beta(xW + b) ⊗(xV+c)$

图3-1-3 : GLUE Language-Understanding Benchmark

总结：
1. 模型更加稳定
2. 计算复杂度不高
3. 预训练及微调效果均良好（实验结果）
4. 理论原因：暂无分析

RMSNorm作为层归一化技术
- LayerNorm
  - $a_i=\sum_{j=1}^{m}w_{ij}x_j,y_i=f(a_i+b_i)\space\space\space\space\space(1)$
  - $\overline{a_i}=\frac{a_i-\mu}{\sigma}g_i,y_i=(\overline{a}+b_i)\space\space\space\space\space\space\space\space\space(2)$
  - 公式（1）：
    
    $x\in{R^m},y\in{R^n}$
    
    $w_i$ 表示第 $i$ 个输出神经元的权重向量， $b_i$ 表示偏置标量，初始值为0， $f$ 函数表示元素相乘
    
    公式（2）：
    
    $\overline{a_i}$ 是向量 $\overline{a_i}\in{R^n}$ 的第 $i$ 个值， $g_i$ 是缩放系数，初始值为1， $\mu$ 和 $\sigma^2$ 为 $a_i$ 均值和方差。
    
    总结：
    
    公式（1）的问题在于对于协变量偏移比较敏感，导致梯度不稳定，难收敛，（2）取均值后会改善
- RMSNorm
$\overline{a_i}=\frac{a_i}{RMS(a)}g_i,RMS(a)=\sqrt{\frac{1}{n}\sum\nolimits_{i=1}^{n}a_j^2}\space\space\space\space\space(3)$

公式（3）：

该公式只关心缩放不变性，并没有重新中心化，并使用均方根来规范输入，并非传统均值和标准差

总结：

该方式不使用重新中心化依旧可以达到LayerNorm相似或更好效果

图3-1-4 : SacreBLEU score在newstest2013上RNNSearch效果

其他

Pre-normalization，先进行层归一化，再训练，提高稳定性
RoPE作为位置编码
QKV bais

表示计算在计算QKV的时候增加学习偏置

专家粒度

MoE模型与Dense模型之间的关键结构差异在于，MoE层包含多个前馈网络(FFNs)，每个网络充当一个独立的专家。

专家路由

专家路由的设计对于提升MoE的性能至关重要。近期，MoE层内集成共享专家和路由特定专家的趋势日益显著。这种设计允许在不同任务中应用共享专家，同时保留其他专家以供特定路由场景中的选择性使用。通过引入共享和专用专家，提供了一种更加适应性强、效率更高的MoE路由机制开发方法。

Qwen2不同版本的结构如下图，“57B-A14B”是指模型总共拥有570亿个参数，对于每个token，有140亿个参数处于活跃状态。“Intermediate size”表示每个专家的参数规模，而“# Activated Experts”则是指在模型推理时激活的专家数量，这个数字不包括共享专家。在MoE模型中，每个专家相当于一个独立的前馈网络(FFN)，它们可以并行处理输入数据的不同部分，从而提高模型的计算效率和表达能力。下面这张图里面的数字很重要，后面会详解这张图。

图3-2-1

MoE 层由多个参数独立的 MLP 组合而成，每个 MLP 被称为一个 expert。input token会被 router 分配到一个或多个 expert 上做处理。如果是多个 expert 处理同一个 input token，那么不同 expert 的输出会再做一次 aggregate，作为 input token 的最终output。

图3-2-2

下面的例子就是一个token例子：

假设有三个输入Token Representations，计算得到T1、T2、T3。
路由器权重Router Weights与每个token的嵌入向量进行点积运算，产生一个分数。
这些分数通过softmax函数归一化，使得每个token的归一化分数之和为1。
根据归一化后的分数，每个token选择分数最高的k个专家进行路由。

图3-2-3

专家初始化

使用类似upcycling 的方式来初始化专家权重，可以平衡dense model的权重。

图3-2-4

结构详解

前面讲到了大致的模型架构设计，下面将细粒度讲解模型结构细节，以及一个输入是如何转换为输出的，Qwen2系列模型较多，本文以Qwen2-7b-instruct为例。

模型层

介绍模型层整体架构之前，首先看一下模型的整体结构，可以分成三大块：embed_tokens、layers、norm。

  架构：
  Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152064, 3584)
      (layers): ModuleList(
        (0-27): 28 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
            (k_proj): Linear(in_features=3584, out_features=512, bias=True)
            (v_proj): Linear(in_features=3584, out_features=512, bias=True)
            (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm()
          (post_attention_layernorm): Qwen2RMSNorm()
        )
      )
      (norm): Qwen2RMSNorm()
    )
    (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
  )

embed_tokens

  核心代码：
  self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size,self.padding_idx) = Embedding(152064, 3584)
  见图3-2-1，此处Hidden Size = 3584

layers

  核心代码：
  self.layers = nn.ModuleList(
      [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  )
  Qwen2DecoderLayer（✖️ 28 ）：
  1. self_attn：Qwen2SdpaAttention
      hidden_size = 3584
      num_heads = 28，见图3-2-1，此处等于Query Heads
      head_dim = 128，见图3-2-1，此处等于Head Size
      num_key_value_heads = 4，见图3-2-1，此处等于KV Heads
      num_key_value_groups = 7，见图3-1-1，因为Queries=28，Keys=Values=4，所以Groups=28/4=7
      num_heads*head_dim = 28*128 = 3584 = hidden_size，
      self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim,               bias=True)  = Linear(in_features=3584, out_features=3584, bias=True)
      self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim,     bias=True)  = Linear(in_features=3584, out_features=512, bias=True)
      self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim,     bias=True)  = Linear(in_features=3584, out_features=512, bias=True)
      self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size,               bias=False) = Linear(in_features=3584, out_features=3584, bias=False)
  2. rotary_emb：Qwen2RotaryEmbedding
  3. mlp：Qwen2MLP
      hidden_size = 3584
      intermediate_size = 18944，见图3-2-1，Intermediate Size为18944，每个专家的参数量
      gate_proj = Linear(in_features=3584, out_features=18944, bias=False)
      up_proj =  Linear(in_features=3584, out_features=18944, bias=False)
      down_proj = Linear(in_features=18944, out_features=3584, bias=False)
      act_fn = SiLU()

norm

  Qwen2RMSNorm：
      input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps) =          （3584，）

参数层
- 参数加载及模型层

参数加载：
    1. 加载模型：
    def _load_pretrained_model(
    cls,
    model,
    state_dict,
    loaded_keys,
    resolved_archive_file,
    pretrained_model_name_or_path,
    ignore_mismatched_sizes=False,
    sharded_metadata=None,
    _fast_init=True,
    low_cpu_mem_usage=False,
    device_map=None,
    offload_folder=None,
    offload_state_dict=None,
    dtype=None,
    hf_quantizer=None,
    keep_in_fp32_modules=None,
    gguf_path=None,
    ):
    resolved_archive_file = [
    'model-00001-of-00004.safetensors', 
    'model-00002-of-00004.safetensors', 
    'model-00003-of-00004.safetensors', 
    'model-00004-of-00004.safetensors'
    ]
    2.加载参数：
    step1:分别加载resolved_archive_file中每一个分片文件
        state_dict = load_state_dict(shard_file, is_quantized=is_quantized)
        shard_file = 'model-00001-of-00004.safetensors'
    step2:加载文件，使用torch.load_file()
        ```python
        result = {}
        with safe_open(filename, framework="pt", device=device) as f:
            for k in f.keys():
            result[k] = f.get_tensor(k)
        return result
        ```
        执行完上述代码后，就完成了参数数赋值
    
参数结构：
    1. 参数解释：下面列表共计长度339，计算公式为339=12*28+3（embed_tokens、norm、lm_head）
    2. 执行推理：假设还以输入为“你好，qwen大模型”为例
        step1:编码
        prompt = "你好，qwen大模型"
        inputs = tokenizer(prompt, return_tensors="pt")
        inputs = 
        {'input_ids': tensor([[108386,   3837,     80,  16948,  26288, 104949]]),               'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
        tokens  = ['ä½łå¥½', 'ï¼Į', 'q', 'wen', 'å¤§', 'æ¨¡åŀĭ']
        "".join(tokens).encode("raw_unicode_escape","ignore").decode("UTF-8","ignore")          =  "\u0142好\u012eqwen大模\u0140\u012d"
        如上，每一个vocab的id都能在词表中找到其对应字符，是unicode编码，最终需要经过utf8解码，实际解码后，并不完全显示为中文，中间会有一些特殊字符（此部分不做深入研究)
        step2:组装输入数据
        model_inputs = 
        {
        'input_ids': tensor([[108386,   3837,     80,  16948,  26288, 104949]]), 
        'position_ids': tensor([[0, 1, 2, 3, 4, 5]]), 'past_key_values':                        DynamicCache(), 'use_cache': True, 'attention_mask': tensor([[1, 1, 1, 1, 1,            1]]), 'cache_position': tensor([0, 1, 2, 3, 4, 5])
        }
        step3:forward推理
        outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_values=past_key_values,
                inputs_embeds=inputs_embeds,
                use_cache=use_cache,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
                cache_position=cache_position,
        )
       第一层（embed_tokens）： inputs_embeds = self.embed_tokens(input_ids) =（1，6）@            （152064, 3584）=（1，6，3584）= （batch_size, *）@（num_embeddings, embedding_dim）        =（batch_size,*,embedding_dim）
       此步骤，实际上实现了tokens->embdding的映射，即：
       “你好，qwen大模型”->['ä½łå¥½', 'ï¼Į', 'q', 'wen', 'å¤§', 'æ¨¡åŀĭ']->                      [[108386,3837,80,16948,26288,104949]]->（1，6，3584）,而正好152064是模型中词表的大小，
       相当于把词表中每一个token的id在'model.embed_tokens.weight': (152064, 3584)中找到了这个        id对应的词向量，所以此步骤（1，6，3584）中（1,6）是一定会保留，这是实际的输入大小，表示只有一        批数据，数据长度为6，而后面的3584表示每个token的向量长度都是3584。
       第二层（model.layers.0）： hidden_states, self_attn_weights, present_key_value =                 self.self_attn(          hidden_states=hidden_states,          attention_mask=attention_mask,          position_ids=position_ids,          past_key_value=past_key_value,          output_attentions=output_attentions,          use_cache=use_cache,          cache_position=cache_position,              )
       此步骤，hidden_states其实就是inputs_embeds经过layernorm后处理的结果，hidden_states仍旧        是（1，6，3584）
       1. 首先，是经过三个映射层
       query_states = self.q_proj(hidden_states)
       key_states = self.k_proj(hidden_states)
       value_states = self.v_proj(hidden_states)
       可以看到下面参数映射表中具体的weight中q，k，v分别是（3584，3584）、（512，3584）、（512，3584）
       所以经过上面三个计算后，其维度分别变成（1，6，3584）、（1，6，512）、（1，6，512）
       第一个好理解，第二个为何变成了（1，6，512），因为k的计算公式为q@k^T，实际上等同于
      （1，6，3584）@（3584，512） = （1，6，512）
       2. 然后，维度扩充
       query_states = query_states.view(bsz, q_len, self.num_heads,                            self.head_dim).transpose(1, 2) = （1，28，6，128）
       key_states = key_states.view(bsz, q_len, self.num_key_value_heads,                      self.head_dim).transpose(1, 2) = （1，4，6，128）
       value_states = value_states.view(bsz, q_len, self.num_key_value_heads,                  self.head_dim).transpose(1, 2) = （1，4，6，128）
       重要‼️记住上面几个数值，6是文本长度，没问题；1是批长度；28指的是Query Heads的数目是28，也就         是前面提到的28头，而前面讲到Qwen2用的是GQA，是分组的，又因为前面讲到了它分成了7组，所以另外两        个值均为28/7=4。
       3.走完中间层：后面的几层就大同小异了，基本都是在重复旋转位置编码、矩阵计算、维度扩充等等，就不         详细再展开讲了，当走完{'model.layers.0.post_attention_layernorm.weight': (3584,)}层         以后，矩阵的维度又变成了（1，6，3584），然后共计28层，后面的都继续走一遍。
       4.最后层（lm_head）：走完中间层以后，会得到矩阵维度hidden_states：（1，6，3584），此时，继续         走lm_head层，计算最终logits = lm_head（hidden_states）=（1，6，152064）=                   （batch_size,seq_len,vocab_size）
       step4:解码（auto-regressive generation）
       解码是一个自回归过程，你可以简单认为就是一个While True，结束条件就是预测的词为结束token,此部        分涉及多次模型推理，过程类似前面。
       1. 得到next_token_logits: next_token_logits =  logits[:,-1,:]
       2. 重采样：主要是设置temperature、top-k、top-p等参数，对next_token_logits的分布进行重新        计算
       3. softmax：计算next_token_logits的概率分布，probs = nn.functional.softmax(logits)         = (1,152064)
       4. 得到概率最大词的index：next_tokens
       5. 重新拼接input_ids:input_ids = torch.cat([input_ids,nex_tokens[:,None],dim=-1]) 
       众所周知，自回归解码的方式会把新预测的token拼接到新的input_ids里面作为新的输入继续预测下一个         token，这个重新拼接就是完成这个过程。
       6.判断是否结束：每次解码后得到的新token要判断是否为eos_token_id，如果是，解码结束，否则继续。
    3. 词表翻译：映射为对应语言，此部分可以认为是前面解码完成，得到所有的token id，需要依据词表翻译         对应的语言，可以认为就是“查字典”，此时generate_ids = (1,10)，共计10个tokens内容为：
       你好，qwen大模型，请问你能做
参数映射表：      
[
{'model.embed_tokens.weight': (152064, 3584)},
 {'model.layers.0.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.0.self_attn.q_proj.bias': (3584,)},
 {'model.layers.0.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.0.self_attn.k_proj.bias': (512,)},
 {'model.layers.0.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.0.self_attn.v_proj.bias': (512,)},
 {'model.layers.0.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.0.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.0.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.0.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.0.input_layernorm.weight': (3584,)},
 {'model.layers.0.post_attention_layernorm.weight': (3584,)},
 {'model.layers.1.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.1.self_attn.q_proj.bias': (3584,)},
 {'model.layers.1.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.1.self_attn.k_proj.bias': (512,)},
 {'model.layers.1.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.1.self_attn.v_proj.bias': (512,)},
 {'model.layers.1.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.1.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.1.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.1.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.1.input_layernorm.weight': (3584,)},
 {'model.layers.1.post_attention_layernorm.weight': (3584,)},
 {'model.layers.2.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.2.self_attn.q_proj.bias': (3584,)},
 {'model.layers.2.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.2.self_attn.k_proj.bias': (512,)},
 {'model.layers.2.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.2.self_attn.v_proj.bias': (512,)},
 {'model.layers.2.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.2.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.2.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.2.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.2.input_layernorm.weight': (3584,)},
 {'model.layers.2.post_attention_layernorm.weight': (3584,)},
 {'model.layers.3.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.3.self_attn.q_proj.bias': (3584,)},
 {'model.layers.3.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.3.self_attn.k_proj.bias': (512,)},
 {'model.layers.3.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.3.self_attn.v_proj.bias': (512,)},
 {'model.layers.3.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.3.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.3.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.3.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.3.input_layernorm.weight': (3584,)},
 {'model.layers.3.post_attention_layernorm.weight': (3584,)},
 {'model.layers.4.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.4.self_attn.q_proj.bias': (3584,)},
 {'model.layers.4.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.4.self_attn.k_proj.bias': (512,)},
 {'model.layers.4.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.4.self_attn.v_proj.bias': (512,)},
 {'model.layers.4.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.4.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.4.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.4.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.4.input_layernorm.weight': (3584,)},
 {'model.layers.4.post_attention_layernorm.weight': (3584,)},
 {'model.layers.5.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.5.self_attn.q_proj.bias': (3584,)},
 {'model.layers.5.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.5.self_attn.k_proj.bias': (512,)},
 {'model.layers.5.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.5.self_attn.v_proj.bias': (512,)},
 {'model.layers.5.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.5.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.5.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.5.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.5.input_layernorm.weight': (3584,)},
 {'model.layers.5.post_attention_layernorm.weight': (3584,)},
 {'model.layers.6.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.6.self_attn.q_proj.bias': (3584,)},
 {'model.layers.6.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.6.self_attn.k_proj.bias': (512,)},
 {'model.layers.6.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.6.self_attn.v_proj.bias': (512,)},
 {'model.layers.6.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.6.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.6.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.6.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.6.input_layernorm.weight': (3584,)},
 {'model.layers.6.post_attention_layernorm.weight': (3584,)},
 {'model.layers.7.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.7.self_attn.q_proj.bias': (3584,)},
 {'model.layers.7.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.7.self_attn.k_proj.bias': (512,)},
 {'model.layers.7.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.7.self_attn.v_proj.bias': (512,)},
 {'model.layers.7.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.7.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.7.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.7.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.7.input_layernorm.weight': (3584,)},
 {'model.layers.7.post_attention_layernorm.weight': (3584,)},
 {'model.layers.8.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.8.self_attn.q_proj.bias': (3584,)},
 {'model.layers.8.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.8.self_attn.k_proj.bias': (512,)},
 {'model.layers.8.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.8.self_attn.v_proj.bias': (512,)},
 {'model.layers.8.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.8.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.8.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.8.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.8.input_layernorm.weight': (3584,)},
 {'model.layers.8.post_attention_layernorm.weight': (3584,)},
 {'model.layers.9.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.9.self_attn.q_proj.bias': (3584,)},
 {'model.layers.9.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.9.self_attn.k_proj.bias': (512,)},
 {'model.layers.9.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.9.self_attn.v_proj.bias': (512,)},
 {'model.layers.9.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.9.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.9.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.9.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.9.input_layernorm.weight': (3584,)},
 {'model.layers.9.post_attention_layernorm.weight': (3584,)},
 {'model.layers.10.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.10.self_attn.q_proj.bias': (3584,)},
 {'model.layers.10.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.10.self_attn.k_proj.bias': (512,)},
 {'model.layers.10.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.10.self_attn.v_proj.bias': (512,)},
 {'model.layers.10.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.10.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.10.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.10.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.10.input_layernorm.weight': (3584,)},
 {'model.layers.10.post_attention_layernorm.weight': (3584,)},
 {'model.layers.11.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.11.self_attn.q_proj.bias': (3584,)},
 {'model.layers.11.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.11.self_attn.k_proj.bias': (512,)},
 {'model.layers.11.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.11.self_attn.v_proj.bias': (512,)},
 {'model.layers.11.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.11.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.11.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.11.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.11.input_layernorm.weight': (3584,)},
 {'model.layers.11.post_attention_layernorm.weight': (3584,)},
 {'model.layers.12.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.12.self_attn.q_proj.bias': (3584,)},
 {'model.layers.12.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.12.self_attn.k_proj.bias': (512,)},
 {'model.layers.12.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.12.self_attn.v_proj.bias': (512,)},
 {'model.layers.12.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.12.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.12.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.12.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.12.input_layernorm.weight': (3584,)},
 {'model.layers.12.post_attention_layernorm.weight': (3584,)},
 {'model.layers.13.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.13.self_attn.q_proj.bias': (3584,)},
 {'model.layers.13.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.13.self_attn.k_proj.bias': (512,)},
 {'model.layers.13.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.13.self_attn.v_proj.bias': (512,)},
 {'model.layers.13.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.13.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.13.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.13.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.13.input_layernorm.weight': (3584,)},
 {'model.layers.13.post_attention_layernorm.weight': (3584,)},
 {'model.layers.14.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.14.self_attn.q_proj.bias': (3584,)},
 {'model.layers.14.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.14.self_attn.k_proj.bias': (512,)},
 {'model.layers.14.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.14.self_attn.v_proj.bias': (512,)},
 {'model.layers.14.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.14.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.14.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.14.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.14.input_layernorm.weight': (3584,)},
 {'model.layers.14.post_attention_layernorm.weight': (3584,)},
 {'model.layers.15.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.15.self_attn.q_proj.bias': (3584,)},
 {'model.layers.15.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.15.self_attn.k_proj.bias': (512,)},
 {'model.layers.15.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.15.self_attn.v_proj.bias': (512,)},
 {'model.layers.15.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.15.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.15.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.15.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.15.input_layernorm.weight': (3584,)},
 {'model.layers.15.post_attention_layernorm.weight': (3584,)},
 {'model.layers.16.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.16.self_attn.q_proj.bias': (3584,)},
 {'model.layers.16.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.16.self_attn.k_proj.bias': (512,)},
 {'model.layers.16.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.16.self_attn.v_proj.bias': (512,)},
 {'model.layers.16.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.16.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.16.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.16.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.16.input_layernorm.weight': (3584,)},
 {'model.layers.16.post_attention_layernorm.weight': (3584,)},
 {'model.layers.17.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.17.self_attn.q_proj.bias': (3584,)},
 {'model.layers.17.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.17.self_attn.k_proj.bias': (512,)},
 {'model.layers.17.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.17.self_attn.v_proj.bias': (512,)},
 {'model.layers.17.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.17.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.17.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.17.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.17.input_layernorm.weight': (3584,)},
 {'model.layers.17.post_attention_layernorm.weight': (3584,)},
 {'model.layers.18.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.18.self_attn.q_proj.bias': (3584,)},
 {'model.layers.18.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.18.self_attn.k_proj.bias': (512,)},
 {'model.layers.18.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.18.self_attn.v_proj.bias': (512,)},
 {'model.layers.18.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.18.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.18.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.18.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.18.input_layernorm.weight': (3584,)},
 {'model.layers.18.post_attention_layernorm.weight': (3584,)},
 {'model.layers.19.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.19.self_attn.q_proj.bias': (3584,)},
 {'model.layers.19.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.19.self_attn.k_proj.bias': (512,)},
 {'model.layers.19.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.19.self_attn.v_proj.bias': (512,)},
 {'model.layers.19.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.19.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.19.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.19.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.19.input_layernorm.weight': (3584,)},
 {'model.layers.19.post_attention_layernorm.weight': (3584,)},
 {'model.layers.20.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.20.self_attn.q_proj.bias': (3584,)},
 {'model.layers.20.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.20.self_attn.k_proj.bias': (512,)},
 {'model.layers.20.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.20.self_attn.v_proj.bias': (512,)},
 {'model.layers.20.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.20.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.20.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.20.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.20.input_layernorm.weight': (3584,)},
 {'model.layers.20.post_attention_layernorm.weight': (3584,)},
 {'model.layers.21.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.21.self_attn.q_proj.bias': (3584,)},
 {'model.layers.21.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.21.self_attn.k_proj.bias': (512,)},
 {'model.layers.21.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.21.self_attn.v_proj.bias': (512,)},
 {'model.layers.21.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.21.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.21.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.21.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.21.input_layernorm.weight': (3584,)},
 {'model.layers.21.post_attention_layernorm.weight': (3584,)},
 {'model.layers.22.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.22.self_attn.q_proj.bias': (3584,)},
 {'model.layers.22.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.22.self_attn.k_proj.bias': (512,)},
 {'model.layers.22.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.22.self_attn.v_proj.bias': (512,)},
 {'model.layers.22.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.22.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.22.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.22.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.22.input_layernorm.weight': (3584,)},
 {'model.layers.22.post_attention_layernorm.weight': (3584,)},
 {'model.layers.23.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.23.self_attn.q_proj.bias': (3584,)},
 {'model.layers.23.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.23.self_attn.k_proj.bias': (512,)},
 {'model.layers.23.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.23.self_attn.v_proj.bias': (512,)},
 {'model.layers.23.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.23.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.23.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.23.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.23.input_layernorm.weight': (3584,)},
 {'model.layers.23.post_attention_layernorm.weight': (3584,)},
 {'model.layers.24.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.24.self_attn.q_proj.bias': (3584,)},
 {'model.layers.24.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.24.self_attn.k_proj.bias': (512,)},
 {'model.layers.24.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.24.self_attn.v_proj.bias': (512,)},
 {'model.layers.24.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.24.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.24.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.24.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.24.input_layernorm.weight': (3584,)},
 {'model.layers.24.post_attention_layernorm.weight': (3584,)},
 {'model.layers.25.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.25.self_attn.q_proj.bias': (3584,)},
 {'model.layers.25.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.25.self_attn.k_proj.bias': (512,)},
 {'model.layers.25.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.25.self_attn.v_proj.bias': (512,)},
 {'model.layers.25.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.25.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.25.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.25.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.25.input_layernorm.weight': (3584,)},
 {'model.layers.25.post_attention_layernorm.weight': (3584,)},
 {'model.layers.26.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.26.self_attn.q_proj.bias': (3584,)},
 {'model.layers.26.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.26.self_attn.k_proj.bias': (512,)},
 {'model.layers.26.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.26.self_attn.v_proj.bias': (512,)},
 {'model.layers.26.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.26.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.26.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.26.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.26.input_layernorm.weight': (3584,)},
 {'model.layers.26.post_attention_layernorm.weight': (3584,)},
 {'model.layers.27.self_attn.q_proj.weight': (3584, 3584)},
 {'model.layers.27.self_attn.q_proj.bias': (3584,)},
 {'model.layers.27.self_attn.k_proj.weight': (512, 3584)},
 {'model.layers.27.self_attn.k_proj.bias': (512,)},
 {'model.layers.27.self_attn.v_proj.weight': (512, 3584)},
 {'model.layers.27.self_attn.v_proj.bias': (512,)},
 {'model.layers.27.self_attn.o_proj.weight': (3584, 3584)},
 {'model.layers.27.mlp.gate_proj.weight': (18944, 3584)},
 {'model.layers.27.mlp.up_proj.weight': (18944, 3584)},
 {'model.layers.27.mlp.down_proj.weight': (3584, 18944)},
 {'model.layers.27.input_layernorm.weight': (3584,)},
 {'model.layers.27.post_attention_layernorm.weight': (3584,)},
 {'model.norm.weight': (3584,)},
 {'lm_head.weight': (152064, 3584)}
 ]

4. # 预训练

质量提升

过滤算法已经通过额外的启发式和基于模型的方法进行了改进，包括使用Qwen模型来过滤掉低质量数据。

数据扩充

相比Qwen1.5，收集了更大容量的高质量代码、数学和多语言数据，增强了模型在各自领域的能力。这个新数据集支持大约30种语言，如英语、中文、西班牙语、法语、德语、阿拉伯语、俄语、韩语、日语、泰语和越南语。

分布提升

确保模型学习到接近人类学习的知识分布，在小模型上实验，优化不同来源和领域的数据。

长文本处理

Qwen2 的长文本处理能力是在预训练的最后阶段将上下文长度从 4,096 个token增加到 32,768 个token，Qwen2的长文本处理能力是通过YARN机制和双重块注意力（Dual Chunk Attention, DCA）实现的。

长文本处理能力存在的问题是什么？

训练数据：正常能得到的训练数据token数目，最多也就2k，换算为汉字大概4k-5k个中文字符，如果是一条训练数据长为2k，那是很难想象的，即从训练阶段，就得不到类似数据，推理阶段自然也难保证效果，如果是训练保证只有2k，训练时遇上8k，训练和推理出现分布差异，训推不一致
外推性：训练可能只有2k，但是要保证8k时能够正常回答，这个问题是较难解决的，如下图，多数长文本解决方案，更多是聚焦该部分
成本：即便是具备了长文本，训练超长文本的成本也是很高的

基本公式
- 输入序列
- $S_N = \{w_i\}_{i=1}^N$ ，N表示输入序列长度，就类似前面的输入“你好，qwen大模型”为6个token，其中 $w_i$ 为第 $i$ 个token， $w_3=q=80$
- embedding
- $E_N = \{x_i\}_{i=1}^N$ ，其中 $x_i$ 表示第 $i$ 个 token对应的 d 维 embedding
- 计算Q、K、V
- $q_m = f_q(x_m,m),k_n = f_k(x_n,n),v_n=f_v(x_n,n)$
- $q_m$ 表示第 $m$ 个 token 对应的词向量 $x_m$ 集成位置信息 $m$ 之后的 query 向量。而 $k_n$ 和 $v_n$ 则表示第 n 个 token 对应的词向量 $v_n$ 集成位置信息 n 之后的 key 和 value 向量。位置编码即是着重于构造一个 $f(q,k,v)$ 函数形式
- 得到结果
- $a_{m,n}=\frac{exp{\frac{q_m^Tk_n}{\sqrt{(d)}}}}{\sum_{j=1}^{N}{exp{\frac{q_m^Tk_j}{\sqrt{(d)}}}}},o_m=\sum\limits_{n=1}^{N}a_{m,n}v_n$
绝对位置编码

$f_{(q,k,v)}(x_i,i)=W_{(q,k,v)}(x_i+p_i)$

$p_{i,2t} = sin(\frac{i}{10000^{\frac{2t}{d}}})$

$p_{i,2t+1} = cos(\frac{i}{10000^{\frac{2t}{d}}})$

2t和2t+1分别表示偶数和奇数位的向量公式，用此方法，主要是：1、数值范围差异小2、周期性，位置有线性关系

RoPE

RoPE其实一种用绝对位置编码，却能够引入相对位置信息的位置编码方式。下面的公式表示两个向量的内积计算可以用一个g的函数来表示，g函数的入参分别是 $x_m$ 、 $x_n$ 和其相对位置差m-n。

$=g(x_m,x_n,m-n)$

向量与复数背景知识回顾
- 向量表示： $\vec{z}=(a,b) = a+bi$ ，一个向量是可以表示为一个坐标和一个复数的，a表示实部，b表示虚部
- 极坐标转换： $\vec{r} = (x,y) = x+yi=rcos\theta +irsin\theta = (2,2) = 2+2i=2\sqrt{2}cos{\frac{\pi}{4}}+2\sqrt{2}isin{\frac{\pi}{4}}$
- 在高等数学中，极坐标和直角坐标之间的转换公式是：
- 1. 从直角坐标转换到极坐标：
    1. 极径 r： $r=\sqrt{x^2+y^2}$
    2. 极角 θ： $θ=arctan⁡(\frac{y}{x})$
  2. 从极坐标转换到直角坐标：
    1. $x=rcos\theta$
    2. $x=ysin\theta$
- 简化表示
- 假设 $e^{iθ} = cos θ + i sin θ$ ，可以简化上述极坐标为 $\vec{r} = re^{i\theta}$ 。
- 计算
- - 复数运算
- $(a + i b) \cdot (a - i b) = a^{2} - i a b + i a b - i^{2} b^{2} = a^{2} - b^{2}$
- $a_1+ib_1 + a_2+ib_2 = (a_1+a_2)+(b_1+b_2)i$
- 复数运算遵循分配律和结合律，和普通运算差别不大
- - 向量运算
- 测试用例：
- $\vec{A}=(a_1,b_1)=a_1\vec{i}+b_1\vec{j},\vec{B}=(a_2,b_2)=a_2\vec{i}+b_2\vec{j}$
- 解释：此处的i和j分别可以表示单位向量i和单位向量j，俩向量夹角为90度
- 加法：
- $\vec{A}+\vec{B}=(a_1+a_2,b_1+b_2)=(a_1+a_2)\vec{i}+(b_1+b_2)\vec{j}$
- 乘法：
- 1. 标量乘法： $\vec{A}\cdot\vec{B}=|A||B|cos\theta=（a_1*a_2）+(b_1*b_2)$ ，注： $\vec{i}\cdot\vec{j}=0$
  2. 向量乘法： $\vec{A}\times\vec{B}=|A||B|sin\theta$ ，此处省略，用到较少
RoPE推导

$=g(x_m,x_n,m-n)$

定义（此处仅考虑2维向量，更多维度请自行推导）：

$f_q(x_m, m) =(W_q x_m) (cos(mθ) + isin (mθ))= (W_q x_m) e^{im\theta}$

$f_n(x_n, n) = (W_k x_n)(cos(nθ) + isin(nθ)) =(W_k x_n)e^{in\theta}$

$g(x_{m},x_{n},m-n) = \operatorname{Re}[(W_{q} x_{m})(W_{k} x_{n})^{*} e^{i(m-n)\theta}]$ ，Re表示取实部，不要虚部

推理：

$f_q(x_m, m) =(W_q x_m) (cosmθ + isin(mθ))= (W_q x_m) e^{im\theta}=\begin{pmatrix} W_q^{11} & W_q^{12} \\ W_q^{21} & W_q^{22} \end{pmatrix}\begin{pmatrix}x_m^1 \\ x_m^2\end{pmatrix}e^{im\theta}$

由于 $\begin{pmatrix} W_q^{11} & W_q^{12} \\ W_q^{21} & W_q^{22} \end{pmatrix}\begin{pmatrix}x_m^1 \\ x_m^2\end{pmatrix}$ 为22矩阵乘以21的矩阵，根据矩阵乘法，其结果必然是一个向量，其维度为2*1，于是定义：

$q_m=\begin{pmatrix}q_m^1\\q_m^2\end{pmatrix}=\begin{pmatrix} W_q^{11} & W_q^{12} \\ W_q^{21} & W_q^{22} \end{pmatrix}\begin{pmatrix}x_m^1 \\ x_m^2\end{pmatrix} =q_m^1+q_m^2i$ ，前面讲过，向量可以表示为复数。

接着：

$f_q(x_m, m) =(W_q x_m) (cos(mθ) + i sin(mθ))= (W_q x_m) e^{im\theta}=\begin{pmatrix} W_q^{11} & W_q^{12} \\ W_q^{21} & W_q^{22} \end{pmatrix}\begin{pmatrix}x_m^1 \\ x_m^2\end{pmatrix}e^{im\theta}=(q_m^1+q_m^2i)e^{im\theta}=(q_m^1+q_m^2i)(cos(m\theta)+isin(m\theta))=q_m^1cos(m\theta)+q_m^1isin(m\theta))+q_m^2icos(m\theta)+q_m^2i*isin(m\theta))=[q_m^1cos(m\theta)-q_m^2sin(m\theta)]+[q_m^1sin(m\theta)+q_m^2cos(m\theta)]i=\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}\begin{pmatrix}q_m^1\\q_m^2\end{pmatrix}$

所以：

$f_q(x_m, m)=\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}\begin{pmatrix}q_m^1\\q_m^2\end{pmatrix}=f_q(x_m, m)=\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}\begin{pmatrix} W_q^{11} & W_q^{12} \\ W_q^{21} & W_q^{22} \end{pmatrix}\begin{pmatrix}x_m^1 \\ x_m^2\end{pmatrix}$

推导完成，后面的函数g和其他部分请自行推导。

重点❗️❗️

我们知道，乘以一个矩阵可以认为是完成了一次变换，而公式中表达式： $\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}$ ，具备以下性质：

保持长度不变：如果有一个向量v，乘以上面矩阵后，你会发现它的长度不变。
保持角度不变：旋转矩阵不仅保持向量的长度，还保持向量之间的相对角度。这意味着，如果两个向量在旋转前是正交的（即它们之间的夹角是 90度），那么它们在旋转后仍然是正交的。
行列式为 1：这个旋转矩阵的行列式为 $cos^2\theta+sin^2\theta=1$ ，这符合特殊正交矩阵的性质，意味着矩阵是可逆的，并且其逆矩阵就是其转置矩阵。

举个例子：

假设 $\theta=45$ ，向量为 $\begin{pmatrix}2\\2\end{pmatrix}$ 和 $\begin{pmatrix}0\\1\end{pmatrix}$ ，则： $\frac{\sqrt{2}}{2}\begin{pmatrix}1&-1\\1&1\end{pmatrix}\begin{pmatrix}2\\2\end{pmatrix}=\begin{pmatrix}0\\2\sqrt2\end{pmatrix}$ ， $\frac{\sqrt{2}}{2}\begin{pmatrix}1&-1\\1&1\end{pmatrix}\begin{pmatrix}0\\1\end{pmatrix}=\frac{\sqrt{2}}{2}\begin{pmatrix}-1\\1\end{pmatrix}$

长度： $\sqrt{2^2+2^2}=\sqrt{8}=2\sqrt{2}=\sqrt{(2\sqrt{2})^2}$ ， $\sqrt{0^2+1^2}=1=\sqrt{\frac{2}{4}+\frac{2}{4}}$

角度： $cos{\theta}=\frac{2}{{2\sqrt{2}}*{1}}=\frac{\sqrt{2}}{2}=\frac{2}{1*2\sqrt{2}}$

扩展到多维：

面的讲解是embedding维度为2，而对于 $d>2$ 的情况，则是将embedding元素两两分组，每组应用同样的旋转角度，任意向量 $q$ 位于位置 $m$ 时，它的第 $i$ 组分量的旋转弧度为：

$m\theta_i=m*base^{-\frac{2i}{d}},\theta_j=10000^\frac{-2(j-1)}{d},j=[1,2,...,\frac{d}{2}]（m=1,base=10000,j-1=i）$

实现外推：

定义旋转矩阵：
1. 旋转矩阵 $R$ 是一个 $d\times{d}$ 的正交矩阵，用于在 $d$ 维空间中旋转向量。例如，对于二维空间，旋转矩阵可以表示为： $R(\theta)=\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}$
2. 其中，θ 是旋转角度
预训练位置 编码：
1. 假设模型在训练时学习 $\theta$ 个位置的编码，记为 $P_1,P_2,P_3,P_4,...,P_n$ 。
生成新的位置 编码：
1. 要生成超过 $n$ 的位置编码，可以使用旋转矩阵 $R$ 重复变换最后一个已知位置编码 Pn。
2. 具体步骤如下：
  - $Q_{n+1}=R⋅P_n$
  - $Q_{n+2}=R⋅Q_{n+1}=R(RP_n)=R^2P_n$
  - $Q_{n+3}=R⋅Q_{n+2}=R(R^2P_n)=R^3P_n$
  - 依此类推，可以得到任意长度的位置编码序列 $Q_1,Q_2,Q_3,...,Q_m$ ，其中 $m\geq{n}$

总结：

RoPE编码的核心矩阵是 $\begin{pmatrix}cos\theta&-sin\theta\\sin\theta&cos\theta\end{pmatrix}$ ，这是一个旋转矩阵，故称之为旋转位置编码
按照前面公式，任何一个向量均可通过乘以旋转矩阵完成编码
位置插值（PI）

PI是对RoPE编码的一种扩展。假设有模型，其默认上下文长度为 $L$ ，对于新的上下文长度 $L'(L'>L)$ ，编码方式变为 $f(x,m)=f(x,\frac{L}{L'}m)$ ，然后进行微调即可，下图表示 $L'=2L$ 。

总结：
1. 位置插值可以轻松启用非常长的上下文窗口。
2. 位置插值生成强大的模型，这些模型可以有效地利用扩展的上下文窗口。
3. 对于原始上下文窗口大小内的任务，位置插值相对较好地保留了模型质量。

  NTK-ware插值

    前面提到的PI插值，是对所有的向量进行无差别操作，但是向量是分高频和低频的，比如长度小于2k的向量就是高频，长度超过2k甚至32k的向量就很少见，属于低频。并且高频在输入中靠前的分组，低频在输入中靠后的分组，这两部分应该需要合理处理，保持较好外推能力。

    NTK-ware的核心思想：保留高频信息；高频分量旋转速度降幅低，低频分量旋转速度降幅高；在高频部分进行外推，低频部分进行内插。

总结：
1. 靠前的分组，在训练中见过非常多完整的旋转周期，位置信息得到了充分的训练，所以具有较强的外推能力。
2. 靠后的分组，在训练中无法见到完整的旋转周期，或者见到的旋转周期非常少，训练不够充分，外推性能弱，需要进行位置插值。

NTK-by-parts插值

NTK-by-parts插值是基于NTK-Aware插值进行优化，其核心思想是：不改变高频部分，仅缩小低频部分的旋转弧度，即不改变靠前分组的旋转弧度，仅减小靠后分组的旋转弧度。

公式推导：

已知： $m\theta_i=m*base^{-\frac{2i}{d}}$

所以：第 $i$ 个分组的旋转周期为:

$\lambda_i = \frac{2\pi}{\theta_i}=\frac{2\pi}{base^{-\frac{2i}{d}}}=2\pi*base^{\frac{2i}{d}}$

解释：前面已经知道， $m\theta_i$ 表示是在位置 $m$ 处第 $i$ 组分量的旋转角度，既然是角度，一般会小于360度（等同于没转）， $\frac{2\pi}{\theta_i}$ 就表示如果按照第 $i$ 组分量每次转 $\theta_i$ 的话，在 $2\pi$ 内它能转多少次，假设 $\theta_i=\frac{\pi}{2}$ ， $\frac{2\pi}{\theta_i}=\frac{2\pi}{\frac{\pi}{2}}=4$ ，即周期为4，也就转4次可以完成一次闭环，见到了整个旋转周期。

进一步理解，波长 $\lambda$ 代表了 RoPE嵌入执行完整旋转 $2\pi$ 所需的token长度，也就是如果 $\lambda=4$ ，表示，它完整执行旋转闭环，可以嵌入4个token，每次旋转一个角度就完成一个token的编码。

插值策略：
- 如果波长 $\lambda$ 远小于上下文大小 $L$ ，我们不进行插值；
- 如果波长 $\lambda$ 等于或大于上下文大小 $L$ ，我们只进行插值并避免任何推断（与之前的“NTK-aware”方法不同）；
- 中间的尺寸可以同时具有两者，类似于“NTK-aware”插值。

YaRN

搞清楚了PI和NTK-by-parts插值，才能真正去理解YaRN策略。

插值策略：

注意力缩放

$softmax({\frac{q_m^Tk_n}{t\sqrt{|D|}}})，\sqrt{\frac{1}{t}}=0.1*ln(\frac{L'}{L})+1，t=temperature,D为词表大小$

NTK-by-parts

解释：

回顾注意力公式： $a_{m,n}=\frac{exp{\frac{q_m^Tk_n}{\sqrt(d)}}}{\sum_{j=1}^{N}{exp{\frac{q_m^Tk_j}{\sqrt(d)}}}},o_m=\sum\limits_{n=1}^{N}a_{m,n}v_n$
仍旧采用NTK-by-parts策略进行插值优化

采用“长度缩放”技巧，通过简单地按相同量缩放复数 RoPE 嵌入，将 $q_m$ 和 $k_n$ 都按常数因子 $\sqrt{\frac{1}{t}}$ 缩放。这样，YaRN 就可以在不修改代码的情况下有效地改变注意力机制，改变注意力的目标是降低模型的困惑度，即模型理解能力。

DCA

原理：通过将长序列的注意力计算分解为基于块的模块，DCA 能够有效捕捉到同一块内（内部块）和不同块之间（跨块）的词汇的相对位置信息，并且与 Flash Attention 无缝集成，从而实现大模型的窗口外推。
- 设计
  - 内部块注意力，专为处理同一块内的标记而设计
  - 跨块注意力，用于处理不同块之间的标记
  - 连续块注意力，用于处理连续的不同块中的标记
- 不同于RoPE，DCA使用的是Q和V的位置索引差形成的矩阵作为Attention矩阵
  - Intra-chunk
  - 定义：
  - 内部块指的是用同一个块内的q和k来计算内积。
  - 公式：
  - 假设共计有12个位置索引，分成两块Chunk1和Chunk2，预训练窗口c=10，s表示块长，s=6，公式如下
  - $P_k=[\underbrace{0,1,2,3,4,5}_{\text{Chunk1}},\underbrace{0,1,2,3,4,5}_{\text{Chunk2}}]$
  - $P_𝐪^{Intra}=P_𝐤=[0,1,…,l−1]mod\space s$
  - $𝐪_i^⊤⁢𝐤_j=f⁢(𝐪,P_𝐪^{Intra⁢[i]})^⊤⁢f⁢(𝐤,P_𝐤⁢[j])$
  - Inter-chunk
  - 定义：
  - 为了聚合其他块的位置信息，提出了块间注意力，模型能够正确地计算不同块之间的注意力权重，同时保持了序列中信息的顺序性。通过这种方式，模型可以有效地处理超出预训练时见过的序列长度的输入，而无需额外的训练。
  - 公式：
  - $P_𝐪^{Inter} =[\underbrace{c−1,c−1,…⁢c−1}_{\text{l elements}}]$
  - 这个公式定义了在跨块注意力计算中，查询向量的位置索引 $P_𝐪^{Intra⁢}$ 。这里，每个查询向量都被赋予了一个固定值 $c-1$ ，这个值是预训练期间的最大位置索引。这样做是为了确保查询向量的位置索引大于之前所有块中的键向量的位置索引。公式中的 $l$ 表示元素的数量，即输入序列被分割成多个块后，每个块中查询向量的数量。
  - $M⁢_{[i]⁢[j]}=P_𝐪^{Intra⁢[i]}−P_𝐤⁢[j]=c−1−P_𝐤⁢[j]≥c−s$
  - 见图（b）Inter-chunk Attention，因为预训练窗口c=10，s=6，可以计算下矩阵，是正好对上的。
  - Successive-Chunk
  - 定义：
  - 连续块注意力可以被视为跨块注意力的一种特殊情况，它被提出来维持大型语言模型（LLMs）的局部性，这里的局部性指的是LLMs在预测下一个标记时往往严重依赖于邻近的标记。仅仅使用跨块注意力可能不再保持邻近标记之间的精确相对位置，从而导致性能下降。
  - 公式：
  - $P_𝐪^{Succ}=[\underbrace{\underbrace{s,s+1,…,s+w−1}_{\text{w elements}},c−1,…,c−1}_{\text{the same for all chunks}}]$
  - - s 是块的大小。
    - w 是局部窗口的大小，表示需要特别关注的邻近标记的数量。
    - c 是预训练时的最大上下文长度。
    - 前 w 个位置索引从 s 到s+w−1，用于保持邻近标记的局部性。
    - 剩余的位置索引都设为c−1，确保这些查询的位置索引大于之前所有块中的键的位置索引。
  - $M[i][j]=\begin{cases} P_𝐪^{Intra⁢[i]}−P_𝐤⁢[j], if ⌊i/s⌋−⌊j/s⌋=0 \\ P_𝐪^{Succ[i]}−P_𝐤⁢[j], if ⌊i/s⌋−⌊j/s⌋=1 \\ P_𝐪^{Intrer[i]}−P_𝐤⁢[j], if ⌊i/s⌋−⌊j/s⌋>1 \end{cases}$
  - $q_i^Tk_j=\begin{cases} f(q,P_𝐪^{Intra⁢[i]})^Tf(k,p_k[j]), if ⌊i/s⌋−⌊j/s⌋=0 \\ f(q,P_𝐪^{Succ[i]})^Tf(k,p_k[j]), if ⌊i/s⌋−⌊j/s⌋=1 \\ f(q,P_𝐪^{Inter[i]})^Tf(k,p_k[j]), if ⌊i/s⌋−⌊j/s⌋>1 \end{cases}$
  - 举例：
  - $初始化：s=6,w=4,c=10$
  - $P_𝐪^{Inter} =[\underbrace{c−1,c−1,…⁢c−1}_{\text{l elements}}]=[\underbrace{9,9,9,9,9,9}_{\text{chunk0}},\underbrace{9,9,9,9,9,9}_{\text{chunk1}}]$
  - $P_𝐪^{Succ}=[6,7,8,6+4-1,10-1,10-1,6,7,8,6+4-1,10-1,10-1]=[\underbrace{6,7,8,9,9,9}_{\text{chunk0}},\underbrace{6,7,8,9,9,9}_{\text{chunk1}}]$

实验

NVIDIA 80G A100 GPU上进行，使用的是Llama2 7B模型测试DCA效果，左图表示GPU使用占比，右图表示推理时间。

总结：

在较短的输入长度下，所有方法都能处理，但随着输入长度的增加，原始自注意力和Flash Attention在处理超过12k到16k标记的输入时开始受限。
DCA在所有测试的输入长度下都显示出比Flash Attention更好的性能，没有显著增加推理时间或GPU内存使用，且没有出现OOM。

后训练

后训练数据：在模型的预训练之后，为了进一步提升模型在特定任务上的表现，需要进行后训练。后训练使用的数据包括示范数据和偏好数据。
示范数据：这种数据形式是监督学习的基础，其中每个指令 $x_i$ 都有一个明确的满意响应 $y_i$ 。这可以用于直接训练模型，使其能够模仿这些示范数据。
偏好数据：在某些情况下，我们不仅需要模型提供正确的响应，还需要它能够提供更好的响应。偏好数据通过提供优选响应 $y_i^+$ 和不那么优选的响应 $y_i^-$ 来帮助模型学习区分不同质量的响应。
数据 构建过程：
- 协作数据标注：这一步骤涉及到人工参与，通过人工标注来确保数据的质量和准确性。这通常包括从大量文本中提取指令和相应的响应，然后由人工标注者确定哪些响应是满意的，哪些是优选的，哪些不是。
- 自动化数据合成：为了扩大数据集，采用自动化方法来生成更多的数据。这可能包括使用语言模型来生成新的指令和响应，或者通过变换现有的数据来创建新的数据点。这些自动化策略可以显著提高数据的多样性和数量，从而提高模型的泛化能力。

示范数据 (D)：由一系列的指令和对应的满意回答组成，表示为 $D={(x_i,y_i)}$ 。其中， $x_i$ 表示指令， $y_i$ 表示满意的响应。
偏好数据 (P)：除了包含指令和响应外，还包括了针对同一指令的两种回答，其中一种回答比另一种更受偏好。表示为 $P={(x_i,y_i^+,y_i^−)}P={(x_i,y_i^+,y_i^−)}$ ， $y_i^+$ 是优选的响应，而 $y_i^-$ 是不那么优选的响应。

示范数据 D 用于监督式微调(SFT)，而偏好数据 P 用于基于人类反馈的强化学习(RLHF)。

参考资料

https://guillaume-be.github.io/2021-09-16/byte_pair_encoding#rust-tokenizers
https://arxiv.org/pdf/2305.13245
https://arxiv.org/pdf/2309.00071
https://zhuanlan.zhihu.com/p/670118670
https://www.st-andrews.ac.uk/~il4/2ComplexNumbersVectors.pdf
https://zhuanlan.zhihu.com/p/647109286
https://zhuanlan.zhihu.com/p/683863159
https://arxiv.org/pdf/2406.14673
https://arxiv.org/pdf/2306.15595
https://arxiv.org/html/2402.17463v2#S3.E2
https://arxiv.org/pdf/1612.08083
https://arxiv.org/pdf/2002.05202
https://zhuanlan.zhihu.com/p/671434414

模型结构-qwen2

背景

编码

BPE

tiktoken

模型架构

整体架构

Dense Model

Grouped Query Attention

Dual Chunk Attention with YARN

SwiGLU 作为激活函数

RMSNorm作为层归一化技术

其他

MoE

结构详解

质量提升

数据扩充

分布提升

长文本处理

RoPE

重点❗️❗️

位置插值（PI）

NTK-ware插值

NTK-by-parts插值

YaRN

DCA

后训练

参考资料