ChatGLM获取Embedding向量本文讨论了ChatGLM的Embedding的使用方法，并通过whitening

一、ChatGLM的代码实现

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().to("cuda:1").eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)

> '你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。'

1.1 打印模型结构

print(model)

>
ChatGLMForConditionalGeneration(
  (transformer): ChatGLMModel(
    (word_embeddings): Embedding(130528, 4096)
    (layers): ModuleList(
      (0-27): 28 x GLMBlock(
        (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attention): SelfAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear(in_features=4096, out_features=12288, bias=True)
          (dense): Linear(in_features=4096, out_features=4096, bias=True)
        )
        (post_attention_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (mlp): GLU(
          (dense_h_to_4h): Linear(in_features=4096, out_features=16384, bias=True)
          (dense_4h_to_h): Linear(in_features=16384, out_features=4096, bias=True)
        )
      )
    )
    (final_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4096, out_features=130528, bias=False)
)

二、获取GLM Embedding

inputs = tokenizer([text], return_tensors="pt").to(device)
resp = model.transformer(**inputs, output_hidden_states=True)
y = resp.last_hidden_state
y_mean = torch.mean(y, dim=0, keepdim=True)
y_mean = y_mean.cpu().detach().numpy()
print(y_mean.shape)

> [1, 4096]

2.1、比对Embedding效果

def get_glm_embedding(text, device="cuda:1"):
    inputs = tokenizer([text], return_tensors="pt").to(device)
    resp = model.transformer(**inputs, output_hidden_states=True)
    y = resp.last_hidden_state
    y_mean = torch.mean(y, dim=0, keepdim=True)
    return y_mean.cpu().detach().numpy()

a = get_glm_embedding("你好！")
b = get_glm_embedding("Hello！")
c = get_glm_embedding("机器学习")

from scipy.spatial.distance import cosine
print(1 - cosine(a[0], b[0]), 1 - cosine(c[0], b[0]), 1 - cosine(c[0], a[0]))

> 0.81396484375, 0.71826171875, 0.634765625

从结果可以看出，GLM的模型Embedding具备无监督检索效果

2.2、细化Embedding方案

测试加入lm_head层的Embedding效果

def get_glm_embedding(text, device="cuda:1", logits=False):
    inputs = tokenizer([text], return_tensors="pt").to(device)
    resp = model.transformer(**inputs, output_hidden_states=True)
    y = resp.last_hidden_state
    if logits:
        y = model.lm_head(y).permute(1, 0, 2).contiguous()
    y = y.squeeze()
    y_mean = torch.mean(y, dim=0, keepdim=True)
    return y_mean.cpu().detach().numpy()

a = get_glm_embedding("你好！")
b = get_glm_embedding("Hello！")
c = get_glm_embedding("机器学习")

from scipy.spatial.distance import cosine
print(1 - cosine(a[0], b[0]), 1 - cosine(c[0], b[0]), 1 - cosine(c[0], a[0]))

> 0.52587890625, 0.2998046875, 0.70654296875

结论：可见lm_head层的加入并没有提升效果

三、提升Embedding的无监督检索效果

参考苏神的BERT-whitening的方法，对GPT的emb效果做一层提升。

3.1、什么是BERT-whitening

BERT-whitening是一种用于优化自然语言处理（NLP）中文本向量表示的技术，它使用了特征白化（whitening）的方法来减少向量表示中的冗余信息，从而提高文本表示的质量和效率。下面我将逐步解释BERT-whitening的原理：

3.1.1、去除偏移

在使用BERT的输出进行后续的文本表示任务时，我们通常需要将不同输入序列的向量表示归一化为相同的长度和形式。为了达到这个目的，通常需要先去除向量表示中的偏移量，即减去平均向量。这个过程可以使用中心化（centering）操作来实现。

3.1.2、特征白化

特征白化是一种常见的数据处理方法，它可以将数据的协方差矩阵变为对角矩阵，从而减少特征之间的相关性，提高数据的可解释性和泛化能力。在BERT-whitening中，我们可以使用ZCA白化（Zero-phase Component Analysis whitening）方法来实现特征白化。具体地，ZCA白化通过对中心化后的向量表示进行奇异值分解（Singular Value Decomposition，SVD）来获得一个正交矩阵，然后将该正交矩阵应用于中心化后的向量表示，从而得到白化后的向量表示。

3.1.3、逆变换

为了保证文本表示的质量，我们需要将白化后的向量表示进行逆变换，从而得到最终的文本向量表示。逆变换可以使用反向ZCA白化方法来实现，它使用原始向量表示的协方差矩阵来对白化后的向量表示进行逆变换，从而得到最终的文本向量表示。

3.2、计算过程

具体而言，计算过程可以分为以下几步：

计算向量表示的均值 $\mu$ ：

$\mu = \frac{1}{n} \sum_{i=1}^{n} x_i$

其中 $x_i$ 是第 $i$ 个向量表示， $n$ 是向量表示的数量。

计算向量表示的协方差矩阵cov：

$cov = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \mu)(x_i - \mu)^T$

对协方差矩阵cov进行奇异值分解（Singular Value Decomposition，SVD）：

$cov = USV^T$

其中， $U$ 是协方差矩阵的左奇异向量， $S$ 是协方差矩阵的奇异值， $V$ 是协方差矩阵的右奇异向量。

计算ZCA白化矩阵W：

$W = U\frac{1}{\sqrt{S + \epsilon}}U^T$

其中， $\epsilon$ 是一个非常小的常数，用于避免分母为0的情况。

计算kernel和bias：

$y = (x + bias)\cdot kernel$

其中， $x$ 是原始的向量表示， $bias$ 是向量表示的负均值 $-\mu$ ， $kernel$ 是ZCA白化矩阵 $W$ 。

3.3、详细代码

计算kernel和bias的数据集采用的语义相似度的数据集，其它任务的数据集计算后会产生不同的检索效果，请自行测试。

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.read_csv("train.csv")
# 过滤超长文本
sent1_cond = df["sentence1"].map(lambda x:len(x)<100)
sent2_cond = df["sentence2"].map(lambda x:len(x)<100)
df = df.loc[(sent1_cond & sent2_cond), ["sentence1", "sentence2", "label"]]

# 根据数据集获取Embedding
all_vecs=[]
all_labels=[]
with tqdm(total=df.shape[0]) as pbar:
    for sent1, sent2, label in df.itertuples(index=None):
        vec1 = get_glm_embedding(model, tokenizer, sent1).reshape([-1, 4096])
        vec2 = get_glm_embedding(model, tokenizer, sent2).reshape([-1, 4096])
        all_vecs.append((vec1, vec2))
        all_labels.append(label)
        pbar.update(1)


# 计算kernel和bias
def compute_kernel_bias(vecs, n_components=256):
    """
    n_components为PCA前n维特征
    """
    mu = vecs.mean(axis=0, keepdims=True)
    cov = np.cov(vecs.T)
    u, s, vh = np.linalg.svd(cov)
    W = np.dot(u, np.diag(1 / np.sqrt(s)))
    return W[:, :n_components], -mu

kernel, bias = compute_kernel_bias(vecs)


# 计算相似度
def transform_and_normalize(vecs, kernel=None, bias=None):
    """
    应用变换，然后标准化
    最后的变换：y = (x + bias).dot(kernel)
    """
    if not (kernel is None or bias is None):
        vecs = (vecs + bias).dot(kernel)
    return vecs / (vecs**2).sum(axis=1, keepdims=True)**0.5

def compute_cosine(a, b, kernel_=None, bias_=None):
    a_vec=transform_and_normalize(a, kernel_, bias_)
    b_vec=transform_and_normalize(b, kernel_, bias_)
    return round((a_vec * b_vec).sum(axis=1).tolist()[0], 2)
    
from scipy.stats import spearmanr
def compute_corrcoef(x, y):
    """使用Spearman相关系数进行评估"""
    return spearmanr(x, y).correlation

# 原始4096维Embedding的相似度计算结果
res_without_kernel=[]
for vecs in tqdm(all_vecs):
    res_without_kernel.append(compute_cosine(*vecs))

print(compute_corrcoef(all_labels, res_without_kernel))
# > 相关性结果：0.4524263870506247

# 经过PCA特征提取并降维的Embedding相似度计算结果
res=[]
for vecs in tqdm(all_vecs):
    res.append(compute_cosine(*vecs))
    
print(compute_corrcoef(all_labels, res))
# > 相关性结果：0.4672354045287457

# 持久化一份向量做后续工程使用
np.savez('svd256.npz', kernel=kernel, bias=bias, all_vecs=all_vecs, all_labels=all_labels)

3.4、最终结论

方式	相关系数(越接近0越不相似)	前15个cosine分数（分数，label）
原始	0.4524263870506247	`[(0.46, 0),(0.53, 0),(0.58, 0),(0.81, 0),(0.82, 1),(0.56, 0),(0.49, 0),(0.74, 1),(0.66, 1),(0.6, 1),(0.83, 1),(0.67, 0),(0.94, 1),(0.55, 0),(0.73, 1)`]
whitening	0.4672354045287457	`[(0, 0),(0.04, 0),(0, 0),(0.25, 0),(0.45, 1),(0, 0),(0, 0),(0.42, 1),(0.21, 1),(0.33, 1),(0.33, 1),(0.02, 0),(0.86, 1),(0.02, 0),(0.25, 1)]`

从结果上看，降维不但没有降低无监督检索精度，还略有提升，在工程层面可以大大提升检索效率。其次，相比原始cosine分数，whitening的cosine分数更具备区分度。