如何更好配置大模型缓存如何更好的配置大模型缓存呢？加速 ⏩ 省钱💰!!! 在我们内部使用的GPTCache产品，根据目

最近更新时间: 2023.6.26

最新版本: v0.1.32

在我们内部使用GPTCache产品，已经在线上运行了一段时间，根据目前已有的数据，发现cache的缓存效果还不错，当然还有提升空间。同时我们猜想，如果针对于某一场景，根据这一场景数据训练相关模型，cache或许可以更好的工作。如果你有真实的使用场景，且有相关数据集，欢迎来一起交流。

在阅读以下内容之前，需要了解GPTCache的基础构成，建议阅读：

GPTCache初始化简介

GPTCache核心组件：

pre-process func 预处理函数
embedding func embedding函数
data manager 数据管理
- cache store 传统数据库
- vector store 向量数据库
- object store (optional, multi-model) 对象数据库，主要存储图片、视频等文件
similarity evaluation 相似评估
post-process func 后处理函数

上面这些核心组件在相似cache初始化的时候都需要设置，当然大部分都有默认值。除了这些，还有额外的参数，包括了：

config，cache的一些配置，如相似阈值，一些特定的预处理函数的参数值等；
next_cache，可以用来设置多级缓存，

例如现在有L1、L2两个GPTCache，其中L1在初始化的时候将L2设置为next cache。

当接受用户请求时，如果L1 cache未命中，则会去L2 cache中寻找，如果L2也未命中，则会调用LLM，然后会将结果存储到L1和L2 cache中。如果L2命中了的话，则会将缓存结果存储在L1 cache中

以上便是所有初始化参数基本描述。

在GPTCache lib中，有一个全局的cache对象，如果llm请求没有设置cache对象的情况下，则是使用这个全局对象。

目前存在三种初始化cache的方法，分别为：

Cache class的init方法，默认情况是精准key匹配，也就是一个简单的map缓存，也就是：

def init(
    self,
    cache_enable_func=cache_all,
    pre_func=last_content,
    embedding_func=string_embedding,
    data_manager: DataManager = get_data_manager(),
    similarity_evaluation=ExactMatchEvaluation(),
    post_func=temperature_softmax,
    config=Config(),
    next_cache=None,
  ):
  pass

api包中的init_similar_cache方法，默认是onnx+sqlite+faiss的模糊匹配

def init_similar_cache(
    data_dir: str = "api_cache",
    cache_obj: Optional[Cache] = None,
    pre_func: Callable = get_prompt,
    embedding: Optional[BaseEmbedding] = None,
    data_manager: Optional[DataManager] = None,
    evaluation: Optional[SimilarityEvaluation] = None,
    post_func: Callable = temperature_softmax,
    config: Config = Config(),
):
  pass

api包中的init_similar_cache_from_config，通过yaml文件初始化cache，默认是onnx+sqlite+faiss的模糊匹配

def init_similar_cache_from_config(config_dir: str, cache_obj: Optional[Cache] = None):
  pass

预处理函数

预处理函数主要被用于：从用户llm请求参数列表中获取用户问题信息，将这部分信息组装成字符串并返回。返回值则是embedding模型的输入。

值得注意的是，不同的llm需要用不同的预处理函数，因为每个llm的请求参数列表是不一致，其中包含用户问题信息的参数名称也是不一样的。当然如果想根据用户的其他llm参数来使用不同的前处理流程，这也是可以实现的。

预处理函数的定义，接收两个参数，返回值可以是一个或者两个。

def foo_pre_process_func(data: Dict[str, Any], **params: Dict[str, Any]) -> Any:
  pass

其中data，则是用户参数列表，params则是额外的一些参数，如cache config，可以通过params.get("cache_config", None)获得。

如果没有特殊要求，函数返回一个值即可，它被用于embedding的输入和当前请求cache的key。

当然也可以返回两个值，则第一个做为当前请求cache的key，第二个做为embdding的输入，目前这种主要被用于处理openai chat长对话。在长对话的情况下，第一个返回值则是用户的原始长对话，只进行简单的对话字符串拼接，第二个返回值则是通过一些模型提取长对话的关键信息，缩短embedding的输入。

目前已有的预处理函数：

源码链接：processor/pre

预处理api文档：gptcache.processor.pre

如果对下面函数存在疑惑，则可以使用查看上述api文档，里面包含了简单的函数例子。

openai chat complete

last_content：获取messages数组中的最后一个内容。
last_content_without_prompt：获取messages数组中的最后一个内容，但是会去除prompt信息，需要配合Config的prompts参数使用。
last_content_without_template：获取messages数组中的最后一个内容，但是会去除模版信息，需要配合Config的template参数使用。
all_content：将messages数组中的所有内容拼接起来。
concat_all_queries：将messages数组中的内容和角色拼接起来。
context_process：处理openai中长对话，核心则是通过一些方法压缩对话，提取对话的核心内容作为cache的key。
- summarization_context：使用summary模型压缩对话，api reference: processor.context.summarization_context
- selective_context：使用模型挑选对话部分内容，api reference: processor.context.selective_context
- concat_context：简单拼接对话，后续内容使用rwkv embedding进行处理，api reference: processor.context.concat_context

langchain llm

get_prompt: 获取请求中的prompt参数

langchain chat llm

get_messages_last_content: 获取message对象数组中最后一个message对象中的content信息

openai image

get_prompt: 获取请求中的prompt参数

openai audio

get_file_name: 获取请求中的文件名称
get_file_bytes: 获取请求中的文件bytes数组

openai moderation

get_openai_moderation_input: 获取openai moderation请求的input参数

llama

get_prompt: 获取请求中的prompt参数

replicate (image -> text, image and text -> text)

get_input_str: 获取请求中的图片和问题
get_input_image_file_name: 获取请求中的图片名称

stable diffusion

get_prompt: 获取请求中的prompt参数

minigpt4

get_image_question: 获取请求中的图片和问题
get_image: 获取请求中图片对象

dolly

get_inputs: 获取请求中inputs参数

注意：不同的llm，在cache初始化时要选择不一样的预处理函数，如果没有则需要自定义。

Embedding

将输入转换成多维数字数组，下面根据输入类型进行分类。

注：cache是否准确，embedding模型的选择比较重要。值得注意的几点：模型支持的语言、模型支持的token数。另外一般来说，在一定计算机资源下，大模型更加准确，但是耗时；小模型运行速度快，但是不太准确。

embedding api文档： embedding api

text

Onnx: small, only supports 512token, and only supports English
Huggingface: default Distilbert-base-uncased, Chinese uer/albert-base-chinese-cluecorpussmall, more models: huggingface models
SBERT: optional model list reference: sbert Pretrained Models
OpenAI: openai embedding api server, more details: openai embeddings
Cohere: cohere embedding api server, more details: cohere embed
LangChain: langchain text embedding models, more details: langchain text embedding models, GPTCache langchain embedding usage
Rwkv: rwkv text embedding models, more details: huggingface transformers rwkv
PaddleNLP: easy-to-use and powerful NLP library, more details: PaddleNLP Transformer models
UForm: multi-modal transformers library, more details: ufrom usage
FastText: library for fast text representation and classification, more details: fastText

audio

Data2VecAudio: huggingface audio embedding model, more details: huggingface data2vec-audio

image

Timm: huggingface image embedding model, more details: huggingface timm
ViT: huggingface vit image embedding model, more details: huggingface vit

Data Manager

针对于文本的相似缓存，只需要cache store和vector store，如果是多模态类型的缓存，在额外需要object store。存储的选择，与llm类型没有关联，不过需要注意在使用vector store的时候需要设置向量维度。

cache store

sqlite
duckdb
mysql
mariadb
sqlserver
oracle
postgresql

vector store

milvus
faiss
chromadb
hnswlib
pgvector
docarray
usearch
redis

object store

local
s3

如何获取Data Manager

通过store名称，使用factory获取

from gptcache.manager import manager_factory

data_manager = manager_factory("sqlite,faiss", data_dir="./workspace", vector_params={"dimension": 128})

通过get_data_manager方法组合多个store对象

from gptcache.manager import get_data_manager, CacheBase, VectorBase

data_manager = get_data_manager(CacheBase('sqlite'), VectorBase('faiss', dimension=128))

每个存储都有更多的初始化参数，可以参考每个store对象的构造函数，store api reference。

Similarity Evaluation

如果想要cache发挥更好的效果，除了embedding和向量引擎，合适的相似评估也非常关键。

相似评估主要是：对召回的cache数据，根据当前用户的llm请求进行评估，得到一个数值。最简单的一种方式那就是：使用embedding距离，当然这里也有其他方法，比如说使用模型来判断两个问题的相似程度等。

下面就是目前已经存在的相似评估组件

SearchDistanceEvaluation，向量搜索距离，简单，快，但是不是很准确
OnnxModelEvaluation，使用模型对比两个问题的相关程度，小模型，仅支持512token，比距离更准确
NumpyNormEvaluation，计算llm请求和cache 数据两个embedding向量的距离，快，简单，准确度与距离差不多
KReciprocalEvaluation，使用K-reprocical来评估相似性，召回多个缓存数据进行比较，需要召回多次，也就需要更多时间，相对来说更加准确，更多细节参考api文档
CohereRerankEvaluation，使用cohere api服务，更加准确，需要一定费用，更多细节：cohere rerank
SequenceMatchEvaluation，序列匹配，适用于多轮对话，将每轮对话分开进行相似评估，然后通过比重得到最终分数
TimeEvaluation，通过缓存的创建时间进行评估，避免使用过时的cache
SbertCrossencoderEvaluation，使用sbert模型进行rerank评估，这个是目前发现效果最佳的similarity evaluation

更多使用细节参考 api doc

当然如果要得到更好的Similarity Evaluation，需要自己根据场景进行自定义，比如说组装现有的Similarity Evaluation。假如你想要在长对话中得到更好的cache效果，你可能需要组装SequenceMatchEvaluation、TimeEvaluation、TimeEvaluation，当然或许也有更好的方式。

Post-Process function

后处理主要是根据所有符合相似阈值的缓存数据得到最终的用户问题答案。可以在缓存数据列表中根据一定策略挑选其中一个，也可以使用模型，将这些答案进行微调，这样相似的问题便可以不同的答案。

目前已有的后处理函数：

temperature_softmax：根据softmax策略进行选择，这样可以保证得到的缓存答案具有一定随机性
first：获取最相似的缓存答案
random：随机获取一个相似的缓存答案

新人推荐

注：不同的llm需要使用不同的预处理函数，当然也需要根据你的需求进行调整。

入门级

体现一下GPTCache功能，使用最简单的组合方案：onnx embedding + (sqlite + faiss) data manager + distance similarity evaluation

英文版本

import time

from gptcache.adapter import openai
from gptcache.adapter.api import init_similar_cache
from gptcache.processor.pre import last_content

init_similar_cache(pre_func=last_content)

questions = [
    "what's github",
    "can you explain what GitHub is",
    "can you tell me more about GitHub",
    "what is the purpose of GitHub",
]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=[{"role": "user", "content": question}],
    )
    print(f"Question: {question}")
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response["choices"][0]["message"]["content"]}\n')

# console output

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Question: what's github
Time consuming: 12.23s
Answer: GitHub is a web-based platform that uses Git for version control. It provides developers with a collaborative environment in which they can store and share their code, manage projects, track issues, and build software. GitHub also provides a range of features for code collaboration and review, including pull requests, forks, and merge requests, that enable users to work together on code development and share their work with a wider community. GitHub is widely used by businesses, open-source communities, and individual developers around the world.

Question: can you explain what GitHub is
Time consuming: 0.64s
Answer: GitHub is a web-based platform that uses Git for version control. It provides developers with a collaborative environment in which they can store and share their code, manage projects, track issues, and build software. GitHub also provides a range of features for code collaboration and review, including pull requests, forks, and merge requests, that enable users to work together on code development and share their work with a wider community. GitHub is widely used by businesses, open-source communities, and individual developers around the world.

Question: can you tell me more about GitHub
Time consuming: 0.21s
Answer: GitHub is a web-based platform that uses Git for version control. It provides developers with a collaborative environment in which they can store and share their code, manage projects, track issues, and build software. GitHub also provides a range of features for code collaboration and review, including pull requests, forks, and merge requests, that enable users to work together on code development and share their work with a wider community. GitHub is widely used by businesses, open-source communities, and individual developers around the world.

Question: what is the purpose of GitHub
Time consuming: 0.24s
Answer: GitHub is a web-based platform that uses Git for version control. It provides developers with a collaborative environment in which they can store and share their code, manage projects, track issues, and build software. GitHub also provides a range of features for code collaboration and review, including pull requests, forks, and merge requests, that enable users to work together on code development and share their work with a wider community. GitHub is widely used by businesses, open-source communities, and individual developers around the world.

中文版本

If the question is Chinese, you need to use other embedding models, here we use the model on huggingface.

import time

from gptcache.adapter import openai
from gptcache.adapter.api import init_similar_cache
from gptcache.embedding import Huggingface
from gptcache.processor.pre import last_content

huggingface = Huggingface(model="uer/albert-base-chinese-cluecorpussmall")
init_similar_cache(pre_func=last_content, embedding=huggingface)

questions = ["什么是Github", "你可以解释下什么是Github吗", "可以告诉我关于Github一些信息吗"]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=[{"role": "user", "content": question}],
    )
    print(f"Question: {question}")
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response["choices"][0]["message"]["content"]}\n')

# console output

Some weights of the model checkpoint at uer/albert-base-chinese-cluecorpussmall were not used when initializing AlbertModel: ['predictions.decoder.bias', 'predictions.LayerNorm.bias', 'predictions.bias', 'predictions.dense.bias', 'predictions.LayerNorm.weight', 'predictions.dense.weight', 'predictions.decoder.weight']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertModel were not initialized from the model checkpoint at uer/albert-base-chinese-cluecorpussmall and are newly initialized: ['albert.pooler.weight', 'albert.pooler.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-06-27 18:05:20,233 - 140704448365760 - connectionpool.py-connectionpool:812 - WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1125)'))': /v1/chat/completions
Question: 什么是Github
Time consuming: 24.09s
Answer: GitHub是一个面向开源及私有软件项目的托管平台，因为只支持git（一个分布式版本控制系统）作为唯一的版本库格式进行托管，故名GitHub。GitHub于2008年4月10日正式上线，除了目前，GitHub已经成为了世界上最大的开源社区和开源软件开发平台之一。

Question: 你可以解释下什么是Github吗
Time consuming: 0.49s
Answer: GitHub是一个面向开源及私有软件项目的托管平台，因为只支持git（一个分布式版本控制系统）作为唯一的版本库格式进行托管，故名GitHub。GitHub于2008年4月10日正式上线，除了目前，GitHub已经成为了世界上最大的开源社区和开源软件开发平台之一。

Question: 可以告诉我关于Github一些信息吗
Time consuming: 0.12s
Answer: GitHub是一个面向开源及私有软件项目的托管平台，因为只支持git（一个分布式版本控制系统）作为唯一的版本库格式进行托管，故名GitHub。GitHub于2008年4月10日正式上线，除了目前，GitHub已经成为了世界上最大的开源社区和开源软件开发平台之一。

标准级

尝试进行不同的组件，更好的匹配自己的使用场景

小提示：

不同的llm需要不同的预处理函数，同时也需要考虑llm的使用场景
使用模型时支持的token数和语言，如embedding和similarity evaluation
向量数据库初始化过程中不要忘记传dimension参数
源代码中有很多例子，可以在bootcamp/example/test目录下找到
如果一个程序中有多个llm需要使用cache，需要创建多个Cache对象

不使用默认值的例子

import time

from gptcache import Cache, Config
from gptcache.adapter import openai
from gptcache.adapter.api import init_similar_cache
from gptcache.embedding import Onnx
from gptcache.manager import manager_factory
from gptcache.processor.post import random_one
from gptcache.processor.pre import last_content
from gptcache.similarity_evaluation import OnnxModelEvaluation

openai_complete_cache = Cache()
encoder = Onnx()
sqlite_faiss_data_manager = manager_factory(
    "sqlite,faiss",
    data_dir="openai_complete_cache",
    scalar_params={
        "sql_url": "sqlite:///./openai_complete_cache.db",
        "table_name": "openai_chat",
    },
    vector_params={
        "dimension": encoder.dimension,
        "index_file_path": "./openai_chat_faiss.index",
    },
)
onnx_evaluation = OnnxModelEvaluation()
cache_config = Config(similarity_threshold=0.75)

init_similar_cache(
    cache_obj=openai_complete_cache,
    pre_func=last_content,
    embedding=encoder,
    data_manager=sqlite_faiss_data_manager,
    evaluation=onnx_evaluation,
    post_func=random_one,
    config=cache_config,
)

questions = [
    "what's github",
    "can you explain what GitHub is",
    "can you tell me more about GitHub",
    "what is the purpose of GitHub",
]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": question}],
        cache_obj=openai_complete_cache,
    )
    print(f"Question: {question}")
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response["choices"][0]["message"]["content"]}\n')

# console output

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Question: what's github
Time consuming: 24.73s
Answer: GitHub is an online platform used primarily for version control and coding collaborations. It's used by developers to store, share and manage their codebase. It allows users to collaborate on projects with other developers, track changes, and manage their code repositories. It also provides several features such as pull requests, code review, and issue tracking. GitHub is widely used in the open source community and is considered as an industry standard for version control and software development.

Question: can you explain what GitHub is
Time consuming: 0.61s
Answer: GitHub is an online platform used primarily for version control and coding collaborations. It's used by developers to store, share and manage their codebase. It allows users to collaborate on projects with other developers, track changes, and manage their code repositories. It also provides several features such as pull requests, code review, and issue tracking. GitHub is widely used in the open source community and is considered as an industry standard for version control and software development.

Question: can you tell me more about GitHub
Time consuming: 33.38s
Answer: GitHub is a web-based hosting service for version control using Git. It is used by developers to collaborate on code from anywhere in the world. It allows developers to easily collaborate on projects, keep track of changes to code, and work together on large codebases. GitHub provides a comprehensive platform for developers to build software together, making it easier to track changes, test and deploy code, and manage project issues. It also hosts millions of open-source projects, making it a valuable resource for developers looking to learn from others’ code and contribute to the open-source community. Additionally, non-developers can use GitHub to store and share documents, create and contribute to wikis, and track projects and issues. GitHub is a key tool in modern software development and has come to be an essential part of the software development process.

Question: what is the purpose of GitHub
Time consuming: 0.32s
Answer: GitHub is an online platform used primarily for version control and coding collaborations. It's used by developers to store, share and manage their codebase. It allows users to collaborate on projects with other developers, track changes, and manage their code repositories. It also provides several features such as pull requests, code review, and issue tracking. GitHub is widely used in the open source community and is considered as an industry standard for version control and software development.

可以发现，第三个问题并没有命中缓存，这个就是因为使用了OnnxModelEvaluation。使用模型进行相似评估，可以提升缓存答案质量，但是也有可能会导致缓存的命中率下降，因为有可能过滤了一些本来可以采用缓存，但是模型认为它们不相似。

所以组件的选择需要根据自己需要进行选择。

注: 如果是自定义cache，需要在llm请求的时候添加cache_obj参数来指定cache对象.

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": question}],
    cache_obj=openai_complete_cache,
)

专业级

了解GPTCache源码，熟悉允许逻辑，可以根据自己的需要自定义组件或者组合已有组件。

根据目前使用情况，决定缓存质量的主要条件有：

预处理函数，因为函数的返回值会作为embedding的输入
embedding模型
vector store
similarity evaluation，使用rerank模型进行相似评估

在我们内部使用的GPTCache产品，根据目前已有的数据，发现cache的缓存效果还不错，当然还有提升空间。同时我们也发现，如果针对于某一场景，针对于这一场景数据训练相关模型，cache将可以更好的工作。如果你有真实的使用场景，且有相关数据集，欢迎来一起交流。