Rag入门-第1课-手搓一个土得掉渣的RAG.ipynb第一个手搓rag，完成了rag的主流程，从辅助文本分块、词嵌入模

1、安装Jupyter Notebook环境

# 下载安装python，使用3。9.9版本  
# 建立一个虚拟环境  
python -m venv myenv  
# 激活虚拟环境  
.\myenv\Scripts\activate  
# 安装Jupyter Notebook  
pip install notebook  
# 启动Jupyter Notebook  
jupyter notebook

在Jupyter，新建一个Rag.ipynb文件。安装依赖库，执行以下命令

!pip install faiss-cpu scikit-learn scipy
!pip install openai
!pip install python-dotenv

2、新建.env配置文件 填写ZHIPU_API_KEY=你的api_key

3、编写代码

加载环境变量

import os
from dotenv import load_dotenv

# 加载环境变量
load\_dotenv()
# 从环境变量中读取api_key
api_key = os.getenv('ZHIPU_API_KEY')
base_url = "<https://open.bigmodel.cn/api/paas/v4/>"
chat_model = "glm-4-flash"
emb_model = "embedding-2"

构造client

from openai import OpenAI
client = OpenAI(
    api_key = api_key,
    base_url = base_url
)

构造文档

embedding_text = """  
Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.  
  
However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.  
  
For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.  
  
Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly applying the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.  
  
We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.  
  
"""  
  
# 设置每个文本块的大小为 150 个字符  
chunk_size = 512  
# 使用列表推导式将长文本分割成多个块，每个块的大小为 chunk_size  
chunks = [embedding_text[i:i + chunk_size] for i in range(0, len(embedding_text), chunk_size)]

向量化

from sklearn.preprocessing import normalize  
import numpy as np  
import faiss  
  
# 初始化一个空列表来存储每个文本块的嵌入向量  
embeddings = []  
  
# 遍历每个文本块  
for chunk in chunks:  
    # 使用 OpenAI API 为当前文本块创建嵌入向量  
    response = client.embeddings.create(  
        model=emb_model,  
        input=chunk,  
    )  
      
    # 将嵌入向量添加到列表中  
    embeddings.append(response.data[0].embedding)  
  
# 使用 sklearn 的 normalize 函数对嵌入向量进行归一化处理  
normalized_embeddings = normalize(np.array(embeddings).astype('float32'))  
  
# 获取嵌入向量的维度  
d = len(embeddings[0])  
  
# 创建一个 Faiss 索引，用于存储和检索嵌入向量  
index = faiss.IndexFlatIP(d)  
  
# 将归一化后的嵌入向量添加到索引中  
index.add(normalized_embeddings)  
  
# 获取索引中的向量总数  
n_vectors = index.ntotal  
  
  
print(n_vectors)

向量检索

from sklearn.preprocessing import normalize  
def match_text(input_text, index, chunks, k=2):  
    """  
    在给定的文本块集合中，找到与输入文本最相似的前k个文本块。  
  
    参数:  
        input_text (str): 要匹配的输入文本。  
        index (faiss.Index): 用于搜索的Faiss索引。  
        chunks (list of str): 文本块的列表。  
        k (int, optional): 要返回的最相似文本块的数量。默认值为2。  
  
    返回:  
        str: 格式化的字符串，包含最相似的文本块及其相似度。  
    """  
    # 确保k不超过文本块的总数  
    k = min(k, len(chunks))  
  
    # 使用OpenAI API为输入文本创建嵌入向量  
    response = client.embeddings.create(  
        model=emb_model,  
        input=input_text,  
    )  
    # 获取输入文本的嵌入向量  
    input_embedding = response.data[0].embedding  
    # 对输入嵌入向量进行归一化处理  
    input_embedding = normalize(np.array([input_embedding]).astype('float32'))  
  
    # 在索引中搜索与输入嵌入向量最相似的k个向量  
    distances, indices = index.search(input_embedding, k)  
    # 初始化一个字符串来存储匹配的文本  
    matching_texts = ""  
    # 遍历搜索结果  
    for i, idx in enumerate(indices[0]):  
        # 打印每个匹配文本块的相似度和文本内容  
        print(f"similarity: {distances[0][i]:.4f}\nmatching text: \n{chunks[idx]}\n")  
        # 将相似度和文本内容添加到匹配文本字符串中  
        matching_texts += f"similarity: {distances[0][i]:.4f}\nmatching text: \n{chunks[idx]}\n"  
  
    # 返回包含匹配文本块及其相似度的字符串  
    return matching_texts

输入提问

input_text = "What are the applications of Agent AI systems ?"  
  
matched_texts = match_text(input_text=input_text, index=index, chunks=chunks, k=2)

构造提问prompt

prompt = f"""  
根据找到的文档  
{matched_texts}  
生成  
{input_text}  
的答案，尽可能使用文档语句的原文回答。不要复述问题，直接开始回答。  
"""

构建对话引擎

def get_completion_stream(prompt):  
    """  
    使用 OpenAI 的 Chat Completions API 生成流式的文本回复。  
  
    参数:  
        prompt (str): 要生成回复的提示文本。  
  
    返回:  
        None: 该函数直接打印生成的回复内容。  
    """  
    # 使用 OpenAI 的 Chat Completions API 创建一个聊天完成请求  
    response = client.chat.completions.create(  
        model=chat_model,  # 填写需要调用的模型名称  
        messages=[  
            {"role": "user", "content": prompt},  
        ],  
        stream=True,  
    )  
    # 如果响应存在  
    if response:  
        # 遍历响应中的每个块  
        for chunk in response:  
            # 获取当前块的内容  
            content = chunk.choices[0].delta.content  
            # 如果内容存在  
            if content:  
                # 打印内容，并刷新输出缓冲区  
                print(content, end='', flush=True)

执行对话

get_completion_stream(prompt)