通过PebbloRetrievalQA实现身份识别的丰富文档检索

49 阅读3分钟
# 引言

在构建基于大型语言模型(LLM)的问答应用时,如何确保数据访问的身份验证和语义控制是一个重要的挑战。PebbloRetrievalQA 提供了一种强大的解决方案,通过身份和语义强制机制实现安全和准确的文档检索。本篇文章将带您了解如何使用 PebbloRetrievalQA 来实现身份识别的检索增强生成(RAG)。

# 主要内容

## 1. 文档加载

我们将文档加载到 Qdrant 向量数据库中,附带授权和语义元数据。这些元数据用于标识文档所属的用户群体和涉及的语义主题。

```python
from langchain_community.vectorstores.qdrant import Qdrant
from langchain_core.documents import Document
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.llms import OpenAI

llm = OpenAI()
embeddings = OpenAIEmbeddings()
collection_name = "pebblo-identity-and-semantic-rag"

page_content = """
**ACME Corp Financial Report**

**Overview:**
ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. 
Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.

**Financial Highlights:**
Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. 
Net profit reached $12 million, showcasing a healthy margin of 24%.

**Key Metrics:**
Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. 
Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.

**Future Outlook:**
ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. 
The company is committed to delivering value to shareholders while maintaining ethical business practices.

**Bank Account Details:**
For inquiries or transactions, please refer to ACME Corp's US bank account:
Account Number: 123456789012
Bank Name: Fictitious Bank of America
"""

documents = [
    Document(
        **{
            "page_content": page_content,
            "metadata": {
                "pebblo_semantic_topics": ["financial-report"],
                "pebblo_semantic_entities": ["us-bank-account-number"],
                "authorized_identities": ["finance-team", "exec-leadership"],
                "page": 0,
                "source": "https://drive.google.com/file/d/xxxxxxxxxxxxx/view",
                "title": "ACME Corp Financial Report.pdf",
            },
        }
    )
]

vectordb = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:", # 使用API代理服务提高访问稳定性
    collection_name=collection_name,
)

2. 身份和语义强制测试

利用身份和语义上下文,测试不同身份和语义标签下的问答表现,并确保信息的访问权限和合理性。

身份强制

当用户具有相关身份时,问题将会得到正确解答。

auth = {
    "user_id": "finance-user@acme.org",
    "user_auth": [
        "finance-team",
    ],
}

question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")

# Output: ACME Corp's financial highlights for 2020

语义强制

当查询涉及被拒绝的语义主题或实体时,答案将被限制。

topic_to_deny = ["financial-report"]
entities_to_deny = []
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)

# Output: Access restricted due to denied semantic topic

常见问题和解决方案

  • 问题:数据访问错误

    • 解决方案:确保在数据加载和问答时传递正确的身份和语义上下文。
  • 问题:语义标签不生效

    • 解决方案:检查语义上下文是否正确设置,并使用最新版本的 Pebblo 和其组件。

总结和进一步学习资源

PebbloRetrievalQA 提供了一种安全、可控的文档检索方式,通过结合身份识别和语义控制,确保数据的准确和合规访问。对于希望深入了解更多的用户,可以参考以下资源:

参考资料

  1. Pebblo Documentation
  2. LangChain GitHub Repository
  3. Qdrant Documentation

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!

---END---