还在为论文发愁？AI帮读了解一下！项目概述：多模态大模型RAG入门项目，通过针对性设置私域知识体系，实现垂直领域的AI

项目概述：

多模态大模型RAG入门项目，通过针对性设置私域知识体系，实现垂直领域的AI快速搭建。相比于已有的类似淘宝的机器人模型，多模态大模型拥有更强大的特定领域知识库，拥有更强的理解能力，回答的问题更加专业，回答角度更加精准。本项目示例为文本分析师，帮助你快速统计论文内容，方便筛选学者们快速筛选出自己需要的文章

技术方案：

如果着急上厕所，这边先给出大致思路 OK，那让我们继续

调用nvidia NIM,使用对应API，NIM提供了大量的模型，可根据具体需求选择模型本方案用了多个模型,例如

    ai-embed-qa-4 Embeddings
    ai-phi-3-vision-128k-instruct
    ai-mixtral-8x7b-instruct
    meta/llama-3.1-405b-instruct

第一步：搭建环境

本次模型搭建主要需要三个工具包:

langchain_nvidia_ai_endpoint: 用来调用nvidia nim的计算资源
langchain: 用来构建对话链, 将智能体的各个组件串联起来
base64: 因为本实验是构建多模态的智能体, 需要base64来对图像进行编解码

    from langchain_nvidia_ai_endpoints import ChatNVIDIA
    from langchain_core.output_parsers import StrOutputParser
    from langchain_core.prompts import ChatPromptTemplate
    from langchain.schema.runnable import RunnableLambda
    from langchain.schema.runnable.passthrough import RunnableAssign
    from langchain_core.runnables import RunnableBranch
    from langchain_core.runnables import RunnablePassthrough
    from langchain.chains import ConversationChain
    from langchain.memory import ConversationBufferMemory

    import os
    import base64
    import pdf2image
    import matplotlib.pyplot as plt
    import numpy as np
    os.environ["NVIDIA_API_KEY"] = ""

第二步：将PDF格式的论文转成图片，然后利用Microsoft Phi 3 vision 来解析图片数据

代码示例
def image2b64(image_file):
    with open(image_file, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
        return image_b64

image_b64 = image2b64("economic-assistance-chart.png")
# image_b64 = image2b64("eco-good-bad-chart.png")
print(image_b64)

第三步, 使用 LangChain 构建多模态智能体

Agent 应用场景：将图片中的统计图表转换为可以用 python 进行分析的数据

Agent 工作流：

接收图片，读取图片数据
对数据进行调整、分析
生成能够绘制图片的代码,并执行代码
根据处理后的数据绘制图表

接收图片 -> 分析数据 -> 修改数据 -> 执行代码 -> 展示结果

辅助函数

这里的函数用于显示输入, 执行代码等, 在我们执行过程中可能会用到

    # 将 langchain 运行状态下的表保存到全局变量中
    def save_table_to_global(x):
        global table
        if 'TABLE' in x.content:
            table = x.content.split('TABLE', 1)[1].split('END_TABLE')[0]
        return x

    # helper function 用于Debug
    def print_and_return(x):
        print(x)
        return x

    # 对打模型生成的代码进行处理, 将注释或解释性文字去除掉, 留下pyhon代码
    def extract_python_code(text):
        pattern = r'```python\s*(.*?)\s*```'
        matches = re.findall(pattern, text, re.DOTALL)
        return [match.strip() for match in matches]

    # 执行由大模型生成的代码
    def execute_and_return(x):
        code = extract_python_code(x.content)[0]
        try:
            result = exec(str(code))
            #print("exec result: "+result)
        except ExceptionType:
            print("The code is not executable, don't give up, try again!")
        return x

    # 将图片编码成base64格式, 以方便输入给大模型
    def image2b64(image_file):
        with open(image_file, "rb") as f:
            image_b64 = base64.b64encode(f.read()).decode()
            return image_b64

定义多模态数据分析 Agent

这里首先定义了提示词模板, chart_reading_prompt, 我们输入的图片会边恒base64格式的string传输给它
将处理好的提示词输入给char_reading, 也就是microsoft/phi-3-vision大模型来进行数据分析, 得到我们需要的表格或者说table变量
将Phi3 vision处理好的table和提示词输入给另一个大模型llama3.1, 修改数据并生成代码
将生成的代码通过上面的执行函数来执行python代码

)

    Instruct LLM Runnable
    instruct_chat = ChatNVIDIA(model="nv-mistralai/mistral-nemo-12b-instruct")
    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
    instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
    instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")

    instruct_prompt = ChatPromptTemplate.from_template(
        "Do NOT repeat my requirements already stated. Based on this table {table}, {input}" \
        "If has table string, start with 'TABLE', end with 'END_TABLE'." \
        "If has code, start with '```python' and end with '```'." \
        "Do NOT include table inside code, and vice versa."
    )
    instruct_chain = instruct_prompt | instruct_chat

然后我们将数据初始化

# 使用全局变量 table 来存储数据
table = None
# 将要处理的图像转换成base64格式
image_b64 = image2b64("economic-assistance-chart.png")

#展示读取的图片
from PIL import Image

display(Image.open("economic-assistance-chart.png"))
# 让模型尝试自己修改模型
user_input = "show this table in string"
chart_agent(image_b64, user_input, table)
print(table)    # let's see what 'table' looks like now

第四步：将生成的结果整合到Gradio中输出

def extract_image_info(image_b64, user_requirement):
    # 这里可以使用图像识别技术或调用相关模型来提取图像的相关信息
    # 为了简单起见，假设我们只是根据要求提供一些简单的描述
    if user_requirement == "描述图片内容":
        return "这是一张内容丰富的图片"
    else:
        return "不支持的要求"

def chart_agent_gr(image_b64, user_requirement, table):
    return extract_image_info(image_b64, user_requirement)

iface = gr.Interface(
    fn=chart_agent_gr,
    inputs=[gr.Image(label="上传图片", type="filepath"), gr.Textbox(label="输入您的要求")],
    outputs="text",
    title="图片信息处理",
    description="上传图片并输入要求，获取相应文字"
)

还等什么，快去试试吧