PandaAI：使用自然语言进行数据分析的利器PandaAI是一个开源的大模型智能体，主要利用自然语言进行数据处理和分析

基本信息

PandaAI是一个开源的大模型智能体，主要利用自然语言进行数据处理和分析，能够将简单的自然语言输入转换为可执行代码，生成可视化图表。

快速开始

安装

pip install "pandasai>=3.0.0b2"

获取API KEY

登陆 app.pandabi.ai/ 注册

import pandasai as pai
pai.api_key.set("your-key")

导入数据

df=pai.read_csv("filepath")
dataset = pai.create("your-team/dataset-name", df)
dataset.push()

基本用法

使用.chat方法即可进行对话。

dataset.chat('Which are the top 5 countries by sales?')

输出格式

PandaAI支持多种输出格式，包括： ○ DataFrame响应：用于表格数据。 ○ 图表响应：用于可视化结果。 ○ 文本响应：用于文本分析和解释。 ○ 数字响应：用于数值输出。 ○ 错误响应：用于错误信息。

PandaAI能够自行识别出问题最相符合的响应，并以对应形式返回答案。

对同一个数据集，用户的不同的问题将会导致不同的返回格式

import pandasai as pai

df = pai.load("my-org/users")

# 返回文本
response = df.chat("Who is the user with the highest age?")
# 返回数值结果
response = df.chat("How many users in total?")
# 返回DataFrame
response = df.chat("Show me the data")
# 返回图表
response = df.chat("Plot the distribution")

数据预处理

PandaAI提供了丰富的数据转换功能，可以对数据集进行特定的清洗，简化数据处理流程。基本的转换类型包括：

● 字符串转换：包括文本大小写转换、去除空白、截断文本、固定宽度填充、正则表达式提取等。

● 数值转换：包括四舍五入、缩放、截断、归一化、标准化、确保正值、数据分组等。

● 日期和时间转换：包括时区转换、日期格式化、转换为日期时间、日期范围验证等。

● 数据清理转换：包括填充缺失值、替换值、删除重复项、电话号码规范化等。

● 分类转换：包括类别的一热编码、值映射、分类标准化等。

● 列重命名：将列名更改为新的名称。

● 验证转换：包括电子邮件格式验证、外键引用验证等。

● 隐私和安全转换：敏感数据匿名等。

● 类型转换：将数据转换为数值类型等。

这些转换可以通过静态文件方式或编程方式指定。

静态文件方式

在schema配置文件中填入transformations相关的操作，只需要指定转换的类型和列名。

transformations:
  - type: to_lowercase
    params:
      column: product_name
  - type: strip
    params:
      column: product_name
  - type: truncate
    params:
      column: product_name
      length: 50

编程方式

该方法更加灵活，可以通过TransformationManager进行处理，该方法同样允许链式调用。

import pandasai as pai

df = pai.read_csv("data.csv")
manager = TransformationManager(df)
result = (manager
    .validate_email("email", drop_invalid=True)
    .normalize_phone("phone")
    .validate_date_range("birth_date", "1900-01-01", "2024-01-01")
    .remove_duplicates("user_id")
    .ensure_positive("amount")
    .standardize_categories("company", {"Apple Inc.": "Apple"})
    .df)

设置

基本设置

提供了以下配置参数：

○ llm：指定要使用的 LLM。

○ save_logs：是否保存 LLM 的日志。

○ verbose：是否在控制台打印日志。

○ max_retries：失败时的最大重试次数。

import pandasai as pai

pai.config.set({
   "llm": "openai",
   "save_logs": True,
   "verbose": False,
   "max_retries": 3
})

模型设置

PandaAI支持多种LLM，默认使用由PandaAI团队开发的BambooLLM，同时还允许调用其他本地或在线模型。

1. OpenAI类的模型

import pandasai as pai
from pandasai_openai import OpenAI

llm = OpenAI(api_token="my-openai-api-key")

# Set your OpenAI API key
pai.config.set({"llm": llm})

2. Langchain接口的模型

import pandasai as pai
from pandasai_langchain import LangchainLLM

llm = LangchainLLM(openai_api_key="my-openai-api-key")

pai.config.set({"llm": llm })

3. Ollama

import pandasai as pai
from pandasai_local import LocalLLM

ollama_llm = LocalLLM(api_base="http://localhost:11434/v1", model="codellama")

pai.config.set({"llm": ollama_llm})

少样本学习

PandaAI允许使用少样本学习来增强在特定场景下的查询效果。

1. 指令训练

通过提供关于模型如何处理特定类型查询的通用指令来训练智能体。

import pandasai as pai
from pandasai import Agent

pai.api_key.set("your-pai-api-key")

agent = Agent("data.csv")
agent.train(docs="The fiscal year starts in April")

response = agent.chat("What is the total sales for the fiscal year?")
print(response)
# The model will use the information provided in the training to generate a response

2. 问答训练

通过提供具体问题的预期答案来训练智能体（即问答对的方式），以增强模型性能和确定性。

from pandasai import Agent

agent = Agent("data.csv")

# Train the model
query = "What is the total sales for the current fiscal year?"
# The following code is passed as a string to the response variable
response = '\n'.join([
    'import pandas as pd',
    '',
    'df = dfs[0]',
    '',
    '# Calculate the total sales for the current fiscal year',
    'total_sales = df[df[\'date\'] >= pd.to_datetime(\'today\').replace(month=4, day=1)][\'sales\'].sum()',
    'result = { "type": "number", "value": total_sales }'
])

agent.train(queries=[query], codes=[response])

response = agent.chat("What is the total sales for the last fiscal year?")
print(response)

# The model will use the information provided in the training to generate a response

总结

PandaAI是一个典型的Text2SQL智能体，核心原理是利用大模型将自然语言查询转换为可执行的代码，通过封装的.chat方法，将问题、表头和 5-10 行数据传递给大模型，然后由模型生成最相关的代码（Python 或 SQL），同时生成的代码在本地执行，并把产生的结果按照最符合问题的形式进行返回。

基本上所有的Text2SQL都大同小异，而PandaAI额外支持了一些数据预处理的方法和自动选择特定格式进行输出，算是一些不错的亮点。