从基础到进阶：通过AI Agents检索生物医学摘要 (二)我们使用 metapub 库构建了文献摘要检索API。我们

这是本系列的第二部分，我们将构建一个基于RAG（Retrieval-Augmented Generation，检索增强生成）技术的AI Agents，用于回答使用人员的问题并与文献摘要进行对话。

在上一部分中，我们已经搭建了Streamlit的用户界面和聊天界面，现在我们将围绕从PubMed数据库获取相关的生物医学文献摘要构建逻辑，依据使用人员提出的自然语言问题进行搜索。同时，我们将在这个过程中使用大型语言模型（LLM）来增强我们的PubMed搜索结果！

提醒一下，这是我们将在系列中构建的解决方案：

问题陈述

在我们的AI Agents用户界面中，将提供一个输入框，供相关人员提出问题。以下是一些示例问题：

在过去五年中，使用单克隆抗体治疗阿尔茨海默病取得了哪些重要进展？
三阴性乳腺癌的最新治疗方法有哪些？
人工智能在诊断放射学中的应用有哪些近期进展？

借助大型语言模型（LLM），我们将把这些专业问题转化为PubMed查询，并从中检索相关的文献摘要。这些摘要将作为基础，帮助Agents基于生物医学数据进行问题回答。

构建过程

已完成步骤概览

如果您还没有完成上一部分的内容，请务必先完成，因为我们将在此基础上继续构建。在上一部分的最后，我们的项目结构如下所示：

.
├── app
│ ├── app.py
│ ├── components
│ │ ├── chat_utils.py
│ │ ├── llm.py
│ │ └── prompts.py
│ └── tests
│ └── test_chat_utils.py
├── assets
│ └── m.png
│ └── favicon32-32.ico
└── environment
  └── requirements.txt

安装的依赖

在这一部分，我们将使用一些额外的依赖，除了上一部分中已经安装的依赖外。下面是我们新添加到 requirements.txt 文件中的依赖列表：

pydantic==2.8.2
metapub==0.5.12

创建新的模块“backend”

在本文中，我们将在 /app 子文件夹下构建一个新的模块，并命名为backend。
今天我们将要构建的项目部分如下所示：

.
└── app
    ├── app.py
    └── backend
       ├── abstract_retrieval
       │   ├── interface.py
       │   ├── pubmed_retriever.py
       │   └── pubmed_query_simplification.py
       │   └── translation_query.py
       └── data_repository
           └── models.py

app/backend/data_repository

为了实现数据层的抽象，我们将使用data_repository 模块。在这个模块中，我们将定义数据模型以及与数据库交互的逻辑（关于数据库交互的具体实现将在序列三中详细讲解）。

models.py

在 models.py 文件中，我们将为生物医学文献摘要创建一个数据模型。这个模型将存储文献摘要的基本信息，包括标题、DOI、作者、出版年份和摘要内容(title, doi, authors, year, abstract_content)。我们使用 Pydantic 库来定义这个模型。

from typing import Optional
from pydantic import BaseModel

class ScientificAbstract(BaseModel):
    doi: Optional[str]
    title: Optional[str]
    authors: Optional[list]
    year: Optional[int]
    abstract_content: str

Pydantic 是我在处理代码中的数据层时最喜欢的库！想了解更多关于 Pydantic 的好处，可以查看官方解释

这个数据模型将与选择的（向量）数据库类型无关。关于数据库以及如何构建向量索引，我们将在本系列的第三篇中进行详细讨论。

app/backend/abstract_retrieval

abstract_retrieval 模块将包含文献摘要检索的相关逻辑。

interface.py

interface.py 文件的作用是将文献摘要检索客户端与其具体实现（如 pubmed_retriever.py）解耦。
这样做是为了让我们的解决方案更具可扩展性，以便将来可以轻松增加其他文献来源，比如维基百科或 Scopus。通过这种方式，我们可以保证应用在未来更容易维护和扩展，同时确保文献摘要检索的输入和输出保持一致，尽管实际的检索实现会有所不同。

from abc import ABC, abstractmethod
from typing import List
from backend.data_repository.models import ScientificAbstract


class AbstractRetriever(ABC):

    @abstractmethod
    def get_abstract_data(self, scientist_question: str) -> List[ScientificAbstract]:
        """ Retrieve a list of scientific abstracts based on a given query. """
        raise NotImplementedError

pubmed_retriever.py

这个.py文件包含了AbstractRetriever抽象类的具体实现。
我们使用metapub库来帮助通过其 API 执行 PubMed 搜索和文献摘要获取。

from typing import List
import time
import random
from metapub import PubMedFetcher
from backend.data_repository.models import ScientificAbstract
from backend.abstract_retrieval.interface import AbstractRetriever
from backend.abstract_retrieval.pubmed_query_simplification import simplify_pubmed_query
from config.logging_config import get_logger


class PubMedAbstractRetriever(AbstractRetriever):
    def __init__(self, pubmed_fetch_object: PubMedFetcher):
        # 初始化 PubMedFetch 对象和日志记录器
        self.pubmed_fetch_object = pubmed_fetch_object
        self.logger = get_logger(__name__)

    def _get_abstract_list(self, query: str, simplify_query: bool = True) -> List[str]:
        # 获取给定查询的 PubMed ID 列表
        if simplify_query:
            # 如果需要简化查询，则简化查询
            self.logger.info(f'尝试简化使用人员查询 {query}')
            query_simplified = self._simplify_pubmed_query(query)

            if query_simplified != query:
                self.logger.info(f'初始查询已简化为: {query_simplified}')
                query = query_simplified
            else:
                self.logger.info('初始查询已经足够简单，无需简化')

        self.logger.info(f'正在搜索查询: {query}')
        return self.pubmed_fetch_object.pmids_for_query(query)

    def _get_abstracts(self, pubmed_ids: List[str]) -> List[ScientificAbstract]:
        # 获取 PubMed 文摘 
        self.logger.info(f'正在获取以下 PubMed ID 的文摘数据: {pubmed_ids}')
        scientific_abstracts = []

        for id in pubmed_ids:
            initial_delay = 1  # 初始延迟时间（秒）
            max_attempts = 10  # 最大尝试次数
            success = False  # 标记是否成功获取文摘

            for attempt in range(max_attempts):
                try:
                    # 尝试获取文摘
                    abstract = self.pubmed_fetch_object.article_by_pmid(id)

                    # 如果文摘内容为 None，跳过当前 PubMed ID
                    if abstract.abstract is None:
                        self.logger.warning(f'PubMed ID {id} 未找到文摘，跳过...')
                        continue

                    # 处理 authors 字段，确保它是一个列表
                    authors = abstract.authors
                    if isinstance(authors, str):  # 如果是字符串，将其分割为列表
                        authors = authors.split(', ')

                    # 创建 ScientificAbstract 对象
                    abstract_formatted = ScientificAbstract(
                        doi=abstract.doi,
                        title=abstract.title,
                        authors=authors,  # 传递作者列表
                        year=abstract.year,
                        abstract_content=abstract.abstract
                    )

                    scientific_abstracts.append(abstract_formatted)
                    success = True
                    break

                except Exception as e:
                    # 如果请求失败，进行指数退避和随机延时
                    wait_time = initial_delay * (2 ** attempt) + random.uniform(0, 1)
                    self.logger.warning(f'PubMed ID {id} 的重试 {attempt + 1} 失败. 错误信息: {e}. {wait_time:.2f} 秒后重试...')
                    time.sleep(wait_time)

            if not success:
                # 如果达到最大尝试次数仍未成功，记录错误
                self.logger.error(f'在尝试 {max_attempts} 次后，仍未成功获取 PubMed ID {id} 的文摘')

        self.logger.info(f'共获取到 {len(scientific_abstracts)} 条文摘数据')
        return scientific_abstracts

    def get_abstract_data(self, scientist_question: str, simplify_query: bool = True) -> List[ScientificAbstract]:
        # 获取使用人员查询的文摘列表
        pmids = self._get_abstract_list(scientist_question, simplify_query)  # 获取 PubMed ID 列表
        abstracts = self._get_abstracts(pmids)  # 获取对应的文摘
        return abstracts

metapub 库中的 pmids_for_query 方法将根据用户输入的自由形式查询（例如：‘牙齿龋齿与骨质疏松症的关系是什么’）来搜索相关的文献摘要。
PubMed 搜索引擎背后使用了一个知识图谱，它会自动扩展查询，返回查询关键词的同义词，然后将这些同义词转化为具体的 PubMed 查询。
然后，article_by_pmid 方法（通过 _get_abstracts 封装）会根据给定的 PubMed ID（pmid）获取对应的文献摘要。
您可以尝试执行代码，使用任何您选择的查询来测试，比如：

pubmed_fetch = PubMedAbstractRetriever(PubMedFetcher())
abstracts = pubmed_fetch.get_abstract_data('what is the relationship between dental cavities and osteoporosis')

for abstract in abstracts:
    print(f'doi: {abstract.doi} \n title: {abstract.title} \n author: {abstract.authors} \n content:{abstract.abstract_content} \n \n')

translation_query.py

由于PubMed 只支持英文搜索，我们需要一个翻译的功能，负责把用户输入翻译成英文

from langchain_core.prompts import PromptTemplate
from components.llm import llm


def translation_chain(scientist_question: str) -> str:
    """ 中文输出翻译成英文 """
    prompt_formatted_str = translation_prompt.format(question=scientist_question)
    return llm.invoke(prompt_formatted_str).content


translation_prompt = PromptTemplate.from_template("""
  You are an expert in biomedical terminology. Your task is to translate the following Chinese question into English, ensuring the correct use of scientific and medical terms. Please focus on preserving the meaning and accurately translating any biomedical concepts.

  Chinese Question: {question}
""")

通过 LLM 优化 PubMed 查询

有时，用户的查询可能会比较长或者复杂。例如，考虑这个问题：

在过去五年中，使用单克隆抗体治疗阿尔茨海默病有任何重大进展吗？

如果你直接使用这个查询进行搜索，我们的 PubMed 检索客户端可能不会返回任何结果。但如果将查询简化为：

“使用单克隆抗体治疗阿尔茨海默病”

就能得到很多相关的结果。因此，简化用户的查询有时是很有必要的，而 LLM（大型语言模型）正好能很好地完成这项工作！

注意：在上面的例子中，我们在简化查询时移除了“过去五年”这样的具体信息——这些对于用户的查询是重要的，因为用户关注的是这个时间段。然而，目前我们将检索该主题的所有相关文献，而不考虑文献的时间。关于如何在问答过程中筛选出过去五年的文章，我们将在后续部分进一步讨论。

构建用户查询到 PubMed 查询简化的提示

首先，我们需要为 LLM 构建一个提示（prompt），通过一些示例来指示什么时候需要简化查询，什么时候不需要。
在 abstract_retrieval 模块中创建一个新的 pubmed_query_simplification.py 文件，该文件将包含简化查询的提示和示例，并定义一个函数，用于封装该提示并调用 LLM（在教程第一部分中定义的 LLM）来生成简化后的查询。

from langchain_core.prompts import PromptTemplate
from components.llm import llm


def simplify_pubmed_query(scientist_question: str) -> str:
    """ Transform verbose queries to simplified queries for PubMed """
    prompt_formatted_str = pubmed_query_simplification_prompt.format(question=scientist_question)
    return llm.invoke(prompt_formatted_str).content

pubmed_query_simplification_prompt = PromptTemplate.from_template("""
    You are an expert in biomedical search queries. Your task is to simplify verbose and detailed user queries into concise and effective search queries suitable for the PubMed database. Focus on capturing the essential scientific or medical elements relevant to biomedical research.

    Here are examples of the queries that need simplification, and what the simplification should look like:

    Example 1:
    Verbose Query: Has there been any significant progress in Alzheimer's disease treatment using monoclonal antibodies in the last five years?
    Is simplification needed here: Yes.
    Simplified Query: Alzheimer's disease monoclonal antibodies treatment progress

    Example 2:
    Verbose Query: What are the latest findings on the impact of climate change on the incidence of vector-borne diseases in tropical regions?
    Is simplification needed here: Yes.
    Simplified Query: Climate change and vector-borne diseases in tropics

    Example 3:
    Verbose Query: Can you provide detailed insights into the recent advancements in gene therapy for treating hereditary blindness?
    Is simplification needed here: Yes.
    Simplified Query: Gene therapy for hereditary blindness advancements

    Example 4:
    Verbose Query: I am interested in understanding how CRISPR technology has been applied in the development of cancer therapies over the recent years.
    Is simplification needed here: Yes.
    Simplified Query: CRISPR technology in cancer therapy development

    Example 5:
    Verbose Query: Alzheimer's disease and amyloid plaques
    Is simplification needed here: No.
    Simplified Query: Alzheimer's disease and amyloid plaques

    Example 6:
    Verbose Query: Effects of aerobic exercise on elderly cognitive function
    Is simplification needed here: No.
    Simplified Query: Effects of aerobic exercise on elderly cognitive function

    Example 7:
    Verbose Query: Molecular mechanisms of insulin resistance in type 2 diabetes
    Is simplification needed here: No.
    Simplified Query: Molecular mechanisms of insulin resistance in type 2 diabetes

    Example 8:
    Verbose Query: Role of gut microbiota in human health and disease
    Is simplification needed here: No.
    Simplified Query: Role of gut microbiota in human health and disease

    This is the user query:
    {question}

    Only decide to simplify the user's question if it is verbose. If it is already simple enough, just return the original user question.
    Only output the simplified query, or the original query if it is simple enough already, nothing else!
""")

现在，我们要回到我们的检索逻辑，并在 PubMed 搜索中加入查询简化的选项（通过添加一个新的方法_simplify_pubmed_query，并在_get_abstract_list和get_abstract_data 中添加额外的参数）：

from typing import List
import time
import random
from metapub import PubMedFetcher
from backend.data_repository.models import ScientificAbstract
from backend.abstract_retrieval.interface import AbstractRetriever
from backend.abstract_retrieval.pubmed_query_simplification import simplify_pubmed_query
from config.logging_config import get_logger


class PubMedAbstractRetriever(AbstractRetriever):
    def __init__(self, pubmed_fetch_object: PubMedFetcher):
        # 初始化 PubMedFetch 对象和日志记录器
        self.pubmed_fetch_object = pubmed_fetch_object
        self.logger = get_logger(__name__)

    def _simplify_pubmed_query(self, query: str, simplification_function: callable = simplify_pubmed_query) -> str:
        # 使用简化函数简化查询
        return simplification_function(query)
        
    def _translation_chain(self, query: str, translation_function: callable = translation_chain) -> str:
        ret = bool(re.search('[\u4e00-\u9fff]', query))
        if ret:
            trans_query = translation_function(query)
            self.logger.info(f'输入是中文，翻译的英文是：{trans_query}')
            return trans_query
        else:
            self.logger.info('输入是英文，不需要翻译')
            return query

    def _get_abstract_list(self, query: str, simplify_query: bool = True) -> List[str]:
        # 获取给定查询的 PubMed ID 列表
        if simplify_query:
            # 如果需要简化查询，则简化查询
            self.logger.info(f'尝试简化使用人员查询 {query}')
            query_simplified = self._simplify_pubmed_query(query)

            if query_simplified != query:
                self.logger.info(f'初始查询已简化为: {query_simplified}')
                query = query_simplified
            else:
                self.logger.info('初始查询已经足够简单，无需简化')

        self.logger.info(f'正在搜索查询: {query}')
        return self.pubmed_fetch_object.pmids_for_query(query)

    def _get_abstracts(self, pubmed_ids: List[str]) -> List[ScientificAbstract]:
        # 获取 PubMed 文摘 
        self.logger.info(f'正在获取以下 PubMed ID 的文摘数据: {pubmed_ids}')
        scientific_abstracts = []

        for id in pubmed_ids:
            initial_delay = 1  # 初始延迟时间（秒）
            max_attempts = 10  # 最大尝试次数
            success = False  # 标记是否成功获取文摘

            for attempt in range(max_attempts):
                try:
                    # 尝试获取文摘
                    abstract = self.pubmed_fetch_object.article_by_pmid(id)

                    # 如果文摘内容为 None，跳过当前 PubMed ID
                    if abstract.abstract is None:
                        self.logger.warning(f'PubMed ID {id} 未找到文摘，跳过...')
                        continue

                    # 处理 authors 字段，确保它是一个列表
                    authors = abstract.authors
                    if isinstance(authors, str):  # 如果是字符串，将其分割为列表
                        authors = authors.split(', ')

                    # 创建 ScientificAbstract 对象
                    abstract_formatted = ScientificAbstract(
                        doi=abstract.doi,
                        title=abstract.title,
                        authors=authors,  # 传递作者列表
                        year=abstract.year,
                        abstract_content=abstract.abstract
                    )

                    scientific_abstracts.append(abstract_formatted)
                    success = True
                    break

                except Exception as e:
                    # 如果请求失败，进行指数退避和随机延时
                    wait_time = initial_delay * (2 ** attempt) + random.uniform(0, 1)
                    self.logger.warning(f'PubMed ID {id} 的重试 {attempt + 1} 失败. 错误信息: {e}. {wait_time:.2f} 秒后重试...')
                    time.sleep(wait_time)

            if not success:
                # 如果达到最大尝试次数仍未成功，记录错误
                self.logger.error(f'在尝试 {max_attempts} 次后，仍未成功获取 PubMed ID {id} 的文摘')

        self.logger.info(f'共获取到 {len(scientific_abstracts)} 条文摘数据')
        return scientific_abstracts

    def get_abstract_data(self, scientist_question: str, simplify_query: bool = True) -> List[ScientificAbstract]:
        # 获取使用人员查询的文摘列表
        translation_question = self._translation_chain(scientist_question)
        pmids = self._get_abstract_list(translation_question, simplify_query)  # 获取 PubMed ID 列表
        abstracts = self._get_abstracts(pmids)  # 获取对应的文摘
        return abstracts

创建一个测试脚本，例如test_pubmed_fetch.py，并测试 PubMed 检索客户端，验证查询简化后返回的结果：

from metapub import PubMedFetcher
from backend.abstract_retrieval.pubmed_retriever import PubMedAbstractRetriever

# 初始化 PubMedFetcher 和 AbstractRetriever
pubmed_fetcher = PubMedFetcher()
abstract_retriever = PubMedAbstractRetriever(pubmed_fetcher)

# 不进行查询简化，直接获取文献摘要
scientist_question = "Has there been any significant progress in Alzheimer's disease treatment using monoclonal antibodies in the last five years?"

abstracts_without_simplification = abstract_retriever.get_abstract_data(scientist_question, simplify_query=False)

# 进行查询简化（默认行为），然后获取文献摘要
abstracts_with_simplification = abstract_retriever.get_abstract_data(scientist_question)

注意：当您的搜索查询涉及一个非常热门的主题时，检索过程可能需要较长时间才能完成。

从输出日志中，您可以看到，在没有简化的情况下，未检索到任何文献摘要，而在简化查询后，检索到 243 篇文献摘要。

结论

在本系列文章《从基础到进阶：通过AI Agents检索生物医学摘要 (二)》中，我们使用 metapub 库构建了文献摘要检索API。我们利用 LLM 优化了使用人员的自然语言查询，从而提高了检索到相关结果的概率。在接下来的部分，我重点讨论如何将检索到的文献摘要保存到数据库中，并为RAG 系统创建向量索引。