ChatGpt 等大模型现在还不具备直接读取网络资源的能力,单次对用户输入内容的长度也有限制,基于这两种限制,LangChain 这些调用大模型的框架诞生了,可以处理一下前置的工作,例如将网络中某一篇PDF识别出文本再交给大模型处理生成总结。对于文本内容太长超出单次输入限制的情况可以使用不同的模型来分割文本。 LangChain 处理文本有三种模型 stuff map_reduce refine 。
stuff
一次性将所有内容输入给大模型。
- 优点:只调用大模型一次,上下文信息完整
- 缺点:只适用于文本较短的场景。现阶段大模型都有单词数据最长文本的限制,长文本不适应这种模式
map_reduce
先将长文档分成一下小块,然后将每个小块调用大模型生成总结,最后再将分块生成的总结合并,生成基于全文的总结。
- 优点:可以处理很长的文本
- 缺点:单次任务需要调用多次大模型。再文本合并的过程中可能有信息的丢失,不如stuff模型有完整的上下文信息
refine
将长文本分成某些块,然后将第1个文本块生成总结,并且将总结内容和第2个文本块合并,依次类推最终生成整篇文章的文本总结
- 优点:这种合并方式比 map_reduce 要少丢失一些信息。
- 缺点:需要调用大模型多次,而且对文本顺序有要求,每一个片段的总结需要依赖上一个片段的结果,因此每个片段生成总结的过程没法并行。任务整体变成了串行流程
Map-Rerank
对文章的每一块进行操作,返回结果的同时返回,每个总结的正确性得分,返回得分最高的那个
- 优点:可以处理长文本
- 缺点:无法进行总结合并生成基于全文的结论,适用于在文档中检索,基于某一文本块的答案。
下面我们找一篇 头条文章 分别用不同的模型对文章内容进行总结, 首先我们需要从头条文章中提取文本:
map_reduce 模型
import os
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.chains import AnalyzeDocumentChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"
# https://www.toutiao.com/article/7237465072535208463
TEXT_URL = './toutiao/text_txt.txt'
# 定义使用的模型
CHAIN_TYPE = "map_reduce"
with open(TEXT_URL, 'r', encoding='utf-8') as f:
state_of_the_union = f.read()
llm = OpenAI(temperature=0)
# 定义文本分割器 每块文本大小为500,不重叠
text_splitter = CharacterTextSplitter(
chunk_size = 1500,
chunk_overlap = 0,
length_function = len,
)
# 生成摘要
summary_chain = load_summarize_chain(llm, chain_type=CHAIN_TYPE)
summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain, text_splitter=text_splitter)
res = summarize_document_chain.run(state_of_the_union)
print(res)
生成内容如下
- 中文:
本文考察了过去100年的经济衰退,重点是1929年的大萧条和20世纪90年代的日本经济衰退。它分析了成功企业在经济衰退中生存和发展的策略,并提供了对当前中国经济的见解。它指出了在全球经济衰退的情况下可能受到影响的五个行业,并鼓励读者提出自己的问题,并制定在经济衰退中生存的策略。
- 英文
This article examines the economic downturns of the past 100 years, focusing on the Great Depression of 1929 and the Japanese recession of the 1990s. It provides an analysis of the strategies used by successful companies to survive and thrive during economic downturns, and offers insight into the current Chinese economy. It identifies five industries that are likely to suffer in the event of a global economic downturn, and encourages readers to ask their own questions and develop strategies to survive an economic downturn.
refine 模型
# 定义使用的模型
CHAIN_TYPE = "refine"
生成内容如下
- 中文
- 本文讨论了过去100年的经济衰退,以及哪些行业在这些时期表现良好。它聚焦于1929年的大萧条,这场持续了10年的大萧条对全球经济造成了毁灭性的影响。它提供了在经济低迷时期能够生存和发展的公司和行业的分析,并提供了对当前中国经济的见解。在大萧条期间,32,000家企业和近10,000家银行破产,失业率达到25%。尽管形势严峻,但一些公司和企业家能够经受住风暴,甚至在经济低迷时期实现增长。例如杜邦(尼龙产品)、宝洁(卫浴产品)、露华浓(化妆品)、RCA(无线收音机)和IBM(商用机器)。上世纪70年代,微软(成立于1975年)和苹果(成立于1976年)在“滞胀”时期出现。这篇文章还研究了大萧条对其他国家的影响,比如德国,纳粹党在那里掌权并发动了第二次世界大战。在美国,危机的特点是企业倒闭、失业、个人破产和负面财富效应。唯一的亮点是娱乐业
- 英文
- This article discusses the economic downturns of the past 100 years and which industries have fared well during these times. It focuses on the Great Depression of 1929, which lasted for 10 years and had a devastating impact on the global economy. It provides an analysis of the companies and industries that have been able to survive and thrive during economic downturns, and offers insight into the current Chinese economy. During the Great Depression, 32,000 businesses and nearly 10,000 banks went bankrupt, and unemployment reached 25%. Despite the dire situation, some companies and entrepreneurs were able to weather the storm and even grow during the downturn. Examples include DuPont (nylon products), Procter & Gamble (bathroom products), Revlon (cosmetics), RCA (wireless radio) and IBM (commercial machines). In the 1970s, Microsoft (founded in 1975) and Apple (founded in 1976) emerged during the "stagflation" period. The article also looks at the impact of the Great Depression on other countries, such as Germany, where the Nazi Party rose to power and started World War II. In the US, the crisis was characterized by business closures, job losses, personal bankruptcies, and negative wealth effects. The only bright spot was the entertainment industry, particularly
应用场景
- 长图文生成智能摘要,增加文章在Feed流中的转化率
- 缩短图文内容,可以将同一篇文章转换到
有柿小红书等多图短文本模式的APP。
作者链接 bento.me/kraken