一、为什么要实现一个NotebookLM?
NotebookLM的核心使命是提取知识,并将其以一种更具吸引力和趣味性的方式呈现出来。在当前大模型技术已经能够轻松提取知识的背景下,如何将这些知识以创新的形式(例如播客)展现,成为一个极具潜力的探索方向。NotebookLM的这种表现形式,不仅仅是一个起点,更是一个启发——它让我们思考:在这条探索之路上,还有哪些可能性等待我们去发掘?
这不禁引发了一个深刻的灵魂之问:如何提升知识的表现力?
我们可以从基础的音频、视频、图像等形式出发,进一步延伸到影视、播客、小品、辩论、冲突等多样化的表现方式。每一种形式都有其独特的魅力,关键在于如何巧妙地结合,让知识在传递的过程中更具感染力和吸引力。
当然,为了更好地拓展和应用,我们首先需要深入了解现有的实现方式。接下来,我们将详细探讨这一点。
二、如何零成本实现一个NotebookLM
如何低成本实现一个自己的notebookLM,这里大致讲下过程,因为篇幅所限,贴主体代码,后续可以根据粉丝呼声,详细讲解,欢迎关注:
1)用langchain获取pdf文本内容
这里就要用到我们的langchain了,它虽然很重,但是读取各类文本确实是方便呀
def get_pdf_content(file_path):
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
integrated_content = ""
for page in pages:
integrated_content += page.page_content + "\n"
return integrated_content
2)写一个agent池子,将文本改写成类播客结构
agent池子主要是方便我们创建和销毁agent,对原始文本进行改写
class AgentPool:
def __init__(self, max_agents=20):
self.max_agents = max_agents
self.agents = []
def add_agent(self, name, instructions, functions=[]):
if len(self.agents) < self.max_agents:
agent = generate_agent_with_name_instruments(name, instructions, functions)
self.agents.append(agent)
return agent
else:
raise Exception("Agent pool is full")
#可能会get None
def get_agent(self, name):
for agent in self.agents:
if agent.name == name:
return agent
return None
def remove_agent(self, name):
self.agents = [agent for agent in self.agents if agent.name != name]
def execute_and_release(self, name, query):
agent = self.get_agent(name)
if agent:
response = generate_response_with_agent_query(agent, query)
self.remove_agent(name)
return response
else:
# 这里的raise表示抛出一个异常。当找不到指定名称的agent时,程序会抛出一个异常,提示用户该agent不存在。
raise Exception(f"Agent with name {name} not found")
if __name__ == "__main__":
# 示例用法
agent_pool = AgentPool()
agent_pool.add_agent("标题助手", "帮助生成和优化标题")
response = agent_pool.execute_and_release("标题助手", "如何生成一个好的标题,关于饮品推广的")
print(response)
a) 文本提取agent, 文本内容为speaker形态
需要的结构:
...
Speaker 1: My pleasure! This is such a fascinating field, and I'm glad we could explore it together. Until next time, keep asking those great questions!
Speaker 2: Will do! Thanks again, and see you next time.
...
实现:
async def writer_gen(content):
pool.add_agent("writer_agent",WRITE_PROMPT)
writer_res = pool.execute_and_release("writer_agent",content)
return writer_res
b) 润色agent,对文本润色,形成数组
需要的结构:
[
...
("Speaker 1", "My pleasure! This is such a fascinating field, and I'm glad we could explore it together. Until next time, keep asking those great questions!"),
("Speaker 2", "Will do! Thanks again, and see you next time.")
...
]
实现:
async def rewriter_gen(writer_res):
pool.add_agent("rewriter_agent",REWRITE_PROMPT)
rewriter_res = pool.execute_and_release("rewriter_agent",writer_res)
return rewriter_res
3)转化成语音
将数组分段转化为语音,主要是使用tts-parler
使用起来相对很简单,可以根据prompt来形容男女声音
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import logging
import os
from concurrent.futures import ProcessPoolExecutor
# 设置日志级别来减少警告信息
logging.getLogger("transformers").setLevel(logging.ERROR)
def generate_audio_with_parler(text_prompt, description, output_file):
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
# Tokenize inputs
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
# Generate audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
# 检查文件是否存在,如果存在则删除
if os.path.exists(output_file):
os.remove(output_file)
print(f"文件 {output_file} 已存在,已删除。")
sf.write(output_file, audio_arr, model.config.sampling_rate)
if __name__ == "__main__":
# Define text and description
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
male_prompt = """Zack's voice is smokey and deep magnetic, speaking at a moderate speed with a very close recording that almost has no background noise."""
output_file = "output_male.wav"
with ProcessPoolExecutor(max_workers = os.cpu_count()) as executor:
executor.map(generate_audio_with_parler, [text_prompt], [male_prompt], [output_file])
4)生成subtitles
用whisper读取音频
model = WhisperModel("base", device="cpu", compute_type="float32")
...
segments, info = model.transcribe(
str(audioPath),
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
srt转化
subtitles = []
for segment in segments:
print(f"Transcript: {segment.text}")
start = timedelta(seconds=segment.start)
end = timedelta(seconds=segment.end)
content = segment.text.strip()
subtitles.append(
srt.Subtitle(
index=len(subtitles) + 1,
start=start,
end=end,
content=content
)
)
# 将字幕转换为SRT格式的字符串
srt_data = srt.compose(subtitles)
5)一张podcast的图
这个随便找个能出图的模型都ok,kimi、mistral,生成这样的,or 题图那样的
6)生成视频
现在有了图片、音频和标题,可以用ffmpeg生成视频了
async def gen_audio_without_zoom(image_file, audio_file, subtitle_file,video_file):
command = [
"ffmpeg", "-y",
"-loop", "1",
"-i", str(image_file),
"-i", str(audio_file),
"-vf", f"subtitles={str(subtitle_file)}:force_style='Fontsize=20'",
"-c:v", "libx264",
"-tune", "stillimage",
"-c:a", "aac",
"-shortest",
str(video_file)
]
subprocess.run(command, check=True)
最后的效果:
三、什么是NotebookLM ?
NotebookLM 是谷歌推出的一款实验性 AI 工具,它是一款基于大型语言模型(LLM)的笔记管理工具,旨在帮助用户更高效地组织、提取和处理文档内容。用户可以上传多种格式的文件(如 PDF、TXT、Markdown、音频等),并通过对话方式与文档进行交互,生成摘要、回答问题或创建音频播客
一个显著特点是其“AI 播客”功能,这使得它在全球范围内迅速走红。