如何从零开始快速实现一个NotebookLM

304 阅读5分钟

截屏2024-12-19 16.30.01.png

一、为什么要实现一个NotebookLM?

NotebookLM的核心使命是提取知识,并将其以一种更具吸引力和趣味性的方式呈现出来。在当前大模型技术已经能够轻松提取知识的背景下,如何将这些知识以创新的形式(例如播客)展现,成为一个极具潜力的探索方向。NotebookLM的这种表现形式,不仅仅是一个起点,更是一个启发——它让我们思考:在这条探索之路上,还有哪些可能性等待我们去发掘?

这不禁引发了一个深刻的灵魂之问如何提升知识的表现力?
我们可以从基础的音频、视频、图像等形式出发,进一步延伸到影视、播客、小品、辩论、冲突等多样化的表现方式。每一种形式都有其独特的魅力,关键在于如何巧妙地结合,让知识在传递的过程中更具感染力和吸引力。

当然,为了更好地拓展和应用,我们首先需要深入了解现有的实现方式。接下来,我们将详细探讨这一点。

二、如何零成本实现一个NotebookLM

如何低成本实现一个自己的notebookLM,这里大致讲下过程,因为篇幅所限,贴主体代码,后续可以根据粉丝呼声,详细讲解,欢迎关注:

1)用langchain获取pdf文本内容

这里就要用到我们的langchain了,它虽然很重,但是读取各类文本确实是方便呀

def get_pdf_content(file_path):
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    integrated_content = ""
    for page in pages:
        integrated_content += page.page_content + "\n"
    return integrated_content

2)写一个agent池子,将文本改写成类播客结构

agent池子主要是方便我们创建和销毁agent,对原始文本进行改写


class AgentPool:
    def __init__(self, max_agents=20):
        self.max_agents = max_agents
        self.agents = []

    def add_agent(self, name, instructions, functions=[]):
        if len(self.agents) < self.max_agents:
            agent = generate_agent_with_name_instruments(name, instructions, functions)
            self.agents.append(agent)
            return agent
        else:
            raise Exception("Agent pool is full")

    #可能会get None
    def get_agent(self, name):
        for agent in self.agents:
            if agent.name == name:
                return agent
        return None

    def remove_agent(self, name):
        self.agents = [agent for agent in self.agents if agent.name != name]


    def execute_and_release(self, name, query):
        agent = self.get_agent(name)
        if agent:
            response = generate_response_with_agent_query(agent, query)
            self.remove_agent(name)
            return response
        else:
            # 这里的raise表示抛出一个异常。当找不到指定名称的agent时,程序会抛出一个异常,提示用户该agent不存在。
            raise Exception(f"Agent with name {name} not found")


if __name__ == "__main__":
    # 示例用法
    agent_pool = AgentPool()
    agent_pool.add_agent("标题助手", "帮助生成和优化标题")
    response = agent_pool.execute_and_release("标题助手", "如何生成一个好的标题,关于饮品推广的")
    print(response)

a) 文本提取agent, 文本内容为speaker形态

需要的结构:

...

Speaker 1: My pleasure! This is such a fascinating field, and I'm glad we could explore it together. Until next time, keep asking those great questions!

Speaker 2: Will do! Thanks again, and see you next time.

...

实现:

async def writer_gen(content):
    pool.add_agent("writer_agent",WRITE_PROMPT)
    writer_res = pool.execute_and_release("writer_agent",content)
    return writer_res   
b) 润色agent,对文本润色,形成数组

需要的结构:

[
    ...
    ("Speaker 1", "My pleasure! This is such a fascinating field, and I'm glad we could explore it together. Until next time, keep asking those great questions!"),
    ("Speaker 2", "Will do! Thanks again, and see you next time.")
    ...
]

实现:

async def rewriter_gen(writer_res):
    pool.add_agent("rewriter_agent",REWRITE_PROMPT)
    rewriter_res = pool.execute_and_release("rewriter_agent",writer_res)
    return rewriter_res

3)转化成语音

将数组分段转化为语音,主要是使用tts-parler

使用起来相对很简单,可以根据prompt来形容男女声音

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import logging
import os
from concurrent.futures import ProcessPoolExecutor

# 设置日志级别来减少警告信息
logging.getLogger("transformers").setLevel(logging.ERROR)


def generate_audio_with_parler(text_prompt, description, output_file):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Load model and tokenizer
    model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
    tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

    # Tokenize inputs
    input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

    # Generate audio
    generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()

    # 检查文件是否存在,如果存在则删除
    if os.path.exists(output_file):
        os.remove(output_file)
        print(f"文件 {output_file} 已存在,已删除。")

    sf.write(output_file, audio_arr, model.config.sampling_rate)


if __name__ == "__main__":

    # Define text and description
    text_prompt = """
    Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
    """
    male_prompt = """Zack's voice is smokey and deep magnetic,  speaking at a moderate speed  with a very close recording that almost has no background noise."""

    output_file = "output_male.wav"

    with ProcessPoolExecutor(max_workers = os.cpu_count()) as executor:
        executor.map(generate_audio_with_parler, [text_prompt], [male_prompt], [output_file])


4)生成subtitles

用whisper读取音频

model = WhisperModel("base", device="cpu", compute_type="float32")

...
segments, info = model.transcribe(
    str(audioPath),
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

srt转化

subtitles = []
for segment in segments:
    print(f"Transcript: {segment.text}")
    start = timedelta(seconds=segment.start)
    end = timedelta(seconds=segment.end)
    content = segment.text.strip()
    subtitles.append(
        srt.Subtitle(
            index=len(subtitles) + 1,
            start=start,
            end=end,
            content=content
        )
    )

# 将字幕转换为SRT格式的字符串
srt_data = srt.compose(subtitles)

5)一张podcast的图

这个随便找个能出图的模型都ok,kimi、mistral,生成这样的,or 题图那样的

image.jpg

6)生成视频

现在有了图片、音频和标题,可以用ffmpeg生成视频了

async def gen_audio_without_zoom(image_file, audio_file, subtitle_file,video_file):
    command = [
        "ffmpeg", "-y",
        "-loop", "1",
        "-i", str(image_file),
        "-i", str(audio_file),
        "-vf", f"subtitles={str(subtitle_file)}:force_style='Fontsize=20'",
        "-c:v", "libx264",
        "-tune", "stillimage",
        "-c:a", "aac",
        "-shortest",
        str(video_file)
    ]
    subprocess.run(command, check=True)

最后的效果:

截屏2024-12-19 16.25.37.png

三、什么是NotebookLM ?

NotebookLM 是谷歌推出的一款实验性 AI 工具,它是一款基于大型语言模型(LLM)的笔记管理工具,旨在帮助用户更高效地组织、提取和处理文档内容。用户可以上传多种格式的文件(如 PDF、TXT、Markdown、音频等),并通过对话方式与文档进行交互,生成摘要、回答问题或创建音频播客

一个显著特点是其“AI 播客”功能,这使得它在全球范围内迅速走红。