如何将扫描稿 PDF 转换成 markdown 文档一、前言在做 RAG 应用时，PPT、PDF 等多模态文档处理会比

一、前言

在做 RAG 应用时，PPT、PDF 等多模态文档处理会比较麻烦，尤其是扫描稿的 PDF 文档。虽然常规的 PDF 可以很容易获取 PDF 的文本内容，但是在遇到表格、图片、数学公式时，也会出现许多问题。今天我们聚焦扫描稿的 PDF，编写一个应用，将 PDF 转换成 markdown 文档。

二、文档结构检测

PDF 转 markdown 涉及许多技术，比如文档结构分析、OCR 识别等，我们先来关注如何解析PDF 文档的结构。

2.1 Doclayout yolo

其实我们可以把文档结构检测的问题看作目标检测任务。扫描稿的 PDF 本质就是图片，而 PDF 本身包含图片、文本、标题、表格等元素。因此我们只需要训练一个目标检测网络，检测图片中的所有元素以及元素的类型。

好消息是我们不需要自己训练这个网络，Doclayout yolo 以及给我们提供了一个现成的 yolo 网络用于检测文档结构。只需几行代码即可完成文档结构检测，详细的内容可以参考项目：github.com/opendatalab…

2.2 代码

Doclayout yolo 使用非常简单，只需下面几行代码：

from doclayout_yolo import YOLOv10
# 下载模型
filepath = hf_hub_download(
	repo_id="juliozhao/DocLayout-YOLO-DocStructBench",
  filename="doclayout_yolo_docstructbench_imgsz1024.pt"
)
# 加载模型
model = YOLOv10(filepath)
# 预测结果
det_res = model.predict(img_path, imgsz=1024, conf=0.2, device="mps")[0]

其中det_res中包含了检测结果。可以通过下面的变量拿到元素类型、元素位置等信息：

# 类型
det_res.boxes.cls
# 置信度
det_res.boxes.conf
# 区域信息的坐标
det_res.boxes.xyxy

我们可以预览一下检测结果：

import cv2

annotated_frame = det_res.plot(pil=True, line_width=5, font_size=20)
cv2.imwrite("result.jpg", annotated_frame)

结果如下：

可以看到效果还不错。而且文本分段落检测出来了。

三、OCR识别

在处理PDF文档时，图片可以直接裁剪出来。而文本区域则需要识别出内容，因此需要使用OCR。OCR有很多选择，这里使用tesseract作为OCR工具。

Tesseract的配置可以参考：juejin.cn/post/696437…

这里我们只需要下面几行代码就可以识别图片中的文本：

import pytesseract
from PIL import Image
# 读取图片
im = Image.open('sentence.jpg')
# 识别文字
string = pytesseract.image_to_string(im)
print(string)

四、markdown转换

markdown语法非常简单，将识别结果转换成markdown语法只需要标题转换成：

# title

图片转换成：

![Figure](...)

图片文本直接原样写入即可。最后的代码如下：

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @Author: ZackFair
# @Desc: Extract layout from image and convert to Markdown
# @Date: 2025/6/30

import cv2
import pytesseract
from doclayout_yolo import YOLOv10
from PIL import Image

from huggingface_hub import hf_hub_download

# 设置 tesseract 路径（macOS）
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'

# 加载模型
filepath = hf_hub_download(repo_id="juliozhao/DocLayout-YOLO-DocStructBench", filename="doclayout_yolo_docstructbench_imgsz1024.pt")
model = YOLOv10(filepath)

# 推理图像
img_path = '01.png'
image = cv2.imread(img_path)
det_res = model.predict(img_path, imgsz=1024, conf=0.2, device="mps")[0]

# 类别名
names = det_res.names

# 每一块的 Markdown 输出
markdown_blocks = []

# 将框排序（按 y 坐标 top-to-bottom）
boxes = zip(det_res.boxes.cls, det_res.boxes.conf, det_res.boxes.xyxy)
boxes = sorted(boxes, key=lambda x: x[2][1])  # xyxy[1] 是 top y 坐标

# 提取和处理每个区域
for cls, conf, xyxy in boxes:
    cls = int(cls.item())
    name = names[cls]
    conf = conf.item()
    xyxy = list(map(int, xyxy.cpu().numpy().tolist()))
    x1, y1, x2, y2 = xyxy
    roi = image[y1:y2, x1:x2]
    roi_pil = Image.fromarray(cv2.cvtColor(roi, cv2.COLOR_BGR2RGB))

    # 对应 Markdown 类型
    if cls == 0:  # title
        text = pytesseract.image_to_string(roi_pil).strip()
        if text:
            markdown_blocks.append(f"# {text}")

    elif cls == 1:  # plain text
        text = pytesseract.image_to_string(roi_pil).strip()
        if text:
            markdown_blocks.append(text)

    elif cls == 3:  # figure
        fig_name = f"figure_{x1}_{y1}.png"
        cv2.imwrite(fig_name, roi)
        markdown_blocks.append(f"![Figure]({fig_name})")

    elif cls == 4:  # figure_caption
        text = pytesseract.image_to_string(roi_pil).strip()
        if text:
            markdown_blocks.append(f"*{text}*")

    elif cls == 5:  # table
        table_name = f"table_{x1}_{y1}.png"
        cv2.imwrite(table_name, roi)
        markdown_blocks.append(f"![Table]({table_name})")

    elif cls == 6:  # table_caption
        text = pytesseract.image_to_string(roi_pil).strip()
        if text:
            markdown_blocks.append(f"*{text}*")

    # 其他类别忽略或可扩展

# 合并为 Markdown 文本
markdown = "\n\n".join(markdown_blocks)

# 输出 markdown
output_md = img_path.replace(".png", ".md")
with open(output_md, "w", encoding="utf-8") as f:
    f.write(markdown)

print(f"✅ Markdown saved to {output_md}")

这里我们先检测文档结构，然后将每个类型的元素以markdown语法转换，并加入markdown_blocks列表中。最终将markdown_blocks中的内容写入md后缀的文件即可。

五、总结

本文展示了一页PDF转换成markdown的处理方法，我们可以很容易扩展到多页的情况。另外其实已经有了一些现成项目可以完成PDF转markdown的操作。比如mineru，使用mineru可以很方便将word、PDF等文档转换成markdown文件，同时还支持对象存储。