Word

用python解析word为文字的代码用各家llm产品都可以生成，下面这段可以分别获取word中的文字和表格

import docx
text = []
doc = docx.Document(self.file_path)
for paragraph in doc.paragraphs:
    text.append(paragraph.text)
for table in doc.tables:
    pass

但是这样丧失了段落和表格的顺序信息，解决方法：

github提的issue有人给出

from docx import Document
from docx.table import Table
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph
from docx.oxml.table import CT_Tbl
docx_file = Document("xxxx")
contents=[]
for element in tqdm(docx_file.element.body):
    if isinstance(element, CT_Tbl):
        table_text = "--------------------table-------------------"
        keys = None
        table = Table(element, docx_file)
        for i, row in enumerate(table.rows):
            text = (cell.text for cell in row.cells)
            if i == 0:
                keys = tuple(text)
                continue
        contents.append(table_text)
    if isinstance(element, CT_P):
        paragraph = Paragraph(element, docx_file)
        contents.append(paragraph.text)

PDF

pdf解析同样存在这个问题，以pdfplubmer包为例：

How do you get the position index of the table on the page · jsvine/pdfplumber · Discussion #849 (github.com)

python 解析word&word-获取图和表的相对位置

Word

PDF