Word
用python解析word为文字的代码 用各家llm产品都可以生成,下面这段可以分别获取word中的文字和表格
import docx
text = []
doc = docx.Document(self.file_path)
for paragraph in doc.paragraphs:
text.append(paragraph.text)
for table in doc.tables:
pass
但是这样丧失了段落和表格的顺序信息,解决方法:
from docx import Document
from docx.table import Table
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph
from docx.oxml.table import CT_Tbl
docx_file = Document("xxxx")
contents=[]
for element in tqdm(docx_file.element.body):
if isinstance(element, CT_Tbl):
table_text = "--------------------table-------------------"
keys = None
table = Table(element, docx_file)
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
contents.append(table_text)
if isinstance(element, CT_P):
paragraph = Paragraph(element, docx_file)
contents.append(paragraph.text)
pdf解析同样存在这个问题,以pdfplubmer包为例: