一种PP-Structure/PaddleOCR生成HTML富文本 + 辅助校验的方案本文实现： 1.将PPStruct

演示效果

paddle_ocr_to_html_demo_和另外_1_个页面_-个人-_Microsoft__Edge_2023-07-25_09-05-44.gif

实现功能:

将PPStructure输出的结果属于转化为HTML富文本。
通过点击对应的富文本区域。左边的图片聚焦到对应的区域，可进行校对。

本文所用示例图片：

输出示例页面：

左侧为原图根据PPStruture输出的结果数组通过fabricjs绘制边框和type类型标识；

右侧为根据PPStruture输出的结果数组转化成的HTML结构；

点击右侧的HTML，左侧会自动聚焦到对应的图片区域，进行校对。

码上掘金

git仓库地址

仓库地址：PPStructure_To_HTML: 一种PaddleOCR/PPStructure根据图片生成HTML，并人工校对的方案 (gitee.com)

项目分为前后端，也可以只安装前端查看无后端示例样例。后端使用python编写，版本为3.10。前端采用vue3+webpack;

前端部署

cd paddle_ocr_to_html_demo # 进入前端项目文件夹

npm install 

npm run serve

后端部署

cd api # 进入后端 文件夹

pip install -r requirements.txt # 安装所需库

python -m uvicorn main:app --reaload

技术简介

PP-Structure： PP-Structure是一个可用于复杂文档结构分析和处理的OCR工具包，旨在帮助开发者更好的完成文档理解相关任务。

简单概括：PPStruture 可以通过文档矫正，结构识别和OCR等技术直接将图片/PDF文件转化为Word文档。

使用简介可见：

ppstructure/docs/quickstart.md · PaddlePaddle/PaddleOCR - Gitee.com

需求

在官方文档中是只存在直接生成Word文档或者纯文本的形式。不存在一个直接生成HTML的形式，因此提出一个根据OCR结果生成HTML结构的方案。
其次，在某些场景下，文档输出的OCR结果并不一定准确。因此需要引入人工校验文本，对输出的OCR文本进行校验纠正，由此提出一个人工校验文本的方案。

实现思路及核心代码

通过阅读PPStructure输出word文档的源码convert_info_docx方法可知道，是通过遍历sorted_layout_boxes函数输出的布局数组通过py-docx库去生成word文档

sorted_layout_boxes 输出的数组格式解析如下

JSON格式实例;

{
    type: '', // 检测到的元素类型,figure（图像），text（文本），title（标题），table(表格)
    bbox: [], // 元素边框， 由四个元素组成，分别为左上角的坐标：x,y 值和 元素长宽值：w,h
    children: [], // 元素区域内的子元素及识别的文本
    html: '', // table元素特有， 包含table对应的HTML字符串
    img: [], // 区域内图片的nparraty数组， 此处在返回的时候已去除，通过canvas根据bbox区域剪裁图片
    layout: 'single'，  // 表示当前识别元素的布局为单列或双列
}

图片对应如下:

single(单列布局)、 double(双列布局) 的结构数组，再根据bbox（范围框） 的维度根据元素的类型：figure（图像）、text（文本）、title（标题），table(表格) 等

具体源码如下（含部分个人注释）：

import os

from copy import deepcopy



from docx import Document

from docx import shared

from docx.enum.text import WD_ALIGN_PARAGRAPH

from docx.enum.section import WD_SECTION

from docx.oxml.ns import qn

from docx.enum.table import WD_TABLE_ALIGNMENT



from ppstructure.recovery.table_process import HtmlToDocx



from ppocr.utils.logging import get_logger

def convert_info_docx(img, res, save_folder, img_name):

    doc = Document()

    doc.styles['Normal'].font.name = 'Times New Roman'

    doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')

    doc.styles['Normal'].font.size = shared.Pt(6.5)



    flag = 1

    for i, region in enumerate(res):

        img_idx = region['img_idx']
        # 如果是布局为single则设置只有一列
        if flag == 2 and region['layout'] == 'single':

            section = doc.add_section(WD_SECTION.CONTINUOUS)

            section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '1')

            flag = 1
        # 如果是布局为double则设置只有一列
        elif flag == 1 and region['layout'] == 'double':

            section = doc.add_section(WD_SECTION.CONTINUOUS)

            section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '2')

            flag = 2


        # 如果类型为图像，则根据img_idx获取图片数据
        if region['type'].lower() == 'figure':

            excel_save_folder = os.path.join(save_folder, img_name)

            img_path = os.path.join(excel_save_folder,

                                    '{}_{}.jpg'.format(region['bbox'], img_idx))

            paragraph_pic = doc.add_paragraph()

            paragraph_pic.alignment = WD_ALIGN_PARAGRAPH.CENTER

            run = paragraph_pic.add_run("")

            if flag == 1:

                run.add_picture(img_path, width=shared.Inches(5))

            elif flag == 2:

                run.add_picture(img_path, width=shared.Inches(2))

        elif region['type'].lower() == 'title':

            doc.add_heading(region['res'][0]['text'])
        # 如果是table，则直接获取html数据写入
        elif region['type'].lower() == 'table':

            parser = HtmlToDocx()

            parser.table_style = 'TableGrid'

            parser.handle_table(region['res']['html'], doc)

        else:

            paragraph = doc.add_paragraph()

            paragraph_format = paragraph.paragraph_format

            for i, line in enumerate(region['res']):

                if i == 0:

                    paragraph_format.first_line_indent = shared.Inches(0.25)

                text_run = paragraph.add_run(line['text'] + ' ')

                text_run.font.size = shared.Pt(10)



    # save to docx

    docx_path = os.path.join(save_folder, '{}_ocr.docx'.format(img_name))

    doc.save(docx_path)

    logger.info('docx save to {}'.format(docx_path))

链接： ppstructure/recovery/recovery_to_doc.py · PaddlePaddle/PaddleOCR - 码云 - 开源中国 (gitee.com)

PPStructure输出结果数组转换为HTML布局数组

将PPStructure 输出的word文档结构数组转换为HTML布局数组：

由于PPStructure输出的数组是自上而下，自左而右的，如图

因此需要对结果元素进行边界值判断

res[i].layout=== 'double' && res[i].bbox[0] > （图片宽度/2 - 20） // 20 是我设置的一个bbox的差值，不一定准确

符合该条件，则证明该元素已经换列

具体代码如下

    // 获取htmlList的结果
    const getHTMLList = (imgShape) => {
      // 创建HTML元素
      const html = document.createElement('div');  // 用于记录直接生成的HTML元素 
      //let flag = 1;
      let currentDiv = null;    // 当前布局的div元素
      let currentDoubleLeft = null; // 记录当前双列布局当前布局的left值
      let currentElement = null;  // 当前布局结构的对象；
      let doubleElement = null; // 用于记录当前的双层布局结构对象；
      const list = [];  // 输出结构对象数组
      // 循环遍历OCR结果
      for (let i = 0; i < ocrRes.value.length; ++i) {
        // 获取区域元素
        const region = ocrRes.value[i];
        // 如果区域元素为图片识别结果，则跳过。
        if (region['type'].toLowerCase() === 'figure_caption') {
          continue
        }
        // 元素布局判断
        // 如果布局是单列
        if (region['layout'] == 'single') {
          // 创建新的HTML元素
          const newDiv = document.createElement('div')
          newDiv.classList = `single-layout ck-editor-ctn`  // 单列类
          newDiv.id = `editor${i}`   
          html.appendChild(newDiv); // 添加布局元素
          currentDiv = newDiv;  // 设置当前布局元素为
          currentDoubleLeft = null; // 重置 用于标记双列布局当前布局的left值为null
          // 创建单列布局元素
          currentElement = {
            type: 'single',
            children: []
          }
          // 添加布局元素
          list.push(currentElement)
        }
        // 如果布局时双列
        else if (region['layout'] == 'double') {
          // 如果双列布局当前布局的left值为空，则当前是新的双列元素布局
          if (currentDoubleLeft === null) {
            // 创建双列布局HTML元素
            const doubleDivCtn = document.createElement('div')
            doubleDivCtn.classList = 'double-div-ctn' // 类
            html.appendChild(doubleDivCtn)  // 添加元素
            currentDoubleLeft = region['bbox'][0];  // 设置当前的布局的左值为当前区域元素的左值
            const currentDoubleDiv = document.createElement('div')  // 当前双列布局中的活跃布局元素
            currentDoubleDiv.classList = 'double-div ck-editor-ctn'
            currentDoubleDiv.id = `editor${i}`
            currentDoubleDiv.style.setProperty('display', 'inline-block');
            currentDoubleDiv.style.setProperty('width', '48%')
            currentDoubleDiv.style.setProperty('margin-left', '2%')
            doubleDivCtn.appendChild(currentDoubleDiv); // 添加当前单列布局进入双列布局中
            currentDiv = currentDoubleDiv;
            doubleElement = {
              type: 'double',
              children: []
            }
            currentElement = {
              children: []
            }
            doubleElement.children.push(currentElement);
            list.push(doubleElement)
          }
          else {
            // 这里判断double 布局下是否已经换行
            if (Math.abs(currentDoubleLeft < Math.floor(imgShape.w / 2 - 20) &&  region['bbox'][0] > Math.floor(imgShape.w / 2 - 20))) {
              const currentDoubleDiv = document.createElement('div')
              currentDoubleDiv.classList = 'double-div ck-editor-ctn'
              currentDoubleDiv.id = `editor${i}`
              currentDoubleDiv.style.setProperty('display', 'inline-block');
              currentDoubleDiv.style.setProperty('width', '48%')
              currentDoubleDiv.style.setProperty('margin-left', '2%')
              currentDiv.parentElement.appendChild(currentDoubleDiv);
              currentDiv = currentDoubleDiv;
              currentDoubleLeft = region['bbox'][0]
              currentElement = {
                children: []
              }
              doubleElement.children.push(currentElement)
              // list.push(currentElement)
            }
            else {
              1;
            }
          }
          // flag = 2;
        }
        /******************* */
        // 以下处理是 根据当前元素的类型， 进行处理后直接添加至当前布局元素下
        
        // 如果是图片则直接从原图进行剪裁
        if (region['type'].toLowerCase() == 'figure') {
          const img = document.createElement('img', {
            class: 'img-ctn',
            src: ''
          })
          img.src = getClipPicUrl({
            x: region['bbox'][0],
            y: region['bbox'][1],
            w: region['bbox'][2] - region['bbox'][0],
            h: region['bbox'][3] - region['bbox'][1],
          }, canvasContext.value)
          currentDiv.appendChild(img);
          currentElement.children.push({
            type: region['type'].toLowerCase(),
            index: i,
            html: `
                    <div id="editor${i}">
                        ${img.outerHTML}
                    </div>
                  `
          })
        }
        else if (region['type'].toLowerCase() == 'title') {
          const title = document.createElement('h1', {
            class: 'title'
          })
          for (let j = 0; j < region['res'].length; ++j) {
            title.innerText = `${title.innerText}${region['res'][j].text}`;
            /* title.appendChild(p) */
          }
          currentDiv.appendChild(title);
          currentElement.children.push({
            type: region['type'].toLowerCase(),
            index: i,
            html: `
                    <div id="editor${i}">
                       ${title.outerHTML}
                    </div>
             `
          })
        }
        else if (region['type'].toLowerCase() == 'table') {
          const tableCtn = document.createElement('div', {
            class: 'table-ctn'
          })
          tableCtn.innerHTML = region['res']['html'];
          currentDiv.appendChild(tableCtn)
          currentElement.children.push({
            type: region['type'].toLowerCase(),
            index: i,
            html: `
                    <div id="editor${i}">
                          ${tableCtn.outerHTML}
                    </div>
                  `
          })
        }
        // text类型
        else {
          const textCtn = document.createElement('div', {
            class: 'text-ctn'
          })
          const tab = document.createElement('span');
          tab.innerHTML = `&nbsp;&nbsp;&nbsp;&nbsp;`
          textCtn.appendChild(tab);
          let textLength = 0
          for (let j = 0; j < region['res'].length; ++j) {
            const line = region['res'][j];
            const paragrah = document.createElement('span')
            paragrah.innerText = line['text'];
            textLength = textLength + line['text'].length;
            textCtn.appendChild(paragrah);
          }
          if(textLength === 0) {
            continue
          }
          currentDiv.appendChild(textCtn)
          currentElement.children.push({
            type: region['type'].toLowerCase(),
            index: i,
            html: `
                    <div id="editor${i}">
                        ${textCtn.outerHTML}
                     </div>
                 `
          })
        }
      }
      
      console.log(html.outerHTML) // 输出生成的HTML
      layoutList.value = list;  // 生成的HTML结构数组
      console.log(list)
      return list;
    }

HTML布局数组转化成HTML文本

代码如下


    // HTML布局数组转换为HTML
    const handleLayoutListToHTML = () => {
      // 创建最外层HTML元素
      const htmlElement = document.createElement('div');
      // 遍历结构数组
      for(let i = 0; i < layoutList.value.length; ++i) {
        const layout = layoutList.value[i];
        // 如果是单列结构
        if(layout.type === 'single') {
          const singleLayout = document.createElement('div');
          for(let j = 0; j < layout.children.length; ++j) {
            singleLayout.innerHTML = `${singleLayout.innerHTML}${layout.children[j].html}`
          }
          htmlElement.appendChild(singleLayout)
        }
        // 如果是双列结构
        else if(layout.type === 'double') {
          // 创建双列布局元素
          const doubleLayout = document.createElement('div');
          // 创建第一列，并遍历第一列元素
          const leftDiv = document.createElement('div');
          leftDiv.style.setProperty('width', '49%');
          leftDiv.style.setProperty('display', 'inline-block');
          leftDiv.style.setProperty('vertical-align', 'top');
          for(let j  = 0; j < layout.children[0].children.length; ++j) {
            leftDiv.innerHTML =  `${leftDiv.innerHTML}${layout.children[0].children[j].html}`
          }
          doubleLayout.appendChild(leftDiv);
          // 遍历第二列
          const rightDiv = document.createElement('div');
          rightDiv.style.setProperty('width', '49%');
          rightDiv.style.setProperty('display', 'inline-block');
          rightDiv.style.setProperty('vertical-align', 'top');
          rightDiv.style.setProperty('margin-left', '2%');
          for(let j  = 0; j < layout.children[1].children.length; ++j) {
            rightDiv.innerHTML = `${rightDiv.innerHTML}${layout.children[1].children[j].html}`
          }
          doubleLayout.appendChild(rightDiv);
          // 添加双列布局元素进入外层元素
          htmlElement.appendChild(doubleLayout);
        }
      }
      htmlText.value = htmlElement.outerHTML;
      return htmlElement.outerHTML;
    }

校正方案

通过Fabricjs实现显示图片，并且绘制bbox边框及label。详见代码。

总结

只是简单的写了个demo，代码写的有些凌乱，仅供参考。不足之处敬请见谅。

参考文档及资料

Fabric.js 从入门到________ - 掘金 (juejin.cn)

JSDoc: Source: fabric.js (fabricjs.com)

PaddleOCR: 基于飞桨的OCR工具库，包含总模型仅8.6M的超轻量级中文OCR，单模型支持中英文数字组合识别、竖排文本识别、长文本识别。同时支持多种文本检测、文本识别的训练算法。 - Gitee.com