图片表格识别为Excel的java代码,亲测有效

994 阅读1分钟
  1. 准备pom依赖
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>4.1.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>5.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>2.0.6</version>
    </dependency>

</dependencies>
  1. 去下载 github.com/tesseract-o… 放入到src/main/resources/tessdata下,要和代码第二行一致,这里是经过训练的识别文件,没有这个报错。
src/main/resources/tessdata
  1. 准备一个空的excel table.xlsx 放src/main/resources 下,需要识别的图片放src/main/resources下,注意代码中名称路径要一致。
  2. 运行代码,结果如下

image.png

  1. 代码
package org.example;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.*;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

public class Main {

    public static void main(String[] args) throws InvalidFormatException, IOException, TesseractException {
        ITesseract instance = new Tesseract();
        instance.setDatapath("src/main/resources/tessdata");
        instance.setLanguage("eng");
        String result = instance.doOCR(new File("src/main/resources/image.PNG"));
        String[] lines = result.split("\r?\n");
        Workbook workbook = WorkbookFactory.create(new File("src/main/resources/table.xlsx"));
        Sheet sheet = workbook.getSheetAt(0);
        int rowCount = 0;
        for (String line : lines) {
            Row row = sheet.createRow(rowCount++);
            int columnCount = 0;
            for (String word : line.split("\s+")) {
                Cell cell = row.createCell(columnCount++);
                cell.setCellValue(word);
            }
        }
        File file = new File("src/main/resources/table_from_image.xlsx");
        workbook.write(new FileOutputStream(file));
        workbook.close();
    }
}

6.原图

image.PNG 7. 识别结果

image.png

  1. 注意事项 切记每次识别时候要把table.xlsx删除,并在桌面新建个空白的拷贝进去,否则可能出现乱码或者报错,table_from_image.xlsx也需要删除,但是不要新建