- 准备pom依赖
<dependencies>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-api -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.6</version>
</dependency>
</dependencies>
- 去下载 github.com/tesseract-o… 放入到src/main/resources/tessdata下,要和代码第二行一致,这里是经过训练的识别文件,没有这个报错。
src/main/resources/tessdata
- 准备一个空的excel table.xlsx 放src/main/resources 下,需要识别的图片放src/main/resources下,注意代码中名称路径要一致。
- 运行代码,结果如下
- 代码
package org.example;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws InvalidFormatException, IOException, TesseractException {
ITesseract instance = new Tesseract();
instance.setDatapath("src/main/resources/tessdata");
instance.setLanguage("eng");
String result = instance.doOCR(new File("src/main/resources/image.PNG"));
String[] lines = result.split("\r?\n");
Workbook workbook = WorkbookFactory.create(new File("src/main/resources/table.xlsx"));
Sheet sheet = workbook.getSheetAt(0);
int rowCount = 0;
for (String line : lines) {
Row row = sheet.createRow(rowCount++);
int columnCount = 0;
for (String word : line.split("\s+")) {
Cell cell = row.createCell(columnCount++);
cell.setCellValue(word);
}
}
File file = new File("src/main/resources/table_from_image.xlsx");
workbook.write(new FileOutputStream(file));
workbook.close();
}
}
6.原图
7. 识别结果
- 注意事项 切记每次识别时候要把table.xlsx删除,并在桌面新建个空白的拷贝进去,否则可能出现乱码或者报错,table_from_image.xlsx也需要删除,但是不要新建