tesseract使用记录接到需要爬取带有图形验证码的接口，之前只是有大概的了解，并没有详细操作过。在此记录解决此问题

开始背景

接到需要爬取带有图形验证码的接口，之前只是有大概的了解，并没有详细操作过。在此记录解决此问题的过程。

前期准备

过程记录

1、mac安装tesseract：

不需要训练的情况

brew install tesseract

需要训练的情况

brew install --with-training-tools tesseract
尝试时发现报错，Homebrew中找不到此命令，需要编译安装：
Error: invalid option: --with-training-tools

参考以下文章(感谢文章作者)

https://pengshiyu.blog.csdn.net/article/details/104398303
https://blog.csdn.net/u010670689/article/details/78374623

1、安装依赖
# Packages which are always needed.
brew install automake autoconf libtool //提示新版本
brew install pkgconfig
brew install icu4c
brew install leptonica

# Packages required for training tools.
brew install pango

# Optional packages for extra features.
brew install libarchive

# Optional package for builds using g++.
brew install gcc

2、下载解压tesseract-4.1.1.tar.gz
https://github.com/tesseract-ocr/tesseract/releases

3、编译安装
cd tesseract-4.1.1
./autogen.sh
mkdir build
cd build

# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j

# Optionally install Tesseract.
sudo make install

# Optionally build and install training tools.
make training
sudo make training-install

4、下载eng.traineddata
https://github.com/tesseract-ocr/tessdata

tessdata库所在位置 /usr/local/share/tessdata

jTessBoxEditor基本训练 blog.csdn.net/qq_40147863…

准备tiff格式字体图片

将准备好的tiff格式图片合并（jTessBoxEditor -> Toools -> Merge）
图片最好进行反相+二值化

生成.box文件

【语法】：tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox  
【语法】：lang为语言名称，fontname为字体名称，num为序号；在tesseract中，一定要注意格式

定义字符配置文件

在文件夹文件夹内，新建一个文本文件，名为font_properties，删掉.txt，用记事本打开，写入内容为：

font 0 0 0 0 0

【语法】：<fontname> <italic> <bold> <fixed> <serif> <fraktur> 
【语法】：fontname为字体名称，italic为斜体，bold为黑体字，
fixed为默认字体，serif为衬线字体，fraktur德文黑字体，
1和0代表有和无，精细区分时可使用

jTessBoxEditor标记Box文件

jTessBoxEditor生成训练数据

后续：

tesseract 使用专门训练后的数据集识别率有一定提高，对于纯文本识别有较强的实用性，基本能够完成需求。学习和使用都比较简单，标记数据方便（大量数据下会变成单纯的体力劳动）。

在识别下图这种字符交叉的数据时准确率较低（图片已经进行了简单的二值化处理，只标记了100张左右的数据，不清楚训练的数据集足够大的情况下准确率如何。ps:这种标记方式很累，不知道是不是姿势不对，所以放弃使用此方案了。下方提供训练后的数据集可用来简单测试）

训练后的数据集：github.com/FreeFlash/c…

附相关python代码

from PIL import Image
import tesserocr
import os

def code_image_ocr(rootPath,file):
    img = Image.open(rootPath+file)
    result = tesserocr.image_to_text(img,lang = 'eng')
    result1 = tesserocr.image_to_text(img,lang = 'code')
    print(file + '->\n eng: ' + result+"\n code: "+result1)