实战：OCR检查图片汉字前言在国际化项目中，图片中的文字可能会导致国际化问题。例如，如果图片中包含中文，那么在国际化过

前言

在国际化项目中，图片中的文字可能会导致国际化问题。例如，如果图片中包含中文，那么在国际化过程中需要额外处理这些图片，或者需要为不同语言准备不同的图片版本。这会增加开发和维护的成本。本文将介绍一种前端实现 OCR 的方法，用于检查图片中是否包含中文，并在提交代码时触发检查。

实现效果

在 git commit 时触发图片快速检查
只提醒不阻断提交，避免使用vscode等其他工具无提醒
可通过增加--no-verify参数，快速跳过commit检查

流程图

虚线部分为本文实现

下面我们讲一步步来实现

开始：简单实现全量图片检查

我们使用 tesseract.js 来实现 OCR 功能。tesseract.js 是一个基于 Tesseract OCR 引擎的 JavaScript 库，可以在浏览器和 Node.js 环境中使用。

选择tesseract.js原因：

安全稳定：不用上传代码和图片，全靠本机的算力进行识别，不存在安全隐患。
无需付费：腾讯云提供的商业版 OCR 服务（120元/千次），较为昂贵。
兼容性强：背靠google，js库稳定可靠，是业内优先级比较高的选择。

安装依赖

首先，我们需要安装 tesseract.js 和 tesseract 语言包：

npm install tesseract.js -D

然后，我们需要下载 tesseract 语言包。我们可以使用以下命令下载中文语言包：

wget https://github.com/tesseract-ocr/tessdata/raw/main/chi_sim.traineddata

将下载的 chi_sim.traineddata 文件放在项目目录中的 tesseract 文件夹中。结构如下：

// 目录结构

project
  |- src/
  |- scripts/
      |- check-image.js
      |- tesseract/
          |- chi_sim.traineddata

语言包选择补充：

我们的目的是只需要判断图片上是否带文字，而非具体带有什么文字，在综合「语言包体积大小」「识别速度」「准确率」三个方面的考虑，决定牺牲「准确率」，换取更小的「语言包体积大小」和更快的「识别速度」。该而语言包 19.3MB 左右，占用空间不大。（版本可能有所差异）

编写代码

我们编写一个简单的脚本来检查所有图片中是否包含中文：

// scripts/check-image.js
const fs = require('fs');
const os = require('os');
const path = require('path');
const { createWorker } = require('tesseract.js');

async function main() {
  const extensions = ['.jpg', '.jpeg', '.png', '.gif'];
  const cwd = process.cwd();
  const imagePaths = (await readdirRecursive('./')).filter((file) => {
    const ext = path.extname(file);
    return extensions.includes(ext);
  });

  // 使用 `createWorker` 创建了一个 Tesseract worker。
  // 并加载当前目录下的 tesseract 目录中的语言包
  const worker = await createWorker({
    langPath: path.join(__dirname, 'tesseract'),
    cachePath: path.join(__dirname, 'tesseract'),
  });
  // 加载语言包/初始化语言包
  await worker.loadLanguage('chi_sim');
  await worker.initialize('chi_sim');

  const imagesWithChinesePaths = [];

  // 👇：核心识别
  // 检查图片：遍历所有图片文件，使用 Tesseract 进行 OCR 识别，并检查识别结果中是否包含中文。
  for (const filePath of imagePaths) {
    const imageUrl = path.relative(cwd, filePath);
    const { data: { text } } = await worker.recognize(imageUrl);
    const chineseText = text.replace(/[^\u4e00-\u9fa5]/g, ''); // 仅保留中文字符
    if (chineseText.length > 0) {
      imagesWithChinesePaths.push(filePath);
    }
  }

  // 👇：结束识别
  await worker.terminate();

  // 输出结果：如果发现包含中文的图片，我们输出警告信息。
  if (imagesWithChinesePaths.length > 0) {
    console.log(`
\x1b[33m[WARNING] 以下图片可能带有汉字：\x1b[0m
\x1b[31m${imagesWithChinesePaths.join('\n')}\x1b[0m
\x1b[1m
如果确认图片没有带汉字，请忽视。
    `);
  }
}

// 递归文件夹check检查
async function readdirRecursive(dir) {
  const files = [];
  const readdir = (currentDir) => {
    const entries = fs.readdirSync(currentDir, { withFileTypes: true });
    for (const entry of entries) {
      const fullPath = path.join(currentDir, entry.name);
      if (entry.isDirectory()) {
        readdir(fullPath);
      } else {
        files.push(fullPath);
      }
    }
  };
  readdir(dir);
  return files;
}

main().catch((error) => {
  console.error(error);
});

执行和结果

执行：

node scripts/check-image.js

结果：

✅执行成功，准确的识别出带有文字的图片了！不过也发现了其他问题：

图片范围：包括了 node_modules，检查了不必要的图片
执行时间：执行需要非常久，无法满足我们快速提交的动作
执行方式：通过 node xxx 执行，和提交没有强关联

总得来说，不是一个完美的实现方案。那能不能只检查本次提交的图片，来解决以上问题？

进阶：增量图片检查

在实际开发中，我们通常只需要检查新增或修改的图片，而不是所有图片。我们可以使用 Git 的钩子（hooks）来实现这一点。

配置 Git 钩子

安装 husky 和 lint-staged

npm install husky lint-staged -D

添加 pre-commit 钩子,用于在提交代码时检查新增或修改的图片。

// package.json
{
  // ...,
  "husky": {
    "hooks": {
      "pre-commit": "lint-staged -v"
    }
  },
  "lint-staged": {
    "*.{jpg,jpeg,png,gif}": "cross-env LINT_STAGED=1 node ./scripts/check-image.js"
  }
}

编写脚本

我们编写一个脚本来检查新增或修改的图片：

const os = require('os');
const path = require('path');
const { createWorker } = require('tesseract.js');
const childProcess = require('child_process');

async function main(files) {
  // ...
}

async function lintStaged() {
  const isLintStaged = !!process.env.LINT_STAGED;
  if (!isLintStaged) {
    return;
  }
  try {
   // 检查当前 Git 仓库是否处于合并状态下
   // 合并不检查，否则会引起不必要的check
    childProcess.execSync('git rev-parse -q --verify MERGE_HEAD', { encoding: 'utf-8' });
    return;
  } catch {
  }

  // 只check当前lint-stage的文件列表
  const lintStagedFiles = process.argv.slice(2);
  return main(lintStagedFiles);
}

lintStaged().catch((error) => {
  console.error(error);
});

执行和结果

执行：

项目增加一张带图片的文件
使用 git commit 提交

结果：

✅执行成功！也完美的达到了我们的预期：提交不阻塞，仅提示；只提交当前图片，没有node_modules

但是，当图片变多时，总觉得慢。由于node是单线程异步的，能否批量处理，把cpu算力利用起来？

3. 终极：批量检查增量图片

为了提高效率，我们可以批量检查新增或修改的图片。

修改脚本

修改脚本，使其能够批量处理图片：

const os = require('os');
const path = require('path');
const { createWorker } = require('tesseract.js');
const childProcess = require('child_process');

// 同时启动的 worker 数量
const workerCount = Math.min(32, os.cpus().length);

async function main(files) {
  // ...
  const imagePaths = (files || ['./']).filter((file) => {
    // ...
  });

  const workers = Array.from({ length: workerCount }, async () => {
    const worker = await createWorker({
      langPath: path.join(__dirname, 'tesseract'),
      cachePath: path.join(__dirname, 'tesseract'),
    });
    await worker.loadLanguage('chi_sim');
    await worker.initialize('chi_sim');
    return worker;
  });

  // 同时启动的 worker 数量，并行处理
  const initializedWorkers = await Promise.all(workers);

  const imagesWithChinesePaths = [];

  while (imagePaths.length > 0) {
    const filePathsToProcess = imagePaths.splice(0, workerCount);
    const workerPromises = filePathsToProcess.map(async (filePath, index) => {
      try {
        const imageUrl = path.relative(cwd, filePath);
        const worker = initializedWorkers[index];
        const { data: { text } } = await worker.recognize(imageUrl);
        const chineseText = text.replace(/[^\u4e00-\u9fa5]/g, ''); // 仅保留中文字符
        if (chineseText.length > 0) {
          imagesWithChinesePaths.push(filePath);
        }
      } catch (error) {
        console.error('识别图像时发生错误：', error);
        imagePaths.push(filePath);
      }
    });

    await Promise.all(workerPromises);
  }

 // 批量停止
  await Promise.all(initializedWorkers.map(worker => worker.terminate()));

  if (imagesWithChinesePaths.length > 0) {
    // ...
  }
}

async function lintStaged() {
  // ...
}

lintStaged().catch((error) => {
  console.error(error);
});

再去执行，多张图片就变快啦

紧急情况，commit 时跳过检查

如果我们确定某个提交中没有包含中文的图片，或者我们希望跳过 OCR 检查，我们可以使用以下命令：

git commit -m "commit message" --no-verify
// or
git commit -m "commit message" -n

完整配置

代码部分

// scripts/check-image.js
const os = require('os');
const path = require('path');
const { createWorker } = require('tesseract.js');
const childProcess = require('child_process');

const workerCount = Math.min(32, os.cpus().length);

async function main(files) {
  const extensions = ['.jpg', '.jpeg', '.png', '.gif'];
  const cwd = process.cwd();
  const imagePaths = (files || ['./']).filter((file) => {
    const ext = path.extname(file);
    return extensions.includes(ext);
  });

  const workers = Array.from({ length: workerCount }, async () => {
    const worker = await createWorker({
      langPath: path.join(__dirname, 'tesseract'),
      cachePath: path.join(__dirname, 'tesseract'),
    });
    await worker.loadLanguage('chi_sim');
    await worker.initialize('chi_sim');
    return worker;
  });

  const initializedWorkers = await Promise.all(workers);

  const imagesWithChinesePaths = [];

  while (imagePaths.length > 0) {
    const filePathsToProcess = imagePaths.splice(0, workerCount);
    const workerPromises = filePathsToProcess.map(async (filePath, index) => {
      try {
        const imageUrl = path.relative(cwd, filePath);
        const worker = initializedWorkers[index];
        const { data: { text } } = await worker.recognize(imageUrl);
        const chineseText = text.replace(/[^\u4e00-\u9fa5]/g, ''); // 仅保留中文字符
        if (chineseText.length > 0) {
          imagesWithChinesePaths.push(filePath);
        }
      } catch (error) {
        console.error('识别图像时发生错误：', error);
        imagePaths.push(filePath);
      }
    });

    await Promise.all(workerPromises);
  }

  await Promise.all(initializedWorkers.map(worker => worker.terminate()));

  if (imagesWithChinesePaths.length > 0) {
    console.log(`
\x1b[33m[WARNING] 新增的以下图片可能带有汉字：\x1b[0m
\x1b[31m${imagesWithChinesePaths.join('\n')}\x1b[0m
\x1b[1m
如果确认图片没有带汉字，请忽视。
    `);
  }
}

async function lintStaged() {
  const isLintStaged = !!process.env.LINT_STAGED;
  if (!isLintStaged) {
    return;
  }
  try {
    childProcess.execSync('git rev-parse -q --verify MERGE_HEAD', { encoding: 'utf-8' });
    return;
  } catch {
  }

  const lintStagedFiles = process.argv.slice(2);
  return main(lintStagedFiles);
}

lintStaged().catch((error) => {
  console.error(error);
});

package.json

{
  // ...
  "husky": {
    "hooks": {
      "pre-commit": "lint-staged -v"
    }
  },
  "lint-staged": {
    "*.{jpg,jpeg,png,gif}": "cross-env LINT_STAGED=1 node ./scripts/check-image.js"
  }
}

总结

通过以上方法，我们可以在前端实现 OCR 检查图片中的中文，并在提交代码时触发检查。这种方法可以帮助我们避免国际化问题，提高维护效率。