JavaScript 如何判断中文字符？作为一位中国的开发者，会经常遇到需要判断字符串是否为中文字符的场景，不过判断是否

导读

作为一位中国的开发者，会经常遇到需要判断字符串是否为中文字符的场景，不过判断是否为中文字符，并不像想象中那么容易，今天就来介绍一下中文字符都包含哪些数据，如何用 JavaScript 编写一个较严谨的 isChinese() 方法检测中文字符的。

中文字符的编码取值范围

检测中文字符的主要思路就是需要知道中文字符的 Unicode 编码范围，然后结合正则表达式检测字符串是否在给定的 Unicode 编码范围内。 通常的一个较为简单的实现如下：

const isChinese = (str) => {
  const pattern = /^[\u4e00-\u9fa5]+$/;
  return pattern.test(str);
}

但这段代码的正则表达式包含的中文字符并不全面，根据维基百科的介绍，中文字符包含以下内容：

中文汉字
象形文字扩展 A-H
兼容象形文字符
兼容表意文字增补字符
中文标点符号
兼容标点符号

中文汉字

const Ideographs = [0x4e00, 0x9fff]

所以上文中那个 isChinese() 方法仅仅只检测了是否包含了中文汉字的字符集，还有很多缺失的部分。

象形文字扩展 A - H

const IdeographsExtensionA = [0x3400, 0x4dbf]
const IdeographsExtensionB = [0x20000, 0x2a6df]
const IdeographsExtensionC = [0x2a700, 0x2b73f]
const IdeographsExtensionD = [0x2b740, 0x2b81f]
const IdeographsExtensionE = [0x2b820, 0x2ceaf]
const IdeographsExtensionF = [0x2ceb0, 0x2ebef]
const IdeographsExtensionG = [0x30000, 0x3134f]
const IdeographsExtensionH = [0x31350, 0x323af]

汉字的标点符号

const Punctuations = [
    // ，
    [0xff0c, 0xff0c],
    // 。
    [0x3002, 0x3002],
    // ·
    [0x00b7, 0x00b7],
    // ×
    [0x00d7, 0x00d7],
    // —
    [0x2014, 0x2014],
    // ‘
    [0x2018, 0x2018],
    // ’
    [0x2019, 0x2019],
    // “
    [0x201c, 0x201c],
    // ”
    [0x201d, 0x201d],
    // …
    [0x2026, 0x2026],
    // 、
    [0x3001, 0x3001],
    // 《
    [0x300a, 0x300a],
    // 》
    [0x300b, 0x300b],
    // 『
    [0x300e, 0x300e],
    // 』
    [0x300f, 0x300f],
    // 【
    [0x3010, 0x3010],
    // 】
    [0x3011, 0x3011],
    // ！
    [0xff01, 0xff01],
    // （
    [0xff08, 0xff08],
    // ）
    [0xff09, 0xff09],
    // ：
    [0xff1a, 0xff1a],
    // ；
    [0xff1b, 0xff1b],
    // ？
    [0xff1f, 0xff1f],
    // ～
    [0xff5e, 0xff5e]
]

兼容性标点符号

const CompatibilityForms = [0xfe30, 0xfe4f]

其中，兼容象形文字符：[0xf900, 0xfaff]和兼容表意文字增补字符：[0x2f800, 0x2fa1f]，（根据个人判断）只是看上去像汉字，是中国文化圈内周边国家造出来。因此在 isChinese() 方法中也没有纳入到汉字字符。

isChinese() 方法

到此，我们已经了解了中文字符都包含了哪些字符，并且也通过wiki百科查找到了对应正则表达式，现在我们归纳总结一下，实现一个较为严谨的 isChinese() 方法。

/**
 * @method isChinese
 * @param {String} str - （必须）检测字符串
 * @param {Boolean} [includePunctuation] - （可选）是否包含标点符号：默认值：true
 * @returns {boolean}
 */
const isChinese = (str, includePunctuation = true) => {
  // 转换成正则表达式
  const toRegExp = (range) => {
    const pattern = range
      .map((range) => {
        const rangeStart = range[0]
        const rangeEnd = range[1]
        const hexStart = rangeStart.toString(16)
        const hexEnd = rangeEnd.toString(16)

        if (rangeStart === rangeEnd) {
          return `\\u{${hexStart}}`
        }
        return `[\\u{${hexStart}}-\\u{${hexEnd}}]`
      })
      .join('|')

    return new RegExp(`^(?:${pattern})+$`, 'u')
  }
  // 文字
  // https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
  const chineseIdeographs = [
    // 中文汉字
    [0x4e00, 0x9fff],

    // 象形文字扩展 A - H
    [0x3400, 0x4dbf],
    [0x20000, 0x2a6df],
    [0x2a700, 0x2b73f],
    [0x2b740, 0x2b81f],
    [0x2b820, 0x2ceaf],
    [0x2ceb0, 0x2ebef],
    [0x30000, 0x3134f],
    [0x31350, 0x323af]
  ]
  // 标点符号
  const chinesePunctuations = [
    // ，
    [0xff0c, 0xff0c],
    // 。
    [0x3002, 0x3002],
    // ·
    [0x00b7, 0x00b7],
    // ×
    [0x00d7, 0x00d7],
    // —
    [0x2014, 0x2014],
    // ‘
    [0x2018, 0x2018],
    // ’
    [0x2019, 0x2019],
    // “
    [0x201c, 0x201c],
    // ”
    [0x201d, 0x201d],
    // …
    [0x2026, 0x2026],
    // 、
    [0x3001, 0x3001],
    // 《
    [0x300a, 0x300a],
    // 》
    [0x300b, 0x300b],
    // 『
    [0x300e, 0x300e],
    // 』
    [0x300f, 0x300f],
    // 【
    [0x3010, 0x3010],
    // 】
    [0x3011, 0x3011],
    // ！
    [0xff01, 0xff01],
    // （
    [0xff08, 0xff08],
    // ）
    [0xff09, 0xff09],
    // ：
    [0xff1a, 0xff1a],
    // ；
    [0xff1b, 0xff1b],
    // ？
    [0xff1f, 0xff1f],
    // ～
    [0xff5e, 0xff5e],
    // 兼容性标点符号
    // https://en.wikipedia.org/wiki/CJK_Compatibility_Forms
    [0xfe30, 0xfe4f]
  ]
  const asciiChars = /\w+/

  if (typeof str !== 'string') {
    return false
  }

  if (asciiChars.test(str)) {
    return false
  }

  const pattern = includePunctuation
    ? toRegExp(chineseIdeographs.concat(chinesePunctuations))
    : toRegExp(chineseIdeographs)

  return pattern.test(str)
}

现在就可以用这个 isChinese() 方法去验证一下效果了。当然，如果你觉得兼容象形文字符和兼容表意文字增补字符也是汉字字符，那应该将他们加入到 isChinese() 方法的字符取值范围中。

总结

本文的重点是给大家介绍了中文字符都包含了哪些范围，以及验证中文字符使用正则表达式的基本检测思路。也希望 isChinese() 方法能在大家的日常开发工作中帮助到大家。