正则表达式：那个你每次用都要Google的"简单"工具正则表达式，号称"程序员的瑞士军刀"，实际上是"每次用都要重新学一

摘要：正则表达式，号称"程序员的瑞士军刀"，实际上是"每次用都要重新学一遍的黑魔法"。/^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$/ 这串天书，是密码验证还是召唤恶魔的咒语？本文不是正则教程，而是一份"踩坑指南"。每个陷阱都有代码，每个坑都能复现。文末附送《正则表达式速查表》和《为什么不要用正则解析HTML》。

一个"简单"的需求

产品经理："加个手机号验证，很简单的。"

你心想：小case，写个正则。

5分钟后：

const phoneRegex = /\d{11}/
console.log(phoneRegex.test("13812345678")) // true ✓

产品经理："不对，要验证是不是真的手机号。"

10分钟后：

const phoneRegex = /^1[3-9]\d{9}$/
console.log(phoneRegex.test("13812345678")) // true ✓
console.log(phoneRegex.test("19912345678")) // true ✓

产品经理："要支持+86和空格。"

20分钟后：

const phoneRegex = /^(\+86)?1[3-9]\d{9}$/
// ❌ 不支持空格

30分钟后：

const phoneRegex = /^(\+86)?[\s-]?1[3-9]\d{9}$/
// ❌ 还是不对

1小时后，你在 Stack Overflow 上搜索："regex phone number china"

看到 20 种不同的答案，每个都说自己是"最完整的"。

你崩溃了。

01. 贪婪匹配 vs 非贪婪匹配：正则的"贪吃蛇"

场景复现

你想提取 HTML 标签中的内容：

const html = "<div>Hello</div><div>World</div>"
const regex = /<div>.*<\/div>/
const match = html.match(regex)
console.log(match[0])

你以为会输出：

<div>Hello</div>

实际输出：

<div>Hello</div><div>World</div>

WTF？

为什么？

.* 是贪婪匹配（Greedy），会尽可能多地匹配字符。

匹配过程：

<div> 匹配成功
.* 开始匹配，一直吃到字符串末尾
<\/div> 匹配失败
.* 回退一个字符，再尝试
重复步骤 3-4，直到找到最后一个 </div>

解决方案：非贪婪匹配

const regex = /<div>.*?<\/div>/ // 加个 ?
const match = html.match(regex)
console.log(match[0]) // <div>Hello</div> ✓

但是：

const html = "<div>Hello</div><div>World</div>"
const regex = /<div>.*?<\/div>/g // 全局匹配
const matches = html.match(regex)
console.log(matches)
// ['<div>Hello</div>', '<div>World</div>'] ✓

贪婪 vs 非贪婪对比

const text = "aaa"

// 贪婪匹配
console.log(text.match(/a+/)[0]) // "aaa"（尽可能多）

// 非贪婪匹配
console.log(text.match(/a+?/)[0]) // "a"（尽可能少）

常见的贪婪量词：

*：0 次或多次（贪婪）
+：1 次或多次（贪婪）
?：0 次或 1 次（贪婪）
{n,m}：n 到 m 次（贪婪）

对应的非贪婪量词：

*?：0 次或多次（非贪婪）
+?：1 次或多次（非贪婪）
??：0 次或 1 次（非贪婪）
{n,m}?：n 到 m 次（非贪婪）

02. 回溯灾难：让你的服务器宕机的正则

场景复现

你写了一个"简单"的正则来验证邮箱：

const emailRegex = /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

// 测试
console.log(emailRegex.test("test@example.com")) // true ✓
console.log(emailRegex.test("user@domain.co.uk")) // true ✓

看起来没问题。

但是：

const maliciousInput = "a".repeat(50) + "@example.com"
console.time("test")
emailRegex.test(maliciousInput)
console.timeEnd("test")
// test: 15000ms（15秒！）

你的服务器卡死了。

为什么？

这叫灾难性回溯（Catastrophic Backtracking）。

问题在于：

;/^([a-z0-9_\.-]+)@/
//        ↑ 这个 + 是贪婪的

当输入是 aaaaaaa... 时：

[a-z0-9_\.-]+ 匹配所有的 a
尝试匹配 @，失败
回退一个 a，再尝试
重复步骤 2-3，指数级增长

时间复杂度：O(2^n)

解决方案

方案1：使用非贪婪匹配

const emailRegex = /^([a-z0-9_\.-]+?)@([\da-z\.-]+?)\.([a-z\.]{2,6})$/

方案2：使用原子组（Atomic Group）

// JavaScript 不支持原子组
// 但可以用其他语言（如 PHP、.NET）
const emailRegex = /^(?>([a-z0-9_\.-]+))@/

方案3：避免嵌套量词

// ❌ 危险：嵌套量词
/^(a+)+$/

// ✅ 安全：不嵌套
/^a+$/

方案4：设置超时

// Node.js 没有内置的正则超时
// 可以使用第三方库
const safeRegex = require("safe-regex")

if (!safeRegex(/^([a-z0-9_\.-]+)@/)) {
  console.error("Unsafe regex!")
}

真实案例

2019年，Cloudflare 全球宕机：

原因：一个正则表达式导致 CPU 100%

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

这个正则有 20+ 个嵌套量词，导致灾难性回溯。

03. 零宽断言：看不见的"幽灵"

什么是零宽断言？

零宽断言（Zero-Width Assertion）不匹配任何字符，只匹配位置。

四种零宽断言：

(?=...)：正向先行断言（Positive Lookahead）
(?!...)：负向先行断言（Negative Lookahead）
(?<=...)：正向后行断言（Positive Lookbehind）
(?<!...)：负向后行断言（Negative Lookbehind）

场景1：密码验证

**需求：**密码必须包含大写字母、小写字母、数字，至少8位

错误的做法：

// ❌ 这只检查是否有这些字符，不检查是否都有
const regex = /^[A-Za-z0-9]{8,}$/
console.log(regex.test("12345678")) // true（但没有字母）

正确的做法：

const regex = /^(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,}$/

console.log(regex.test("Password1")) // true ✓
console.log(regex.test("password1")) // false（没有大写）
console.log(regex.test("PASSWORD1")) // false（没有小写）
console.log(regex.test("Password")) // false（没有数字）
console.log(regex.test("Pass1")) // false（少于8位）

解释：

^              开始
(?=.*[A-Z])    必须包含至少一个大写字母（先行断言）
(?=.*[a-z])    必须包含至少一个小写字母（先行断言）
(?=.*\d)       必须包含至少一个数字（先行断言）
.{8,}          至少8个任意字符
$              结束

场景2：提取价格（不包含货币符号）

const text = "Price: $99.99"

// ❌ 包含了 $
const regex1 = /\$\d+\.\d+/
console.log(text.match(regex1)[0]) // "$99.99"

// ✅ 不包含 $
const regex2 = /(?<=\$)\d+\.\d+/
console.log(text.match(regex2)[0]) // "99.99"

解释：

(?<=\$)    正向后行断言：前面必须是 $，但不捕获 $
\d+\.\d+   匹配数字

场景3：千位分隔符

**需求：**给数字加千位分隔符

const num = "1234567890"

// 使用正向先行断言
const formatted = num.replace(/\B(?=(\d{3})+(?!\d))/g, ",")
console.log(formatted) // "1,234,567,890"

解释：

\B                  非单词边界（不在开头）
(?=(\d{3})+(?!\d))  后面是3的倍数个数字，且后面不再有数字

测试：

console.log("123".replace(/\B(?=(\d{3})+(?!\d))/g, ",")) // "123"
console.log("1234".replace(/\B(?=(\d{3})+(?!\d))/g, ",")) // "1,234"
console.log("12345".replace(/\B(?=(\d{3})+(?!\d))/g, ",")) // "12,345"
console.log("123456".replace(/\B(?=(\d{3})+(?!\d))/g, ",")) // "123,456"

04. 捕获组 vs 非捕获组：括号的秘密

捕获组

const text = "2026-01-20"
const regex = /(\d{4})-(\d{2})-(\d{2})/
const match = text.match(regex)

console.log(match[0]) // "2026-01-20"（完整匹配）
console.log(match[1]) // "2026"（第1个捕获组）
console.log(match[2]) // "01"（第2个捕获组）
console.log(match[3]) // "20"（第3个捕获组）

非捕获组

const text = "2026-01-20"
const regex = /(?:\d{4})-(?:\d{2})-(?:\d{2})/
const match = text.match(regex)

console.log(match[0]) // "2026-01-20"（完整匹配）
console.log(match[1]) // undefined（没有捕获组）

为什么要用非捕获组？

性能更好：不需要保存捕获的内容
避免混淆：不会影响捕获组的编号

命名捕获组（ES2018）

const text = "2026-01-20"
const regex = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
const match = text.match(regex)

console.log(match.groups.year) // "2026"
console.log(match.groups.month) // "01"
console.log(match.groups.day) // "20"

在替换中使用：

const text = "2026-01-20"
const formatted = text.replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  "$<month>/$<day>/$<year>",
)
console.log(formatted) // "01/20/2026"

05. 常见场景的正则表达式

邮箱验证

// ❌ 过于简单
/^.+@.+\..+$/

// ❌ 过于复杂（RFC 5322 完整版，6000+ 字符）
// 不要尝试完全符合 RFC 标准

// ✅ 实用版本
/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/

// 测试
const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
console.log(emailRegex.test("user@example.com"));     // true
console.log(emailRegex.test("user.name@example.com")); // true
console.log(emailRegex.test("user@example.co.uk"));   // true
console.log(emailRegex.test("user@example"));         // false
console.log(emailRegex.test("@example.com"));         // false

手机号验证（中国）

// 基础版本
/^1[3-9]\d{9}$/

// 支持 +86 和空格
/^(\+86)?[\s-]?1[3-9]\d{9}$/

// 更严格的版本（区分运营商）
/^1(3\d|4[5-9]|5[0-35-9]|6[2567]|7[0-8]|8\d|9[0-35-9])\d{8}$/

// 测试
const phoneRegex = /^1[3-9]\d{9}$/;
console.log(phoneRegex.test("13812345678")); // true
console.log(phoneRegex.test("19912345678")); // true
console.log(phoneRegex.test("12812345678")); // false
console.log(phoneRegex.test("138123456"));   // false

URL 验证

// 简单版本
/^https?:\/\/.+/

// 更严格的版本
/^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/

// 测试
const urlRegex = /^https?:\/\/.+/;
console.log(urlRegex.test("https://example.com"));       // true
console.log(urlRegex.test("http://example.com/path"));   // true
console.log(urlRegex.test("ftp://example.com"));         // false
console.log(urlRegex.test("example.com"));               // false

身份证号验证（中国）

// 18位身份证号
;/^[1-9]\d{5}(18|19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dXx]$/

// 测试
const idRegex =
  /^[1-9]\d{5}(18|19|20)\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])\d{3}[\dXx]$/
console.log(idRegex.test("110101199001011234")) // true
console.log(idRegex.test("110101199013011234")) // false（月份错误）
console.log(idRegex.test("110101199001321234")) // false（日期错误）

密码强度验证

// 至少8位，包含大小写字母、数字、特殊字符
;/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/

// 测试
const passwordRegex =
  /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/
console.log(passwordRegex.test("Password1!")) // true
console.log(passwordRegex.test("password1!")) // false（没有大写）
console.log(passwordRegex.test("PASSWORD1!")) // false（没有小写）
console.log(passwordRegex.test("Password!")) // false（没有数字）
console.log(passwordRegex.test("Password1")) // false（没有特殊字符）

IP 地址验证

// IPv4
;/^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$/

// 测试
const ipRegex =
  /^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$/
console.log(ipRegex.test("192.168.1.1")) // true
console.log(ipRegex.test("255.255.255.255")) // true
console.log(ipRegex.test("256.1.1.1")) // false
console.log(ipRegex.test("192.168.1")) // false

06. 为什么不要用正则解析 HTML

经典问题

Stack Overflow 上最著名的回答之一：

**问：**如何用正则表达式解析 HTML？

**答：**你不能。HTML 不是正则语言。

为什么？

HTML 是上下文无关文法（Context-Free Grammar），而正则表达式只能处理正则文法（Regular Grammar）。

尝试用正则解析 HTML

// 尝试提取所有 <a> 标签的 href
const html = `
  <a href="https://example.com">Link 1</a>
  <a href='https://example.org'>Link 2</a>
  <a href="https://example.net" target="_blank">Link 3</a>
  <a href="https://example.com/path?query=value&other=123">Link 4</a>
`

// ❌ 简单的正则
const regex1 = /<a href="(.+)">/g
const matches1 = [...html.matchAll(regex1)]
console.log(matches1.map((m) => m[1]))
// ["https://example.com", "https://example.net" target="_blank"]
// ❌ 错误！第二个匹配包含了额外的属性

问题：

属性可以用单引号或双引号
属性之间可以有空格
属性顺序可以变化
标签可以嵌套
标签可以跨行

正确的做法：使用 DOM Parser

// ✅ 使用 DOMParser（浏览器）
const parser = new DOMParser()
const doc = parser.parseFromString(html, "text/html")
const links = doc.querySelectorAll("a")
const hrefs = Array.from(links).map((link) => link.href)
console.log(hrefs)
// ["https://example.com", "https://example.org", "https://example.net", ...]

// ✅ 使用 cheerio（Node.js）
const cheerio = require("cheerio")
const $ = cheerio.load(html)
const hrefs = $("a")
  .map((i, el) => $(el).attr("href"))
  .get()
console.log(hrefs)

唯一的例外

如果你只是想做简单的文本替换：

// 移除所有 HTML 标签
const text = html.replace(/<[^>]+>/g, "")
console.log(text)
// "Link 1 Link 2 Link 3 Link 4"

但即使这样也有问题：

const html = "<div>Price: 5 < 10</div>"
const text = html.replace(/<[^>]+>/g, "")
console.log(text)
// "Price: 5 "
// ❌ 错误！把 "< 10" 当成标签删除了

07. 正则表达式的性能优化

技巧1：避免回溯

// ❌ 慢：会回溯
/^(a+)+b$/

// ✅ 快：不会回溯
/^a+b$/

技巧2：使用非捕获组

// ❌ 慢：创建捕获组
/(\d{4})-(\d{2})-(\d{2})/

// ✅ 快：不创建捕获组（如果不需要捕获）
/(?:\d{4})-(?:\d{2})-(?:\d{2})/

技巧3：提前锚定

// ❌ 慢：会尝试每个位置
/\d{4}-\d{2}-\d{2}/

// ✅ 快：只尝试开头
/^\d{4}-\d{2}-\d{2}/

技巧4：使用字符类而不是多个选择

// ❌ 慢
/(a|b|c|d|e)/

// ✅ 快
/[a-e]/

技巧5：避免不必要的全局匹配

// ❌ 慢：全局匹配
const regex = /\d+/g
regex.test(text)

// ✅ 快：非全局匹配（如果只需要测试是否存在）
const regex = /\d+/
regex.test(text)

08. 正则表达式速查表

基本语法

.       任意字符（除换行符）
\d      数字 [0-9]
\D      非数字 [^0-9]
\w      单词字符 [a-zA-Z0-9_]
\W      非单词字符
\s      空白字符 [ \t\n\r\f\v]
\S      非空白字符
^       开始
$       结束
\b      单词边界
\B      非单词边界

量词

*       0次或多次（贪婪）
+       1次或多次（贪婪）
?       0次或1次（贪婪）
{n}     恰好n次
{n,}    至少n次
{n,m}   n到m次
*?      0次或多次（非贪婪）
+?      1次或多次（非贪婪）
??      0次或1次（非贪婪）

字符类

[abc]   a、b或c
[^abc]  除了a、b、c
[a-z]   a到z
[A-Z]   A到Z
[0-9]   0到9

分组

(...)       捕获组
(?:...)     非捕获组
(?<name>...)命名捕获组

断言

(?=...)     正向先行断言
(?!...)     负向先行断言
(?<=...)    正向后行断言
(?<!...)    负向后行断言

标志

g       全局匹配
i       忽略大小写
m       多行模式
s       dotall模式（.匹配换行符）
u       Unicode模式
y       粘性模式

写在最后：正则不是银弹

正则表达式很强大，但不是万能的。

适合用正则的场景：

简单的文本匹配和替换
输入验证（邮箱、手机号、密码）
日志分析
简单的文本提取

不适合用正则的场景：

解析 HTML/XML（用 DOM Parser）
解析 JSON（用 JSON.parse）
复杂的语法分析（用专门的 Parser）
需要上下文的匹配（用状态机）

记住这些原则：

能用字符串方法就不用正则
能用简单正则就不用复杂正则
能用现成库就不自己写正则
写完正则一定要测试边界情况
复杂正则一定要加注释

正则表达式是工具，不是炫技的手段。

代码是写给人看的，不是写给正则引擎看的。

你被哪个正则表达式折磨过？

在评论区分享你的"正则地狱"故事吧！

说不定，你的经历能救另一个正在 Google "regex email validation" 的开发者。