The Linux Command Line-WILLIAM-正则表达式文本数据对于类 unix 系统来说很重要，而正

文本数据对于类 unix 系统来说很重要，而正则表达式常常与命令行工具一起来使用。

什么是正则表达式

简单来讲，正则表达式是用于识别文本模式的符号表示。某种程度上，类似于shell 的匹配文件和路径名的通配符方法，但规模要大得多。许多命令行工具，编程语言支持它来解决操作文本问题。但是，不是所有的正则表达式是相同的；工具之间以及编程语言之间都有细微的不同。接下来讨论的正则表达式在 POSIX 标准之下（它将涵盖大多数命令行工具），而不是许多编程语言（最著名的是 Perl ）使用的，更大且更丰富的记号集。

grep—Search Through Text

grep（global regular expression print）可以使用正则表达式。大体上，grep在文本文件中搜索是否存在指定的正则表达式，并输出与标准输出匹配的任何行。

到现在为止，我们能够以以下方式使用 grep：

[me@linuxbox ~]$ ls /usr/bin | grep zip

上面命令输出 /usr/bin 中所有包含 zip 的文件。

grep 程序接受以以下方式接受参数与选项：

grep [options] regex [file...]

regex 为正则表达式

Table19-1: grep Options

Option	Description
-i	Ignore case. Do not distinguish between upper- and lowercase characters. May also be specified --ignore-case .
-v	Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match. May also be specified --invert-match
-c	Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves. May also be specified --count
-l	Print the name of each file that contains a match instead of the lines themselves. May also be specified --files-with-matches
-L	Like the -l option, but print only the names of files that do not contain matches. May also be specified --files-without-match
-n	Prefix each matching line with the number of the line within the file. May also be specified --line-number.
-h	For multifile searches, suppress the output of filenames. May also be specified --no-filename

为了实验正则表达式，首先创建文件：

[me@linuxbox ~]$ ls /bin > dirlist-bin.txt
[me@linuxbox ~]$ ls /usr/bin > dirlist-usr-bin.txt
[me@linuxbox ~]$ ls /sbin > dirlist-sbin.txt
[me@linuxbox ~]$ ls /usr/sbin > dirlist-usr-sbin.txt
[me@linuxbox ~]$ ls dirlist*.txt
dirlist-bin.txt dirlist-sbin.txt dirlist-usr-sbin.txt
dirlist-usr-bin.txt

使用以下命令进行简单搜索：

[me@linuxbox ~]$ grep bzip dirlist*.txt
dirlist-bin.txt:bzip2
dirlist-bin.txt:bzip2recover

上面命令， grep在所有列出的文件中搜索字符串 bzip 并找到了两个匹配项，两个匹配项都在 dirlist-bin.txt 文件中。如果我们相比于匹配项更关注于包含匹配项的文件，可以指定 -l 选项：

[me@linuxbox ~]$ grep -l bzip dirlist*.txt
dirlist-bin.txt

相反，如果想要查看未包含匹配字符的文件列表：

[me@linuxbox ~]$ grep -L bzip dirlist*.txt
dirlist-sbin.txt
dirlist-usr-bin.txt
dirlist-usr-sbin.txt

Metacharacters and Literals

上面 grep 搜索中已经用到了简单的正则表达式。表达式 bzip 指在某行中只有包含了这四个字符并按照 b, z, i,p 顺序以及中间不能有其他字符才能被匹配。bzip 四个字符全是匹配自身的文字字符（literal characters），除文字字符之外，正则表达式还包括元字符（ Metacharacters ）来指定更加复杂的匹配。正则表达式元字符由下面字符组成：

^ $ . [ ] { } - ? * + ( ) | \

所有其他字符都是文字字符（literal characters），虽然反斜杠字符在少数情况下用来创建元序列(metasequences)和转义元字符（ Metacharacters ）。

【注】可以看到许多元字符对 shell 执行扩展时候来讲同样是有意义的。因此，执行包含元字符正则表达式时候要使用单引号引起来。

The Any Character

首先来看 . 元字符，此元字符在正则表达式中使用的话，代表在此位置上匹配任何一个字符。

[me@linuxbox ~]$ grep -h '.zip' dirlist*.txt
bunzip2
bzip2
bzip2recover
gunzip
gzip
funzip
gpg-zip
preunzip
prezip
prezip-bin
unzip
unzipsfx

在文件中搜索任何匹配正则表达式 .zip 的行。值得注意的是， zip 程序没有被发现，这是因为在正则表达式中包括元字符 . 增加匹配长度为四个字符，因为 zip 为三个字符，因此没有被匹配。另外，如果有 .zip 扩展名文件也会被匹配，因为扩展名中的 . 会作为普通字符。

锚（Anchors）

在正则表达式中插入符号( ^ ) 以及美元符号 ( $)作为锚（Anchors）。锚可以在文中指定匹配位置，例如，插入符号( ^ )指定正则表达式只有在行开始时候匹配以及美元符号 ($ )指定在行末尾。

[me@linuxbox ~]$ grep -h '^zip' dirlist*.txt
zip
zipcloak
zipgrep
zipinfo
zipnote
zipsplit
[me@linuxbox ~]$ grep -h 'zip$' dirlist*.txt
gunzip
gzip
funzip
gpg-zip
preunzip
prezip
unzip
zip
[me@linuxbox ~]$ grep -h '^zip$' dirlist*.txt
zip

上面例子分别匹配在行首以zip，以zip结尾以及 zip。另外正则表达式 ^$ 将会匹配空行。

括号表达式（Bracket Expressions and Character Classes）

使用括号表达式可以从指定的字符集合中匹配一个字符，指定的字符集中可以包含元字符。例如，匹配包含bzip 或者 gzip 字符串的行；

[me@linuxbox ~]$ grep -h '[bg]zip' dirlist*.txt
bzip2
bzip2recover
gzip

在括号表达式中元字符失去特殊意义。然而，有两个元字符在括号表达式中使用有不同的含义。第一个是插入符号( ^ ) 代表否定；第二个是破折号( - )，用来指定字符范围。

否定（Negation）

在括号表达式中如果第一个字符是插入符号( ^ )，任何一个其余的字符不能出现在给定的字符位置。

[me@linuxbox ~]$ grep -h '[^bg]zip' dirlist*.txt
bunzip2
gunzip
funzip
gpg-zip
preunzip
prezip
prezip-bin
unzip
unzipsfx

上面命令匹配包含在 zip 之前不是 b 或者 g 字符的文本。

另外，插入符号( ^ )只有在括号表达式首个位置出现时候才有特殊意义，出现在其他位置将被视为普通字符。

字符范围（Traditional Character Ranges）

想要匹配所有以大写字符开头的文件：

[me@linuxbox ~]$ grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXZY]' dirlist*.txt

已经另外一种不必把所有字母打出来的另外一种方法：

[me@linuxbox ~]$ grep -h '^[A-Z]' dirlist*.txt
MAKEDEV
ControlPanel
GET
HEAD
POST
X
X11
Xorg
MAKEFLOPPIES
NetworkManager
NetworkManagerDispatcher

仅仅使用 3个字符而省略了键入26个字母。还可以使用多个范围，来匹配以字母数字为首的文件名：

[me@linuxbox ~]$ grep -h '^[A-Za-z0-9]' dirlist*.txt

可以通过将破折号放在表达式首位来引入破折号到字符集中：

[me@linuxbox ~]$ grep -h '[-AZ]' dirlist*.txt

POSIX 字符类（POSIX Character Classes）

括号表达式中的范围不总是生效，在 grep 程序中并没有遇到问题，但是在其他的程序中有时会出现问题。

文件名扩展中我们使用了通配符，字符范围的使用方式几乎与正则表达式中使用的方式相同，但是有一些问题。

[me@linuxbox ~]$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*
/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

上面命令查看文件名以大写字母开头的文件列表，但是用范围的时候会遇到一些问题：

[me@linuxbox ~]$ ls /usr/sbin/[A-Z]*
/usr/sbin/biosdecode
/usr/sbin/chat
/usr/sbin/chgpasswd
/usr/sbin/chpasswd
/usr/sbin/chroot
/usr/sbin/cleanup-info
/usr/sbin/complain
/usr/sbin/console-kit-daemon

可以看到与上面结果不符，这是因为在 Unix 开始发展，它只了解 ASCII 字符。在 ASCII 中，前 32 个字符（0–31）为控制码（例如缩进，退格和回车等）。后面的32个字符（32–63）为可打印字符包括标点以及数字 0-9。接下来的 32个字符（64–95）包含大写字母和标点。最后 31 个字符（96–127）包含小写字符以及标点。基于此，使用ASCII时使用如下排序规则：

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

与字典顺序不同，字典顺序为：

aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

随着Unix在美国广泛使用，需要支持美国英语中找不到的字符。ASCII 表扩展为 8 字节，增加字符 128–255 来适应多种语言。为了支持此能力，POSIX 标准引进了 locate 概念，它可以在特定位置根据需要来调整选择的字符集。可以使用下面命令来查看系统使用的语言设置：

[me@linuxbox ~]$ echo $LANG
en_US.UTF-8

根据上面的设置，POSIX解释应用将使用字典顺序。这就可以解释上面的命令了。[A-Z] 代表的字符范围根据字典范围包含了除了小写a字符之外的所有字符，即：

AbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

Table 19-2: POSIX Character Classes

Character Class	Description
[:alnum:]	The alphanumeric characters; in ASCII, equivalent to [A-Za-z0-9]
[:word:]	The same as [:alnum:], with the addition of the underscore character ( _ )
[:alpha:]	The alphabetic characters; in ASCII, equivalent to [A-Za-z]
[:blank:]	Includes the space and tab characters
[:cntrl:]	The ASCII control codes; includes the ASCII characters 0 through 31 and 127
[:digit:]	The numerals 0 through 9
[:graph:]	The visible characters; in ASCII, includes characters 33 through 126
[:lower:]	The lowercase letters
[:punct:]	The punctuation characters; in ASCII, equivalent to [-!"#$%&'()*+,./:;<=>?@[\]_`{	}~]
[:print:]	The printable characters; all the characters in [:graph:] plus the space character
[:space:]	The whitespace characters including space, tab, carriage return, newline, vertical tab, and form feed; in ASCII, equivalent to [ \t\r\n\v\f]
[:upper:]	The uppercase characters
[:xdigit:]	Characters used to express hexadecimal numbers; in ASCII, equivalent to [0-9A-Fa-f]

即使使用字符类，也没有方便的方法来表示指定的范围，例如：[A-M]

使用字符类，可以重置目录列表，并获得更好的结果。

[me@linuxbox ~]$ ls /usr/sbin/[[:upper:]]*
/usr/sbin/MAKEFLOPPIES
/usr/sbin/NetworkManagerDispatcher
/usr/sbin/NetworkManager

恢复传统的顺序

可以通过设置 LANG 环境变量来使用传统的顺序（ASCII）。此值决定于安装系统时候选择的语言。

查看 locate 设置：

[me@linuxbox ~]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

设置 LANG 变量为 POSIX 来使用使用传统的顺序（ASCII）

[me@linuxbox ~]$ export LANG=POSIX

还可以通过在 .bashrc 文件加入下面命令：

export LANG=POSIX

POSIX Basic vs. Extended Regular Expressions

POSIX 同样地将正则表达式实现分为了两种：基本正则表达式（basic regular expressions ，BRE）以及扩展正则表达式（extended regular expressions，ERE）。到目前为止我们所介绍的功能，任何POSIX兼容并实现BRE的任何应用程序都支持。例如 grep。

BRE 与 ERE 有什么不同呢？不同之处在于元字符。BRE 中可以识别 ^ $ . [ ] * 元字符，所有其他字符视为普通字符。而 ERE 在此基础之上加入了下面字符 ( ) { } ? + | 。

然而，在 BRE 中，字符 ( ) { } 如果被反斜杠转义就可以当作元字符；而在 ERE 中，反斜杠在前面的任何元字符都被当作普通字符。

egrep 程序支持 ERE ；而 GNU 版本的 grep 使用 -E 选项同样支持 ERE。

POSIX 来源

1980 年代，Unix 被AT&T授权给很多家计算机厂商，并且为其系统提供各种版本的操作系统。因此使得软件的兼容性受到了限制。

1980年代中期，IEEE（Institute of Electrical and Electronics Engineers）开始制定一系列定义 Unix 系统如何运行的标准。这些协议，通常以 IEEE 1003 为大家所知，定义了应用程序接口（application programming interfaces ，APIs）。POSIX 这个名字是 Portable Operating System Interface （X 为 extra snappiness），由 Richard Stallman 建议并被 IEEE 采用。

extended regular expressions，ERE

Alternation

第一个讨论的扩展正则功能为 alternation，它允许在一个表达式集合中匹配，可以是字符串或者正则表达式。类似于括号表达式中允许匹配一个字符集中的一个字符。

首先，来实验用以前的方式来匹配字符串：

[me@linuxbox ~]$ echo "AAA" | grep AAA
AAA
[me@linuxbox ~]$ echo "BBB" | grep AAA
[me@linuxbox ~]$

上面的命令中，通过管道（pip）传输echo的输出到 grep。

现在使用 Alternation，通过管道符号来分割：

[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB'
AAA
[me@linuxbox ~]$ echo "BBB" | grep -E 'AAA|BBB'
BBB
[me@linuxbox ~]$ echo "CCC" | grep -E 'AAA|BBB'
[me@linuxbox ~]$

'AAA|BBB' 意味着待匹配字符串要匹配 ‘AAA’ 或者 ‘BBB’。当然也可以增加更多的匹配项：

[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB|CCC'
AAA

为了把 alternation 和其他的正则表达式元素连接使用，可以使用（）来分隔 alternation：

[me@linuxbox ~]$ grep -Eh '^(bz|gz|zip)' dirlist*.txt

上面表达式会匹配以 bz , gz 或者 zip 开始的文件名。如果将括号去掉表达式意思发生改变：匹配以 bz开始的文件名或者包含 gz 或 zip 的文件名：

[me@linuxbox ~]$ grep -Eh '^bz|gz|zip' dirlist*.txt

Quantifiers

ERE 支持若干方法来指定元素匹配次数

?—Match an Element Zero Times or One Time

此量词实际意味着 “使前面的元素为可选（Make the preceding element optional）”。假如想要检查电话号码的格式的话，可以考虑两种形式 (nnn) nnn-nnnn 或者 nnn nnn-nnnn（n为数字），可以构造正则表达式：

^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

上面正则中，在括号后面为问号，代表着匹配一次或者零次。而因为括号在 ERE 为元字符，所以使用反斜杠来将其转义。

[me@linuxbox ~]$ echo "(555) 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9]
[0-9][0-9]$'
(555) 123-4567
[me@linuxbox ~]$ echo "555 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'
555 123-4567
[me@linuxbox ~]$ echo "AAA 123-4567" | grep -E '^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$'
[me@linuxbox ~]$

*—Match an Element Zero or More Times

类似于？元字符，* 用来表示一个可选的元素；不同于？只是一次，此元素可能出现任意次。例如判断字符串是否是一个句子，以大写字母开头，接着包含若干个大小写单词以及空格，最后以句号结尾。可以使用以下正则：

[[:upper:]][[:upper:][:lower:] ]*\.

上面表达式由三项组成：一个包含 [:upper:] 字符类的括号表达式，包含 [:upper:] 和 [:lower:] 字符类和空格的括号表达式以及反斜杠转义的句号。第二项后面的 * 元字符，可以在大写字母开头之后，可以跟随任意数量的大小写字母和空格，并且仍然匹配：

[me@linuxbox ~]$ echo "This works." | grep -E '[[:upper:]][[:upper:][:lower:]]*\.'
This works.
[me@linuxbox ~]$ echo "This Works." | grep -E '[[:upper:]][[:upper:][:lower:]]*\.'
This Works.
[me@linuxbox ~]$ echo "this does not" | grep -E '[[:upper:]][[:upper:][:lower:] ]*\.'
[me@linuxbox ~]$

+—Match an Element One or More Times

元字符类似于 * ，但与之不同的是要求前面元素至少出现一次。例如匹配由一个或者多个字母由单个空格分割的字符组：

 ^([[:alpha:]]+ ?)+$

[me@linuxbox ~]$ echo "This that" | grep -E '^([[:alpha:]]+ ?)+$'
This that
[me@linuxbox ~]$ echo "a b c" | grep -E '^([[:alpha:]]+ ?)+$'
a b c
[me@linuxbox ~]$ echo "a b 9" | grep -E '^([[:alpha:]]+ ?)+$'
[me@linuxbox ~]$ echo "abc  d" | grep -E '^([[:alpha:]]+ ?)+$'
[me@linuxbox ~]$

可以看到没有匹配 "a b 9"，是因为 9 不是字母字符；同样也没有匹配 "abc d" 是因为包含两个空格。

{ }—Match an Element a Specific Number of Times

{ 和 } 元字符用于表示要求的匹配项的最小与最大次数。有以下四种表示方式：

Table 19-3: Specifying the Number of Matches

Specifier	Meaning
{n}	Match the preceding element if it occurs exactly n times.
{n,m}	Match the preceding element if it occurs at least n times, but no more than m times.
{n,}	Match the preceding element if it occurs n or more times.
{,m}	Match the preceding element if it occurs no more than m times.

使用这一特性可以简化上面的检查号码可用性的正则：

从：

 ^\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$

到：

 ^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$

检验一下：

[me@linuxbox ~]$ echo "(555) 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
(555) 123-4567
[me@linuxbox ~]$ echo "555 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
555 123-4567
[me@linuxbox ~]$ echo "5555 123-4567" | grep -E '^\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$'
[me@linuxbox ~]$

Putting Regular Expressions to Work

Validating a Phone List with grep

上面的检查手机号例子中只检查一个电话号，但是更加实际的场景是检查一个电话号列表：

首先创建电话号列表：

[me@linuxbox ~]$ for i in {1..10}; do echo "(${RANDOM:0:3}) ${RANDOM:0:3}-${RANDOM:0:4}" >> phonelist.txt; done

上面命令将会创建包含十个电话号码的 phonelist.txt 文件。

[me@linuxbox ~]$ cat phonelist.txt
(232) 298-2265
(624) 381-1078
(540) 126-1980
(874) 163-2885
(286) 254-2860
(292) 108-518
(129) 44-1379
(458) 273-1642
(686) 299-8268
(198) 307-2440

有一些号码格式是错误的，可以扫描文件然后将错误的号码列出来：

[me@linuxbox ~]$ grep -Ev '^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$' phonelist.txt
(292) 108-518
(129) 44-1379
[me@linuxbox ~]$

上面命令中的 -v 选项可以将不符的匹配列出来。

Finding Ugly Filenames with find

find 命令支持正则表达式。grep 打印包含匹配的字符串的行，find 则要求路径名称要与正则表达式完全匹配。下面例子中将找出不包含下面集合中元素的路径名：

[-_./0-9a-zA-Z]

可以使用正则可以找出包含空格和其他敏感字符：

[me@linuxbox ~]$ find . -regex '.*[^-_./0-9a-zA-Z].*'

因为要完全匹配整个路径，使用 .* 在正则两端来匹配零次或者任意次单个字符。中间的正则表达式使用否定的括号表达式。

Searching for Files with locate

locate 命令提供基本（--regexp）和扩展（ --regex）正则：

[me@linuxbox ~]$ locate --regex 'bin/(bz|gz|zip)'
/bin/bzcat
/bin/bzcmp
/bin/bzdiff
/bin/bzegrep
/bin/bzexe
/bin/bzfgrep
/bin/bzgrep
/bin/bzip2
/bin/bzip2recover
/bin/bzless
/bin/bzmore
/bin/gzexe
/bin/gzip
/usr/bin/zip
/usr/bin/zipcloak
/usr/bin/zipgrep
/usr/bin/zipinfo
/usr/bin/zipnote
/usr/bin/zipsplit

Searching for Text with less and vim

less 和 vim 共享一个方法来搜索文本。按下 / 键后接着输入正则表达式将会执行搜索：

首先，使用 less 打开文件：

[me@linuxbox ~]$ less phonelist.txt

接着搜索：

(232) 298-2265
(624) 381-1078
(540) 126-1980
(874) 163-2885
(286) 254-2860
(292) 108-518
(129) 44-1379
(458) 273-1642
(686) 299-8268
(198) 307-2440
~
~
~
/^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$

less 将会高亮标出匹配的字符串。

另一方方面， vim 支持基本正则，因此可以改变表达式：

/([0-9]\{3\}) [0-9]\{3\}-[0-9]\{4\}

上面命令大体上相同；然而，许多字符在扩展正则中被解析成元字符，除非被反斜杠转义。这都依赖于系统中 vim 指定的配置，同样地，vim 高亮标出匹配的字符串，若没有可以在命令模式下执行命令 :hlsearch 来激活。

【注】根据系统版本的不同，vim 可能不支持高亮文本搜索。如 Ubuntu ，默认提供了轻量的 vim