正则表达式

正则表达式（regular expression）是一种字符串匹配模式或者规则，它可以用来检索、替换那些符合特定规则的文本。正则表达式几乎适用于所有编程语言。无论是前端语言JavaScript，还是许多其他的后端语言，比如Python、Java等，这些语言都提供了相应的函数’模块来支持正则表达式，比如Python的re模块就提供了正则表达式的常用方法。

正则表达式主要应用场景

用来验证字符串是否符合指定特征。比如验证用户名或者密码是否符合要求，是否是合法的邮件地址等；
用来查找字符串。从一个长的文本中查找符合指定特征的字符串，比查找固定字符串更加的灵活方便；
用来替换，比普通的替换更强大

re模块常用函数

函数	说明
match()	从字符串的`开始位置`开始匹配正则表达式，返回Match()对象，匹配到第一个就返回
search()	在字符串的中开始匹配正则表达式，返回Match()对象，匹配到第一个就返回
findall()	搜索字符串，以`列表类型`返回全部能匹配的
compile()	编译正则表达式，返回Pattern对象
split()	将一个字符串按照正则表达式匹配结果进行分割，返回`列表类型`
sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

import re
# match方法
# match(pattern,string,flags)
# pattern表示要匹配的正则表达式，string表示要匹配的字符串，flag表示匹配的模式
# 在起始位置匹配 匹配上则返回一个match对象 否则返回 None
d1=re.match('w','helloworldworld')
d11=re.match('h','helloworld')
print(d1)    #None
print(d11)   #<re.Match object; span=(0, 1), match='h'>

# search方法
# re.search(pattern,string,flags)
# 从起始位置开始匹配 匹配成功返回一个search对象 否则返回None
d2=re.search('w','helloworldworld') #<re.Match object; span=(5, 6), match='w'>
print(d2)

# findall()
# re.findall(pattern, string, flags=0)
# 在字符串中找到正则表达式所匹配的所有元素，并返回一个列表, 否则返回空
d3=re.findall('w','helloworldworld')
print(d3)   #['w', 'w']

# complie()
# re.compile(pattern, flags)
# 用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，
# 供 match() 和 search() 这两个函数使用
d4=re.compile('w')  #re.compile('w')
d44=re.search(d4,'helloworldworld')
print(d44)  #<re.Match object; span=(5, 6), match='w'>

# split()
# re.split(pattern,string,maxsplit,flags)
# 将一个字符串按照正则表达式匹配结果 进行分割 返回列表类型
d5=re.split('h','helloworldhelloworldh')
print(d5)  #['', 'elloworld', 'elloworld', '']

# sub()
# re.sub(pattern,repl,sting,count,flags)
# 在一个字符串中替换所有匹配正则表达式的字串，返回替换后的字符串
d6=re.sub('l','k','helloworldhelloworldh',3)
print(d6)  #hekkoworkdhelloworldh,count不赋值的话，默认全部替换

正则修饰符

修饰符	描述
re.I	使匹配对大小写不敏感
re.M	正则表达式中的^操作符能够将给定字符串的每行当作匹配的开始
re.S	使.匹配，包括换行在内的所有字符

正则表达式的规则

普通字符

字母、数字、下划线、汉字、以及没有特殊定义的符号，都是普通字符。正则表达式中的普通字符，在匹配的时候，只匹配和自身相同的一个字符。

例如：表达式c，在匹配字符串abcde时，匹配结果是：成功；匹配到的内容是c；匹配到的位置是（2，3）（python下标从0开始）

特殊字符

一些不便书写的字符，采用在前面加【\】的方法。例如制表符\t，换行符\n等；
一些有特殊用处的标点符号，在前面加【\】后，代表该符号本身，例如**{,}， [, ]， /， \， +， *， .， $， ^， |， ?**等；

元字符

元字符	匹配内容
.	匹配除换行符以外的任意字符
\w	匹配所有的普通字符（数字，字母，下划线）
\s	匹配任意空白符
\d	匹配所有数字
\W	匹配非字母或数字或下划线
\S	匹配非空白字符
\D	匹配非数字
a\|b	匹配字符a或字符b
()	正则表达式分组所用符号，匹配括号内的表达式，表示一个组，简而言之就是只显示括号中的匹配结果
[...]	匹配字符组中的字符
[^...]	匹配除字符组中字符的的字所有符
^	匹配字符串的开始位置，不匹配任何字符串
$	匹配字符串的结尾位置，不匹配任何字符串

数量字符

贪婪和非贪婪

在使用修饰匹配次数的特殊符号时，如“?”,“*”, “+”等，可以使同一个表达式能够匹配不同的次数，具体匹配的次数随被匹配的字符串而定。这种重复匹配不定次数的表达式在匹配过程中，总是尽可能多的匹配，这种匹配原则就叫作"贪婪" 模式。

import re
#.匹配任意字符
res = re.findall('.','egH4 5#\n%##')  #['e', 'g', 'H', '4', ' ', '5', '#', '%', '#', '#']
print(res)

#\d,\w,\s
res1 = re.findall('\w','egH4_5#_n%#123#')  #['e', 'g', 'H', '4', '_', '5', '_', 'n', '1', '2', '3']
print(res1)
res2=re.findall('\s','egH4_5#_  n%\n#12\t3#')  #[' ', ' ', '\n', '\t']
print(res2)
res3=re.findall('\d','egH4_5#_n%#123#')  #['4', '5', '1', '2', '3']
print(res3)

#\D,\W,\S
res4 = re.findall('\W','egH4_5#_n%#123#')  #['#', '\', '%', '#', '#']
print(res4)
res5=re.findall('\S','egH4_5#_  n%\n#12\t3#')  #['e', 'g', 'H', '4', '_', '5', '#', '\', '_', 'n', '%', '#', '1', '2', '3', '#']
print(res5)
res6=re.findall('\D','egH4_5#_n%#123#')  #['e', 'g', 'H', '_', '#', '\', '_', 'n', '%', '#', '#']
print(res6)

#a|b,(),[...]
res7 = re.findall('\d|\s','egH4_ 5#_\n%#\t#')  #['4', ' ', '5', '\n', '\t']
print(res7)
res8 = re.findall('\w_(\d)\s(\w)','we_e95sd_6 g4_3 sfdsdsewr _8 sfe')  #[('6', 'g'), ('3', 's')]
print(res8)
res9 = re.findall('[\d_&#]','egH4_ 5#_\n%##')   #['4', '_', '5', '#', '_', '#', '#']箭头→中=中括号中的任意一个能拿到就行
print(res9)

# [^...]
res10 = re.findall('[^\d_&#]','egH4_ 5#_\n%##')  #['e', 'g', 'H', ' ', '\n', '%']
print(res10)
res11 = re.findall('^\w','egH4_\n5#_\n%##',re.M)  #匹配开始位置  ['e', '5']
print(res11)
res12 = re.findall('\w$','egH4_\n5#h\n%#4')  #匹配结尾位置  ['4']
print(res12)

res = re.findall('\d{3}','sdf34fsd3dsgfd123456')  #['123', '456']
print(res)
res = re.findall('\d{2,5}','12345678912345678889534') #['12345', '67891', '23456', '78889', '534'],至少两次，最多三次
print(res)
res = re.findall('\d{3,}','23sdf3dd23432sdfs323rsdfds2342352')  #['23432', '323', '2342352']，至少三次
print(res)

# 0次或者1次
res = re.findall('\d?','23dsfs3dweIUI')  #['2', '3', '', '', '', '', '3', '', '', '', '', '', '', '']
print(res)
# 1次或者多次（贪婪）
res = re.findall('\d+','23dsfs3dwe12345dds12345678')  #['23', '3', '12345', '12345678']，至少一次
print(res)
# 0次或者多次（贪婪）
res = re.findall('\d*','23dsfs3dwe12345dds12345678')  #['23', '', '', '', '', '3', '', '', '', '12345', '', '', '', '12345678', '']，随便多少次
print(res)

万能公式的运用

万能公式指的是【.*？】

import re

'''
万能公式.*？
'''
str2 = '''
 <div data-role="ershoufang" >
     <div>
        <a href="/ershoufang/yuhua/"  title="长沙雨花在售二手房 ">雨花</a>
         <a href="/ershoufang/yuelu/"  title="长沙岳麓在售二手房 ">岳麓</a>
        <a href="/ershoufang/tianxin/"  title="长沙天心在售二手房 ">天心</a>
         <a href="/ershoufang/kaifu/"  title="长沙开福在售二手房 ">开福</a>
        <a href="/ershoufang/furong/"  title="长沙芙蓉在售二手房 ">芙蓉</a>
        <a href="/ershoufang/wangcheng/"  title="长沙望城在售二手房 ">望城</a>
        <a href="/ershoufang/ningxiang/"  title="长沙宁乡在售二手房 ">宁乡</a>
        <a href="/ershoufang/liuyang/"  title="长沙浏阳在售二手房 ">浏阳</a>
        <a href="/ershoufang/changshaxian/"  title="长沙长沙县在售二手房 ">长沙县</a>
    </div>
 </div>
'''
str='<a href="/ershoufang/.*?/"  title="长沙(.*?) ">(.*?)</a>'
re1=re.findall(str,str2)
print(re1)

输出结果：

import re
import requests
s = """
<div>
    <div class='t1'>
        <span>hhhh</span>
        <ul>
             <li>1</li>
             <li>1</li>
             <li>1</li>
             <li>1</li>
        </ul>
    </div>
    <span>lllll</span>
</div>
"""
res = re.findall("<div class='t1'>.*?</div>",s,re.S)
print(res)

如果以上不加re.S的话，结果是[]，是匹配不到的，因为

后面是接有一个换行符的。如果把【.*？】换成【.*】的话，它会把后面的span标签也匹配到，这就是贪婪匹配。

正则解析