字符串处理
一些简单的字符串处理
使用多个界定符分割字符串
这个在过滤分割符 甚至过滤字符串都很有用
import re
line = 'asdf fjdk; afed, fjek,asdf, foo'
sep_list = re.split(r'[;,\s]\s*',line)
print(sep_list)
# ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
字符串开头或结尾匹配
filename = 'spam.txt'
print(filename.endswith('.txt')) # True
print(filename.startswith('file:')) # False
用Shell通配符匹配字符串
from fnmatch import fnmatch, fnmatchcase
fnmatch('foo.txt', '*.txt') # True
fnmatch('foo.txt', '?oo.txt') # True
fnmatch('Dat45.csv', 'Dat[0-9]*') # True
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']
# On OS X (Mac)
fnmatch('foo.txt', '*.TXT')
# False
# On Windows
fnmatch('foo.txt', '*.TXT')
# True
fnmatchcase('foo.txt', '*.TXT') # False
字符串匹配和搜索
import re
text = 'yeah, but no, but yeah, but no, but yeah'
print( text.find('no'))
# 10 返回第一个出现的位置
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'
pattern = re.compile(r'\d+/\d+/\d+')
print(re.match(pattern, text1))
# <_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>
print(re.match(pattern, text2))
# None
# `match()` 总是从字符串开始去匹配,如果你想查找字符串任意部分的模式出现位置, 使用 `findall()` 方法去代替
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print(pattern.findall(text))
# ['11/27/2012', '3/13/2013']
字符串搜索和替换
# 字面替换
text = 'yeah, but no, but yeah, but no, but yeah'
text.replace('yeah', 'yep')
# 复杂模式替换
import re
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
res = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
print(res)
# Today is 2012-11-27. PyCon starts 2013-3-13.
# sub() 函数中的第一个参数是被匹配的模式,第二个参数是替换模式。反斜杠数字比如 \3 指向前面模式的捕获组号。
# 字符串忽略大小写的搜索替换
text = 'UPPER PYTHON, lower python, Mixed Python'
res = re.findall('python', text, flags=re.IGNORECASE)
print(res)
# ['PYTHON', 'python', 'Python']
最短匹配模式
写正则的时候使用.? 代替 .
多行匹配模式
text = '''/* this is a
multiline comment */
'''
comment = re.compile(r'/\*(.*?)\*/')
print(comment.findall(text))
# [] 因为.不能匹配到换行符
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment.findall(text))
# [' this is a\n\t\tmultiline comment ']
# 还有一种方式 书中没有给出, 标志参数设置为 `re.S` 可以使.代表任意字符
comment = re.compile(r'/\*(.*?)\*/',re.S)
print(comment.findall(text))
# [' this is a\n\t\tmultiline comment ']
删除字符串中不需要的字符
# strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操作
s = ' hello world \n'
print(s.strip()) # hello world
t = '-----hello====='
print(t.lstrip('-')) # hello=====
print(t.strip('-=')) # hello
字符串对齐
# 可以使用字符串的 ljust() , rjust() 和 center() 方法
text = 'Hello World'
text.ljust(20)
# Hello World
text.rjust(20)
# Hello World
text.center(20)
# Hello World
# 所有这些方法都能接受一个可选的填充字符
text.rjust(20,'=')
# =========Hello World
合并拼接字符串
parts = ['Is', 'Chicago', 'Not', 'Chicago?']
print(' '.join(parts))
字符串中插入变量
s = '{name} has {n} messages.'
res = s.format(name='Guido',n=32)
print(res)
# Guido has 32 messages.
# 或者,如果要被替换的变量能在变量域中找到, 那么你可以结合使用 format_map() 和 vars()
name = 'Guido'
n = 37
res = s.format_map(vars())
print(res)
# Guido has 32 messages.
以指定列宽格式化字符串
使用 textwrap 模块来格式化字符串的输出
import textwrap
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."
print(textwrap.fill(s, 70))
# Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
# not around the eyes, don't look around the eyes, look into my eyes,
# you're under.
print(textwrap.fill(s, 40))
# Look into my eyes, look into my eyes,
# the eyes, the eyes, the eyes, not around
# the eyes, don't look around the eyes,
# look into my eyes, you're under.
print(textwrap.fill(s, 40, initial_indent=' '))
# Look into my eyes, look into my
# eyes, the eyes, the eyes, the eyes, not
# around the eyes, don't look around the
# eyes, look into my eyes, you're under.
print(textwrap.fill(s, 40, subsequent_indent=' '))
# Look into my eyes, look into my eyes,
# the eyes, the eyes, the eyes, not
# around the eyes, don't look around
# the eyes, look into my eyes, you're
# under.
在字符串中处理html和xml
import html
s = 'Elements are written as "<tag>text</tag>".'
print(s)
# Elements are written as "<tag>text</tag>".
print(html.escape(s))
# Elements are written as "<tag>text</tag>".
# 不转冒号
print(html.escape(s, quote=False))
# Elements are written as "<tag>text</tag>".
from html.parser import HTMLParser
s = 'Spicy "Jalapeño".'
p = HTMLParser()
print(p.unescape(s))
# Spicy "Jalapeño".
t = 'The prompt is >>>'
from xml.sax.saxutils import unescape
print(unescape(t))
# The prompt is >>>