字符串处理

一些简单的字符串处理

使用多个界定符分割字符串

这个在过滤分割符甚至过滤字符串都很有用

import re
line = 'asdf fjdk; afed, fjek,asdf, foo'
sep_list = re.split(r'[;,\s]\s*',line)

print(sep_list)
# ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

字符串开头或结尾匹配

filename = 'spam.txt'
print(filename.endswith('.txt'))   # True
print(filename.startswith('file:')) # False

用Shell通配符匹配字符串

from fnmatch import fnmatch, fnmatchcase

fnmatch('foo.txt', '*.txt') # True 
fnmatch('foo.txt', '?oo.txt') # True 
fnmatch('Dat45.csv', 'Dat[0-9]*') # True 
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']

# On OS X (Mac)
fnmatch('foo.txt', '*.TXT')
# False
# On Windows
fnmatch('foo.txt', '*.TXT')
# True

fnmatchcase('foo.txt', '*.TXT') # False

字符串匹配和搜索

import re

text = 'yeah, but no, but yeah, but no, but yeah'
print( text.find('no')) 
# 10  返回第一个出现的位置

text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

pattern = re.compile(r'\d+/\d+/\d+')

print(re.match(pattern, text1))
# <_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>
print(re.match(pattern, text2))
# None

# `match()` 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置， 使用 `findall()` 方法去代替

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'

print(pattern.findall(text)) 
# ['11/27/2012', '3/13/2013']

字符串搜索和替换

# 字面替换 
text = 'yeah, but no, but yeah, but no, but yeah'
text.replace('yeah', 'yep')

# 复杂模式替换
import re 

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
res = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
print(res)
# Today is 2012-11-27. PyCon starts 2013-3-13.

# sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字比如 \3 指向前面模式的捕获组号。


# 字符串忽略大小写的搜索替换
text = 'UPPER PYTHON, lower python, Mixed Python'
res = re.findall('python', text, flags=re.IGNORECASE)
print(res)
# ['PYTHON', 'python', 'Python']

最短匹配模式

写正则的时候使用.? 代替 .

多行匹配模式

text = '''/* this is a
		multiline comment */
	'''
comment = re.compile(r'/\*(.*?)\*/')

print(comment.findall(text))
# []   因为.不能匹配到换行符

comment = re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment.findall(text))
# [' this is a\n\t\tmultiline comment ']

# 还有一种方式 书中没有给出, 标志参数设置为 `re.S` 可以使.代表任意字符
comment = re.compile(r'/\*(.*?)\*/',re.S)
print(comment.findall(text))
# [' this is a\n\t\tmultiline comment ']

删除字符串中不需要的字符

# strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操作
s = ' hello world \n'

print(s.strip()) # hello world 
t = '-----hello====='
print(t.lstrip('-')) # hello=====
print(t.strip('-=')) # hello

字符串对齐

# 可以使用字符串的 ljust() , rjust() 和 center() 方法
text = 'Hello World'
text.ljust(20)
# Hello World         
text.rjust(20)
#          Hello World
text.center(20)
# 	 Hello World     

# 所有这些方法都能接受一个可选的填充字符

text.rjust(20,'=')
# =========Hello World

合并拼接字符串

parts = ['Is', 'Chicago', 'Not', 'Chicago?']
print(' '.join(parts))

字符串中插入变量

s = '{name} has {n} messages.'
res = s.format(name='Guido',n=32)
print(res)
# Guido has 32 messages.

# 或者，如果要被替换的变量能在变量域中找到， 那么你可以结合使用 format_map() 和 vars() 
name = 'Guido'
n = 37
res = s.format_map(vars())
print(res)
# Guido has 32 messages.

以指定列宽格式化字符串

使用 textwrap 模块来格式化字符串的输出

import textwrap
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

print(textwrap.fill(s, 70))
# Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
# not around the eyes, don't look around the eyes, look into my eyes,
# you're under.

print(textwrap.fill(s, 40))

# Look into my eyes, look into my eyes,
# the eyes, the eyes, the eyes, not around
# the eyes, don't look around the eyes,
# look into my eyes, you're under.

print(textwrap.fill(s, 40, initial_indent='    '))

#     Look into my eyes, look into my
# eyes, the eyes, the eyes, the eyes, not
# around the eyes, don't look around the
# eyes, look into my eyes, you're under.


print(textwrap.fill(s, 40, subsequent_indent='    '))

# Look into my eyes, look into my eyes,
#     the eyes, the eyes, the eyes, not
#     around the eyes, don't look around
#     the eyes, look into my eyes, you're
#     under.

在字符串中处理html和xml

import html 

s = 'Elements are written as "<tag>text</tag>".'
print(s)

# Elements are written as "<tag>text</tag>".

print(html.escape(s))

# Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.

# 不转冒号
print(html.escape(s, quote=False))

# Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".

from html.parser import HTMLParser
s = 'Spicy &quot;Jalape&#241;o&quot.'

p = HTMLParser()
print(p.unescape(s)) 
# Spicy "Jalapeño".

t = 'The prompt is &gt;&gt;&gt;'
from xml.sax.saxutils import unescape
print(unescape(t))
# The prompt is >>>

Python3-cookbook- 笔记2 - 字符串和文本处理

字符串处理

使用多个界定符分割字符串

字符串开头或结尾匹配

用Shell通配符匹配字符串

字符串匹配和搜索

字符串搜索和替换

最短匹配模式

多行匹配模式

删除字符串中不需要的字符

字符串对齐

合并拼接字符串

字符串中插入变量

以指定列宽格式化字符串

在字符串中处理html和xml