读写文件

创建文件

Python open() 方法用于打开一个文件，并返回文件对象，在对文件进行处理过程都需要使用到这个函数，如果该文件无法被打开，会抛出 OSError。

**注意：**使用 open() 方法一定要保证关闭文件对象，即调用 close() 方法。

open() 函数常用形式是接收两个参数：文件名(file)和模式(mode)。其中mode可选，设置文件的打开模式

语法格式：open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

f = open("new_file.txt", "w")   # 创建并打开
f.write("some text...")         # 在文件里写东西
f.close()                       # 关闭

正常来说，文件会保存在你脚本的当前目录下，比如运行的是 me.py 脚本，那么 new_file.txt 就会被创建在这个脚本的同级目录中。

如果你觉得要写一个 f.close() 或者有时候你怕自己忘记要 close(), Python 人性化地提供了另外一种打开文件的方式。这个方式把打开和关闭嵌入到了一个 with 架构中。

with open("new_file2.txt", "w") as f:
    f.writelines(["some text for file2...\n", "2nd line\n"])
    #当你传入的时候像列表样的数据时， 列表中的每个元素就是一行记录，数据会分行来写。

读文件

在读文件的时候，和写文件的方式是十分类似的，就把里面的 w 改成 r。也就是说，其实 w 代表的是 write， r 代表的是 read。

f = open("new_file2.txt", "r")
print(f.read())
f.close()

输出：

some text for file2...
2nd line

如果你的记录是一个列表，你想在文件中每行记录列表当中的一个值，可以用 writelines()，那么在读文件的时候，也可以 readlines() 直接读出来一个列表。

with open("new_file2.txt", "r") as f:
    print(f.readlines())

输出：

['some text for file2...\n', '2nd line\n']

readline()能一行一行读取，取代一次性读取，不让内存被一次性占满（没有s，和 readlines() 区分）

with open("new_file2.txt", "r") as f:
    while True:
        line = f.readline()
        print(line)
        if not line:
            break

输出：

some text for file2...

2nd line

文件编码，中文乱码

原文件的中文编码，通常来说是 utf-8、gbk、gb2312其中的某一种。

有些文件在 Windows 存储的时候，是以 gbk 的格式存储的，下面的 chinese.txt 用 gbk 编码保存。

with open("chinese.txt", "wb") as f:
    f.write("这是中文的，this is Chinese".encode("gbk"))

with open("chinese.txt", "rb", ) as f:
    print(f.read())
    #print(f.read().decode('gbk'))  # windows在本机尝试，可以试试这个

注意这里选用的写模式 w 还多了一个 b，合起来是 wb，意思是 write binary 形式，取代默认的 text 形式。所以我读的时候，也加上了一个 b，变成了 rb，read binary。

输出：

b'\xd5\xe2\xca\xc7\xd6\xd0\xce\xc4\xb5\xc4\xa3\xacthis is Chinese'

运行试试，会给你出一段乱码，因为Python不识别这段编码后的文本。

那如果直接用原始的 r 来读文本，它甚至都会给我报一个错。读不了。

# 下面的代码会报错
with open("chinese.txt", "r") as f:
    print(f.read())

解决方法：先确认是哪一种文件编码，然后在读的时候需要传入一个 encoding 的参数，表示用这一种编码来读。这样中文乱码的问题就顺利解决了。

with open("chinese.txt", "r", encoding="gbk") as f:
    print(f.read())

输出：

这是中文的，this is Chinese

读写模式

mode	意思
w	（创建）写文本，如果该文件已存在则打开文件，并从开头开始编辑，即原有内容会被删除。如果该文件不存在，创建新文件。
r	读文本，文件不存在会报错
a	在文本最后添加
wb	写二进制 binary
rb	读二进制 binary
ab	添加二进制
w+	又可以读又可以（创建）写
r+	又可以读又可以写, 文件不存在会报错
a+	可读写，在文本最后添加
x	创建

with open("new_file.txt", "r") as f:
    print(f.read())
with open("new_file.txt", "r+") as f:
    f.write("text has been replaced")
    f.seek(0)       # 将开始读的位置从写入的最后位置调到开头，设置文件读取的指针位置
    print(f.read())

输出：

some text...
text has been replaced

with open("new_file.txt", "a+") as f:
    print(f.read())
    f.write("\nadd new line")
    f.seek(0)       # 将开始读的位置从写入的最后位置调到开头
    print(f.read())

输出：


text has been replaced
add new line

文件目录管理

文件目录操作

查看目录位置

import os 

print("当前目录：", os.getcwd())	#cwd，Current Window Directory的缩写,意思是“当前窗口目录”
print("当前目录里有什么：", os.listdir())

输出：

当前目录： /home/pyodide
当前目录里有什么： []

创建文件夹

os.makedirs("project", exist_ok=True)
print(os.path.exists("project")) #判断是否真的存在

输出：

True

os.makedirs() 方法用于递归创建目录。

如果exist_ok是False（默认），当目标目录（即要创建的目录）已经存在，会抛出一个OSError。

文件管理系统

先为用户创建一个文件夹

if os.path.exists("user/mofan"):
    print("user exist")
else:
    os.makedirs("user/mofan")
    print("user created")
print(os.listdir("user"))

输出：

user created
['mofan']

若用户注销，则需要删除用户文件夹

if os.path.exists("user/mofan"):
    os.removedirs("user/mofan")
    print("user removed")
else:
    print("user not exist")

输出：

user removed

若文件夹里有文件，文件夹不为空，删除会报错

os.makedirs("user/mofan", exist_ok=True)
with open("user/mofan/a.txt", "w") as f:
    f.write("nothing")
os.removedirs("user/mofan")  # 这里会报错

若逐个文件删除效率太低，可调用shutil库中的rmtree来递归的去删除文件

使用时注意：它可以清空整个目录。

import shutil

shutil.rmtree("user/mofan")
print(os.listdir("user"))

输出：

[]

若对用户文件夹进行改名

os.makedirs("user/mofan", exist_ok=True)
os.rename("user/mofan", "user/mofanpy")
print(os.listdir("user"))

输出：

['mofanpy']

文件目录多种检验

判断是否为一个文件或文件夹的方法

import os
os.makedirs("user/mofan", exist_ok=True)
with open("user/mofan/a.txt", "w") as f:
    f.write("nothing")
print(os.path.isfile("user/mofan/a.txt")) # True
print(os.path.exists("user/mofan/a.txt")) # True
print(os.path.isdir("user/mofan/a.txt")) # False
print(os.path.isdir("user/mofan"))  # True

输出：

True
True
False
True

在很多时候，我们的文件是通过传参传进来的，比如告诉你一个文件目录，我想为这个文件创建一个副本，我就得用到三个功能，

先拿到文件名 os.path.basename
再拿文件夹名 os.path.dirname
为副本重命名
把目录重新组合 os.path.join

import os
def copy(path):
    filename = os.path.basename(path)   # 文件名 a.txt
    dir_name = os.path.dirname(path)    # 文件夹名 user/mofan/
    new_filename = "new_" + filename    # 新文件名
    return os.path.join(dir_name, new_filename) # 目录重组
print(copy("user/mofan/a.txt"))

输出：

user/mofan/new_a.txt

其中，os.path.join()函数：拼接文件路径，可以有多个参数。如果拼接在后的参数中含有''开头的参数，将从''开头的参数开始，前面的参数均将失效，并且路径将从对应磁盘的根目录开始。

另一个方法：利用os.path.split()

def copy(path):
    dir_name, filename = os.path.split(path)
    new_filename = "new_" + filename    # 新文件名
    return os.path.join(dir_name, new_filename) # 目录重组
print(copy("user/mofan/a.txt"))
os.path.join()

输出：

user/mofan/new_a.txt

os.path.split()函数:将文件名和路径分割开。

语法：os.path.split(‘PATH’)

参数：PATH指一个文件的全路径作为参数。如果给出的是一个目录和文件名，则输出路径和文件名；如果给出的是一个目录名，则输出路径和为空文件名。

正则表达式

正则表达式(regular expression)描述了一种字符串匹配的模式（pattern），可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。

不用正则的判断

如果碰上要在文字中寻找某个信息时，你很可能会做下面这样的判断与尝试。

pattern1 = "file"
pattern2 = "files"
string = "the file is in the folder"
print("file in string", pattern1 in string)   
print("files in string", pattern2 in string)

输出：

file in string True
files in string False

不过种类变多了之后，总写这样的判断，处理的能力是十分有限的。

文件系统如果要做注册管理，那么肯定会要用正则表达式的，比如验证用户的邮箱是否有效。使用正则做一个通用的邮箱地址判断，就是正则最一般的能力。下面我就举个简单的邮箱判断例子：

import re
ptn = re.compile(r"\w+?@\w+?\.com")

matched = ptn.search("mofan@mofanpy.com")
print("mofan@mofanpy.com is a valid email:", matched)
matched = ptn.search("mofan@mofanpy+com")
print("mofan@mofanpy+com is a valid email:", matched)

输出：

mofan@mofanpy.com is a valid email: <re.Match object; span=(0, 17), match='mofan@mofanpy.com'>
mofan@mofanpy+com is a valid email: None

正则给额外信息

正则除了帮你判断有没有某个 pattern 模式，还可以做很多事情，上面的例子也显示，如果使用正则识别出一个模式的字符串，那么它返回的远不止一个 True，而是很多额外的信息。比如上面的 email 判断。

import re
matched = re.search(r"\w+?@\w+?\.com", "mofan@mofanpy.com")
print("mofan@mofanpy.com:", matched)
matched = re.search(r"\w+?@\w+?\.com", "the email is mofan@mofanpy.com.")
print("the email is mofan@mofanpy.com:", matched)

输出：

mofan@mofanpy.com: <re.Match object; span=(0, 17), match='mofan@mofanpy.com'>
the email is mofan@mofanpy.com: <re.Match object; span=(13, 30), match='mofan@mofanpy.com'>

可以看到在返回的信息中，有一个 span=(13,30)，有一个 match='mofan@mofanpy.com 这类的信息。他们分别代表着在原始字符串中，我们找到的 pattern 是从哪一位到哪一位，pattern 找到的具体字符串又是什么。

再比如：

match = re.search(r"run", "I run to you")
print(match)
print(match.group())

输出：

<re.Match object; span=(2, 5), match='run'>
run

正则表达式很多时候都要包含\、r 代表原生字符串，使用 r 开头的字符串是为了不混淆 pattern 字符串中到底要写几个 \，只要当成一个规则来记住在写 pattern 的时候，都写上一个 r 在前面就好了。

同时满足多种条件

在上面匹配 run 字符中，可以在字符串中找固定的词

如果我想让这种匹配模式多样化，一种模式包含多种字符的可能呢？

print(re.search(r"ran", "I run to you"))
print(re.search(r"run", "I run to you"))

输出：

None
<re.Match object; span=(2, 5), match='run'>

正则能把这两种情况写在一个判断里。只需要使用一个 | 就可以。 | 就代表或者的意思。

re.search(r"ran|run", "I run to you")

输出：

<re.Match object; span=(2, 5), match='run'>

还有一种情况，就是前后都是固定的，但是我要同时满足多个字符的不同匹配，比如我想同时找到 find 和 found。

print(re.search(r"f(ou|i)nd", "I find you"))
print(re.search(r"f(ou|i)nd", "I found you"))

输出：

<re.Match object; span=(2, 6), match='find'>
<re.Match object; span=(2, 7), match='found'>

按类型匹配

通用匹配方式：

特定标识	含义	范围
\d	任何数字	[0-9]
\D	不是数字的
\s	任何空白字符	[ \t\n\r\f\v]
\S	空白字符以外的
\w	任何大小写字母,数字和 _	[a-zA-Z0-9_]
\W	\w 以外的
\b	匹配一个单词边界	比如 er\b 可以匹配 never 中的 er，但不能匹配 verb 中的 er
\B	匹配非单词边界	比如 er\B 能匹配 verb 中的 er，但不能匹配 never 中的 er
\	强制匹配 \
.	匹配任何字符 (除了 \n)
?	前面的模式可有可无
*	重复零次或多次
+	重复一次或多次
{n,m}	重复 n 至 m 次
{n}	重复 n 次
+?	非贪婪，最小方式匹配 +
*?	非贪婪，最小方式匹配 *
??	非贪婪，最小方式匹配 ?
??	匹配一行开头，在 re.M 下，每行开头都匹配
$	匹配一行结尾，在 re.M 下，每行结尾都匹配
\A	匹配最开始，在 re.M 下，也从文本最开始
\B	匹配最结尾，在 re.M 下，也从文本最结尾

例如：

re.search(r"\w+?@\w+?\.com", "mofan@mofanpy.com")

输出：

<re.Match object; span=(0, 17), match='mofan@mofanpy.com'>

用 \w 这个标识符，来表示任意的字母和数字还有下划线。还用了 +? 用来表示让 \w 至少匹配 1 次，并且当识别到 @ 的时候做非贪婪模式匹配，也就是遇到 @ 就跳过当前的重复匹配模式，进入下一个匹配阶段。

print(re.search(r"138\d{8}", "13812345678"))
print(re.search(r"138\d{8}", "138123456780000"))

输出：

<re.Match object; span=(0, 11), match='13812345678'>
<re.Match object; span=(0, 11), match='13812345678'>

\d{8} 表示任意的数字，重复 8 遍。

中文

print(re.search(r"不?爱", "我爱你"))
print(re.search(r"不?爱", "我不爱你"))
print(re.search(r"不.*?爱", "我不是很爱你"))

输出：

<re.Match object; span=(1, 2), match='爱'>
<re.Match object; span=(1, 3), match='不爱'>
<re.Match object; span=(1, 5), match='不是很爱'>

汉字通常是用 Unicode 来表示的，如果把汉字变成 Unicode，就可以像英文一样，使用例如r"[a-zA-Z]" 识别出所有英文

encode() 方法以 encoding 指定的编码格式编码字符串。errors参数可以指定不同的错误处理方案。

语法：str.encode(encoding='UTF-8',errors='strict')

"中".encode("unicode-escape")

输出：

b'\\u4e2d'

利用英文的处理方法

re.search(r"[\u4e00-\u9fa5]+", "我爱莫烦Python。")

输出：

<re.Match object; span=(0, 4), match='我爱莫烦'>

那有时候我们还是想留下对标点的识别的，只需要将中文标点的识别范围，比如 [！？。，￥【】「」] 补进去就好了。

re.search(r"[\u4e00-\u9fa5！？。，￥【】「」]+", "我爱莫烦。莫烦棒！")

输出：

<re.Match object; span=(0, 9), match='我爱莫烦。莫烦棒！'>

查找替换等更多功能

功能	说明	举例
re.search()	扫描查找整个字符串，找到第一个模式匹配的	re.search(r"run", "I run to you") > 'run'
re.match()	从字符的最开头匹配，找到第一个模式匹配的即使用 re.M 多行匹配，也是从最最开头开始匹配(只匹配开头部分)	re.match(r"run", "I run to you") > None re.match("ang","angsan5lisi") > <_sre.SRE_Match object; span=(0, 3), match='ang'>
re.findall()	返回一个不重复的 pattern 的匹配列表	re.findall(r"r[ua]n"," I run to you. you ran to him") > ['run', 'ran']
re.finditer()	和 findall 一样，只是用迭代器的方式使用	for item in re.finditer(r"r[ua]n", "I run to you. you ran to him"):
re.split()	用正则分开字符串	re.split(r"r[ua]n", "I run to you. you ran to him") > ['I ', ' to you. you ', ' to him']
re.sub()	用正则替换字符	re.sub(r"r[ua]n"," jump", "I run to you. you ran to him") > 'I jump to you. you jump to him'
re.subn()	和 sub 一样，额外返回一个替代次数	re.subn(r"r[ua]n"," jump", "I run to you. you ran to him") > ('I jump to you. you jump to him', 2)

迭代器：如果你的某个对象可以用for循环去遍历出里面的所有的值，那么他就可以作为迭代器。

print("search:", re.search(r"run", "I run to you"))
print("match:", re.match(r"run", "I run to you"))
print("findall:", re.findall(r"r[ua]n", "I run to you. you ran to him"))

for i in re.finditer(r"r[ua]n", "I run to you. you ran to him"):	#迭代器
    print("finditer:", i)

print("split:", re.split(r"r[ua]n", "I run to you. you ran to him"))
print("sub:", re.sub(r"r[ua]n", "jump", "I run to you. you ran to him"))
print("subn:", re.subn(r"r[ua]n", "jump", "I run to you. you ran to him"))

输出：

search: <re.Match object; span=(2, 5), match='run'>
match: None
findall: ['run', 'ran']
finditer: <re.Match object; span=(2, 5), match='run'>
finditer: <re.Match object; span=(18, 21), match='ran'>
split: ['I ', ' to you. you ', ' to him']
sub: I jump to you. you jump to him
subn: ('I jump to you. you jump to him', 2)

在模式中获取特定信息

还有很多使用方法，用于处理更个性化的情况

比如：想找到 *.jpg 图片文件，而且只返回给我去掉 .jpg 之后的纯文件名

found = []
for i in re.finditer(r"[\w-]+?\.jpg", "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"):
    found.append(re.sub(r".jpg", "", i.group()))
print(found)

输出：

['2021-02-01', '2021-02-02', '2021-02-03']

其中：

list.append(object) 向列表中添加一个对象object

re.sub()方法来进行查询和替换（格式：sub(replacement, string[, count=0])replacement是被替换成的文本，string是需要被替换的文本，count是一个可选参数，指最大被替换的数量）

正则表达式中，group()用来提出分组截获的字符串，（）用来分组

上面这种做法虽然可行，但是还不够简单利索，因为同时用到了两个功能 finditer 和 sub，正则还可以更简单。

介绍一下和 group 一起用的 ()。只要在正则表达中，加入一个 () 选定要截取返回的位置，他就直接返回括号里的内容。

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
print("without ():", re.findall(r"[\w-]+?\.jpg", string))
print("with ():", re.findall(r"([\w-]+?)\.jpg", string))

输出：

without (): ['2021-02-01.jpg', '2021-02-02.jpg', '2021-02-03.jpg']
with (): ['2021-02-01', '2021-02-02', '2021-02-03']

如果想获取更详细的信息，比如年月日分开获取，多做几个括号就好了，然后用 group 功能获取到不同括号中匹配到的字符串。

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.finditer(r"(\d+?)-(\d+?)-(\d+?)\.jpg", string)	# 迭代器，括号部分对应group（）的序号
for file in match:	# match属于迭代器
    print("matched string:", file.group(0), ",year:", file.group(1), ", month:", file.group(2), ", day:", file.group(3))

输出：

matched string: 2021-02-01.jpg ,year: 2021 , month: 02 , day: 01
matched string: 2021-02-02.jpg ,year: 2021 , month: 02 , day: 02
matched string: 2021-02-03.jpg ,year: 2021 , month: 02 , day: 03

group() 同group（0）就是匹配正则表达式整体结果
group(1) 列出第一个括号匹配部分，group(2) 列出第二个括号匹配部分，group(3) 列出第三个括号匹配部分。

另一种途径实现上面的功能： findall 也可以达到同样的目的。只是它没有提供 file.group(0) 这种全匹配的信息。

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.findall(r"(\d+?)-(\d+?)-(\d+?)\.jpg", string)
for file in match:
    print("year:", file[0], ", month:", file[1], ", day:", file[2])

输出：

year: 2021 , month: 02 , day: 01
year: 2021 , month: 02 , day: 02
year: 2021 , month: 02 , day: 03

有时候我们 group 的信息太多了，括号写得太多，还能用一个名字来索引匹配好的字段，然后用 group("索引") 的方式获取到对应的片段。

注意，上面方案中的 findall 不提供名字索引的方法，可以用 search 或者 finditer 来调用 group 方法。为了索引，我们需要在括号中写上 ?P<索引名> 这种模式。

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.finditer(r"(?P<y>\d+?)-(?P<m>\d+?)-(?P<d>\d+?)\.jpg", string)  # 添加了?P<索引名>
for file in match:
    print("matched string:", file.group(0), 
        ", year:", file.group("y"), 
        ", month:", file.group("m"), 
        ", day:", file.group("d"))

输出：

matched string: 2021-02-01.jpg , year: 2021 , month: 02 , day: 01
matched string: 2021-02-02.jpg , year: 2021 , month: 02 , day: 02
matched string: 2021-02-03.jpg , year: 2021 , month: 02 , day: 03

多模式匹配

模式	全称	说明
re.I	re.IGNORECASE	忽略大小写
re.M	re.MULTILINE	多行模式，改变'^'和'$'的行为
re.S	re.DOTALL	点任意匹配模式，改变'.'的行为, 使".“可以匹配任意字符
re.L	re.LOCALE	使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
re.U	re.UNICODE	使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.X	re.VERBOSE	详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。以下两个正则表达式是等价的

例如： re.I 忽略大小写的例子

ptn, string = r"r[ua]n", "I Ran to you"
print("without re.I:", re.search(ptn, string))
print("with re.I:", re.search(ptn, string, flags=re.I))	# 添加了re.I，使其忽略大小写的区别

输出：

without re.I: None
with re.I: <re.Match object; span=(2, 5), match='Ran'>

例如：在每行文字的开头匹配特定字符

ptn = r"^ran"
string = """I 
ran to you"""	#注意在三引号字符串中可以包含换行回车等特殊字符，因此2、3行代码不等价于string = """I ran to you"""
string0 = """ran to you"""

print("without re.M:", re.search(ptn, string0))
print("without re.M:", re.search(ptn, string))	# 对比与上一行输出的区别

print("with re.M:", re.search(ptn, string, flags=re.M))
print("with re.M and match:", re.match(ptn, string, flags=re.M))

输出：

without re.M: <re.Match object; span=(0, 3), match='ran'>
without re.M: None
with re.M: <re.Match object; span=(3, 6), match='ran'>
with re.M and match: None

注意： re.search() 和 re.match() 有区别样，re.match() 是不管你有没有 re.M flag。

如果你想用多种 flags，也是可以的，比如同时用 re.M, re.I，你只需要这样书写re.M|re.I：

ptn = r"^ran"
string = """I
Ran to you"""
print("with re.M and re.I:", re.search(ptn, string, flags=re.M|re.I))

输出：

with re.M and re.I: <re.Match object; span=(2, 5), match='Ran'>

其实还有一种写法可以直接在 ptn 里面定义这些 flags

有的人会比较喜欢下面这样的写法，在模式 ptn 的开头，注明我要采用哪几个 flags： (?im) 这就是说要用 re.I, re.M。

string = """I
Ran to you"""
re.search(r"(?im)^ran", string)

输出：

<re.Match object; span=(2, 5), match='Ran'>

更快执行

如果要重复判断一个正则表达式，通常不会直接在 re.search(ptn) 这里里面写 ptn，而是在外面先定义好，解析好一个正则 pattern，然后直接用这个 pattern 循环执行查找。这样可以更有效率，比如你要重复查找 100 万次，我们先 compile 正则再查找能节省可观的时间。

import time
n = 1000000
# 不提前 compile
t0 = time.time()
for _ in range(n):
    re.search(r"ran", "I ran to you")
t1 = time.time()
print("不提前 compile 运行时间：", t1-t0)

# 先做 compile
ptn = re.compile(r"ran")
for _ in range(n):
    ptn.search("I ran to you")
print("提前 compile 运行时间：", time.time()-t1)

# 运行时间会因硬件的性能产生细微差异

输出：

不提前 compile 运行时间： 1.32200026512146
提前 compile 运行时间： 0.3129997253417969

pickle/json 序列化

在程序运行的过程中，所有的变量都是在内存中，比如，定义一个 dict：

a = {'name':'Bob','age':20,'score':90}

字典 a 可以随时修改变量，比如把 name 改成 'Bill'，但是一旦程序结束，变量所占用的内存就被操作系统全部回收。如果没有把修改后的 'Bill'存储到磁盘上，下次重新运行程序，变量又被初始化为 'Bob'。

把对象（变量）从内存中编程可以存储或传输的形式的过程称之为序列化，在python中称为pickle，其他语言称之为serialization ,marshalling ,flatterning 等等，都是一个意思。

序列化之后，就可以把序列化后的内容写入磁盘，或者通过网络传输到别的机器上（因为硬盘或网络传输时只接受bytes）。

反过来，把变量内容从序列化的对象重新读到内存里称之为反序列化，即unpacking。

在python中，可以使用pickle和json两个模块对数据进行序列化操作

其中：

json可以用于字符串或者字典等与python数据类型之间的序列化与反序列化操作
pickle可以用于python特有类型与python数据类型之间的序列化与反序列化操作

Pickle

想要把一整个 class 都打包的时候要用 pickle。但是用 pickle 打包 class 是一件有风险的事情，有时候还会失败。

例子：怎么把一个字典打包到 pickle 里

pickle.dumps(obj) ：把 obj 对象序列化后以 bytes 对象返回，不写入文件

import pickle

data = {"filename": "f1.txt", "create_time": "today", "size": 111}
pickle.dumps(data)

输出：

b'\x80\x04\x958\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x08filename\x94\x8c\x06f1.txt\x94\x8c\x0bcreate_time\x94\x8c\x05today\x94\x8c\x04size\x94Kou.'

运行之后可以看到这个字典被 pickle 以后，不能直接读出来里面的信息，因为这些信息已经被编码了。

所以打包时需要注意，是否需要能看懂被打包的数据，如果没有这个需求，那你就可以用 pickle，如果有这个需求，就使用 json 的打包库，它打出来的包，就是你能看懂的东西了。

可以用 pickle.dump() 将字典直接转换成一个文件。

pickle.dump(obj , file) ：序列化对象，并将结果数据流写入到文件对象中

data = {"filename": "f1.txt", "create_time": "today", "size": 111}
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)

import os
os.listdir()

输出：

['data.pkl']

若将文件读出来，就需要pickle.load() 。

pickle.load(file) ： 反序列化对象，将文件中的数据解析为一个Python对象

用 open 的方式把文件都读出来，然后再用 pickle 对其解析，还原成最开始的那一个字典。

with open("data.pkl", "rb") as f:
    data = pickle.load(f)
print(data)

输出：

{'filename': 'f1.txt', 'create_time': 'today', 'size': 111}

**注意，**在反序列化 unpickle 的时候， File 的 class 一定要有，不然反序列化会因为找不到 File 的类而失败的。

class File:
    def __init__(self, name, create_time, size):
        self.name = name
        self.create_time = create_time
        self.size = size
    
    def change_name(self, new_name):
        self.name = new_name

data = File("f2.txt", "now", 222)
# 存
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)
# 读
with open("data.pkl", "rb") as f:
    read_data = pickle.load(f)
print(read_data.name)
print(read_data.size)

输出：

f2.txt
222

最后，unpickle 出来的东西还是一个 class 的实例。可以按正常的 class 方式使用这个 unpickle 的东西

有些类型的对象是不能被序列化的。这些通常是那些依赖外部系统状态的对象，比如打开的文件，网络连接，线程，进程，栈帧等等。

如果在 class 中把上述东西赋值到了 class 的属性上，比如下面的 self.file = open()，这样的 class 在 pickle 的时候会报错的。

class File:
    def __init__(self, name, create_time, size):
        self.name = name
        self.create_time = create_time
        self.size = size
        self.file = open(name, "w")

data = File("f3.txt", "now", 222)
# pickle 存，会报错
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)

输出：

Traceback (most recent call last):
  File "<input>", line 11, in <module>
TypeError: cannot pickle '_io.TextIOWrapper' object

处理这种不能 pickle 的对象的实例属性时可以使用特殊的方法，用户自定义类可以通过提供 __getstate__() 和 __setstate__() 方法来绕过 pickle 的这些限制。 pickle.dump() 会调用 __getstate__() 获取序列化的对象。 __setstate__() 在反序列化时被调用。

class File:
    def __init__(self, name, create_time, size):
        self.name = name
        self.create_time = create_time
        self.size = size
        self.file = open(name, "w")

    def __getstate__(self):
        # pickle 出去需要且能被 pickle 的信息
        pickled = {"name": self.name, "create_time": self.create_time, "size": self.size}
        return pickled

    def __setstate__(self, pickled_dict):
        # unpickle 加载回来，重组 class
        self.__init__(
            pickled_dict["name"], pickled_dict["create_time"], pickled_dict["size"])

data = File("f3.txt", "now", 222)
# 存
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)
# 读
with open("data.pkl", "rb") as f:
    read_data = pickle.load(f)
print(read_data.name)
print(read_data.size)

输出：

f3.txt
222

Json

Python 中的 json 库，就是来处理 json 形式的数据的。

import json

data = {"filename": "f1.txt", "create_time": "today", "size": 111}
j = json.dumps(data)
print(j)
print(type(j))

输出：

{"filename": "f1.txt", "create_time": "today", "size": 111}
<class 'str'>

就是照搬字典的样子，变成了一个字符串形式的字典。

在网页请求中，经常会使用到这样的 json 数据。

如果在 Python 中，需要明文来存储数据的话，使用 json.dump()

data = {"filename": "f1.txt", "create_time": "today", "size": 111}
with open("data.json", "w") as f:
    json.dump(data, f)

print("直接当纯文本读：")
with open("data.json", "r") as f:
    print(f.read())

print("用 json 加载了读：")
with open("data.json", "r") as f:
    new_data = json.load(f)
print("字典读取：", new_data["filename"])

输出：

直接当纯文本读：
{"filename": "f1.txt", "create_time": "today", "size": 111}
用 json 加载了读：
字典读取： f1.txt

但是 json 相比 pickle 还是有缺点。

上面已经看到，pickle 可以很轻松的打包 Python 的 class。 但是 json 不能序列化保存 class。只能挑出来重要的信息，放到字典或列表里，然后再用 json 打包字典。

例子：

class File:
    def __init__(self, name, create_time, size):
        self.name = name
        self.create_time = create_time
        self.size = size
    
    def change_name(self, new_name):
        self.name = new_name

data = File("f4.txt", "now", 222)
# 存，会报错
with open("data.json", "w") as f:
    json.dump(data, f)

Pickle 和 Json 的不同

对比	Pickle	Json
存储格式	Python 特定的 Bytes 格式	通用 JSON text 格式，可用于常用的网络通讯中
数据种类	类、功能、字典、列表、元组等	基本和 Pickle 一样，但不能存类、功能
保存后可读性	不能直接阅读	能直接阅读
跨语言性	只能用在 Python	可以跨多语言读写
处理时间	长（需编码数据）	短（不需编码）
安全性	不安全（除非你信任数据源）（pickle会因为不同版本的python，可能生成的文件稍有不同。不好跨平台跨版本使用）	相对安全

在做云服务的时候经常要用json, 在保存本地数据的时候，经常用pickle

3.文件管理