公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
本文记录的是如何使用Pandas来读取不同情况下的TXT文件,主要是介绍部分常见参数的使用。
文章中涉及到文本匹配知识:正则表达式,有一定的正则基础食用更香,小编以后会专门写一篇Python正则表达式的文章。
正则基础
下面的表格记录的是正则表达式中常用元字符及其含义:
| 符号 | 含义 |
|---|---|
| 点. | 匹配除换行符外的任意字符 |
| 星号* | 匹配0个或者多个任意字符 |
| 问号? | 匹配0个或者1个任意字符(非贪婪模式) |
| 开始位置 | |
| $ | 结束位置 |
| \s | 匹配任意空白 |
| \S | 匹配任意非空白 |
| \d | 匹配一个数字 |
| \D | 匹配一个非数字 |
| \w | 匹配一个单词字符,包含数字和字母 |
| \W | 匹配一个非单词字符,包含数字和字母 |
| [abcd] | 匹配abcd中的一个任意字符 |
| [^abcd] | 匹配不含包abcd的任意字符,其中^表示非 |
| + | 匹配1次或者多次前面的内容 |
| {n} | 匹配n词(固定) |
| {n,} | 匹配至少n次 |
| {n,m} | 匹配n到m次 |
| x|y | 匹配x或者y |
() | 匹配括号内的内容 |
参数
详细的参数参考官网
pandas.pydata.org/docs/refere…
pandas.read_table(
filepath_or_buffer,
sep=NoDefault.no_default,
delimiter=None,
header='infer',
names=NoDefault.no_default,
index_col=None,
usecols=None,
squeeze=None,
prefix=NoDefault.no_default,
mangle_dupe_cols=True,
dtype=None,
engine=None,
converters=None,
true_values=None,
false_values=None,
skipinitialspace=False,
skiprows=None,
skipfooter=0,
nrows=None,
na_values=None,
keep_default_na=True,
na_filter=True,
verbose=False,
skip_blank_lines=True,
parse_dates=False,
infer_datetime_format=False,
keep_date_col=False,
date_parser=None,
dayfirst=False,
cache_dates=True,
iterator=False,
chunksize=None,
compression='infer',
thousands=None,
decimal='.',
lineterminator=None,
quotechar='"',
quoting=0,
doublequote=True,
escapechar=None,
comment=None,
encoding=None,
encoding_errors='strict',
dialect=None,
error_bad_lines=None,
warn_bad_lines=None,
on_bad_lines=None,
delim_whitespace=False,
low_memory=True,
memory_map=False,
float_precision=None,
storage_options=None)
可以看到pandas.read_table()函数中的绝大部分的参数和pandas.read_csv是比较类似的,下面内容中介绍的用法也是类似的。可以参考学习。
模拟数据
import pandas as pd
import numpy as np
模拟了6份不同场景下的数据:
1、数据1特点:
- 没有表头
- 只有一个空格
# txt_data1.txt
18 xiaoming male
20 xiaozhou female
30 sunjun male
19 zhouqiang male
2、数据2特点:
- 有表头
- 只有一个空格
age name sex
18 xiaoming male
20 xiaozhou female
30 sunjun male
19 zhouqiang male
3、数据3特点:
- 有表头
- 存在多个空格
age name sex # 表头
18 xiaoming male # 存在多个空格
20 xiaozhou female
30 sunjun male
19 zhouqiang male
4、数据4特点:
- 有表头
- 连接符号不是空格,是
+
age+name+sex
18+xiaoming+male
20+xiaozhou+female
30+sunjun+male
19+zhouqiang+male
5、数据5特点
- 没有表头
- 没有固定连接符
0female135guangzhou139
1male140shenzhen128
2male127xiamen145
3female129beijing150
6、数据6特点:
- 有无效信息
- 有空白行
- 没有表头
## 数据:信息学院学生信息
## 学期:第一学期
18 xiaoming male
20 xiaozhou female
30 sunjun male
19 zhouqiang male
## 数据信息为模拟数据
默认读取
pd.read_table("txt_data1.txt")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 18 xiaoming male | |
|---|---|
| 0 | 20 xiaozhou female |
| 1 | 30 sunjun male |
| 2 | 19 zhouqiang male |
pd.read_table("txt_data2.txt")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age name sex | |
|---|---|
| 0 | 18 xiaoming male |
| 1 | 20 xiaozhou female |
| 2 | 30 sunjun male |
| 3 | 19 zhouqiang male |
pd.read_table("txt_data3.txt")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age name sex | |
|---|---|
| 0 | 18 xiaoming male |
| 1 | 20 xiaozhou female |
| 2 | 30 sunjun male |
| 3 | 19 zhouqiang male |
从默认读取的结果来看,pandas默认将第一行数据当做了表头,而且只有一列数据产生。
表头-header
pd.read_table("txt_data1.txt",header=None) # 表示使用自然数来做表头
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | |
|---|---|
| 0 | 18 xiaoming male |
| 1 | 20 xiaozhou female |
| 2 | 30 sunjun male |
| 3 | 19 zhouqiang male |
pd.read_table("txt_data1.txt",header=[0]) # 表示将第一行当做表头
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 18 xiaoming male | |
|---|---|
| 0 | 20 xiaozhou female |
| 1 | 30 sunjun male |
| 2 | 19 zhouqiang male |
指定分割符-sep
指定空格为分隔符
pd.read_table("txt_data1.txt",sep=" ")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 18 | xiaoming | male | |
|---|---|---|---|
| 0 | 20 | xiaozhou | female |
| 1 | 30 | sunjun | male |
| 2 | 19 | zhouqiang | male |
\s也可以看做是将空白当做分隔符
pd.read_table("txt_data1.txt",sep="\s") # \s表示空行
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 18 | xiaoming | male | |
|---|---|---|---|
| 0 | 20 | xiaozhou | female |
| 1 | 30 | sunjun | male |
| 2 | 19 | zhouqiang | male |
pd.read_table("txt_data1.txt", sep=" ", header=None)
sep 和 header参数的连用:
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 18 | xiaoming | male |
| 1 | 20 | xiaozhou | female |
| 2 | 30 | sunjun | male |
| 3 | 19 | zhouqiang | male |
使用+作为分割符:
pd.read_table("txt_data4.txt",sep="+",header=None)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | age | name | sex |
| 1 | 18 | xiaoming | male |
| 2 | 20 | xiaozhou | female |
| 3 | 30 | sunjun | male |
| 4 | 19 | zhouqiang | male |
其他分割符
+号表示匹配一个或者多个前面的元素:
# \s 匹配空白行 +匹配多个元素
pd.read_table("txt_data3.txt",sep="\s+")
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18 | xiaoming | male |
| 1 | 20 | xiaozhou | female |
| 2 | 30 | sunjun | male |
| 3 | 19 | zhouqiang | male |
自定义表头-names
pd.read_table("txt_data1.txt",
sep=" ",
names=["age","name","sex"] # 自定义表头
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18 | xiaoming | male |
| 1 | 20 | xiaozhou | female |
| 2 | 30 | sunjun | male |
| 3 | 19 | zhouqiang | male |
指定索引-index_col
指定作为索引的列:
pd.read_table("txt_data1.txt",
sep=" ",
names=["age","name","sex"],
index_col=[1] # 将name作为索引
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | sex | |
|---|---|---|
| name | ||
| xiaoming | 18 | male |
| xiaozhou | 20 | female |
| sunjun | 30 | male |
| zhouqiang | 19 | male |
字母作为分隔符
pd.read_table("txt_data5.txt",
sep="\D+", # 使用非数字作为分割符
names=["id","col1","col2"])
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| id | col1 | col2 | |
|---|---|---|---|
| 0 | 0 | 135 | 139 |
| 1 | 1 | 140 | 128 |
| 2 | 2 | 127 | 145 |
| 3 | 3 | 129 | 150 |
指定数据类型-dtype
df = pd.read_table("txt_data5.txt",
sep="\D+",
names=["id","col1","col2"])
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| id | col1 | col2 | |
|---|---|---|---|
| 0 | 0 | 135 | 139 |
| 1 | 1 | 140 | 128 |
| 2 | 2 | 127 | 145 |
| 3 | 3 | 129 | 150 |
df.dtypes # 默认类型
id int64
col1 int64
col2 int64
dtype: object
df = pd.read_table("txt_data5.txt",
sep="\D+", # 以非数字作为分隔符
names=["id","col1","col2"],
dtype={"id":'int32',"col1":'int32',"col2":"float64"})
df.dtypes # 指定类型
id int32
col1 int32
col2 float64
dtype: object
字段转换-converters
pd.read_table(
"txt_data3.txt",
sep="\s+",
usecols=[0,1,2],
converters={
1: lambda x: x.upper(), # 全部大写
2: lambda x: x.title() # 首字母大写
}
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18 | XIAOMING | Male |
| 1 | 20 | XIAOZHOU | Female |
| 2 | 30 | SUNJUN | Male |
| 3 | 19 | ZHOUQIANG | Male |
跳过指定行-skiprows
pd.read_table("txt_data6.txt",
sep="\s+",
names=["age", "name", "sex"],
skiprows=[0,1,7] # 索引从0开始;跳过指定行
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18 | xiaoming | male |
| 1 | 20 | xiaozhou | female |
| 2 | 30 | sunjun | male |
| 3 | 19 | zhouqiang | male |
跳过空白行-skip_blank_lines
pd.read_table("txt_data6.txt",
sep="\s+",
skip_blank_lines=False, # 默认是True;在这里没有跳过空白行
names=["age", "name", "sex"],
skiprows=[0,1,7]
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18.0 | xiaoming | male |
| 1 | 20.0 | xiaozhou | female |
| 2 | 30.0 | sunjun | male |
| 3 | 19.0 | zhouqiang | male |
| 4 | NaN | NaN | NaN |
pd.read_table("txt_data6.txt",
sep="\s+",
skip_blank_lines=True, # 默认是True
names=["age", "name", "sex"],
skiprows=[0,1,7]
)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| age | name | sex | |
|---|---|---|---|
| 0 | 18 | xiaoming | male |
| 1 | 20 | xiaozhou | female |
| 2 | 30 | sunjun | male |
| 3 | 19 | zhouqiang | male |