pandas03:索引(上)

187 阅读9分钟

排版问题

导入包

import numpy as np
import pandas as pd

索引器

表的列索引

  • 列索引是最常见的索引形式
  • 一般通过[]来实现
    • 通过[列名]可以从DataFrame中取出一列,返回值为Series
    • 取出多列,可通过列名组成的列表,返回值为DataFrame
    • 取出单列,列名不包含空格,可用.列名取出,这和[列名]等价
# 从表中取出姓名一列
df = pd.read_csv('data/learn_pandas.csv',
                 usecols = ['School','Grade','Name','Gender','Weight','Height'])
df['Name'].head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object
# 取出多列
df[['Gender','Name']].head()

# 取出单列,列名不包含空格
df.Name.head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

序列的行索引

  • 1.以字符串为索引的Series
    • 取出单个索引的对应元素,可以使用[item]
      • Series只有单个值对应,返回这个标量值
      • 多个值对应,返回Series
    • 取出多个索引对应元素,可以使用[items的列表]
# 取出单个索引的对应元素
s = pd.Series(np.arange(1,7,1),index=['a','b','a','a','e','f'])
s['a']
a    1
a    3
a    4
dtype: int64
s['b']
2
# 取出多个索引对应元素
s[['a','f']]
a    1
a    3
a    4
f    6
dtype: int64
  • 取出某2个索引之间的元素,并且这2个索引在整个索引中唯一出现,则可以使用切片
  • 这里切片会包含2个端点
# 取出2个索引之间的元素
s['b':'e'] 
b    2
a    3
a    4
e    5
dtype: int64
  • 2.以整数为索引的Series
    • 默认会生成从0开始的整数索引,任意一组符合长度要求的整数都可以作为索引
    • 和字符串一样,如果使用[int]或者[int_list],可以取出对应索引元素的值
s = pd.Series(['a','b','c','d','e','f'],index=[1,2,3,1,4,5])
s[1]
1    a
1    d
dtype: object
s[[2,3]]
2    b
3    c
dtype: object
  • 如果使用整数切片,则会取出对应索引位置的值
  • 这里整数切片和python中一样,不包含右端点
s[1:5:2]
2    b
1    d
dtype: object

loc索引器

  • 对行对选取
  • 对于表而言,有2种索引器
    • 基于元素的loc索引
      • loc索引形式:loc[*,*]
        • 第1个*代表的选择
        • 第2个*代表的选择
      • 如果省略第2个位置写作loc[*],这里只选择
      • *的位置一共有5类合法对象:单个元素、元素列表、元素切片、布尔列表、函数
    • 基于位置的iloc索引
# 设置Name列为索引
df_demo = df.set_index('Name')
df_demo.head(10)

  • *为单个元素
df_demo.loc['Mei Sun'] # 名字重复的人
df_demo.loc['Quan Zhao'] # 名字唯一
School    Shanghai Jiao Tong University
Grade                            Junior
Gender                           Female
Height                            160.6
Weight                               53
Name: Quan Zhao, dtype: object
# 同时选择行和列
df_demo.loc['Mei Sun','School']
Name
Mei Sun    Shanghai Jiao Tong University
Mei Sun    Shanghai Jiao Tong University
Name: School, dtype: object
  • *为元素列表
# 取出列表所有元素值对应的行/列
df_demo.loc[['Mei Sun','Quan Zhao'],['School','Gender']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Gender
Name
Mei Sun Shanghai Jiao Tong University Male
Mei Sun Shanghai Jiao Tong University Female
Quan Zhao Shanghai Jiao Tong University Female
  • *为切片
  • Series使用字符串索引,如果是唯一值的起点和终点字符,可以使用切片,且包含2个端点
    • 如果不唯一,则报错
df_demo.loc['Xiaoli Qian':'Gaoqiang Qian','School':'Gender']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender
Name
Xiaoli Qian Tsinghua University Freshman Female
Qiang Chu Shanghai Jiao Tong University Freshman Female
Gaoqiang Qian Tsinghua University Junior Female

注意⚠️ 如果DataFrame使用整数索引,使用整数切片的时候和上面字符串索引要求一致,都是元素切片。 包含端点且起点、终点不允许有重复值。

df_demo1 = df_demo.copy()
df_demo1.index = range(df_demo.shape[0],0,-1)
df_demo1.loc[5:3]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
5 Fudan University Junior Female 153.9 46.0
4 Tsinghua University Senior Female 160.9 50.0
3 Shanghai Jiao Tong University Senior Female 153.9 45.0
df_demo1.loc[3:5] # 没有返回值,说明不是整数位置切片
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
  • *为布尔列表 根据条件筛选,传入loc布尔列表和dataframe长度相同,且列表为True会被选中
# 选出体重超过70kg的学生
df_demo.loc[df_demo.Weight>70].head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Gaojuan You Fudan University Sophomore Male 174.0 74.0
Xiaopeng Zhou Shanghai Jiao Tong University Freshman Male 174.1 74.0
Xiaofeng Sun Tsinghua University Senior Male 170.3 71.0
Qiang Zheng Shanghai Jiao Tong University Senior Male 183.9 87.0
# 通过isin返回布尔列表。isin是series中的一个函数,用于查看某列中是否包含某个字符串
# 选出大一和大四同学信息
df_demo.loc[df_demo.Grade.isin(['Freshman','Senior'])].head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 158.9 46.0
Changqiang You Peking University Freshman Male 166.5 70.0
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Xiaoli Qian Tsinghua University Freshman Female 158.0 51.0
Qiang Chu Shanghai Jiao Tong University Freshman Female 162.5 52.0

复合条件

  • |(或)
  • &(且)
  • ~(取反)
# 选出复旦大学体重超过70kg 大四学生, 或者 北大男生超过80kg 非大四学生
c1 = df_demo.School == 'Fudan University'
c2 = df_demo.Weight > 70
c3 = df_demo.Grade == 'Senior' 
fudan = c1 & c2 & c3

c4 = df_demo.School == 'Peking University'
c5 = df_demo.Weight > 80
c6 = df_demo.Grade == 'Senior' 
qinghua = c4 & c5 & (~c6)

df_demo.loc[fudan | qinghua]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Qiang Han Peking University Freshman Male 185.3 87.0
Chengpeng Zhou Fudan University Senior Male 177.1 81.0
Changpeng Zhao Peking University Freshman Male 181.3 83.0
Chengpeng Qian Fudan University Senior Male 177.2 73.0
Name
Gaopeng Yang      False
Changqiang You    False
Mei Sun           False
Xiaojuan Sun      False
Gaojuan You       False
                  ...  
Xiaojuan Sun      False
Li Zhao           False
Chengqiang Chu    False
Chengmei Shen     False
Chunpeng Lv       False
Length: 200, dtype: bool
  • *为函数 这里的函数,必须以前面的四种合法形式之一为返回值,并且函数的输入值为 DataFrame 本身。
# 把上述逻辑写入函数中返回
def condition(x):
    c1 = df_demo.School == 'Fudan University'
    c2 = df_demo.Weight > 70
    c3 = df_demo.Grade == 'Senior' 
    fudan = c1 & c2 & c3

    c4 = df_demo.School == 'Peking University'
    c5 = df_demo.Weight > 80
    c6 = df_demo.Grade == 'Senior' 
    qinghua = c4 & c5 & (~c6)
    
    result = fudan | qinghua
    return result
df_demo.loc[condition]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Qiang Han Peking University Freshman Male 185.3 87.0
Chengpeng Zhou Fudan University Senior Male 177.1 81.0
Changpeng Zhao Peking University Freshman Male 181.3 83.0
Chengpeng Qian Fudan University Senior Male 177.2 73.0
# 使用lambda表达式实现
df_demo.loc[lambda x:'Qiang Han',lambda x:'Gender']
Name
Qiang Han      Male
Qiang Han    Female
Name: Gender, dtype: object
  • 函数无法返回如start: end: step的切片形式
  • 返回切片时要用slice对象进行包装
df_demo.loc[lambda x:slice('Gaojuan You','Gaoqiang Qian')]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Gaojuan You Fudan University Sophomore Male 174.0 74.0
Xiaoli Qian Tsinghua University Freshman Female 158.0 51.0
Qiang Chu Shanghai Jiao Tong University Freshman Female 162.5 52.0
Gaoqiang Qian Tsinghua University Junior Female 161.9 50.0

不要使用链式赋值

在对表或者序列赋值时,应当在使用一层索引器后直接进行赋值操作,这样做是由于进行多次索引后赋值是赋在临时返回的 copy 副本上的,而没有真正修改元素从而报出 SettingWithCopyWarning 警告。

# 举例
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]],columns=list('AB'))
df_chain
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
A B
0 0 0
1 1 0
2 -1 0
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        df_chain[df_chain.A!=0].B = 1
    except Warning as w:
        Warning_Msg = w
print(Warning_Msg)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_chain
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
A B
0 0 0
1 1 0
2 -1 0
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
A B
0 0 0
1 1 1
2 -1 1

iloc索引器

  • iloc使用与loc完全相似,不过是针对位置进行筛选
  • 在相应的*位置出一种有5类合法对象
    • 整数
    • 整数列表
    • 整数切片
    • 布尔列表
    • 函数,返回值必须是前面4个合法对象中的一个,输入为dataframe本身
df_demo.iloc[1,1] # 第2行 第2列
'Freshman'
df_demo.iloc[[0,1],[0,1]] # 前2行,前2列
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman
Changqiang You Peking University Freshman
df_demo.iloc[1:4,2:4] # 切片,左闭右开
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Gender Height
Name
Changqiang You Male 166.5
Mei Sun Male 188.9
Xiaojuan Sun Female NaN
df_demo.iloc[lambda x:slice(1,4)] 
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Changqiang You Peking University Freshman Male 166.5 70.0
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Xiaojuan Sun Fudan University Sophomore Female NaN 41.0
  • 使用布尔不能传入Series,而必须传入序列的values
  • 使用布尔筛选优先考虑loc方式
# 选体重超过80kg学生
df_demo.iloc[(df_demo.Weight>80).values].head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Qiang Zheng Shanghai Jiao Tong University Senior Male 183.9 87.0
Qiang Han Peking University Freshman Male 185.3 87.0
Chengpeng Zhou Fudan University Senior Male 177.1 81.0
Feng Han Shanghai Jiao Tong University Sophomore Male 183.4 82.0
# 返回相应位置的值/子序列
df_demo.School.iloc[1]
'Peking University'
df_demo.School.iloc[1:5:2] # 2为步长
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

query方法

  • 支持以字符串的形式查询表达式传入query方法,表达式执行结果必须返回布尔列表
  • 进行复杂索引时,无须像普通方法重复使用Dataframe作为列名
# 改写loc复合查询
df_demo.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70) |'
         ' (School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Qiang Han Peking University Freshman Male 185.3 87.0
Chengpeng Zhou Fudan University Senior Male 177.1 81.0
Changpeng Zhao Peking University Freshman Male 181.3 83.0
Chengpeng Qian Fudan University Senior Male 177.2 73.0
# query中,帮助用户注册了所有来自dataframe的列名
# 所有属于Series的方法都可以被调用
df_demo.query('Weight > Weight.mean()').head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Changqiang You Peking University Freshman Male 166.5 70.0
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Gaojuan You Fudan University Sophomore Male 174.0 74.0
Xiaopeng Zhou Shanghai Jiao Tong University Freshman Male 174.1 74.0
Xiaomei Zhou Tsinghua University Senior Female 165.3 57.0
  • 对于含有空格的列名,需要使用 col name 的方式进行引用
  • query其他常用
    • query提供可读性语法,orandis innot in
    • 等价替换
      • ==is in
      • !=not in
    • 需要引用外部变量,在变量名前加@
# 选出男生中不是大一、大二的学生
df_demo.query('(Grade not in ["Freshman","Sophomore"]) and'
        '(Gender == "Male")').head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Xiaoqiang Qin Tsinghua University Junior Male 170.1 68.0
Peng Wang Tsinghua University Junior Male 162.8 65.0
Xiaofeng Sun Tsinghua University Senior Male 170.3 71.0
Xiaopeng Shen Shanghai Jiao Tong University Senior Male 166.0 62.0
# 查询所有大三和大四学生
df_demo.query('Grade == ["Junior","Senior"]').head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
School Grade Gender Height Weight
Name
Mei Sun Shanghai Jiao Tong University Senior Male 188.9 89.0
Gaoqiang Qian Tsinghua University Junior Female 161.9 50.0
Juan Xu Peking University Junior Female 164.8 NaN
Xiaoquan Lv Tsinghua University Junior Female 153.2 43.0
Peng You Shanghai Jiao Tong University Senior Female NaN 48.0
# 引用外部变量 @
low, high = 70, 80
df_demo.query('Weight.between(@low,@high)').head()
School Grade Gender Height Weight
Name
Changqiang You Peking University Freshman Male 166.5 70.0
Gaojuan You Fudan University Sophomore Male 174.0 74.0
Xiaopeng Zhou Shanghai Jiao Tong University Freshman Male 174.1 74.0
Xiaofeng Sun Tsinghua University Senior Male 170.3 71.0
Gaoli Zhao Peking University Freshman Male 175.4 78.0

随机抽样

  • sample,对样本/特征进行随机抽样,参数如下:
    • n,抽样数量
    • axis,抽样的方向(0为行,1为列)
    • frac,抽样比例(0.3为从总体中抽出30%的样本)
    • replace,是否放回,True,有放回抽样
    • weights,每个样本的抽样相对概率
# 有放回抽样,抽样3次
df = pd.DataFrame({'id':list('abcde'),
                   'value':[1,2,3,4,9]})
df
id value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 9
# e为90出现概率非常高。。。
df.sample(5, replace = True, weights = df.value)
id value
2 c 3
4 e 9
3 d 4
4 e 9
3 d 4

参考