12个高效的Numpy和Pandas函数为你保驾护航

191 阅读6分钟

NumPy的6个高效函数

import numpy as np

argpartition()

argpartition() 函数可以找出Numpy数组中N个最大值的索引,也会将找到的这些索引值输出。然后根据我们的需要对数值进行排序。

# Random array
x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
help(np.argpartition)
Help on function argpartition in module numpy:

argpartition(a, kth, axis=-1, kind='introselect', order=None)
    Perform an indirect partition along the given axis using the
    algorithm specified by the `kind` keyword. It returns an array of
    indices of the same shape as `a` that index data along the given
    axis in partitioned order.
    
    .. versionadded:: 1.8.0
    
    Parameters
    ----------
    a : array_like
        Array to sort.
    kth : int or sequence of ints
        Element index to partition by. The k-th element will be in its
        final sorted position and all smaller elements will be moved
        before it and all larger elements behind it. The order all
        elements in the partitions is undefined. If provided with a
        sequence of k-th it will partition all of them into their sorted
        position at once.
    axis : int or None, optional
        Axis along which to sort. The default is -1 (the last axis). If
        None, the flattened array is used.
    kind : {'introselect'}, optional
        Selection algorithm. Default is 'introselect'
    order : str or list of str, optional
        When `a` is an array with fields defined, this argument
        specifies which fields to compare first, second, etc. A single
        field can be specified as a string, and not all fields need be
        specified, but unspecified fields will still be used, in the
        order in which they come up in the dtype, to break ties.
    
    Returns
    -------
    index_array : ndarray, int
        Array of indices that partition `a` along the specified axis.
        If `a` is one-dimensional, ``a[index_array]`` yields a partitioned `a`.
        More generally, ``np.take_along_axis(a, index_array, axis=a)`` always
        yields the partitioned `a`, irrespective of dimensionality.
    
    See Also
    --------
    partition : Describes partition algorithms used.
    ndarray.partition : Inplace partition.
    argsort : Full indirect sort.
    take_along_axis : Apply ``index_array`` from argpartition 
                      to an array as if by calling partition.
    
    Notes
    -----
    See `partition` for notes on the different selection algorithms.
    
    Examples
    --------
    One dimensional array:
    
    >>> x = np.array([3, 4, 2, 1])
    >>> x[np.argpartition(x, 3)]
    array([2, 1, 3, 4])
    >>> x[np.argpartition(x, (1, 3))]
    array([1, 2, 3, 4])
    
    >>> x = [3, 4, 2, 1]
    >>> np.array(x)[np.argpartition(x, 3)]
    array([2, 1, 3, 4])
    
    Multi-dimensional array:
    
    >>> x = np.array([[3, 4, 2], [1, 3, 1]])
    >>> index_array = np.argpartition(x, kth=1, axis=-1)
    >>> np.take_along_axis(x, index_array, axis=-1)  # same as np.partition(x, kth=1)
    array([[2, 3, 4],
           [1, 1, 3]])
index_val = np.argpartition(x, 5)[-5:]
index_val
array([6, 8, 1, 2, 0], dtype=int64)
np.sort(x[index_val])

array([ 9, 10, 12, 12, 16])
allclose()

allclose() 函数用于匹配两个数组,并得到布尔值表示的输出。如果一个公差范围内连个数组不等同,则返回false。该函数对于检查两个数组是否相似非常有用。

array1 = np.array([0.12, 0.17, 0.24, 0.29])
array2 = np.array([0.13, 0.19, 0.26, 0.31])
# 在0.1的公差范围内,返回false
np.allclose(array1, array2, 0.1)
False
# 在0.2的公差范围内,返回true
np.allclose(array1, array2, 0.2)
True

clip()

clip() 函数使得一个数组中的数值保持在一个区间内,区间外的数值会被裁减到区间内。

x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])
np.clip(x, 2, 5)
array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

extract()

extract() 函数是在特定的条件下从一个数组提取特定的元素。

array = np.random.randint(20, size=12)
array
array([11,  4, 17, 16,  6, 16, 12,  4,  6, 11,  7, 14])
# 检查数组的数模2的余数是否为1
cond = np.mod(array, 2) == 1
cond
array([ True, False,  True, False, False, False, False, False, False,
        True,  True, False])
# 从数组中选出模2余1的数
np.extract(cond, array)
array([11, 17, 11,  7])
# 还有可以多条件提取
np.extract(((array < 3) | (array > 15)), array)
array([17, 16, 16])

where()

where() 函数是用数组中返回满足特定条件的元素。类似SQL中的 where。

y = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
# 返回大于5的数字
np.where(y > 5)
(array([2, 3, 5, 7, 8], dtype=int64),)
# 大于的数字用字母“a”来替换,否则用字母“b”来替换
np.where(y > 5, "a", "b")
array(['b', 'b', 'a', 'a', 'b', 'a', 'b', 'a', 'a'], dtype='<U1')

percentile()

percentile() 函数用于计算特定轴方向上数组元素的第 n 个百分位数。

a = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
print("axis=0轴上的在50%分位数的数是: ",  
      np.percentile(a, 50, axis =0))
axis=0轴上的在50%分位数的数是:  6.0
b = np.array([[10, 7, 4], [3, 2, 1]])
print("axis=0轴上的在30%分位数的数是: ",  
      np.percentile(b, 30, axis =0))
axis=0轴上的在30%分位数的数是:  [5.1 3.5 1.9]

Pandas的6个高效函数

import pandas as pd

map()

map() 函数根据相应的输入来映射 Series 的值。用于将一个 Series 中的每个值替换成另一个值,该值可以来自一个函数,也可能来自一个 dict 或 Series。

df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])
df
bde
India1.607278-0.917900-0.802763
USA0.1064300.3607550.480584
China0.0367811.2109991.006594
Russia-0.202205-0.6896970.436188
changefn = lambda x: '%.2f' % x
df['d'].map(changefn)
India     -0.92
USA        0.36
China      1.21
Russia    -0.69
Name: d, dtype: object

apply()

apply() 允许用户传递函数,并将其应用于 DataFrame 中的每个值。

fn = lambda x: x.max() - x.min()
df.apply(fn)
b    1.809483
d    2.128899
e    1.809358
dtype: float64

isin()

isin() 函数用于过滤数据帧。isin() 有助于选择特定列中具有特定(或多个)值的行。

df=pd.DataFrame(np.random.randn(4,4),columns=['A','B','C','D'])
df
ABCD
0-0.055425-0.0059920.7152231.061672
1-0.248495-0.3369130.1794302.288203
20.4511220.9360680.8380751.089782
30.8643471.231097-0.076225-0.319944
df['E'] = ['a', 'a', 'b', 'c']
df
ABCDE
0-0.055425-0.0059920.7152231.061672a
1-0.248495-0.3369130.1794302.288203a
20.4511220.9360680.8380751.089782b
30.8643471.231097-0.076225-0.319944c
df.E.isin(['a', 'c'])
0     True
1     True
2    False
3     True
Name: E, dtype: bool
df.isin(['a', 'c'])
ABCDE
0FalseFalseFalseFalseTrue
1FalseFalseFalseFalseTrue
2FalseFalseFalseFalseFalse
3FalseFalseFalseFalseTrue
df.loc[df.E.isin(['a', 'c'])]
ABCDE
0-0.055425-0.0059920.7152231.061672a
1-0.248495-0.3369130.1794302.288203a
30.8643471.231097-0.076225-0.319944c
df['F'] = [0, 1, 0, 2]
# 同时过滤多列
df.loc[df.E.isin(['a', 'c']) & df.F.isin([0,])]
ABCDEF
0-0.055425-0.0059920.7152231.061672a0

copy()

copy() 函数用于复制 Pandas 对象。当一个数据帧分配另一个数据帧时,如果对其中一个数据进行更改,另一个数据帧的值也将会改变。为了防止这个问题发生,可以使用 copy() 函数。

data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia']) 
# 将 data 复制 data1,其实只是赋值给 data1 一个索引,本质上还是指向同一个Series
data1= data
# 修改值
data1[0]='USA'
data
0         USA
1    Pakistan
2       China
3    Mongolia
dtype: object
# 复制
new = data.copy() 
new[1]='Changed value'
print(new) 
print(data) 
0              USA
1    Changed value
2            China
3         Mongolia
dtype: object
0         USA
1    Pakistan
2       China
3    Mongolia
dtype: object

select_dtypes()

select_dtypes() 函数的作用是,基于 dtypes 的列返回数据帧列的一个子集。这个函数的参数可以设置为包含所有拥有特定数据类型的列,或者设置为排除具有特定数据类型的列。

framex =  df.select_dtypes(include="float64")
framex
ABCD
0-0.055425-0.0059920.7152231.061672
1-0.248495-0.3369130.1794302.288203
20.4511220.9360680.8380751.089782
30.8643471.231097-0.076225-0.319944

pivot_table()

pivot_table() 是数据透视表函数。含有以下参数:

  • values: 需要聚合的列名,默认情况下聚合所有数值型的列。
  • index: 在结果透视表的行上进行分组的列名或其他分组键。
  • columns: 在结果透视表的列上进行分组的列名或其他分组键。
  • aggfunc: 在聚合函数或函数列表(默认是‘mean’)。
  • fill_value: 在结果表中替换缺失值的值。
  • dropna: 删除缺失值。
  • margins: 添加行/列小计和总和(默认为False)。
school = pd.DataFrame({
    'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'], 
    'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'], 
    'C': [26, 22, 20, 23, 24]})
school
ABC
0JayMasters26
1UsherGraduate22
2NickyGraduate20
3RomeroMasters23
4WillGraduate24
table = pd.pivot_table(school, values ='A', index =['B', 'C'], 
                         columns =['B'], aggfunc = np.sum, fill_value="Not Available") 
  
table
BGraduateMasters
BC
Graduate20NickyNot Available
22UsherNot Available
24WillNot Available
Masters23Not AvailableRomero
26Not AvailableJay