NumPy的6个高效函数
import numpy as np
import numpy as npargpartition()
argpartition() 函数可以找出Numpy数组中N个最大值的索引,也会将找到的这些索引值输出。然后根据我们的需要对数值进行排序。
# Random array
x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])
help(np.argpartition)
Help on function argpartition in module numpy:
argpartition(a, kth, axis=-1, kind='introselect', order=None)
Perform an indirect partition along the given axis using the
algorithm specified by the `kind` keyword. It returns an array of
indices of the same shape as `a` that index data along the given
axis in partitioned order.
.. versionadded:: 1.8.0
Parameters
----------
a : array_like
Array to sort.
kth : int or sequence of ints
Element index to partition by. The k-th element will be in its
final sorted position and all smaller elements will be moved
before it and all larger elements behind it. The order all
elements in the partitions is undefined. If provided with a
sequence of k-th it will partition all of them into their sorted
position at once.
axis : int or None, optional
Axis along which to sort. The default is -1 (the last axis). If
None, the flattened array is used.
kind : {'introselect'}, optional
Selection algorithm. Default is 'introselect'
order : str or list of str, optional
When `a` is an array with fields defined, this argument
specifies which fields to compare first, second, etc. A single
field can be specified as a string, and not all fields need be
specified, but unspecified fields will still be used, in the
order in which they come up in the dtype, to break ties.
Returns
-------
index_array : ndarray, int
Array of indices that partition `a` along the specified axis.
If `a` is one-dimensional, ``a[index_array]`` yields a partitioned `a`.
More generally, ``np.take_along_axis(a, index_array, axis=a)`` always
yields the partitioned `a`, irrespective of dimensionality.
See Also
--------
partition : Describes partition algorithms used.
ndarray.partition : Inplace partition.
argsort : Full indirect sort.
take_along_axis : Apply ``index_array`` from argpartition
to an array as if by calling partition.
Notes
-----
See `partition` for notes on the different selection algorithms.
Examples
--------
One dimensional array:
>>> x = np.array([3, 4, 2, 1])
>>> x[np.argpartition(x, 3)]
array([2, 1, 3, 4])
>>> x[np.argpartition(x, (1, 3))]
array([1, 2, 3, 4])
>>> x = [3, 4, 2, 1]
>>> np.array(x)[np.argpartition(x, 3)]
array([2, 1, 3, 4])
Multi-dimensional array:
>>> x = np.array([[3, 4, 2], [1, 3, 1]])
>>> index_array = np.argpartition(x, kth=1, axis=-1)
>>> np.take_along_axis(x, index_array, axis=-1) # same as np.partition(x, kth=1)
array([[2, 3, 4],
[1, 1, 3]])
index_val = np.argpartition(x, 5)[-5:]
index_val
array([6, 8, 1, 2, 0], dtype=int64)
np.sort(x[index_val])
array([ 9, 10, 12, 12, 16])
allclose()
array([ 9, 10, 12, 12, 16])allclose() 函数用于匹配两个数组,并得到布尔值表示的输出。如果一个公差范围内连个数组不等同,则返回false。该函数对于检查两个数组是否相似非常有用。
array1 = np.array([0.12, 0.17, 0.24, 0.29])
array2 = np.array([0.13, 0.19, 0.26, 0.31])
# 在0.1的公差范围内,返回false
np.allclose(array1, array2, 0.1)
False
# 在0.2的公差范围内,返回true
np.allclose(array1, array2, 0.2)
True
clip()
clip() 函数使得一个数组中的数值保持在一个区间内,区间外的数值会被裁减到区间内。
x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])
np.clip(x, 2, 5)
array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])
extract()
extract() 函数是在特定的条件下从一个数组提取特定的元素。
array = np.random.randint(20, size=12)
array
array([11, 4, 17, 16, 6, 16, 12, 4, 6, 11, 7, 14])
# 检查数组的数模2的余数是否为1
cond = np.mod(array, 2) == 1
cond
array([ True, False, True, False, False, False, False, False, False,
True, True, False])
# 从数组中选出模2余1的数
np.extract(cond, array)
array([11, 17, 11, 7])
# 还有可以多条件提取
np.extract(((array < 3) | (array > 15)), array)
array([17, 16, 16])
where()
where() 函数是用数组中返回满足特定条件的元素。类似SQL中的 where。
y = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
# 返回大于5的数字
np.where(y > 5)
(array([2, 3, 5, 7, 8], dtype=int64),)
# 大于的数字用字母“a”来替换,否则用字母“b”来替换
np.where(y > 5, "a", "b")
array(['b', 'b', 'a', 'a', 'b', 'a', 'b', 'a', 'a'], dtype='<U1')
percentile()
percentile() 函数用于计算特定轴方向上数组元素的第 n 个百分位数。
a = np.array([1, 5, 6, 8, 1, 7, 3, 6, 9])
print("axis=0轴上的在50%分位数的数是: ",
np.percentile(a, 50, axis =0))
axis=0轴上的在50%分位数的数是: 6.0
b = np.array([[10, 7, 4], [3, 2, 1]])
print("axis=0轴上的在30%分位数的数是: ",
np.percentile(b, 30, axis =0))
axis=0轴上的在30%分位数的数是: [5.1 3.5 1.9]
Pandas的6个高效函数
import pandas as pd
map()
map() 函数根据相应的输入来映射 Series 的值。用于将一个 Series 中的每个值替换成另一个值,该值可以来自一个函数,也可能来自一个 dict 或 Series。
df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])
df
| b | d | e | |
|---|---|---|---|
| India | 1.607278 | -0.917900 | -0.802763 |
| USA | 0.106430 | 0.360755 | 0.480584 |
| China | 0.036781 | 1.210999 | 1.006594 |
| Russia | -0.202205 | -0.689697 | 0.436188 |
changefn = lambda x: '%.2f' % x
df['d'].map(changefn)
India -0.92
USA 0.36
China 1.21
Russia -0.69
Name: d, dtype: object
apply()
apply() 允许用户传递函数,并将其应用于 DataFrame 中的每个值。
fn = lambda x: x.max() - x.min()
df.apply(fn)
b 1.809483
d 2.128899
e 1.809358
dtype: float64
isin()
isin() 函数用于过滤数据帧。isin() 有助于选择特定列中具有特定(或多个)值的行。
df=pd.DataFrame(np.random.randn(4,4),columns=['A','B','C','D'])
df
| A | B | C | D | |
|---|---|---|---|---|
| 0 | -0.055425 | -0.005992 | 0.715223 | 1.061672 |
| 1 | -0.248495 | -0.336913 | 0.179430 | 2.288203 |
| 2 | 0.451122 | 0.936068 | 0.838075 | 1.089782 |
| 3 | 0.864347 | 1.231097 | -0.076225 | -0.319944 |
df['E'] = ['a', 'a', 'b', 'c']
df
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | -0.055425 | -0.005992 | 0.715223 | 1.061672 | a |
| 1 | -0.248495 | -0.336913 | 0.179430 | 2.288203 | a |
| 2 | 0.451122 | 0.936068 | 0.838075 | 1.089782 | b |
| 3 | 0.864347 | 1.231097 | -0.076225 | -0.319944 | c |
df.E.isin(['a', 'c'])
0 True
1 True
2 False
3 True
Name: E, dtype: bool
df.isin(['a', 'c'])
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | False | False | False | False | True |
| 1 | False | False | False | False | True |
| 2 | False | False | False | False | False |
| 3 | False | False | False | False | True |
df.loc[df.E.isin(['a', 'c'])]
| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | -0.055425 | -0.005992 | 0.715223 | 1.061672 | a |
| 1 | -0.248495 | -0.336913 | 0.179430 | 2.288203 | a |
| 3 | 0.864347 | 1.231097 | -0.076225 | -0.319944 | c |
df['F'] = [0, 1, 0, 2]
# 同时过滤多列
df.loc[df.E.isin(['a', 'c']) & df.F.isin([0,])]
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | -0.055425 | -0.005992 | 0.715223 | 1.061672 | a | 0 |
copy()
copy() 函数用于复制 Pandas 对象。当一个数据帧分配另一个数据帧时,如果对其中一个数据进行更改,另一个数据帧的值也将会改变。为了防止这个问题发生,可以使用 copy() 函数。
data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])
# 将 data 复制 data1,其实只是赋值给 data1 一个索引,本质上还是指向同一个Series
data1= data
# 修改值
data1[0]='USA'
data
0 USA
1 Pakistan
2 China
3 Mongolia
dtype: object
# 复制
new = data.copy()
new[1]='Changed value'
print(new)
print(data)
0 USA
1 Changed value
2 China
3 Mongolia
dtype: object
0 USA
1 Pakistan
2 China
3 Mongolia
dtype: object
select_dtypes()
select_dtypes() 函数的作用是,基于 dtypes 的列返回数据帧列的一个子集。这个函数的参数可以设置为包含所有拥有特定数据类型的列,或者设置为排除具有特定数据类型的列。
framex = df.select_dtypes(include="float64")
framex
| A | B | C | D | |
|---|---|---|---|---|
| 0 | -0.055425 | -0.005992 | 0.715223 | 1.061672 |
| 1 | -0.248495 | -0.336913 | 0.179430 | 2.288203 |
| 2 | 0.451122 | 0.936068 | 0.838075 | 1.089782 |
| 3 | 0.864347 | 1.231097 | -0.076225 | -0.319944 |
pivot_table()
pivot_table() 是数据透视表函数。含有以下参数:
- values: 需要聚合的列名,默认情况下聚合所有数值型的列。
- index: 在结果透视表的行上进行分组的列名或其他分组键。
- columns: 在结果透视表的列上进行分组的列名或其他分组键。
- aggfunc: 在聚合函数或函数列表(默认是‘mean’)。
- fill_value: 在结果表中替换缺失值的值。
- dropna: 删除缺失值。
- margins: 添加行/列小计和总和(默认为False)。
school = pd.DataFrame({
'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [26, 22, 20, 23, 24]})
school
| A | B | C | |
|---|---|---|---|
| 0 | Jay | Masters | 26 |
| 1 | Usher | Graduate | 22 |
| 2 | Nicky | Graduate | 20 |
| 3 | Romero | Masters | 23 |
| 4 | Will | Graduate | 24 |
table = pd.pivot_table(school, values ='A', index =['B', 'C'],
columns =['B'], aggfunc = np.sum, fill_value="Not Available")
table
| B | Graduate | Masters | |
|---|---|---|---|
| B | C | ||
| Graduate | 20 | Nicky | Not Available |
| 22 | Usher | Not Available | |
| 24 | Will | Not Available | |
| Masters | 23 | Not Available | Romero |
| 26 | Not Available | Jay |