在处理大型数组时,使用“for”循环会大大降低代码的执行效率。比如,我们有一个包含100000000行和2列的数组“a”,第一列是索引,第二列是相关值。我们需要获取第二列中每个索引的数字的中值。使用“for”循环可以实现,但是速度非常慢。
import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
[1, 3],
[2, 3],
[2, 4],
[2, 6],
[1, 4],
...
...
[1000000,6]])
for i in xrange(1000000):
b[i]=np.median(a[np.where(a[:,0]==i),1])
http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding;//爬虫IP免费获取;
2、解决方案
方法一:使用Pandas库
Pandas库是一个强大的数据处理库,它提供了许多方便的函数来处理数组。我们可以使用Pandas库的groupby()函数来实现上述任务。
import numpy as np
import pandas as pd
a = np.array([[1.0, 2.0],
[1.0, 3.0],
[2.0, 5.0],
[2.0, 6.0],
[2.0, 8.0],
[1.0, 4.0],
[1.0, 1.0],
[1.0, 3.5],
[5.0, 8.0],
[2.0, 1.0],
[5.0, 9.0]])
# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])
# Form the groups.
grouped = df.groupby('index')
# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result
输出:
value
index
1 3.0
2 5.5
5 8.5
方法二:使用NumPy库
NumPy库是一个强大的科学计算库,它提供了许多函数来处理数组。我们可以使用NumPy库的argsort()函数和unique()函数来实现上述任务。
# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a
# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
result[0] = np.median(a[w[i]:w[i+1]])
# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]
方法三:使用bincount()函数
我们可以使用NumPy库的bincount()函数来实现上述任务。
num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]
方法四:使用Pandas库的melt()函数
import numpy as np
import pandas as pd
a = np.array([[1, 2],
[1, 3],
[2, 3],
[2, 4],
[2, 6],
[1, 4],
...
...
[1000000,6]])
# Convert the array to a DataFrame
df = pd.DataFrame(a, columns=['index', 'value'])
# Melt the DataFrame
df = df.melt(id_vars='index', value_name='value')
# Group the DataFrame by 'index' and calculate the median of 'value'
result = df.groupby('index')['value'].median()
# Print the result
print(result)