如何避免使用“for”循环来处理数组

79 阅读2分钟

在处理大型数组时,使用“for”循环会大大降低代码的执行效率。比如,我们有一个包含100000000行和2列的数组“a”,第一列是索引,第二列是相关值。我们需要获取第二列中每个索引的数字的中值。使用“for”循环可以实现,但是速度非常慢。

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])
    http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding;//爬虫IP免费获取;

image.png

2、解决方案

方法一:使用Pandas库

Pandas库是一个强大的数据处理库,它提供了许多方便的函数来处理数组。我们可以使用Pandas库的groupby()函数来实现上述任务。

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

输出:

       value
index       
1        3.0
2        5.5
5        8.5

方法二:使用NumPy库

NumPy库是一个强大的科学计算库,它提供了许多函数来处理数组。我们可以使用NumPy库的argsort()函数和unique()函数来实现上述任务。

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

方法三:使用bincount()函数

我们可以使用NumPy库的bincount()函数来实现上述任务。

num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]

方法四:使用Pandas库的melt()函数

import numpy as np
import pandas as pd

a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])

# Convert the array to a DataFrame
df = pd.DataFrame(a, columns=['index', 'value'])

# Melt the DataFrame
df = df.melt(id_vars='index', value_name='value')

# Group the DataFrame by 'index' and calculate the median of 'value'
result = df.groupby('index')['value'].median()

# Print the result
print(result)