我用Python实现了将两个上亿行的id文件,取差集,速度简直飞起来!

55 阅读2分钟

我使用 Python 3.11.10 实现了两个上亿行的id文件,取差集(高版本Python无法安装Roaringbitmap)。

1. 实现说明

假设每个文件都是一行一个数字。

我们会用到 Roaringbitmap ,每个生成1个bitmap,bitmap的每一个bit位代表了一个元素,也就是每个 Bitmap 有2亿个bit位, 最后2个bitmap做一下差集即可。

Roaringbitmap 会压缩 bit 位,比如连续的 100 个 bit 位都是1,那么它就可以压缩成 1*100 (简单举例,实际更复杂),所以完全不用担心内存会压爆了。

如果这2亿行文件有重复,那计算速度会更快。

如果这2亿行文件不是连续,中间有断开,计算速度可能会变慢一点,综合一下,速度不一定会慢太多的。

2. 生成两个文件

# ids2.txt 比 ids1.txt 少 10 个数字,为了后面能看出差集是多少
import time

start_ts = int(time.time())
max_num = 200_000_000
with open('ids1.txt', 'w') as f:
    for i in range(max_num):
        f.write(str(i) + "\n")

with open('ids2.txt', 'w') as f:
    for i in range(max_num):
        if i < max_num - 10:
            f.write(str(i) + "\n")
print(int(time.time() - start_ts))

看生成2亿行的2个文件,花了44秒,文件大小 1.8 G

3. 计算差集

安装 RoaringBitmap

pip install roaringbitmap

差集代码如下

from roaringbitmap import RoaringBitmap
import time

def read_ids_to_bitmap(file_path):
    bitmap = RoaringBitmap()
    with open(file_path, 'r') as f:
        for line in f:
            try:
                id_num = int(line.strip())
                bitmap.add(id_num)
            except ValueError:
                continue
    return bitmap

def compute_difference(file1_path, file2_path, output_path):
    start_time = time.time()

    # Read both files into RoaringBitmaps
    print("Reading first file...")
    bitmap1 = read_ids_to_bitmap(file1_path)
    print(f"First file loaded. Size: {len(bitmap1)} IDs")

    print("Reading second file...")
    bitmap2 = read_ids_to_bitmap(file2_path)
    print(f"Second file loaded. Size: {len(bitmap2)} IDs")

    # Compute difference
    print("Computing difference...")
    diff = bitmap1 - bitmap2

    # Write result to output file
    print("Writing result...")
    with open(output_path, 'w') as f:
        for id_num in diff:
            f.write(f"{id_num}\n")

    print(f"Difference computed. Result size: {len(diff)} IDs")
    print(f"Total time: {time.time() - start_time:.2f} seconds")

if __name__ == "__main__":
    file1 = "ids1.txt"
    file2 = "ids2.txt"
    output = "diff.txt"
    compute_difference(file1, file2, output)

计算差集花费 42 秒。

最开始接触 RoaringBitmap 是因为 Hbase 数据库,接触到了 Hbase 协处理器,在 Java 中的 RoaringBitmap 使用更方便,下面是 RoaringBitmap 官网。


最后插播一下,欢迎大家访问我最新搭建的 MAC软件网站,收集MAC精品软件:

macos-software.com/