Java 中大集合 <Long> 求交集的方法比较

343 阅读4分钟

背景

项目中使用到 List 求交集,很容易想到 collecion.retainAll () 方法,但是在数据量比较大时,这个方法效率并不高。本文研究了几种常用的方法,以供大家参考。

方法

【首先】 梳理下思路,List 去重一般有几种方法。

  1. 『外层遍历 + 内层遍历』查找:

复杂度 O (NM) ,一般使用 contains () 检查是否包含

  1. 『外层遍历 + 内层 Hash』查找:

复杂度 O (N),一般将内层 List 转化为 HashSet 实现

  1. 『外层遍历 + 内层 bitMap』查找:

复杂度 O (N),一般将内层 List 转化为字节映射实现

【其次】 这里其实忽略了一个点,就是 『单层遍历』中,检查 元素不包含 时,需要将这个元素移除(即 remove 方法)。remove 时,也会导致性能问题。

这里面我们使用 Java8 中java.util.AbstractCollection#retainAll方法来验证下我们的思路。

// Java8 中 方法:java.util.AbstractCollection#retainAll
public boolean retainAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;
    Iterator<E> it = iterator();
    
    // 1. 外层遍历
    while (it.hasNext()) {
        
        // 2. 内层查找『是否包含』
        if (!c.contains(it.next())) {
        
            // 3. 不包含时,移除外层元素
            it.remove();
            modified = true;
        }
    }
    return modified;
}

这里用图总结下,求交集的流程:   image.png

实现

1.『外层遍历 + 内层遍历』查找:

java 中常用 2 种遍历查找的 List:ArrayList、LinkedList,在内外层中测试。

// 外层:ArrayList,内层:ArrayList
private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:ArrayList
private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:ArrayList,内层:LinkedList
private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:LinkedList
private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    ArrayList<Long> setB = new ArrayList<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
}

 

2.『外层遍历 + 内层 Hash』查找:

java 中常用 HashSet,内层替换为 HashSet 查找。

// 外层:ArrayList,内层:HashSet
private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    HashSet<Long> setB = new HashSet<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:HashSet
private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    HashSet<Long> setB = new HashSet<>(listB);
    setA.retainAll(setB);
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
}

3.『外层遍历 + 内层 bitMap』查找:

BitSet 也称作 BitMap,它是一种通用的快速数据结构,不幸的是它太费内存,所以通常我们使用压缩位图。RoaringBitmap 是一种压缩位置,它提供更好的压缩效果,在某些情况下比其它压缩位图快好几百倍。

github.com/RoaringBitm…

RoaringBitmap 已经使用在很多知名的开源项目中:

  • Apache Spark
  • Apache Hive
  • Apache Tez
  • Apache Kylin
  • ... ...

Roaringbitmap 中在 Long 类型中,提供了 2 种实现Roaring64NavigableMapRoaring64Bitmap Roaring64NavigableMap基于红黑树实现,Roaring64Bitmap基于 ART(The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases )数据结构实现。

那么,外层使用 ArrayList、LinkedList,内层使用 Roaring64NavigableMap、Roaring64Bitmap。 

// 外层:ArrayList,内层:Roaring64NavigableMap
private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:Roaring64NavigableMap
private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}


// 外层:ArrayList,内层:Roaring64Bitmap
private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    ArrayList<Long> setA = new ArrayList<>(listA);
    Roaring64NavigableMap ansB = new Roaring64NavigableMap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}

// 外层:LinkedList,内层:Roaring64Bitmap
private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
    long begin = System.currentTimeMillis();
    LinkedList<Long> setA = new LinkedList<>(listA);
    Roaring64Bitmap ansB = new Roaring64Bitmap();
    listB.forEach(ansB::addLong);
    setA.removeIf(e -> !ansB.contains(e));
    long end = System.currentTimeMillis();
    System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}

 

测试结果

使用 Mac Pro 2021 M1 + JDK8 测试,100 万级的数据太慢,实行中没有太大参考意义,有兴趣可以自行测试。

说明:由于受数据,以及电脑本身负载影响,测试结果可能不一致,仅做量级参考。

 

 查找方法(外层 - 内层)1 万 (毫秒)10 万 (毫秒)20 万 (毫秒)50 万 (毫秒)
『外层遍历 + 内层遍历』查找ArrayList-ArrayList67550226280264133
LinkedList-ArrayList63541827961272362
LinkedList-ArrayList57543621330260976
LinkedList-LinkedList59515323251252472
『外层遍历 + 内层 Hash』查找ArrayList-HashSet84675102
LinkedList-HashSet8174982
『外层遍历 + 内层 bitMap』查找ArrayList-Roaring64NavigableMap91265719876
LinkedList-Roaring64NavigableMap26125562876
ArrayList-Roaring64Bitmap2078572801
LinkedList-Roaring64Bitmap119171221384

附录

pom.xml

<dependency>
  <groupId>org.roaringbitmap</groupId>
  <artifactId>RoaringBitmap</artifactId>
  <version>0.9.23</version>
</dependency>

完整测试代码

import org.apache.commons.lang.math.RandomUtils;
import org.junit.Test;
import org.roaringbitmap.longlong.Roaring64Bitmap;
import org.roaringbitmap.longlong.Roaring64NavigableMap;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;

public class SetOperation {

    /**
     * 集合的运算方法用时测试
     */
    @Test
    public void setOperation() {
        int size = 50_0000;
        List<Long> listA = new ArrayList<>(size);
        List<Long> listB = new ArrayList<>(size);

        initData(size, listA, listB);

        //『外层遍历+内层遍历』查找
        System.out.println("1. 『外层遍历+内层遍历』查找");
        outArrayListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
        outArrayListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));

        //『外层遍历+内层Hash』查找:
        System.out.println("2.『外层遍历+内层Hash』查找:");
        outArrayListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));

        //『外层遍历+内层bitMap』查找
        System.out.println("3.『外层遍历+内层bitMap』查找");
        outArrayListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
        outArrayListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
        outLinkedListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
    }

    // 外层:ArrayList,内层:ArrayList
    private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:ArrayList
    private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:ArrayList,内层:LinkedList
    private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:LinkedList
    private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        ArrayList<Long> setB = new ArrayList<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
    }

    // 外层:ArrayList,内层:HashSet
    private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        HashSet<Long> setB = new HashSet<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:HashSet
    private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        HashSet<Long> setB = new HashSet<>(listB);
        setA.retainAll(setB);
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
    }


    // 外层:ArrayList,内层:Roaring64NavigableMap
    private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:Roaring64NavigableMap
    private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
    }


    // 外层:ArrayList,内层:Roaring64Bitmap
    private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        ArrayList<Long> setA = new ArrayList<>(listA);
        Roaring64NavigableMap ansB = new Roaring64NavigableMap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
    }

    // 外层:LinkedList,内层:Roaring64Bitmap
    private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
        long begin = System.currentTimeMillis();
        LinkedList<Long> setA = new LinkedList<>(listA);
        Roaring64Bitmap ansB = new Roaring64Bitmap();
        listB.forEach(ansB::addLong);
        setA.removeIf(e -> !ansB.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
    }

    private void initData(int size, List<Long> listA, List<Long> listB) {
        Random random = new Random();
        Random random2 = new Random();
        random.longs(size).forEach(e -> {
            listA.add(e);
            if (random2.nextFloat() > 0.5) {
                listB.add(e);
            } else {
                listB.add(RandomUtils.nextLong());
            }
        });
    }
}

写在后面

过多内容就不一一展示了, 好了,” 退休程序猿 “会持续更新,咱们下期再见~~~