业务系统大数据量如何去重？持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第17天，点击查看活动详情

持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第17天，点击查看活动详情

前言

业务场景，比如电商营销，大数据通过模型算法分析圈选出来一个大数据量的人群，定向营销比如推送push消息或者短信告诉客户发优惠券了或者有什么活动。但是用户id可能存在重复我们需要做去重？可能没有注意到这种数据量我们第一时间可能会想到直接用hashset就可以了吗？但是这种50w或者500w用户占用的内存空间有多大呢？

内存占用计算

Java对象存储，java内存模型， JVM内存的堆内存、栈内存及方法区等等，程序运行过程的变量主要关注堆内存变量。内存的计算我这边主要依赖第三方工具类 RamUsageEstimator

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>4.0.0</version>
</dependency>

public class HeapTest {

    public static void main(String[] args) {
        Set<String> set = new HashSet<>();
        Long phone = 13800000001L;
        for (int i = 0 ; i < 5000000; i++) {
            set.add(String.valueOf(phone + i));
        }
        //计算指定对象及其引用树上的所有对象的综合大小，单位字节
        System.out.println(org.apache.lucene.util.RamUsageEstimator.sizeOf(set));
        //计算指定对象本身在堆空间的大小，单位字节
        System.out.println(org.apache.lucene.util.RamUsageEstimator.shallowSizeOf(set));
        //计算指定对象及其引用树上的所有对象的综合大小，返回可读的结果
        System.out.println(org.apache.lucene.util.RamUsageEstimator.humanSizeOf(set));
    }
}

上面的计算方法可能和运行环境不同有些误差，用jprofile查看实时的内存快照，500w的手机号码也在500m左右，这个在业务系统里面是不能接受的。

那么在业务系统里面这种大数据如何去重呢？

BitMap数据结构

* Though you can set the bits in any order (e.g., set(100), set(10), set(1),
* you will typically get better performance if you set the bits in increasing order (e.g., set(1), set(10), set(100)).
* 
* Setting a bit that is larger than any of the current set bit
* is a constant time operation. Setting a bit that is smaller than an 
* already set bit can require time proportional to the compressed
* size of the bitmap, as the bitmap may need to be rewritten.

官方的说明文档，实际应用过程中我们需要自己去实现bitmap，这边我直接用的狗狗开源的一个sdk。

<dependency>
    <groupId>com.googlecode.javaewah</groupId>
    <artifactId>JavaEWAH</artifactId>
    <version>1.1.0</version>
</dependency>

public class HeapTest {
    public static void main(String[] args) {  
        EWAHCompressedBitmap bitMap = new EWAHCompressedBitmap();
        Long phone = 13800000001L;
        for (int i = 0 ; i < 5000000; i++) {
            bitMap.addWord(phone+i);
        }
        //计算指定对象及其引用树上的所有对象的综合大小，单位字节
        System.out.println(org.apache.lucene.util.RamUsageEstimator.sizeOf(bitMap));
        //计算指定对象本身在堆空间的大小，单位字节
        System.out.println(org.apache.lucene.util.RamUsageEstimator.shallowSizeOf(bitMap));
        //计算指定对象及其引用树上的所有对象的综合大小，返回可读的结果
        System.out.println(org.apache.lucene.util.RamUsageEstimator.humanSizeOf(bitMap));
    }
}

其实我们可以看到存储空间已经大量的减少了，但是这个也不是很符合我们的这个场景，海量的数据不同的活动对应的用户也是不相同的。

依赖外部系统

在业务系统里面，去做这种大数据量的去重其实并不太合适，我们可以借助一些第三方数据库或者框架来实现，比如Redis 布隆过滤器，或者人群表入库的时候，不做唯一索引的限制，再次查询的时候去做去重。

参考文档

大数据分析常用去重算法分析『Bitmap篇』 (kyligence.io)
漫画：什么是Bitmap算法？ - 知乎 (zhihu.com)