使用Redis实现布隆过滤器使用Redis实现布隆过滤器注：本专栏文章均为本人原创，未经本人授权请勿私自转载，谢谢。

注：本专栏文章均为本人原创，未经本人授权请勿私自转载，谢谢。

布隆过滤器常被用于判定集合中是否包含某个值，其优点是空间效率和查询时间都远远超过一般算法。

布隆过滤器中有一个很长的二进制向量，在插入值时，通过对输入值应用一系列的哈希映射函数，将其得到的所有散列值映射到二进制向量上，对应位置的二进制位标记为 1；在查询值是否存在时，同样对输入值应用这些哈希映射函数，检测其得到的所有二进制位是否同时为 1，来判定该输入值是否存在。

布隆过滤器的工作原理

1. 初始情况下，二进制串为：

位置：F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
数值：0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

2. 向集合中添加 "string1" 字串，其得到的 3 个哈希值分别为：hash1 = 5、hash2 = 7、hash3 = 12（C），此时二进制串变为：

位置：F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
数值：0  0  0  1  0  0  0  0  1  0  1  0  0  0  0  0

3. 向集合中添加 "string2" 字串，其得到的 3 个哈希值分别为：hash1 = 2、hash2 = 10（A）、hash3 = 14（E），此时二进制串变为：

位置：F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
数值：0  1  0  1  0  1  0  0  1  0  1  0  0  1  0  0

4. 查询 "string1" 字串是否存在

计算得到的 3 个哈希值分别为：hash1 = 5、hash2 = 7、hash3 = 12（C）；查询二进制字串的这三个位置，值都为 1，说明该字符串可能存在（实际也存在）。

5. 查询 "string3" 字串是否存在

计算得到的 3 个哈希值分别为：hash1 = 5、hash2 = 7、hash3 = 14（E）；查询二进制字串的这三个位置，值都为 1，说明该字符串可能存在（实际不存在，但由于 string1 和 string2 共同的影响，在二进制字串中表现为该字串存在）。

6. 查询 "string4" 字串是否存在

计算得到的 3 个哈希值分别为：hash1 = 5、hash2 = 8、hash3 = 14（E）；查询二进制字串的这三个位置，有一个值为 0，说明该字符串一定不存在。

讨论：

由于存在哈希结果的交叉，在布隆过滤器中判定为存在的数据，可能是不存在的（假阳性）；但是判定为不存在的数据，是一定不存在的。

假设元素的全集个数为 n, 二进制串长度为 m, 散列函数个数 k 的最优个数为：

最优散列函数个数 k = ln(2)*(m/n)

相应的，误判率 p 的计算公式为：

误判率 p = (1-e^(-kn/m))^k

可以通过不断调整 m 和 n 的数值，直到计算出满意的误判率为止。

单机布隆过滤器的使用

Google Guava、HuTool 等工具类中都封装了布隆过滤器的实现，以下为 Guava 版本布隆过滤器的使用：

@SuppressWarnings("UnstableApiUsage")
public static void main(String[] args) {
    // 创建一个布隆过滤器，参数依次为：将对象转换为 byte 的通道、预期插入数、期望的假阳性率
    BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(StandardCharsets.UTF_8), 1000, 0.001);
    // 插入随机字符串
    final int count = 200;
    String[] values = new String[count];
    for (int i = 0; i < count; i++) {
        String value = "value" + ThreadLocalRandom.current().nextLong();
        bloomFilter.put(value);
        values[i] = value;
    }
    // 查询这些随机字串是否存在
    for (int i = 0; i < count; i++) {
        String value = values[i];
        boolean isExist = bloomFilter.mightContain(value);
        System.out.println(String.format("i = %03d, isExist = %b, hashCode = %x", i, isExist, value.hashCode()));
    }
    // 生成另外的随机字串，并查询是否存在
    for (int i = count; i < count + 100; i++) {
        String value = "value" + ThreadLocalRandom.current().nextLong();
        boolean isExist = bloomFilter.mightContain(value);
        System.out.println(String.format("i = %03d, isExist = %b, hashCode = %x", i, isExist, value.hashCode()));
    }
}

基于Redis实现的分布式布隆过滤器

但对于分布式的场景，上述基于单机的布隆过滤器显然无法满足需求。

我们可以使用 Redis 中的 BitMap 实现一个分布式可扩展的布隆过滤器，它可以提供一个足够大的二进制串来保存 hash 值。

以下为笔者使用 Jeids 实现的基于 Redis 的简易布隆过滤器，核心代码如下：

public class BloomFilter {

    private static final Logger logger = LoggerFactory.getLogger(BloomFilter.class);

    /**
     * 二进制串的长度
     */
    private static long BITMAP_LEN = 10000000;

    /**
     * key 前缀
     */
    public static final String KEY_PREFIX = "bf-";

    /**
     * 内置的三个哈希函数
     */
    private List<Function<String, Long>> hashFuncs = new ArrayList<>(Arrays.asList(
            str -> (str.hashCode() >>> 1) % BITMAP_LEN,
            str -> (str.hashCode() >>> 2) % BITMAP_LEN,
            str -> (str.hashCode() >>> 3) % BITMAP_LEN
    ));

    /**
     * Jedis 连接池
     */
    private Jedis jedis;

    public BloomFilter(Jedis jedis) {
        this.jedis = jedis;
    }

    /**
     * 向布隆过滤器中插入数据
     *
     * @param key
     * @param value
     */
    public void insert(String key, String value) {
        String realKey = KEY_PREFIX + key;
        // 将 value 与所有 hash 函数作用，然后将所有返回的值所对应的比特位设置为 true
        try (Pipeline pipeline = jedis.pipelined()) {
            hashFuncs.stream().map(e -> e.apply(value))
                    .forEach(idx -> pipeline.setbit(realKey, idx, true));
            pipeline.syncAndReturnAll();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 判断数据是否存在
     *
     * @param key
     * @param value
     * @return
     */
    public boolean exist(String key, String value) {
        String realKey = KEY_PREFIX + key;
        // 将 value 与所有 hash 函数作用，然后判断返回的值中是否包含 false
        try (Pipeline pipeline = jedis.pipelined()) {
            hashFuncs.stream().map(e -> e.apply(value))
                    .forEach(idx -> pipeline.getbit(realKey, idx));
            return !pipeline.syncAndReturnAll().contains(false);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return false;
    }
}

测试代码如下：

@Test
public void redisBloomFilterTest() throws Exception {
    BloomFilter bloomFilter = new BloomFilter(jedis);
    // 插入随机字符串
    final int count = 200;
    String[] values = new String[count];
    for (int i = 0; i < count; i++) {
        String value = "value" + ThreadLocalRandom.current().nextLong();
        bloomFilter.insert("test", value);
        values[i] = value;
    }
    // 查询这些随机字串是否存在
    for (int i = 0; i < count; i++) {
        String value = values[i];
        boolean isExist = bloomFilter.exist("test", value);
        System.out.println(String.format("i = %03d, isExist = %b, hashCode = %x", i, isExist, value.hashCode()));
    }
    // 生成另外的随机字串，并查询是否存在
    for (int i = count; i < count + 100; i++) {
        String value = "value" + ThreadLocalRandom.current().nextLong();
        boolean isExist = bloomFilter.exist("test", value);
        System.out.println(String.format("i = %03d, isExist = %b, hashCode = %x", i, isExist, value.hashCode()));
    }
}

最终结果比较长，这里就不放出了，结果中 isExist 的输出具有以下特点：

只要是之前插入过的字符串，都是输出 true；而之前未插入的字符串，在大部分情况下都是输出 false，偶尔小部分情况下输出 true（假阳性），符合我们之前对布隆过滤器的讨论。

以上代码仅供讨论 Redis 版本的布隆过滤器实现，实际生产中可直接使用 redisson 家的 RBloomFilter 来实现。

Redis 的布隆过滤器主要的作用是用于防止缓存穿透和黑名单过滤，这个会放在以后讨论。