🎯 Redis布隆过滤器：用"概率"换"性能"的艺术🎬 开场：一个关于"黑名单"的故事想象你是机场安检员 ✈️：

考察点： 位图、hash函数、不存在判定、容量规划

🎬 开场：一个关于"黑名单"的故事

想象你是机场安检员 ✈️：

场景1：完美方案（传统集合）

恐怖分子名单：1000万人
每次检查：在1000万人中查找
- 内存占用：1000万 × 32字节 = 320MB
- 查询速度：O(log n) 或 O(1)
- 准确性：100%

场景2：现实方案（布隆过滤器）

恐怖分子名单：1000万人
使用布隆过滤器：
- 内存占用：12MB（省了96%！）
- 查询速度：O(k) k为hash次数，极快
- 准确性：99.99%（可能误判）

什么是误判？

正常人A通过安检 → 布隆过滤器说："他在黑名单！"
真实情况：他不在黑名单（误判！）

恐怖分子B来了 → 布隆过滤器说："他在黑名单！"
真实情况：他确实在黑名单（正确！）

重点：
✅ 如果说"不在"，那一定不在（不会漏掉真正的坏人）
❌ 如果说"在"，可能误判（可能冤枉好人）

Redis布隆过滤器就是这样的"高效筛选器"！ 🎯

第一部分：布隆过滤器原理 📚

1.1 什么是布隆过滤器？

布隆过滤器（Bloom Filter） = 位图 + 多个哈希函数

核心思想：
用极少的空间，快速判断一个元素"可能存在"或"一定不存在"

特点：
✅ 空间效率极高（位图）
✅ 查询速度极快（O(k)）
✅ 不会漏判（说不存在就一定不存在）
❌ 可能误判（说存在可能是误判）
❌ 不能删除（删了会影响其他元素）

1.2 数据结构

布隆过滤器 = 一个很长的位数组 + k个哈希函数

位数组（bit array）：
索引: 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
位值: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
      ↑ 每一位只占1bit（0或1）

k个哈希函数：
hash1("apple") = 3
hash2("apple") = 7
hash3("apple") = 12

1.3 添加元素（Add）

def add(element):
    """添加元素到布隆过滤器"""
    for hash_func in hash_functions:
        # 计算哈希值
        index = hash_func(element) % bit_array_size
        # 设置对应位为1
        bit_array[index] = 1

示例：添加 "apple"

初始状态：
索引: 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
位值: 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

计算哈希：
hash1("apple") % 16 = 3
hash2("apple") % 16 = 7
hash3("apple") % 16 = 12

添加后：
索引: 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
位值: 0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0
               ↑           ↑              ↑
            hash1       hash2          hash3

1.4 查询元素（Contains）

def contains(element):
    """判断元素是否可能存在"""
    for hash_func in hash_functions:
        index = hash_func(element) % bit_array_size
        # 如果任何一位是0，元素一定不存在
        if bit_array[index] == 0:
            return False
    # 所有位都是1，元素可能存在
    return True

示例：查询 "apple" 和 "banana"

当前状态：
索引: 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
位值: 0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0

查询 "apple"：
hash1("apple") % 16 = 3  → bit[3] = 1 ✓
hash2("apple") % 16 = 7  → bit[7] = 1 ✓
hash3("apple") % 16 = 12 → bit[12] = 1 ✓
结果：可能存在 ✅

查询 "banana"：
hash1("banana") % 16 = 2  → bit[2] = 0 ✗
结果：一定不存在 ❌

1.5 误判原因

添加了 "apple"、"orange"、"grape"

位数组状态：
索引: 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
位值: 0  1  0  1  0  1  0  1  0  0  1  0  1  0  0  1

查询 "banana"（没添加过）：
hash1("banana") % 16 = 3  → bit[3] = 1 ✓ (apple设置的)
hash2("banana") % 16 = 7  → bit[7] = 1 ✓ (orange设置的)
hash3("banana") % 16 = 12 → bit[12] = 1 ✓ (grape设置的)

结果：误判为存在！❌

原因：
这些位被其他元素设置成1了
发生了"碰撞"

第二部分：Redis实现布隆过滤器 🛠️

2.1 方式1：使用Redis位图（Bitmap）

import redis
import mmh3  # MurmurHash3

class RedisBloomFilter:
    """基于Redis Bitmap实现的布隆过滤器"""
    
    def __init__(self, redis_client, key, size=10000000, hash_num=7):
        """
        :param redis_client: Redis连接
        :param key: Redis键名
        :param size: 位数组大小
        :param hash_num: 哈希函数数量
        """
        self.redis = redis_client
        self.key = key
        self.size = size
        self.hash_num = hash_num
    
    def _get_hash_positions(self, element):
        """计算元素的哈希位置"""
        positions = []
        for seed in range(self.hash_num):
            # 使用MurmurHash3计算哈希值
            hash_value = mmh3.hash(str(element), seed)
            position = hash_value % self.size
            positions.append(position)
        return positions
    
    def add(self, element):
        """添加元素"""
        positions = self._get_hash_positions(element)
        
        # 使用pipeline批量操作
        pipe = self.redis.pipeline()
        for pos in positions:
            pipe.setbit(self.key, pos, 1)
        pipe.execute()
        
        print(f"✅ 添加元素: {element}")
    
    def contains(self, element):
        """判断元素是否存在"""
        positions = self._get_hash_positions(element)
        
        # 使用pipeline批量查询
        pipe = self.redis.pipeline()
        for pos in positions:
            pipe.getbit(self.key, pos)
        results = pipe.execute()
        
        # 所有位都是1才返回True
        return all(results)
    
    def add_batch(self, elements):
        """批量添加元素"""
        pipe = self.redis.pipeline()
        
        for element in elements:
            positions = self._get_hash_positions(element)
            for pos in positions:
                pipe.setbit(self.key, pos, 1)
        
        pipe.execute()
        print(f"✅ 批量添加 {len(elements)} 个元素")


# 使用示例
if __name__ == "__main__":
    r = redis.Redis(host='localhost', port=6379)
    
    # 创建布隆过滤器
    bf = RedisBloomFilter(r, key="user:blacklist", size=10000000, hash_num=7)
    
    # 添加元素
    bf.add("user:12345")
    bf.add("user:67890")
    
    # 查询
    print(bf.contains("user:12345"))  # True（确实存在）
    print(bf.contains("user:99999"))  # False（不存在）
    print(bf.contains("user:11111"))  # 可能True（误判）

2.2 方式2：使用RedisBloom模块（推荐⭐⭐⭐⭐⭐）

Redis 4.0+ 可以安装 RedisBloom 模块，提供原生支持。

安装RedisBloom

# Docker安装
docker run -p 6379:6379 --name redis-bloom redis/redis-stack-server:latest

# 或编译安装
git clone --recursive https://github.com/RedisBloom/RedisBloom.git
cd RedisBloom
make
redis-server --loadmodule ./redisbloom.so

使用RedisBloom

import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# 创建布隆过滤器
# BF.RESERVE key error_rate capacity
r.execute_command('BF.RESERVE', 'user:blacklist', '0.01', '1000000')
# 误判率0.01，预期容量100万

# 添加元素
r.execute_command('BF.ADD', 'user:blacklist', 'user:12345')
# 返回: 1（成功添加）

# 批量添加
r.execute_command('BF.MADD', 'user:blacklist', 'user:111', 'user:222', 'user:333')
# 返回: [1, 1, 1]

# 判断是否存在
exists = r.execute_command('BF.EXISTS', 'user:blacklist', 'user:12345')
print(exists)  # 1（存在）

# 批量判断
results = r.execute_command('BF.MEXISTS', 'user:blacklist', 'user:111', 'user:999')
print(results)  # [1, 0]

# 查看过滤器信息
info = r.execute_command('BF.INFO', 'user:blacklist')
print(info)
# 输出：
# ['Capacity', 1000000, 'Size', 1437759, 'Number of filters', 1, ...]

第三部分：误判率计算 📊

3.1 误判率公式

布隆过滤器的误判率由三个参数决定：

m: 位数组大小（bit数）
n: 已添加元素数量
k: 哈希函数数量

误判率公式：
ε ≈ (1 - e^(-kn/m))^k

推导：
1. 单个位在一次哈希后仍为0的概率：1 - 1/m
2. k次哈希后仍为0的概率：(1 - 1/m)^k
3. n个元素后仍为0的概率：(1 - 1/m)^(kn)
4. 近似：(1 - 1/m)^(kn) ≈ e^(-kn/m)
5. 误判率：(1 - e^(-kn/m))^k

3.2 最优哈希函数数量

给定m和n，最优的k值：

k_optimal = (m/n) × ln(2) ≈ 0.693 × m/n

此时误判率最低：
ε_min ≈ 0.6185^(m/n)

3.3 参数计算器

import math

class BloomFilterCalculator:
    """布隆过滤器参数计算器"""
    
    @staticmethod
    def optimal_params(n, error_rate):
        """
        根据预期元素数量和误判率，计算最优参数
        :param n: 预期元素数量
        :param error_rate: 期望误判率（如0.01表示1%）
        :return: (m, k) 位数组大小和哈希函数数量
        """
        # 计算所需位数
        m = -n * math.log(error_rate) / (math.log(2) ** 2)
        m = int(math.ceil(m))
        
        # 计算最优哈希函数数量
        k = (m / n) * math.log(2)
        k = int(math.ceil(k))
        
        return m, k
    
    @staticmethod
    def calculate_error_rate(m, n, k):
        """
        计算误判率
        :param m: 位数组大小
        :param n: 元素数量
        :param k: 哈希函数数量
        :return: 误判率
        """
        error_rate = (1 - math.exp(-k * n / m)) ** k
        return error_rate
    
    @staticmethod
    def size_in_mb(m):
        """
        计算占用内存（MB）
        """
        return m / 8 / 1024 / 1024


# 使用示例
calc = BloomFilterCalculator()

# 场景1：100万用户，误判率1%
print("=" * 50)
print("场景1：100万用户，误判率1%")
n = 1000000
error_rate = 0.01
m, k = calc.optimal_params(n, error_rate)
print(f"位数组大小: {m:,} bits")
print(f"占用内存: {calc.size_in_mb(m):.2f} MB")
print(f"哈希函数数量: {k}")
print(f"实际误判率: {calc.calculate_error_rate(m, n, k):.4%}")

# 场景2：1000万用户，误判率0.1%
print("=" * 50)
print("场景2：1000万用户，误判率0.1%")
n = 10000000
error_rate = 0.001
m, k = calc.optimal_params(n, error_rate)
print(f"位数组大小: {m:,} bits")
print(f"占用内存: {calc.size_in_mb(m):.2f} MB")
print(f"哈希函数数量: {k}")
print(f"实际误判率: {calc.calculate_error_rate(m, n, k):.4%}")

# 输出：
# ==================================================
# 场景1：100万用户，误判率1%
# 位数组大小: 9,585,059 bits
# 占用内存: 1.14 MB
# 哈希函数数量: 7
# 实际误判率: 0.9998%
# ==================================================
# 场景2：1000万用户，误判率0.1%
# 位数组大小: 143,775,891 bits
# 占用内存: 17.11 MB
# 哈希函数数量: 10
# 实际误判率: 0.1000%

3.4 参数对照表

预期元素数	误判率	位数组大小	内存占用	哈希函数数
10万	1%	958,506 bits	117 KB	7
10万	0.1%	1,437,759 bits	176 KB	10
100万	1%	9,585,059 bits	1.14 MB	7
100万	0.1%	14,377,589 bits	1.71 MB	10
1000万	1%	95,850,587 bits	11.4 MB	7
1000万	0.1%	143,775,891 bits	17.1 MB	10
1亿	1%	958,505,870 bits	114 MB	7
1亿	0.1%	1,437,758,910 bits	171 MB	10

结论：

误判率越低，内存越大
哈希函数越多，计算越慢但误判率越低
通常选择1%-0.1%的误判率

第四部分：实战应用场景 💼

4.1 场景1：防止缓存穿透

问题：

大量请求不存在的key
→ 缓存miss
→ 查询数据库
→ 数据库也没有
→ 返回空
→ 持续攻击，数据库崩溃！

解决方案：

class CacheWithBloomFilter:
    """带布隆过滤器的缓存"""
    
    def __init__(self, redis_client, mysql_conn):
        self.redis = redis_client
        self.mysql = mysql_conn
        self.bf = RedisBloomFilter(redis_client, "cache:bloom", size=10000000, hash_num=7)
        
        # 初始化：将所有存在的key加入布隆过滤器
        self._init_bloom_filter()
    
    def _init_bloom_filter(self):
        """初始化布隆过滤器"""
        print("🔄 初始化布隆过滤器...")
        
        # 从数据库读取所有有效的key
        cursor = self.mysql.cursor()
        cursor.execute("SELECT id FROM users WHERE deleted = 0")
        
        keys = [f"user:{row[0]}" for row in cursor.fetchall()]
        
        # 批量添加到布隆过滤器
        for i in range(0, len(keys), 1000):
            batch = keys[i:i+1000]
            for key in batch:
                self.bf.add(key)
        
        print(f"✅ 初始化完成，加载了 {len(keys)} 个key")
    
    def get(self, key):
        """获取缓存"""
        # 1. 先查布隆过滤器
        if not self.bf.contains(key):
            # 一定不存在，直接返回
            print(f"❌ 布隆过滤器：{key} 不存在")
            return None
        
        # 2. 可能存在，查Redis缓存
        value = self.redis.get(key)
        if value:
            print(f"✅ 缓存命中：{key}")
            return value
        
        # 3. 缓存miss，查数据库
        print(f"🔍 查询数据库：{key}")
        value = self._query_db(key)
        
        if value:
            # 存在，写入缓存
            self.redis.setex(key, 3600, value)
            return value
        else:
            # 真的不存在（误判）
            print(f"⚠️ 误判：{key} 实际不存在")
            return None
    
    def set(self, key, value):
        """设置缓存"""
        # 写入缓存
        self.redis.setex(key, 3600, value)
        
        # 添加到布隆过滤器
        self.bf.add(key)
    
    def delete(self, key):
        """删除缓存"""
        self.redis.delete(key)
        # 注意：布隆过滤器无法删除！
        # 只能等待过期或重建

4.2 场景2：防止重复推送

class PushService:
    """推送服务（防止重复推送）"""
    
    def __init__(self, redis_client):
        self.redis = redis_client
        # 每天一个布隆过滤器
        self.bf_key_prefix = "push:dedup:"
    
    def _get_bf_key(self):
        """获取今天的布隆过滤器key"""
        from datetime import datetime
        today = datetime.now().strftime("%Y%m%d")
        return f"{self.bf_key_prefix}{today}"
    
    def push(self, user_id, message):
        """推送消息"""
        bf_key = self._get_bf_key()
        
        # 生成唯一标识
        push_id = f"{user_id}:{message['type']}:{message['content_id']}"
        
        # 检查是否已推送
        exists = self.redis.execute_command('BF.EXISTS', bf_key, push_id)
        
        if exists:
            print(f"⚠️ 重复推送，已过滤：{push_id}")
            return False
        
        # 添加到布隆过滤器
        self.redis.execute_command('BF.ADD', bf_key, push_id)
        
        # 执行推送
        print(f"✅ 推送成功：{push_id}")
        self._do_push(user_id, message)
        
        return True
    
    def _do_push(self, user_id, message):
        """实际推送逻辑"""
        pass


# 使用示例
push_service = PushService(redis_client)

# 推送消息
push_service.push(12345, {"type": "article", "content_id": 999})
# ✅ 推送成功

push_service.push(12345, {"type": "article", "content_id": 999})
# ⚠️ 重复推送，已过滤

4.3 场景3：爬虫URL去重

class WebCrawler:
    """爬虫URL去重"""
    
    def __init__(self, redis_client):
        self.redis = redis_client
        self.bf = RedisBloomFilter(redis_client, "crawler:urls", size=100000000, hash_num=7)
        # 支持1亿个URL
    
    def add_url(self, url):
        """添加URL到待爬队列"""
        # 检查是否已爬取
        if self.bf.contains(url):
            print(f"⚠️ URL已爬取，跳过：{url}")
            return False
        
        # 添加到布隆过滤器
        self.bf.add(url)
        
        # 添加到爬取队列
        self.redis.lpush("crawler:queue", url)
        print(f"✅ 新URL入队：{url}")
        
        return True
    
    def crawl(self):
        """爬取网页"""
        while True:
            # 从队列取URL
            url = self.redis.rpop("crawler:queue")
            
            if not url:
                break
            
            print(f"🕷️ 正在爬取：{url}")
            # 爬取页面...
            # 提取新链接...
            # 添加新链接到队列


# 使用示例
crawler = WebCrawler(redis_client)

crawler.add_url("https://example.com/page1")  # ✅ 新URL入队
crawler.add_url("https://example.com/page2")  # ✅ 新URL入队
crawler.add_url("https://example.com/page1")  # ⚠️ URL已爬取，跳过

第五部分：优化与注意事项 ⚠️

5.1 布隆过滤器无法删除

问题：

添加了user:12345
后来user:12345被删除了
但布隆过滤器中无法删除！

解决方案1：定期重建

def rebuild_bloom_filter():
    """定期重建布隆过滤器"""
    # 创建新的布隆过滤器
    new_bf = RedisBloomFilter(redis_client, "cache:bloom:new")
    
    # 从数据库加载当前有效的key
    valid_keys = get_valid_keys_from_db()
    
    for key in valid_keys:
        new_bf.add(key)
    
    # 原子性切换
    redis_client.rename("cache:bloom:new", "cache:bloom")
    
    print("✅ 布隆过滤器重建完成")

# 定时任务：每天凌晨3点重建

解决方案2：使用计数布隆过滤器（Counting Bloom Filter）

# 使用多个位代替一个位，支持删除
# 但会占用更多空间（4-16倍）

解决方案3：时间分片

def get_bf_key(ttl_days=7):
    """每天一个布隆过滤器，保留7天"""
    from datetime import datetime, timedelta
    
    today = datetime.now()
    bf_keys = []
    
    for i in range(ttl_days):
        date = (today - timedelta(days=i)).strftime("%Y%m%d")
        bf_keys.append(f"cache:bloom:{date}")
    
    return bf_keys

def contains(element):
    """查询所有有效的布隆过滤器"""
    for bf_key in get_bf_key():
        if bf_exists(bf_key, element):
            return True
    return False

5.2 容量规划

# 预估规划
预期元素数量: 1000万
误判率: 0.1%
→ 内存需求: 17.1 MB

实际规划:
- 预留50%缓冲 → 1500万容量
- 内存需求: 25.7 MB
- 使用RedisBloom: BF.RESERVE key 0.001 15000000

5.3 性能优化

# 1. 批量操作
BF.MADD key element1 element2 element3  # 比多次ADD快

# 2. pipeline
for element in elements:
    pipe.execute_command('BF.ADD', key, element)
pipe.execute()

# 3. 合理设置过期时间
EXPIRE bloom:filter:20240101 86400  # 24小时后过期

🎓 总结：布隆过滤器选型

         [需要去重/过滤？]
               |
        ┌──────┴──────┐
        ↓             ↓
  [数据量多大？]  [能容忍误判？]
        |             |
    < 100万        能容忍
        |             |
      Set          布隆过滤器
        
    > 1000万      不能容忍
        |             |
   布隆过滤器      Set/数据库

记忆口诀 🎵

布隆过滤很神奇，
位图加上哈希函数。
空间节省九成半，
查询速度快如飞。

说不存在就不在，
说存在可能误判。
只能添加不能删，
定期重建来更新。

缓存穿透它能防，
URL去重它在行。
黑名单防刷它管，
推送去重也能用！

面试要点 ⭐

原理：位图+k个哈希函数
特点：不会漏判（说不存在就不存在），可能误判（说存在可能不存在）
误判率：ε ≈ (1 - e^(-kn/m))^k
最优k值：k = 0.693 × m/n
无法删除：只能重建或用计数布隆过滤器
应用场景：缓存穿透、去重、黑名单过滤
RedisBloom：BF.RESERVE、BF.ADD、BF.EXISTS

最后总结：

布隆过滤器就像机场安检的"快速通道" ✈️：

说你不在黑名单：一定正确，直接通过 ✅
说你在黑名单：可能误判，需要二次确认 🔍
好处：速度极快，节省资源
代价：可能冤枉好人（但不会放过坏人）

记住：用空间换时间，用概率换确定性！ 🎯

加油，高性能系统架构师！💪