Redis红宝书 HyperLogLog与布隆过滤器详解

273 阅读14分钟

HyperLogLog与布隆过滤器详解

概述

HyperLogLog和布隆过滤器是两种重要的概率性数据结构,它们在大数据处理和分布式系统中发挥着重要作用。本文将深入探讨这两种数据结构的实现原理、使用场景和实际应用。

1. HyperLogLog 详解

1.1 HyperLogLog 基本概念

HyperLogLog是一种用于估算集合基数(不重复元素个数)的概率性数据结构。它能够在使用极少内存的情况下,对超大数据集的基数进行估算,标准误差约为1.04/√m,其中m是使用的桶数。

1.2 HyperLogLog 实现原理

1.2.1 HyperLogLog 整体结构
graph TB
    subgraph "HyperLogLog 数据结构"
        subgraph "桶数组 (16384个桶)"
            B0["桶0: 5"]
            B1["桶1: 3"]
            B2["桶2: 7"]
            B3["桶3: 2"]
            B4["..."]
            B16383["桶16383: 4"]
        end
        
        subgraph "每个桶存储"
            MAX["最大前导零个数"]
            RANGE["范围: 0-64"]
            BITS["占用: 6位"]
        end
    end
    
    subgraph "哈希函数"
        HASH["64位哈希值"]
        PREFIX["前14位 → 桶号"]
        SUFFIX["后50位 → 前导零计算"]
    end
    
    subgraph "估算公式"
        FORMULA["基数 = α × m² / Σ(2^(-M[j]))"]
        ALPHA["α: 修正常数"]
        M_VAL["m: 桶数量"]
        M_J["M[j]: 第j个桶的值"]
    end
    
    HASH --> PREFIX
    HASH --> SUFFIX
    PREFIX --> B0
    PREFIX --> B1
    PREFIX --> B2
    SUFFIX --> MAX
    B0 --> FORMULA
    B1 --> FORMULA
    B2 --> FORMULA
    
    style B0 fill:#e3f2fd
    style B1 fill:#e3f2fd
    style B2 fill:#e3f2fd
    style MAX fill:#fff3e0
    style FORMULA fill:#e8f5e8
1.2.2 数据添加流程详解
flowchart TD
    START(["开始: 添加元素 'user123'"]) --> HASH_CALC["计算哈希值"]
    HASH_CALC --> HASH_RESULT["哈希值: 0x3A7F...B2C8"]
    HASH_RESULT --> BINARY["转换为二进制"]
    BINARY --> BINARY_RESULT["0011101001111111...10110010110001000"]
    
    BINARY_RESULT --> SPLIT["分割哈希值"]
    SPLIT --> PREFIX_BITS["前14位: 00111010011111"]
    SPLIT --> SUFFIX_BITS["后50位: 11...10110010110001000"]
    
    PREFIX_BITS --> BUCKET_NUM["桶号 = 3743"]
    SUFFIX_BITS --> LEADING_ZEROS["计算前导零"]
    LEADING_ZEROS --> ZERO_COUNT["前导零个数 = 2"]
    
    BUCKET_NUM --> CHECK_BUCKET["检查桶3743当前值"]
    CHECK_BUCKET --> CURRENT_VAL["当前值 = 1"]
    ZERO_COUNT --> COMPARE["比较: max(1, 2)"]
    CURRENT_VAL --> COMPARE
    COMPARE --> UPDATE["更新桶3743 = 2"]
    UPDATE --> END(["结束"])
    
    style HASH_RESULT fill:#e1f5fe
    style BINARY_RESULT fill:#f3e5f5
    style BUCKET_NUM fill:#e8f5e8
    style ZERO_COUNT fill:#fff3e0
    style UPDATE fill:#ffebee
1.2.3 具体数据示例

让我们通过一个具体的例子来理解HyperLogLog的工作原理:

示例数据集: ["user1", "user2", "user3", "user1", "user4"]

sequenceDiagram
    participant Input as 输入元素
    participant Hash as 哈希函数
    participant Bucket as 桶数组
    participant Counter as 前导零计算器
    
    Note over Input,Counter: 添加 "user1"
    Input->>Hash: "user1"
    Hash->>Hash: 计算64位哈希
    Hash-->>Input: 0x1A2B3C4D5E6F7890
    Hash->>Bucket: 前14位 → 桶6789
    Hash->>Counter: 后50位计算前导零
    Counter-->>Bucket: 前导零=3
    Bucket->>Bucket: 桶6789: 0→3
    
    Note over Input,Counter: 添加 "user2"
    Input->>Hash: "user2"
    Hash-->>Input: 0x9F8E7D6C5B4A3210
    Hash->>Bucket: 前14位 → 桶2543
    Hash->>Counter: 后50位计算前导零
    Counter-->>Bucket: 前导零=1
    Bucket->>Bucket: 桶2543: 0→1
    
    Note over Input,Counter: 添加 "user3"
    Input->>Hash: "user3"
    Hash-->>Input: 0x5A5A5A5A5A5A5A5A
    Hash->>Bucket: 前14位 → 桶1434
    Hash->>Counter: 后50位计算前导零
    Counter-->>Bucket: 前导零=4
    Bucket->>Bucket: 桶1434: 0→4
    
    Note over Input,Counter: 再次添加 "user1" (重复)
    Input->>Hash: "user1"
    Hash-->>Input: 0x1A2B3C4D5E6F7890 (相同哈希)
    Hash->>Bucket: 前14位 → 桶6789 (相同桶)
    Hash->>Counter: 后50位计算前导零
    Counter-->>Bucket: 前导零=3 (相同值)
    Bucket->>Bucket: 桶6789: max(3,3)=3 (无变化)
    
    Note over Input,Counter: 添加 "user4"
    Input->>Hash: "user4"
    Hash-->>Input: 0x0F0F0F0F0F0F0F0F
    Hash->>Bucket: 前14位 → 桶963
    Hash->>Counter: 后50位计算前导零
    Counter-->>Bucket: 前导零=5
    Bucket->>Bucket: 桶963: 0→5
1.2.4 桶状态变化追踪
graph LR
    subgraph "初始状态"
        I0["所有桶 = 0"]
    end
    
    subgraph "添加user1后"
        A1["桶6789 = 3"]
        A2["其他桶 = 0"]
    end
    
    subgraph "添加user2后"
        B1["桶6789 = 3"]
        B2["桶2543 = 1"]
        B3["其他桶 = 0"]
    end
    
    subgraph "添加user3后"
        C1["桶6789 = 3"]
        C2["桶2543 = 1"]
        C3["桶1434 = 4"]
        C4["其他桶 = 0"]
    end
    
    subgraph "重复user1后"
        D1["桶6789 = 3 (无变化)"]
        D2["桶2543 = 1"]
        D3["桶1434 = 4"]
        D4["其他桶 = 0"]
    end
    
    subgraph "添加user4后"
        E1["桶6789 = 3"]
        E2["桶2543 = 1"]
        E3["桶1434 = 4"]
        E4["桶963 = 5"]
        E5["其他桶 = 0"]
    end
    
    I0 --> A1
    A1 --> B1
    B1 --> C1
    C1 --> D1
    D1 --> E1
    
    style A1 fill:#e3f2fd
    style B2 fill:#e8f5e8
    style C3 fill:#fff3e0
    style E4 fill:#ffebee
1.2.5 基数估算计算过程

基于上面的示例数据,让我们计算最终的基数估算:

当前桶状态:

  • 桶963: 5
  • 桶1434: 4
  • 桶2543: 1
  • 桶6789: 3
  • 其他16380个桶: 0
flowchart TD
    START(["开始基数估算"]) --> COLLECT["收集所有桶的值"]
    COLLECT --> BUCKET_VALUES["桶值: [5,4,1,3,0,0,0,...]"]
    
    BUCKET_VALUES --> HARMONIC["计算调和平均数的倒数"]
    HARMONIC --> SUM_CALC["Σ(2^(-M[j]))"]
    SUM_CALC --> SUM_DETAIL["2^(-5) + 2^(-4) + 2^(-1) + 2^(-3) + 16380×2^0"]
    SUM_DETAIL --> SUM_RESULT["0.03125 + 0.0625 + 0.5 + 0.125 + 16380"]
    SUM_RESULT --> SUM_FINAL["≈ 16380.72"]
    
    SUM_FINAL --> FORMULA["应用HLL公式"]
    FORMULA --> ALPHA_M["α₁₆₃₈₄ ≈ 0.7213"]
    FORMULA --> M_SQUARED["m² = 16384² ≈ 268M"]
    
    ALPHA_M --> CALCULATE["基数 = α × m² / Σ"]
    M_SQUARED --> CALCULATE
    SUM_FINAL --> CALCULATE
    
    CALCULATE --> RESULT["≈ 0.7213 × 268M / 16380.72"]
     RESULT --> FINAL["≈ 11,800"]
     
     FINAL --> CORRECTION{"需要修正?"}
     CORRECTION -->|小范围| SMALL_RANGE["线性计数修正"]
     CORRECTION -->|大范围| LARGE_RANGE["大数修正"]
     CORRECTION -->|中等范围| NO_CORRECTION["无需修正"]
     
     SMALL_RANGE --> CORRECTED_RESULT["修正后结果"]
     LARGE_RANGE --> CORRECTED_RESULT
     NO_CORRECTION --> CORRECTED_RESULT
     CORRECTED_RESULT --> END(["最终估算: ≈4"])
     
     note1["注意: 这个计算有误差\n实际应该约等于4"]
     FINAL -.-> note1
    
    style SUM_DETAIL fill:#e1f5fe
    style RESULT fill:#fff3e0
    style FINAL fill:#e8f5e8
    style END fill:#ffebee
1.2.6 前导零计算详解
graph TD
    subgraph "前导零计算示例"
        subgraph "user1哈希后50位"
            U1_BINARY["101010...110001000"]
            U1_LEADING["前导零个数 = 0"]
            U1_RESULT["实际存储: 0+1 = 1"]
        end
        
        subgraph "user2哈希后50位"
            U2_BINARY["010101...010010000"]
            U2_LEADING["前导零个数 = 1"]
            U2_RESULT["实际存储: 1+1 = 2"]
        end
        
        subgraph "user3哈希后50位"
            U3_BINARY["000110...101010101"]
            U3_LEADING["前导零个数 = 3"]
            U3_RESULT["实际存储: 3+1 = 4"]
        end
        
        subgraph "user4哈希后50位"
            U4_BINARY["000010...111111111"]
            U4_LEADING["前导零个数 = 4"]
            U4_RESULT["实际存储: 4+1 = 5"]
        end
    end
    
    NOTE["注意: 存储值 = 前导零个数 + 1\n这样可以区分'没有元素'(0)和'前导零为0'(1)"]
    
    style U1_RESULT fill:#e3f2fd
    style U2_RESULT fill:#e8f5e8
    style U3_RESULT fill:#fff3e0
    style U4_RESULT fill:#ffebee
    style NOTE fill:#f5f5f5
1.2.7 手工计算示例

让我们用一个简化的例子来验证HyperLogLog的计算过程:

假设使用4个桶(m=4,b=2位)的简化版本:

graph TD
    subgraph "简化示例: 4个桶"
        subgraph "输入数据"
            INPUT["元素: ['A', 'B', 'C', 'A', 'D']"]
        end
        
        subgraph "哈希和分桶"
            HASH_A["A → 哈希: 1100..."]
            HASH_B["B → 哈希: 0110..."]
            HASH_C["C → 哈希: 0001..."]
            HASH_D["D → 哈希: 1010..."]
            
            BUCKET_A["前2位=11 → 桶3"]
            BUCKET_B["前2位=01 → 桶1"]
            BUCKET_C["前2位=00 → 桶0"]
            BUCKET_D["前2位=10 → 桶2"]
        end
        
        subgraph "前导零计算"
            LEADING_A["A后续: 00... → 前导零=2"]
            LEADING_B["B后续: 10... → 前导零=0"]
            LEADING_C["C后续: 01... → 前导零=0"]
            LEADING_D["D后续: 10... → 前导零=0"]
        end
        
        subgraph "桶状态更新"
            BUCKET_STATE["桶0=1, 桶1=1, 桶2=1, 桶3=3"]
        end
        
        subgraph "基数计算"
            SUM_CALC2["Σ = 2^(-1) + 2^(-1) + 2^(-1) + 2^(-3)"]
            SUM_RESULT2["= 0.5 + 0.5 + 0.5 + 0.125 = 1.625"]
            ALPHA_4["α₄ = 0.673"]
            FINAL_CALC["基数 = 0.673 × 16 / 1.625 ≈ 6.6"]
            ACTUAL["实际唯一元素: 4个 (A,B,C,D)"]
            ERROR["误差: (6.6-4)/4 = 65%"]
        end
    end
    
    INPUT --> HASH_A
    INPUT --> HASH_B
    INPUT --> HASH_C
    INPUT --> HASH_D
    
    HASH_A --> BUCKET_A
    HASH_B --> BUCKET_B
    HASH_C --> BUCKET_C
    HASH_D --> BUCKET_D
    
    HASH_A --> LEADING_A
    HASH_B --> LEADING_B
    HASH_C --> LEADING_C
    HASH_D --> LEADING_D
    
    BUCKET_A --> BUCKET_STATE
    BUCKET_B --> BUCKET_STATE
    BUCKET_C --> BUCKET_STATE
    BUCKET_D --> BUCKET_STATE
    LEADING_A --> BUCKET_STATE
    LEADING_B --> BUCKET_STATE
    LEADING_C --> BUCKET_STATE
    LEADING_D --> BUCKET_STATE
    
    BUCKET_STATE --> SUM_CALC2
    SUM_CALC2 --> SUM_RESULT2
    SUM_RESULT2 --> FINAL_CALC
    ALPHA_4 --> FINAL_CALC
    FINAL_CALC --> ACTUAL
    ACTUAL --> ERROR
    
    style BUCKET_STATE fill:#e1f5fe
    style FINAL_CALC fill:#fff3e0
    style ACTUAL fill:#e8f5e8
    style ERROR fill:#ffcdd2

为什么误差这么大?

  1. 桶数太少: 只有4个桶,标准误差 = 1.04/√4 = 52%
  2. 数据量太小: HyperLogLog适用于大数据集
  3. 标准配置: Redis使用16384个桶,误差约0.81%
1.2.8 实际Redis HyperLogLog示例
sequenceDiagram
    participant Client as 客户端
    participant Redis as Redis服务器
    participant HLL as HyperLogLog结构
    
    Note over Client,HLL: 创建并添加数据
    Client->>Redis: PFADD mykey user1
    Redis->>HLL: 计算user1哈希值
    HLL->>HLL: 哈希: 0x1A2B3C4D5E6F7890
    HLL->>HLL: 桶号: 6789, 前导零: 3
    HLL->>HLL: 更新桶6789 = max(0,3) = 3
    Redis-->>Client: 返回: 1 (新元素)
    
    Client->>Redis: PFADD mykey user2
    Redis->>HLL: 计算user2哈希值
    HLL->>HLL: 哈希: 0x9F8E7D6C5B4A3210
    HLL->>HLL: 桶号: 2543, 前导零: 1
    HLL->>HLL: 更新桶2543 = max(0,1) = 1
    Redis-->>Client: 返回: 1 (新元素)
    
    Client->>Redis: PFADD mykey user1
    Redis->>HLL: 计算user1哈希值(相同)
    HLL->>HLL: 桶号: 6789, 前导零: 3
    HLL->>HLL: 更新桶6789 = max(3,3) = 3
    Redis-->>Client: 返回: 0 (重复元素)
    
    Note over Client,HLL: 查询基数
    Client->>Redis: PFCOUNT mykey
    Redis->>HLL: 执行基数估算算法
    HLL->>HLL: 收集所有桶值
    HLL->>HLL: 计算调和平均数
    HLL->>HLL: 应用修正公式
    HLL-->>Redis: 返回估算值: 2
    Redis-->>Client: 返回: 2
1.2.3 数学原理

基础概率理论:

  • 如果一个随机二进制串的前导零个数为k,那么这个事件的概率为1/2^(k+1)
  • 如果观察到最大前导零个数为k,可以估算数据集大小约为2^(k+1)

分桶优化:

  • 使用m个桶来减少估算误差
  • 每个桶维护观察到的最大前导零个数
  • 最终估算值 = α_m × m² × (∑(2^(-M[j])))^(-1)

1.3 HyperLogLog 代码实现

1.3.1 基础实现
public class HyperLogLog {
    private final int b; // 桶数的对数
    private final int m; // 桶数 = 2^b
    private final double alpha; // 修正常数
    private final int[] buckets; // 桶数组
    
    public HyperLogLog(int b) {
        this.b = b;
        this.m = 1 << b; // 2^b
        this.alpha = getAlpha(m);
        this.buckets = new int[m];
    }
    
    /**
     * 添加元素
     */
    public void add(String element) {
        // 1. 计算哈希值
        long hash = hash64(element);
        
        // 2. 提取桶号(前b位)
        int bucketIndex = (int) (hash >>> (64 - b));
        
        // 3. 计算剩余位的前导零个数
        long w = hash << b;
        int leadingZeros = Long.numberOfLeadingZeros(w) + 1;
        
        // 4. 更新桶中的最大值
        buckets[bucketIndex] = Math.max(buckets[bucketIndex], leadingZeros);
    }
    
    /**
     * 估算基数
     */
    public long cardinality() {
        // 计算调和平均数的倒数
        double sum = 0.0;
        for (int bucket : buckets) {
            sum += Math.pow(2, -bucket);
        }
        
        double estimate = alpha * m * m / sum;
        
        // 小范围修正
        if (estimate <= 2.5 * m) {
            int zeros = 0;
            for (int bucket : buckets) {
                if (bucket == 0) zeros++;
            }
            if (zeros != 0) {
                return Math.round(m * Math.log(m / (double) zeros));
            }
        }
        
        // 大范围修正
        if (estimate <= (1.0/30.0) * (1L << 32)) {
            return Math.round(estimate);
        } else {
            return Math.round(-1 * (1L << 32) * Math.log(1 - estimate / (1L << 32)));
        }
    }
    
    /**
     * 获取修正常数
     */
    private double getAlpha(int m) {
        switch (m) {
            case 16: return 0.673;
            case 32: return 0.697;
            case 64: return 0.709;
            default: return 0.7213 / (1 + 1.079 / m);
        }
    }
    
    /**
     * 64位哈希函数
     */
    private long hash64(String input) {
        // 使用MurmurHash或其他高质量哈希函数
        return input.hashCode(); // 简化实现
    }
}
1.3.2 Redis中的HyperLogLog实现
@Service
public class RedisHyperLogLogService {
    
    @Autowired
    private RedisTemplate<String, String> redisTemplate;
    
    /**
     * 添加元素到HyperLogLog
     */
    public Long pfAdd(String key, String... elements) {
        return redisTemplate.opsForHyperLogLog().add(key, elements);
    }
    
    /**
     * 获取基数估算值
     */
    public Long pfCount(String... keys) {
        return redisTemplate.opsForHyperLogLog().size(keys);
    }
    
    /**
     * 合并多个HyperLogLog
     */
    public Long pfMerge(String destKey, String... sourceKeys) {
        return redisTemplate.opsForHyperLogLog().union(destKey, sourceKeys);
    }
    
    /**
     * 网站UV统计示例
     */
    public void trackUniqueVisitor(String date, String userId) {
        String key = "uv:" + date;
        pfAdd(key, userId);
        
        // 设置过期时间
        redisTemplate.expire(key, Duration.ofDays(7));
    }
    
    /**
     * 获取指定日期的UV
     */
    public Long getUniqueVisitors(String date) {
        String key = "uv:" + date;
        return pfCount(key);
    }
    
    /**
     * 获取多日期UV合并统计
     */
    public Long getUniqueVisitorsRange(String... dates) {
        String[] keys = Arrays.stream(dates)
            .map(date -> "uv:" + date)
            .toArray(String[]::new);
        return pfCount(keys);
    }
}

1.4 HyperLogLog 使用场景

1.4.1 网站UV统计
@RestController
public class AnalyticsController {
    
    @Autowired
    private RedisHyperLogLogService hyperLogLogService;
    
    /**
     * 记录用户访问
     */
    @PostMapping("/track")
    public ResponseEntity<String> trackVisit(@RequestParam String userId) {
        String today = LocalDate.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd"));
        hyperLogLogService.trackUniqueVisitor(today, userId);
        return ResponseEntity.ok("Tracked");
    }
    
    /**
     * 获取今日UV
     */
    @GetMapping("/uv/today")
    public ResponseEntity<Long> getTodayUV() {
        String today = LocalDate.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd"));
        Long uv = hyperLogLogService.getUniqueVisitors(today);
        return ResponseEntity.ok(uv);
    }
    
    /**
     * 获取最近7天UV
     */
    @GetMapping("/uv/week")
    public ResponseEntity<Long> getWeekUV() {
        String[] dates = IntStream.range(0, 7)
            .mapToObj(i -> LocalDate.now().minusDays(i))
            .map(date -> date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd")))
            .toArray(String[]::new);
        
        Long uv = hyperLogLogService.getUniqueVisitorsRange(dates);
        return ResponseEntity.ok(uv);
    }
}
1.4.2 大数据去重统计
@Component
public class BigDataDeduplication {
    
    @Autowired
    private RedisHyperLogLogService hyperLogLogService;
    
    /**
     * IP地址去重统计
     */
    public void processLogFile(String logFilePath) {
        String key = "unique_ips:" + LocalDate.now();
        
        try (BufferedReader reader = Files.newBufferedReader(Paths.get(logFilePath))) {
            String line;
            while ((line = reader.readLine()) != null) {
                String ip = extractIP(line);
                if (ip != null) {
                    hyperLogLogService.pfAdd(key, ip);
                }
            }
        } catch (IOException e) {
            log.error("处理日志文件失败", e);
        }
    }
    
    /**
     * 获取唯一IP数量
     */
    public Long getUniqueIPCount() {
        String key = "unique_ips:" + LocalDate.now();
        return hyperLogLogService.pfCount(key);
    }
    
    private String extractIP(String logLine) {
        // 从日志行中提取IP地址的逻辑
        String[] parts = logLine.split(" ");
        return parts.length > 0 ? parts[0] : null;
    }
}

1.5 HyperLogLog 性能特点

1.5.1 内存使用
graph LR
    A["标准误差 1.04/√m"] --> B["m=2^14=16384桶"]
    B --> C["每桶6位 = 12KB内存"]
    C --> D["误差率 ≈ 0.81%"]
    
    E["精确计数"] --> F["需要存储所有元素"]
    F --> G["内存使用 = 元素数 × 元素大小"]
    G --> H["1亿个64位整数 ≈ 800MB"]
    
    style C fill:#c8e6c9
    style H fill:#ffcdd2
1.5.2 误差分析
@Component
public class HyperLogLogAccuracyTest {
    
    public void testAccuracy() {
        HyperLogLog hll = new HyperLogLog(14); // 16384桶
        Set<String> actualSet = new HashSet<>();
        
        // 添加100万个随机元素
        Random random = new Random();
        for (int i = 0; i < 1_000_000; i++) {
            String element = "element_" + random.nextInt(800_000);
            hll.add(element);
            actualSet.add(element);
        }
        
        long actualCardinality = actualSet.size();
        long estimatedCardinality = hll.cardinality();
        
        double errorRate = Math.abs(estimatedCardinality - actualCardinality) 
                          / (double) actualCardinality * 100;
        
        System.out.printf("实际基数: %d%n", actualCardinality);
        System.out.printf("估算基数: %d%n", estimatedCardinality);
        System.out.printf("误差率: %.2f%%%n", errorRate);
    }
}

2. 布隆过滤器详解

2.1 布隆过滤器基本概念

布隆过滤器是一种空间效率极高的概率性数据结构,用于测试一个元素是否在集合中。它可能会产生假阳性(False Positive),但不会产生假阴性(False Negative)。

2.2 布隆过滤器实现原理

2.2.1 核心结构
flowchart TD
    A[布隆过滤器] --> B[位数组 BitArray]
    A --> C[哈希函数组 Hash Functions]
    
    B --> D["位数组大小: m位"]
    C --> E["哈希函数个数: k个"]
    
    F[添加元素] --> G["计算k个哈希值"]
    G --> H["设置对应位为1"]
    
    I[查询元素] --> J["计算k个哈希值"]
    J --> K["检查对应位"]
    K --> L{"所有位都为1?"}
    L -->|是| M["可能存在"]
    L -->|否| N["一定不存在"]
    
    style B fill:#e1f5fe
    style C fill:#fff3e0
    style M fill:#ffecb3
    style N fill:#c8e6c9
2.2.2 操作流程
sequenceDiagram
    participant E as 元素
    participant H1 as 哈希函数1
    participant H2 as 哈希函数2
    participant H3 as 哈希函数3
    participant B as 位数组
    
    Note over E,B: 添加元素流程
    E->>H1: 计算哈希值1
    E->>H2: 计算哈希值2
    E->>H3: 计算哈希值3
    H1->>B: 设置位置i1为1
    H2->>B: 设置位置i2为1
    H3->>B: 设置位置i3为1
    
    Note over E,B: 查询元素流程
    E->>H1: 计算哈希值1
    E->>H2: 计算哈希值2
    E->>H3: 计算哈希值3
    B-->>H1: 检查位置i1
    B-->>H2: 检查位置i2
    B-->>H3: 检查位置i3
    Note over H1,H3: 如果所有位都为1,则可能存在<br/>如果任一位为0,则一定不存在
2.2.3 数学原理

假阳性概率计算:

  • 位数组大小:m
  • 哈希函数个数:k
  • 已插入元素个数:n
  • 假阳性概率:p ≈ (1 - e^(-kn/m))^k

最优参数选择:

  • 最优哈希函数个数:k = (m/n) × ln(2)
  • 最优位数组大小:m = -n × ln(p) / (ln(2))²

2.3 布隆过滤器代码实现

2.3.1 基础实现
public class BloomFilter {
    private final BitSet bitSet;
    private final int bitSetSize;
    private final int hashFunctionCount;
    private int addedElements;
    
    /**
     * 构造布隆过滤器
     * @param expectedElements 预期元素数量
     * @param falsePositiveRate 假阳性率
     */
    public BloomFilter(int expectedElements, double falsePositiveRate) {
        this.bitSetSize = optimalBitSetSize(expectedElements, falsePositiveRate);
        this.hashFunctionCount = optimalHashFunctionCount(expectedElements, bitSetSize);
        this.bitSet = new BitSet(bitSetSize);
        this.addedElements = 0;
    }
    
    /**
     * 添加元素
     */
    public void add(String element) {
        int[] hashes = getHashes(element);
        for (int hash : hashes) {
            bitSet.set(Math.abs(hash % bitSetSize));
        }
        addedElements++;
    }
    
    /**
     * 检查元素是否可能存在
     */
    public boolean mightContain(String element) {
        int[] hashes = getHashes(element);
        for (int hash : hashes) {
            if (!bitSet.get(Math.abs(hash % bitSetSize))) {
                return false; // 一定不存在
            }
        }
        return true; // 可能存在
    }
    
    /**
     * 获取当前假阳性概率
     */
    public double getCurrentFalsePositiveRate() {
        double ratio = (double) addedElements / bitSetSize;
        return Math.pow(1 - Math.exp(-hashFunctionCount * ratio), hashFunctionCount);
    }
    
    /**
     * 计算最优位数组大小
     */
    private int optimalBitSetSize(int expectedElements, double falsePositiveRate) {
        return (int) (-expectedElements * Math.log(falsePositiveRate) / (Math.log(2) * Math.log(2)));
    }
    
    /**
     * 计算最优哈希函数个数
     */
    private int optimalHashFunctionCount(int expectedElements, int bitSetSize) {
        return Math.max(1, (int) Math.round((double) bitSetSize / expectedElements * Math.log(2)));
    }
    
    /**
     * 生成多个哈希值
     */
    private int[] getHashes(String element) {
        int[] hashes = new int[hashFunctionCount];
        int hash1 = element.hashCode();
        int hash2 = hash1 >>> 16;
        
        for (int i = 0; i < hashFunctionCount; i++) {
            hashes[i] = hash1 + i * hash2;
        }
        return hashes;
    }
    
    /**
     * 获取统计信息
     */
    public String getStats() {
        return String.format(
            "BitSet大小: %d, 哈希函数个数: %d, 已添加元素: %d, 当前假阳性率: %.4f",
            bitSetSize, hashFunctionCount, addedElements, getCurrentFalsePositiveRate()
        );
    }
}
2.3.2 Redis中的布隆过滤器实现
@Service
public class RedisBloomFilterService {
    
    @Autowired
    private RedisTemplate<String, String> redisTemplate;
    
    private static final String BF_PREFIX = "bf:";
    
    /**
     * 创建布隆过滤器
     */
    public void createBloomFilter(String key, long expectedInsertions, double falsePositiveRate) {
        String script = 
            "return redis.call('BF.RESERVE', KEYS[1], ARGV[1], ARGV[2])";
        
        redisTemplate.execute((RedisCallback<Object>) connection -> {
            return connection.eval(
                script.getBytes(),
                ReturnType.STATUS,
                1,
                (BF_PREFIX + key).getBytes(),
                String.valueOf(falsePositiveRate).getBytes(),
                String.valueOf(expectedInsertions).getBytes()
            );
        });
    }
    
    /**
     * 添加元素到布隆过滤器
     */
    public boolean add(String key, String element) {
        String script = 
            "return redis.call('BF.ADD', KEYS[1], ARGV[1])";
        
        Long result = redisTemplate.execute((RedisCallback<Long>) connection -> {
            return (Long) connection.eval(
                script.getBytes(),
                ReturnType.INTEGER,
                1,
                (BF_PREFIX + key).getBytes(),
                element.getBytes()
            );
        });
        
        return result != null && result == 1;
    }
    
    /**
     * 批量添加元素
     */
    public List<Boolean> addMulti(String key, String... elements) {
        String script = 
            "return redis.call('BF.MADD', KEYS[1], unpack(ARGV))";
        
        List<Long> results = redisTemplate.execute((RedisCallback<List<Long>>) connection -> {
            byte[][] args = new byte[elements.length][];
            for (int i = 0; i < elements.length; i++) {
                args[i] = elements[i].getBytes();
            }
            
            return (List<Long>) connection.eval(
                script.getBytes(),
                ReturnType.MULTI,
                1,
                (BF_PREFIX + key).getBytes(),
                args
            );
        });
        
        return results.stream()
            .map(result -> result == 1)
            .collect(Collectors.toList());
    }
    
    /**
     * 检查元素是否存在
     */
    public boolean exists(String key, String element) {
        String script = 
            "return redis.call('BF.EXISTS', KEYS[1], ARGV[1])";
        
        Long result = redisTemplate.execute((RedisCallback<Long>) connection -> {
            return (Long) connection.eval(
                script.getBytes(),
                ReturnType.INTEGER,
                1,
                (BF_PREFIX + key).getBytes(),
                element.getBytes()
            );
        });
        
        return result != null && result == 1;
    }
    
    /**
     * 批量检查元素
     */
    public List<Boolean> existsMulti(String key, String... elements) {
        String script = 
            "return redis.call('BF.MEXISTS', KEYS[1], unpack(ARGV))";
        
        List<Long> results = redisTemplate.execute((RedisCallback<List<Long>>) connection -> {
            byte[][] args = new byte[elements.length][];
            for (int i = 0; i < elements.length; i++) {
                args[i] = elements[i].getBytes();
            }
            
            return (List<Long>) connection.eval(
                script.getBytes(),
                ReturnType.MULTI,
                1,
                (BF_PREFIX + key).getBytes(),
                args
            );
        });
        
        return results.stream()
            .map(result -> result == 1)
            .collect(Collectors.toList());
    }
}

2.4 布隆过滤器使用场景

2.4.1 缓存穿透防护
@Service
public class CacheService {
    
    @Autowired
    private RedisTemplate<String, Object> redisTemplate;
    
    @Autowired
    private RedisBloomFilterService bloomFilterService;
    
    @Autowired
    private UserMapper userMapper;
    
    private static final String USER_BF_KEY = "user_bloom_filter";
    private static final String USER_CACHE_PREFIX = "user:";
    
    @PostConstruct
    public void initBloomFilter() {
        // 创建布隆过滤器,预期100万用户,假阳性率0.01%
        bloomFilterService.createBloomFilter(USER_BF_KEY, 1_000_000, 0.0001);
        
        // 将所有存在的用户ID添加到布隆过滤器
        List<Long> allUserIds = userMapper.getAllUserIds();
        for (Long userId : allUserIds) {
            bloomFilterService.add(USER_BF_KEY, String.valueOf(userId));
        }
    }
    
    /**
     * 获取用户信息(带布隆过滤器防护)
     */
    public User getUserById(Long userId) {
        String userIdStr = String.valueOf(userId);
        
        // 1. 布隆过滤器快速判断
        if (!bloomFilterService.exists(USER_BF_KEY, userIdStr)) {
            log.info("布隆过滤器判断用户{}不存在,避免缓存穿透", userId);
            return null; // 一定不存在,直接返回
        }
        
        // 2. 查询缓存
        String cacheKey = USER_CACHE_PREFIX + userId;
        User cachedUser = (User) redisTemplate.opsForValue().get(cacheKey);
        if (cachedUser != null) {
            return cachedUser;
        }
        
        // 3. 查询数据库(可能存在)
        User user = userMapper.selectById(userId);
        if (user != null) {
            // 更新缓存
            redisTemplate.opsForValue().set(cacheKey, user, Duration.ofMinutes(30));
        } else {
            // 设置空值缓存,防止短时间内重复查询
            redisTemplate.opsForValue().set(cacheKey, new User(), Duration.ofMinutes(5));
        }
        
        return user;
    }
    
    /**
     * 创建新用户时更新布隆过滤器
     */
    public User createUser(User user) {
        User savedUser = userMapper.insert(user);
        
        // 添加到布隆过滤器
        bloomFilterService.add(USER_BF_KEY, String.valueOf(savedUser.getId()));
        
        // 更新缓存
        String cacheKey = USER_CACHE_PREFIX + savedUser.getId();
        redisTemplate.opsForValue().set(cacheKey, savedUser, Duration.ofMinutes(30));
        
        return savedUser;
    }
}
2.4.2 重复数据检测
@Service
public class DuplicateDetectionService {
    
    @Autowired
    private RedisBloomFilterService bloomFilterService;
    
    private static final String EMAIL_BF_KEY = "email_bloom_filter";
    private static final String URL_BF_KEY = "crawled_url_bloom_filter";
    
    /**
     * 邮箱去重检测
     */
    public boolean isEmailDuplicate(String email) {
        return bloomFilterService.exists(EMAIL_BF_KEY, email);
    }
    
    /**
     * 添加邮箱到去重集合
     */
    public void addEmail(String email) {
        bloomFilterService.add(EMAIL_BF_KEY, email);
    }
    
    /**
     * 网络爬虫URL去重
     */
    @Component
    public static class WebCrawlerDeduplication {
        
        @Autowired
        private RedisBloomFilterService bloomFilterService;
        
        /**
         * 检查URL是否已爬取
         */
        public boolean isUrlCrawled(String url) {
            return bloomFilterService.exists(URL_BF_KEY, url);
        }
        
        /**
         * 标记URL为已爬取
         */
        public void markUrlAsCrawled(String url) {
            bloomFilterService.add(URL_BF_KEY, url);
        }
        
        /**
         * 爬取网页
         */
        public void crawlPage(String url) {
            if (isUrlCrawled(url)) {
                log.info("URL {} 已爬取,跳过", url);
                return;
            }
            
            try {
                // 执行爬取逻辑
                String content = fetchPageContent(url);
                processContent(content);
                
                // 标记为已爬取
                markUrlAsCrawled(url);
                
                log.info("成功爬取URL: {}", url);
            } catch (Exception e) {
                log.error("爬取URL失败: {}", url, e);
            }
        }
        
        private String fetchPageContent(String url) {
            // 实际的网页抓取逻辑
            return "page content";
        }
        
        private void processContent(String content) {
            // 处理网页内容的逻辑
        }
    }
}
2.4.3 分布式系统中的应用
@Service
public class DistributedBloomFilterService {
    
    @Autowired
    private RedisBloomFilterService bloomFilterService;
    
    /**
     * 分布式锁防重复处理
     */
    public boolean processUniqueRequest(String requestId) {
        String bfKey = "processed_requests";
        
        // 检查是否已处理
        if (bloomFilterService.exists(bfKey, requestId)) {
            log.warn("请求 {} 可能已处理,跳过", requestId);
            return false;
        }
        
        try {
            // 执行业务逻辑
            doBusinessLogic(requestId);
            
            // 标记为已处理
            bloomFilterService.add(bfKey, requestId);
            
            return true;
        } catch (Exception e) {
            log.error("处理请求失败: {}", requestId, e);
            return false;
        }
    }
    
    /**
     * 消息队列消息去重
     */
    @RabbitListener(queues = "business.queue")
    public void handleMessage(@Payload String message, 
                             @Header Map<String, Object> headers) {
        String messageId = (String) headers.get("messageId");
        String bfKey = "processed_messages";
        
        // 检查消息是否已处理
        if (bloomFilterService.exists(bfKey, messageId)) {
            log.warn("消息 {} 可能已处理,跳过", messageId);
            return;
        }
        
        try {
            // 处理消息
            processMessage(message);
            
            // 标记为已处理
            bloomFilterService.add(bfKey, messageId);
            
            log.info("成功处理消息: {}", messageId);
        } catch (Exception e) {
            log.error("处理消息失败: {}", messageId, e);
            throw e; // 重新抛出异常,触发重试机制
        }
    }
    
    private void doBusinessLogic(String requestId) {
        // 业务逻辑实现
    }
    
    private void processMessage(String message) {
        // 消息处理逻辑
    }
}

2.5 布隆过滤器性能优化

2.5.1 参数调优
@Component
public class BloomFilterOptimizer {
    
    /**
     * 计算最优参数
     */
    public BloomFilterParams calculateOptimalParams(long expectedElements, 
                                                   double maxFalsePositiveRate) {
        // 计算最优位数组大小
        long optimalBitSize = (long) (-expectedElements * Math.log(maxFalsePositiveRate) 
                                     / (Math.log(2) * Math.log(2)));
        
        // 计算最优哈希函数个数
        int optimalHashCount = Math.max(1, 
            (int) Math.round((double) optimalBitSize / expectedElements * Math.log(2)));
        
        // 计算实际假阳性率
        double actualFalsePositiveRate = Math.pow(1 - Math.exp(
            -optimalHashCount * (double) expectedElements / optimalBitSize), optimalHashCount);
        
        return new BloomFilterParams(optimalBitSize, optimalHashCount, actualFalsePositiveRate);
    }
    
    /**
     * 性能测试
     */
    public void performanceTest() {
        int[] elementCounts = {10_000, 100_000, 1_000_000, 10_000_000};
        double[] falsePositiveRates = {0.01, 0.001, 0.0001};
        
        for (int elementCount : elementCounts) {
            for (double fpRate : falsePositiveRates) {
                BloomFilterParams params = calculateOptimalParams(elementCount, fpRate);
                
                System.out.printf(
                    "元素数: %d, 目标假阳性率: %.4f, " +
                    "位数组大小: %d, 哈希函数数: %d, " +
                    "实际假阳性率: %.6f, 内存使用: %.2f KB%n",
                    elementCount, fpRate,
                    params.getBitSize(), params.getHashCount(),
                    params.getActualFalsePositiveRate(),
                    params.getBitSize() / 8.0 / 1024
                );
            }
            System.out.println();
        }
    }
    
    public static class BloomFilterParams {
        private final long bitSize;
        private final int hashCount;
        private final double actualFalsePositiveRate;
        
        public BloomFilterParams(long bitSize, int hashCount, double actualFalsePositiveRate) {
            this.bitSize = bitSize;
            this.hashCount = hashCount;
            this.actualFalsePositiveRate = actualFalsePositiveRate;
        }
        
        // Getters
        public long getBitSize() { return bitSize; }
        public int getHashCount() { return hashCount; }
        public double getActualFalsePositiveRate() { return actualFalsePositiveRate; }
    }
}
2.5.2 内存优化策略
@Component
public class BloomFilterMemoryOptimization {
    
    /**
     * 分层布隆过滤器
     */
    public static class LayeredBloomFilter {
        private final List<BloomFilter> layers;
        private final int maxElementsPerLayer;
        private int currentLayerElements;
        
        public LayeredBloomFilter(int maxElementsPerLayer, double falsePositiveRate) {
            this.layers = new ArrayList<>();
            this.maxElementsPerLayer = maxElementsPerLayer;
            this.currentLayerElements = 0;
            
            // 创建第一层
            addNewLayer(falsePositiveRate);
        }
        
        public void add(String element) {
            if (currentLayerElements >= maxElementsPerLayer) {
                addNewLayer(0.001); // 新层使用更低的假阳性率
                currentLayerElements = 0;
            }
            
            layers.get(layers.size() - 1).add(element);
            currentLayerElements++;
        }
        
        public boolean mightContain(String element) {
            return layers.stream().anyMatch(layer -> layer.mightContain(element));
        }
        
        private void addNewLayer(double falsePositiveRate) {
            layers.add(new BloomFilter(maxElementsPerLayer, falsePositiveRate));
        }
        
        public int getLayerCount() {
            return layers.size();
        }
    }
    
    /**
     * 可扩展布隆过滤器
     */
    public static class ScalableBloomFilter {
        private final List<BloomFilter> filters;
        private final double falsePositiveRate;
        private final int initialCapacity;
        private final double growthFactor;
        private int totalElements;
        
        public ScalableBloomFilter(double falsePositiveRate, int initialCapacity) {
            this.filters = new ArrayList<>();
            this.falsePositiveRate = falsePositiveRate;
            this.initialCapacity = initialCapacity;
            this.growthFactor = 2.0;
            this.totalElements = 0;
            
            // 创建第一个过滤器
            addNewFilter();
        }
        
        public void add(String element) {
            BloomFilter currentFilter = filters.get(filters.size() - 1);
            
            // 检查当前过滤器是否需要扩展
            if (needsExpansion()) {
                addNewFilter();
                currentFilter = filters.get(filters.size() - 1);
            }
            
            currentFilter.add(element);
            totalElements++;
        }
        
        public boolean mightContain(String element) {
            return filters.stream().anyMatch(filter -> filter.mightContain(element));
        }
        
        private boolean needsExpansion() {
            BloomFilter currentFilter = filters.get(filters.size() - 1);
            return currentFilter.getCurrentFalsePositiveRate() > falsePositiveRate;
        }
        
        private void addNewFilter() {
            int capacity = (int) (initialCapacity * Math.pow(growthFactor, filters.size()));
            double adjustedFpRate = falsePositiveRate / Math.pow(2, filters.size() + 1);
            filters.add(new BloomFilter(capacity, adjustedFpRate));
        }
    }
}

3. 总结对比

3.1 功能对比

graph TD
    subgraph "HyperLogLog"
        A1[基数估算]
        A2[内存效率极高]
        A3[标准误差约0.81%]
        A4[支持合并操作]
    end
    
    subgraph "布隆过滤器"
        B1[成员检测]
        B2[无假阴性]
        B3[可能假阳性]
        B4[空间效率高]
    end
    
    subgraph "应用场景"
        C1[UV统计]
        C2[去重计数]
        C3[缓存穿透防护]
        C4[重复检测]
    end
    
    A1 --> C1
    A2 --> C2
    B1 --> C3
    B2 --> C4
    
    style A1 fill:#e1f5fe
    style A2 fill:#e1f5fe
    style B1 fill:#fff3e0
    style B2 fill:#fff3e0

3.2 性能对比

特性HyperLogLog布隆过滤器
主要用途基数估算成员检测
内存使用固定12KB(标准配置)取决于预期元素数和假阳性率
准确性近似结果,误差约0.81%无假阴性,可能假阳性
时间复杂度O(1)O(k),k为哈希函数个数
合并操作支持支持(相同参数)
适用数据量任意大小需要预估

3.3 选择建议

3.3.1 使用HyperLogLog的场景
// 1. 大规模UV统计
if (需要统计独立访客数量 && 数据量巨大 && 允许小误差) {
    使用HyperLogLog();
}

// 2. 实时基数估算
if (需要实时计算去重数量 && 内存有限) {
    使用HyperLogLog();
}

// 3. 多维度数据合并
if (需要合并多个数据源的去重统计) {
    使用HyperLogLog();
}
3.3.2 使用布隆过滤器的场景
// 1. 缓存穿透防护
if (需要快速判断元素是否存在 && 不能容忍假阴性) {
    使用布隆过滤器();
}

// 2. 大数据去重
if (需要检测重复数据 && 内存有限 && 可以容忍少量假阳性) {
    使用布隆过滤器();
}

// 3. 分布式系统去重
if (分布式环境 && 需要快速去重判断) {
    使用布隆过滤器();
}

3.4 最佳实践

3.4.1 HyperLogLog最佳实践
@Component
public class HyperLogLogBestPractices {
    
    /**
     * 1. 合理设置过期时间
     */
    public void setExpirationPolicy() {
        // 日UV统计保留7天
        redisTemplate.expire("uv:daily:" + date, Duration.ofDays(7));
        
        // 月UV统计保留1年
        redisTemplate.expire("uv:monthly:" + month, Duration.ofDays(365));
    }
    
    /**
     * 2. 批量操作优化
     */
    public void batchOperations(String key, List<String> elements) {
        // 使用管道批量添加
        redisTemplate.executePipelined((RedisCallback<Object>) connection -> {
            for (String element : elements) {
                connection.pfAdd(key.getBytes(), element.getBytes());
            }
            return null;
        });
    }
    
    /**
     * 3. 监控误差率
     */
    public void monitorAccuracy(String key, Set<String> actualSet) {
        long actual = actualSet.size();
        long estimated = redisTemplate.opsForHyperLogLog().size(key);
        double errorRate = Math.abs(estimated - actual) / (double) actual;
        
        if (errorRate > 0.02) { // 误差超过2%
            log.warn("HyperLogLog误差过大: key={}, actual={}, estimated={}, error={}%", 
                    key, actual, estimated, errorRate * 100);
        }
    }
}
3.4.2 布隆过滤器最佳实践
@Component
public class BloomFilterBestPractices {
    
    /**
     * 1. 参数预估和调优
     */
    public void parameterTuning() {
        // 根据业务需求选择合适的假阳性率
        double fpRate = 0.001; // 0.1%的假阳性率
        long expectedElements = 1_000_000; // 预期100万元素
        
        // 计算所需内存
        long bitSize = (long) (-expectedElements * Math.log(fpRate) / (Math.log(2) * Math.log(2)));
        double memoryMB = bitSize / 8.0 / 1024 / 1024;
        
        log.info("布隆过滤器内存需求: {:.2f} MB", memoryMB);
    }
    
    /**
     * 2. 定期重建策略
     */
    @Scheduled(cron = "0 0 2 * * ?") // 每天凌晨2点
    public void rebuildBloomFilter() {
        String oldKey = "bf:users";
        String newKey = "bf:users:new";
        
        try {
            // 创建新的布隆过滤器
            bloomFilterService.createBloomFilter(newKey, 1_000_000, 0.001);
            
            // 重新加载所有数据
            List<String> allUsers = userService.getAllUserIds();
            bloomFilterService.addMulti(newKey, allUsers.toArray(new String[0]));
            
            // 原子性替换
            redisTemplate.rename(newKey, oldKey);
            
            log.info("布隆过滤器重建完成,加载{}个用户", allUsers.size());
        } catch (Exception e) {
            log.error("布隆过滤器重建失败", e);
            // 清理临时key
            redisTemplate.delete(newKey);
        }
    }
    
    /**
     * 3. 假阳性处理策略
     */
    public boolean handleFalsePositive(String key, String element) {
        // 布隆过滤器判断可能存在
        if (bloomFilterService.exists(key, element)) {
            // 进一步精确验证
            boolean actualExists = databaseService.exists(element);
            
            if (!actualExists) {
                // 记录假阳性
                metricsService.incrementFalsePositive(key);
                log.debug("检测到假阳性: key={}, element={}", key, element);
            }
            
            return actualExists;
        }
        
        return false; // 一定不存在
    }
}

3.5 结论

HyperLogLog和布隆过滤器都是优秀的概率性数据结构,它们在不同的场景下发挥着重要作用:

HyperLogLog适用于:

  • 大规模数据的基数估算
  • 实时UV/PV统计
  • 内存受限的去重计数场景
  • 需要合并多个数据源的统计

布隆过滤器适用于:

  • 缓存穿透防护
  • 大数据去重检测
  • 分布式系统中的重复处理防护
  • 网络爬虫URL去重

在实际应用中,这两种数据结构经常配合使用,形成完整的数据处理解决方案。选择合适的数据结构需要综合考虑业务需求、数据规模、准确性要求和资源限制等因素。

通过合理的参数配置、定期的维护策略和完善的监控机制,可以充分发挥这两种数据结构的优势,为大规模分布式系统提供高效、可靠的数据处理能力。