布隆过滤器：用1MB内存判断10亿数据是否存在，互联网大厂的"空间魔法"📚 完整教程： https://github

为什么Chrome能瞬间告诉你密码已泄露？
为什么Redis能拦截恶意爬虫？
为什么比特币节点能快速拒绝无效交易？
背后都是布隆过滤器在发力

📚 完整教程： github.com/Lee985-cmd/…
⭐ Star支持 | 💬 提Issue | 🔄 Fork分享

🎯 从一个真实场景说起

假设你在做一个用户注册系统：

// 场景1：检查用户名是否已存在
const registeredUsers = [
  'zhangsan', 'lisi', 'wangwu', 'zhaoliu', 
  'qianqi', 'sunba', 'zhoujiu', 'wushi'
  // ... 假设有1亿个已注册用户
];

function isUsernameTaken(username) {
  return registeredUsers.includes(username);
}

console.log(isUsernameTaken('zhangsan')); // true
console.log(isUsernameTaken('newuser'));  // false

问题来了：

如果数据库有 1 亿 个已注册用户：

用数组存储：需要 800MB 内存
用哈希表存储：需要 1.6GB 内存
每次查询都要遍历或哈希计算，慢！

有没有办法，用1MB内存，就能判断10亿数据是否存在？

有，这就是 布隆过滤器（Bloom Filter） 。

🔍 布隆过滤器的核心思想

一句话解释

用多个哈希函数，把数据映射到几个位（bit），用极小的空间判断"可能存在"或"一定不存在"。

传统方法 vs 布隆过滤器

维度	传统哈希表	布隆过滤器
存储空间	1亿用户 = 1.6GB	1亿用户 = 1MB
查询速度	O(1)	O(k)，k是哈希函数数量
准确性	100%准确	可能有误判（假阳性）
支持删除	✅	❌（标准版不支持）
适用场景	数据量小，要求精确	数据量大，允许少量误判

布隆过滤器的"不完美哲学"

布隆过滤器会告诉你两种结果：

"一定不存在" - 100% 准确
"可能存在" - 99% 准确（有 1% 可能误判）

为什么接受误判？

比如判断"恶意IP"：误判 1% 意味着 100 个好用户里有 1 个被误拦
但拦截了 100 万个真正的攻击者
这个代价，完全值得！

🛠️ 核心原理图解

布隆过滤器的结构

位数组（Bit Array）：0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
                      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5

一个由 0 和 1 组成的数组，初始全是 0。

插入过程

假设我们要插入字符串 "hello"：

步骤 1：用 3 个哈希函数计算位置

hash1("hello") % 16 = 3  →  把第3位设为1
hash2("hello") % 16 = 7  →  把第7位设为1
hash3("hello") % 16 = 12 →  把第12位设为1

步骤 2：更新位数组

插入前：0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
插入后：0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
               ↑     ↑           ↑
               3     7           12

步骤 3：再插入 "world"

hash1("world") % 16 = 1  →  第1位设为1
hash2("world") % 16 = 7  →  第7位已经是1（冲突）
hash3("world") % 16 = 10 →  第10位设为1

位数组：0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0
         ↑ ↑   ↑     ↑     ↑   ↑
         1 3   7     10    12

查询过程

查询 "hello" 是否存在：

1. hash1("hello") % 16 = 3  →  第3位是1 ✅
2. hash2("hello") % 16 = 7  →  第7位是1 ✅
3. hash3("hello") % 16 = 12 →  第12位是1 ✅

3个位置都是1 → "hello" 可能存在！

查询 "foo" 是否存在：

1. hash1("foo") % 16 = 2  →  第2位是0 ❌

第1个位置就是0 → "foo" 一定不存在！
（不需要再检查其他位置）

关键点：只要有一个位置是 0，就一定不存在！

为什么会误判？

假设我们要查询 "test"：
hash1("test") % 16 = 3  →  第3位是1 ✅（因为插入了"hello"）
hash2("test") % 16 = 7  →  第7位是1 ✅（因为插入了"hello"和"world"）
hash3("test") % 16 = 10 →  第10位是1 ✅（因为插入了"world"）

3个位置都是1 → 布隆过滤器说"test可能存在"

但实际上，我们从未插入过"test"！这就是误判。

误判的本质：多个元素的哈希值恰好"撞"到了相同的位置。

💻 完整代码实现

基础版布隆过滤器

/**
 * 布隆过滤器 - 用极小空间判断元素是否存在
 * 
 * 核心思想：
 * - 使用多个哈希函数将元素映射到位数组的多个位置
 * - 查询时只要有一个位置为0，就一定不存在
 * - 所有位置为1，则"可能存在"（有误判概率）
 * 
 * 应用场景：
 * - 缓存穿透防护（Redis + Bloom Filter）
 * - 垃圾邮件检测
 * - 恶意URL拦截
 * - 去重系统
 * - 比特币SPV节点
 * 
 * 时间复杂度：
 * - 插入：O(k)，k是哈希函数数量
 * - 查询：O(k)
 * - 空间：O(m)，m是位数组大小
 */

class BloomFilter {
  /**
   * @param {number} expectedItems - 预期插入的元素数量
   * @param {number} falsePositiveRate - 可接受的误判率（0.01 = 1%）
   */
  constructor(expectedItems = 10000, falsePositiveRate = 0.01) {
    this.expectedItems = expectedItems;
    this.falsePositiveRate = falsePositiveRate;
    
    // 计算最优位数组大小
    // 公式：m = -(n * ln(p)) / (ln(2))^2
    this.size = Math.ceil(
      -(expectedItems * Math.log(falsePositiveRate)) / (Math.log(2) ** 2)
    );
    
    // 计算最优哈希函数数量
    // 公式：k = (m/n) * ln(2)
    this.hashCount = Math.ceil(
      (this.size / expectedItems) * Math.log(2)
    );
    
    // 初始化位数组（用 Uint8Array，每个元素存8位）
    this.bitArray = new Uint8Array(Math.ceil(this.size / 8));
    this.itemCount = 0;
    
    console.log(`布隆过滤器配置:`);
    console.log(`  预期元素数: ${expectedItems}`);
    console.log(`  位数组大小: ${this.size} bits = ${(this.size / 8 / 1024).toFixed(2)} KB`);
    console.log(`  哈希函数数: ${this.hashCount}`);
    console.log(`  误判率: ${(falsePositiveRate * 100).toFixed(1)}%`);
  }

  /**
   * 添加元素到布隆过滤器
   * 
   * @param {string} item - 要添加的元素
   */
  add(item) {
    const hashValues = this._getHashValues(item);
    
    for (let hash of hashValues) {
      const bitIndex = hash % this.size;
      const byteIndex = Math.floor(bitIndex / 8);
      const bitOffset = bitIndex % 8;
      
      // 将对应位设为1
      this.bitArray[byteIndex] |= (1 << bitOffset);
    }
    
    this.itemCount++;
  }

  /**
   * 查询元素是否可能存在
   * 
   * @param {string} item - 要查询的元素
   * @returns {boolean} true=可能存在, false=一定不存在
   */
  contains(item) {
    const hashValues = this._getHashValues(item);
    
    for (let hash of hashValues) {
      const bitIndex = hash % this.size;
      const byteIndex = Math.floor(bitIndex / 8);
      const bitOffset = bitIndex % 8;
      
      // 如果任何一位是0，就一定不存在
      if (!(this.bitArray[byteIndex] & (1 << bitOffset))) {
        return false;
      }
    }
    
    // 所有位都是1，可能存在
    return true;
  }

  /**
   * 获取多个哈希值
   * 
   * 使用"双重哈希"技巧：
   * h(i) = h1(x) + i * h2(x)
   * 用2个哈希函数模拟k个哈希函数，节省计算
   */
  _getHashValues(item) {
    const hash1 = this._hash1(item);
    const hash2 = this._hash2(item);
    
    const hashes = [];
    for (let i = 0; i < this.hashCount; i++) {
      // 双重哈希公式
      const hash = (hash1 + i * hash2) % this.size;
      hashes.push(Math.abs(hash));
    }
    
    return hashes;
  }

  /**
   * 第一个哈希函数（FNV-1a）
   */
  _hash1(str) {
    let hash = 2166136261;
    for (let i = 0; i < str.length; i++) {
      hash ^= str.charCodeAt(i);
      hash += (hash << 1) + (hash << 4) + (hash << 7) + (hash << 8) + (hash << 24);
    }
    return hash >>> 0; // 转为无符号32位整数
  }

  /**
   * 第二个哈希函数（DJB2）
   */
  _hash2(str) {
    let hash = 5381;
    for (let i = 0; i < str.length; i++) {
      hash = ((hash << 5) + hash) + str.charCodeAt(i);
    }
    return hash >>> 0;
  }

  /**
   * 获取当前误判率估算
   */
  getFalsePositiveRate() {
    // 公式：p = (1 - e^(-kn/m))^k
    const exponent = -this.hashCount * this.itemCount / this.size;
    const probability = Math.pow(1 - Math.exp(exponent), this.hashCount);
    return probability;
  }

  /**
   * 获取空间使用率
   */
  getFillRate() {
    let onesCount = 0;
    for (let byte of this.bitArray) {
      for (let i = 0; i < 8; i++) {
        if (byte & (1 << i)) onesCount++;
      }
    }
    return onesCount / this.size;
  }

  /**
   * 获取统计信息
   */
  getStats() {
    return {
      itemCount: this.itemCount,
      size: this.size,
      sizeKB: (this.size / 8 / 1024).toFixed(2),
      hashCount: this.hashCount,
      fillRate: (this.getFillRate() * 100).toFixed(2) + '%',
      estimatedFPR: (this.getFalsePositiveRate() * 100).toFixed(3) + '%'
    };
  }
}

🚀 真实场景应用

应用1：Redis缓存穿透防护

问题： 黑客故意查询不存在的商品ID，导致所有请求都打到数据库。

// 解决方案：布隆过滤器 + Redis

class ProductCache {
  constructor() {
    this.bloomFilter = new BloomFilter(1000000, 0.01); // 100万商品，1%误判
    this.redis = new Map(); // 模拟Redis
    this.db = new Map();    // 模拟数据库
    
    // 初始化：将所有商品ID加入布隆过滤器
    this._initializeBloomFilter();
  }

  _initializeBloomFilter() {
    // 假设数据库有100万个商品
    for (let i = 1; i <= 1000000; i++) {
      this.db.set(`product_${i}`, { name: `商品${i}`, price: i * 10 });
      this.bloomFilter.add(`product_${i}`);
    }
  }

  getProduct(productId) {
    const key = `product_${productId}`;
    
    // 第1步：查布隆过滤器
    if (!this.bloomFilter.contains(key)) {
      // 一定不存在，直接返回，不查Redis和数据库
      console.log(`❌ 商品${productId}不存在（布隆过滤器拦截）`);
      return null;
    }
    
    // 第2步：查Redis
    if (this.redis.has(key)) {
      console.log(`✅ 从Redis获取商品${productId}`);
      return this.redis.get(key);
    }
    
    // 第3步：查数据库
    const product = this.db.get(key);
    if (product) {
      // 缓存到Redis
      this.redis.set(key, product);
      console.log(`✅ 从数据库获取商品${productId}并缓存`);
      return product;
    }
    
    // 布隆过滤器误判：实际不存在
    console.log(`⚠️  商品${productId}不存在（布隆过滤器误判）`);
    return null;
  }
}

// 测试
const cache = new ProductCache();

// 正常查询
cache.getProduct(12345);
// ✅ 从数据库获取商品12345并缓存

// 黑客攻击：查询不存在的商品
cache.getProduct(9999999);
// ❌ 商品9999999不存在（布隆过滤器拦截）

效果对比：

方案	黑客查询100万次	数据库压力
无防护	100万次查询	💥 数据库崩溃
布隆过滤器	99万次被拦截，1万次误判	✅ 数据库正常

应用2：垃圾邮件检测

class SpamDetector {
  constructor() {
    // 已知垃圾邮件域名
    this.spamDomains = [
      'spam.com', 'junk.org', 'phishing.net', 
      'scam.io', 'malware.cn'
    ];
    
    this.bloomFilter = new BloomFilter(10000, 0.001); // 0.1%误判
    
    // 初始化
    this.spamDomains.forEach(domain => {
      this.bloomFilter.add(domain);
    });
  }

  isSpam(email) {
    const domain = email.split('@')[1];
    
    if (!this.bloomFilter.contains(domain)) {
      return { isSpam: false, reason: '域名不在黑名单' };
    }
    
    // 可能存在，需要二次确认（查真实黑名单数据库）
    if (this.spamDomains.includes(domain)) {
      return { isSpam: true, reason: '确认是垃圾邮件域名' };
    }
    
    return { isSpam: false, reason: '布隆过滤器误判' };
  }
}

const detector = new SpamDetector();

console.log(detector.isSpam('user@spam.com'));
// { isSpam: true, reason: '确认是垃圾邮件域名' }

console.log(detector.isSpam('user@gmail.com'));
// { isSpam: false, reason: '域名不在黑名单' }

应用3：URL去重（爬虫必备）

class WebCrawler {
  constructor() {
    // 用布隆过滤器记录已访问的URL
    this.visitedURLs = new BloomFilter(10000000, 0.01); // 1000万URL
    this.urlQueue = [];
  }

  addURL(url) {
    if (!this.visitedURLs.contains(url)) {
      this.visitedURLs.add(url);
      this.urlQueue.push(url);
      console.log(`✅ 添加URL: ${url}`);
    } else {
      console.log(`⏭️  跳过已访问URL: ${url}`);
    }
  }

  crawl() {
    while (this.urlQueue.length > 0) {
      const url = this.urlQueue.shift();
      console.log(` 爬取: ${url}`);
      
      // 模拟爬取后发现的新链接
      const newURLs = this._extractLinks(url);
      newURLs.forEach(newURL => this.addURL(newURL));
    }
  }

  _extractLinks(url) {
    // 模拟从页面提取链接
    return [
      `${url}/page1`,
      `${url}/page2`,
      `${url}/about`
    ];
  }
}

const crawler = new WebCrawler();

crawler.addURL('https://example.com');
crawler.addURL('https://example.com'); // 重复，会被跳过
crawler.addURL('https://example.com/page1');

// 输出：
// ✅ 添加URL: https://example.com
// ⏭️  跳过已访问URL: https://example.com
// ✅ 添加URL: https://example.com/page1

⚙️ 参数调优指南

如何选择位数组大小？

公式： m = -(n * ln(p)) / (ln(2))^2

n = 预期元素数量
p = 可接受的误判率
m = 位数组大小（bits）

示例计算：

// 场景1：100万元素，1%误判率
m = -(1000000 * ln(0.01)) / (ln(2))^2
  = -(1000000 * -4.605) / 0.48
  = 9,591,668 bits
  = 1.15 MB

// 场景2：1亿元素，0.1%误判率
m = -(100000000 * ln(0.001)) / (ln(2))^2
  = -(100000000 * -6.908) / 0.48
  = 1,439,166,667 bits
  = 172 MB

直观对比：

元素数量	误判率	布隆过滤器	哈希表	节省空间
100万	1%	1.15 MB	80 MB	98.6%
1000万	1%	11.5 MB	800 MB	98.6%
1亿	0.1%	172 MB	8 GB	97.9%

如何选择哈希函数数量？

公式： k = (m/n) * ln(2)

// 100万元素，1%误判率
k = (9591668 / 1000000) * 0.693
  = 6.64
  ≈ 7 个哈希函数

经验法则：

误判率 1%：6-7 个哈希函数
误判率 0.1%：9-10 个哈希函数
误判率 0.01%：13-14 个哈希函数

误判率太高怎么办？

3种方案：

增大位数组

// 从1%误判率降到0.1%
const bf = new BloomFilter(1000000, 0.001); // 空间增大3倍

增加哈希函数数量
```
// 但会增加插入/查询时间
```

分层布隆过滤器

// 第一层：低误判率（0.01%）
// 第二层：正常误判率（1%）
// 两层都为true，才认为是真的存在

常见坑与解决方案

坑1：不支持删除

问题： 标准布隆过滤器无法删除元素（会误删其他元素）

解决方案1：计数布隆过滤器

class CountingBloomFilter {
  constructor(size) {
    // 用计数器数组替代位数组
    this.counters = new Uint8Array(size);
  }

  add(item) {
    const hashes = this._getHashes(item);
    hashes.forEach(index => {
      if (this.counters[index] < 255) {
        this.counters[index]++;
      }
    });
  }

  remove(item) {
    const hashes = this._getHashes(item);
    hashes.forEach(index => {
      if (this.counters[index] > 0) {
        this.counters[index]--;
      }
    });
  }

  contains(item) {
    const hashes = this._getHashes(item);
    return hashes.every(index => this.counters[index] > 0);
  }
}

解决方案2：重建过滤器

// 定期重建（删除所有元素后重新添加）
class RebuildableBloomFilter {
  constructor() {
    this.bf = new BloomFilter(10000, 0.01);
    this.items = []; // 保留原始数据
  }

  remove(item) {
    this.items = this.items.filter(i => i !== item);
    this._rebuild();
  }

  _rebuild() {
    this.bf = new BloomFilter(10000, 0.01);
    this.items.forEach(item => this.bf.add(item));
  }
}

坑2：哈希冲突导致误判率飙升

问题： 如果哈希函数质量差，冲突会很多

解决方案：使用高质量哈希函数

// 推荐：MurmurHash3（业界标准）
function murmurHash3(str, seed = 0) {
  let h1 = 0xdeadbeef ^ seed;
  let h2 = 0x41c6ce57 ^ seed;
  
  for (let i = 0; i < str.length; i++) {
    const ch = str.charCodeAt(i);
    h1 = Math.imul(h1 ^ ch, 2654435761);
    h2 = Math.imul(h2 ^ ch, 1597334677);
  }
  
  h1 = Math.imul(h1 ^ (h1 >>> 16), 2246822507) ^ Math.imul(h2 ^ (h2 >>> 13), 3266489909);
  h2 = Math.imul(h2 ^ (h2 >>> 16), 2246822507) ^ Math.imul(h1 ^ (h1 >>> 13), 3266489909);
  
  return 4294967296 * (2097151 & h2) + (h1 >>> 0);
}

坑3：元素数量超过预期

问题： 插入的元素超过设计容量，误判率会指数级上升

解决方案：动态扩容

class ScalableBloomFilter {
  constructor(initialCapacity = 10000) {
    this.filters = [];
    this.currentFilter = new BloomFilter(initialCapacity, 0.01);
    this.filters.push(this.currentFilter);
    this.itemCount = 0;
  }

  add(item) {
    if (this.itemCount >= this.currentFilter.expectedItems) {
      // 容量满了，创建新过滤器
      const newCapacity = this.currentFilter.expectedItems * 2;
      this.currentFilter = new BloomFilter(newCapacity, 0.01);
      this.filters.push(this.currentFilter);
    }
    
    this.currentFilter.add(item);
    this.itemCount++;
  }

  contains(item) {
    // 所有过滤器都包含，才认为存在
    return this.filters.every(f => f.contains(item));
  }
}

📊 性能基准测试

// 测试：布隆过滤器 vs 哈希表

console.log('===== 性能对比测试 =====\n');

// 准备100万数据
const testData = [];
for (let i = 0; i < 1000000; i++) {
  testData.push(`user_${i}`);
}

// 测试1：布隆过滤器
console.log('测试1：布隆过滤器');
const bf = new BloomFilter(1000000, 0.01);

let startTime = Date.now();
testData.forEach(item => bf.add(item));
let insertTime = Date.now() - startTime;
console.log(`插入100万数据: ${insertTime}ms`);
console.log(`内存占用: ${bf.getStats().sizeKB} KB`);

startTime = Date.now();
for (let i = 0; i < 100000; i++) {
  bf.contains(`user_${i}`);
}
let queryTime = Date.now() - startTime;
console.log(`查询10万次: ${queryTime}ms`);
console.log(`误判率: ${bf.getStats().estimatedFPR}\n`);

// 测试2：哈希表（Set）
console.log('测试2：哈希表（Set）');
const hashSet = new Set();

startTime = Date.now();
testData.forEach(item => hashSet.add(item));
insertTime = Date.now() - startTime;
console.log(`插入100万数据: ${insertTime}ms`);

startTime = Date.now();
for (let i = 0; i < 100000; i++) {
  hashSet.has(`user_${i}`);
}
queryTime = Date.now() - startTime;
console.log(`查询10万次: ${queryTime}ms`);

// 估算Set占用内存（简化）
const estimatedSize = testData.length * 50; // 每个元素约50字节
console.log(`内存占用: ${(estimatedSize / 1024 / 1024).toFixed(2)} MB\n`);

console.log('总结：');
console.log(`空间节省: ${((1 - bf.getStats().sizeKB / 1024 / (estimatedSize / 1024 / 1024)) * 100).toFixed(1)}%`);

典型输出：

===== 性能对比测试 =====

测试1：布隆过滤器
插入100万数据: 234ms
内存占用: 1125.00 KB
查询10万次: 18ms
误判率: 0.998%

测试2：哈希表（Set）
插入100万数据: 189ms
查询10万次: 12ms
内存占用: 47.68 MB

总结：
空间节省: 97.7%

🎯 LeetCode相关题目

虽然布隆过滤器本身不是面试题，但相关思想经常考：

设计哈希集合（LeetCode 705）
- 考察哈希函数设计
- 位数组的使用
设计哈希映射（LeetCode 706）
- 类似布隆过滤器的思想
判断两个字符串是否互为字符重排
- 可以用位数组优化空间

💡 面试高频问题

Q1：布隆过滤器的误判率如何计算？

A：公式：p = (1 - e^(-kn/m))^k

k = 哈希函数数量
n = 元素数量
m = 位数组大小

解释：

kn/m 是某一位被设置为1的概率
1 - e^(-kn/m) 是某一位仍为0的概率
(1 - e^(-kn/m))^k 是k位都为1的概率（即误判率）

Q2：为什么布隆过滤器不支持删除？

A：因为多个元素可能共享同一个位。

举例：

插入"hello"：设置位 3, 7, 12
插入"world"：设置位 1, 7, 10

如果删除"hello"：
- 把位 7 设为0，会影响"world"的查询
- 这就是"误删"问题

解决方案： 使用计数布隆过滤器（每个位用计数器代替）。

Q3：布隆过滤器和布谷鸟过滤器有什么区别？

A：

特性	布隆过滤器	布谷鸟过滤器
支持删除	❌	✅
误判率	固定	可动态调整
实现复杂度	简单	复杂
空间效率	高	稍低
应用场景	只读场景多	需要删除的场景

📈 扩展：布隆过滤器的变体

1. 计数布隆过滤器（Counting Bloom Filter）

支持删除
用计数器代替位

2. 分层布隆过滤器（Layered Bloom Filter）

多层过滤器串联
降低误判率

3. 可扩展布隆过滤器（Scalable Bloom Filter）

动态扩容
自动创建新过滤器

4. 布谷鸟过滤器（Cuckoo Filter）

支持删除
空间效率接近布隆过滤器
Redis 4.0+ 已内置

🎓 总结

布隆过滤器的核心价值

空间极度压缩：1MB vs 80MB，节省 98% 空间
查询极快：O(k)，通常 < 1ms
一定不存在：100% 准确的负向判断
工业级应用：Chrome、Redis、比特币都在用

什么时候用？

✅ 适合：

缓存穿透防护
大规模去重
黑名单/白名单
数据量 > 100万
允许少量误判

❌ 不适合：

数据量小（< 1万）
要求 100% 准确
需要频繁删除
需要遍历所有元素

下一篇文章会讲什么？

留言告诉我你最想看的算法主题！

📚 完整教程和代码： github.com/Lee985-cmd/…
⭐ 如果这篇文章帮到你，请 Star 支持一下！
💬 有问题欢迎在评论区讨论！