leveldb 源码解读 一

96 阅读12分钟

skiplist

MemTable

内部使用skipList存储数据

ADD

input:seq number, type, user key, value

internalKey: user key + seq num (7 bytes) + type (1byte)

memKey: internalKey size (var int) + internalKey + value size (var int) + value

memKey最终被存入skiplist

LookupKey

keyLength (var int) + user key + seq num (7 bytes) + type (1 byte)

Get

input: LookupKey

output: value

使用memKey(value部分为空)到skipList中进行>=查询。在skiplist中,当MemTable中userKey相同,会再根据seqNum降序排序,因此相同key的最新的记录会被最先取到(注意需要根据type判断是否为一条删除记录)。再从查询到的item中解析出value。

MemTableIterator

实际为MemTable内部skiplist的iterator

Ref

MemTable维护引用计数,计数为零时,执行析构

Comparator接口

对slice进行比较

Name: 会记录到leveldb元数据中,防止多次打开db时,写入和查询使用的Comparator不匹配

FindShortestSeparator:返回一个更短的在两个key之间的slice,用来节约存储空间

FindShortSuccessor: 返回一个(可能)更短的 && >=key 的slice,用来节约存储空间

InternalKeyComparator

用作MemTable内memKey比较,内部组合了一个Comparator的实现:

依次按照: 1)user key升序,2)seq num降序, 3)type 降序 对memKey进行排序

WAL log

更新MemTable之前需要先写log,down机时可以根据WAL log恢复memTable。log写均为顺序写,append-only

Format:

The log file contents are a sequence of 32KB blocks. 
The only exception is that the tail of thefile may contain a partial block.
Each block consists of a sequence of records:
block:= record* trailer?
record :=
checksum: uint32     // crc32c of type and data[] ; little-endian
length: uint16       // little-endian
type: uint8          // One of FULL,FIRST, MIDDLE, LAST
data: uint8[length]

record type:

  • full: block 中data为一条完整record

  • first:长record被切分为多个block,first是第一个block

  • Middle:长record中间block

  • last:长record的最后一个block

最后一个block剩余长度<=7时(checksum + length + type), 已经无法放record数据,为直接append 0补齐block

Writer

writer记录当前block_offest. 对每个record,writter根据block_offset将record依次拆分成若干physical record,并将所有physical record append到file的各个block中。

WritebleFile

是leveldb对file的接口抽象,默认提供了posix环境下的接口实现。支持用户自行继承并实现此类接口来扩展迁移到其他平台(比如windows,对象存储,hdfs等)。支持:

  • append

  • close

  • flush

  • sync: 同步写。在默认posix语意下,会同步写到磁盘才返回成功。否则写page cache成功便会返回。数据成功写入page cache便能保证进程crash数据依然完整,但是不能保证os/machine crash数据依然完整。相反,直写disk能确保os/machine crash时数据完整,但是会带来性能退化。

Reader

主要成员:

  • reporter: 报告错误

  • SequentialFile: 支持seek和顺序读的文件接口,拓展能力和方式同上述的WritebleFile

  • checkSum:是否对record的checkSum进行检查

  • SkipToInitialBlock:将文件offset移动到第一个block开始处

  • ReadRecord

SSTable

sstable是leveldb在文件系统中组织存储有序kv对的地方。

Format

table/block.h

  • Data blocks:有序kv

  • Meta block: filters

  • Meta index block:对filter的索引

  • Index block: 对data block的索引

  • Footer:

Block

Restart point

sstable中对key的存储是有序的,按前缀压缩的。解决存储空间,但增加了查询的计算复杂度和查询时间。因此每隔若干个key就全量不前缀压缩的存储一个key,这种全量存储的点叫做restart point。在block的最后记录重启点的偏移量。

block格式被大量复用,其格式为:

Block data + type (1 byte) + crc32 (4bytes)

type表示压缩方式:none or snappy

block中kv对

  • 共享前缀长度 shared_bytes: varint32

  • 前缀之后的字符串长度 unshared_bytes: varint32

  • 值的长度 value_length: varint32

  • 前缀之后的字符串 key_delta: char[unshared_bytes]

  • 值 value: char[value_length]

对于重启点:shared_bytes = 0

block结尾:

  • restarts: uint32[num_restarts], 各个重启点的偏移量

  • num_restarts uint32

Block Builder

由block builder构建block.

  • Add()方法提供顺序append block的能力,要求append的block是严格递增的。

使用slice记录上次append的key,记录restart点后append的数目,从而计算出 shared_bytes、unshared_bytes、value_length、key_delta。

  • 最后调用Finish,补上block中restart索引

Block Reader

在类block中

  const char* data_;
  size_t size_;
  uint32_t restart_offset_;  // Offset in data_ of restart array
  bool owned_;               // Block owns data_[]

提供一个iterator用来遍历block中entry

const Comparator* const comparator_;
  const char* const data_;       // underlying block contents
  uint32_t const restarts_;      // Offset of restart array (list of fixed32)
  uint32_t const num_restarts_;  // Number of uint32_t entries in restart array

  // current_ is offset in data_ of current entry.  >= restarts_ if !Valid
  uint32_t current_;
  uint32_t restart_index_;  // Index of restart block in which current_ falls
  std::string key_;
  Slice value_;
  Status status_;
  • Seek

Create SSTable

源代码位于: table_builder.cc/block_builder.cc

成员为Rep*, 包含在另一个struct中,应该是为了隐藏其内部实现,支持内部后续迭代,因为table_builder.h是直接暴露给Leveldb库用户的(属于include/leveldb目录下)。

struct TableBuilder::Rep {
  Options options;
  Options index_block_options;
  WritableFile* file;
  uint64_t offset;
  Status status;
  BlockBuilder data_block;
  BlockBuilder index_block;
  std::string last_key;
  int64_t num_entries;
  bool closed;  // Either Finish() or Abandon() has been called.
  FilterBlockBuilder* filter_block;
  bool pending_index_entry;  // true only if data block is empty
  BlockHandle pending_handle;  // Handle to add to index block
  std::string compressed_output;
};

Add key

  • 需确保插入key>last_key。

  • 对block的第一条节点创建index_handle,并将index_handle插入index block中。index的key是last_key和当前key的shortest seprator (比如key: "abd" 和 key: "aef"的shortest seprator可以是“ac", 这样可以有效减少index的key长度)

  • 将当前key将入bloom filter中

  • 在当前data block中插入key & value

  • 当当前data block大小达到阈值时,调用flucsh将当前data block写入到文件中。

void TableBuilder::Add(const Slice& key, const Slice& value) {
  Rep* r = rep_;
  assert(!r->closed);
  if (!ok()) return;
  if (r->num_entries > 0) {
    assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
  }

  if (r->pending_index_entry) {
    assert(r->data_block.empty());
    r->options.comparator->FindShortestSeparator(&r->last_key, key);
    std::string handle_encoding;
    r->pending_handle.EncodeTo(&handle_encoding);
    r->index_block.Add(r->last_key, Slice(handle_encoding));
    r->pending_index_entry = false;
  }

  if (r->filter_block != nullptr) {
    r->filter_block->AddKey(key);
  }

  r->last_key.assign(key.data(), key.size());
  r->num_entries++;
  r->data_block.Add(key, value);

  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
  if (estimated_block_size >= r->options.block_size) {
    Flush();
  }
}

Finish

SSTable创建完成后调用,往ssTable中依次写入:

  • filterBlock

  • Meta index block

  • Index block

  • footer

Read SSTable

  • 读取并验证footer

  • 根据footer获取index block和mete block的offset。

  • 读取index block和 meta block

ooter footer;
  s = footer.DecodeFrom(&footer_input);
  if (!s.ok()) return s;

  // Read the index block
  BlockContents index_block_contents;
  ReadOptions opt;
  if (options.paranoid_checks) {
    opt.verify_checksums = true;
  }
  s = ReadBlock(file, opt, footer.index_handle(), &index_block_contents);

  if (s.ok()) {
    // We've successfully read the footer and the index block: we're
    // ready to serve requests.
    Block* index_block = new Block(index_block_contents);
    Rep* rep = new Table::Rep;
    rep->options = options;
    rep->file = file;
    rep->metaindex_handle = footer.metaindex_handle();
    rep->index_block = index_block;
    rep->cache_id = (options.block_cache ? options.block_cache->NewId() : 0);
    rep->filter_data = nullptr;
    rep->filter = nullptr;
    *table = new Table(rep);
    (*table)->ReadMeta(footer);
  }

SSTable iterator

table的iterator是一个TwoLevelIterator. 第一层是index block的iterator,每个iterator又指向了一个data block,data block也有自己的iterator构成第二层的iterator。Table Iterator的查找,也是在两层iterator上构建了一个二分查找。

Status Table::InternalGet(const ReadOptions& options, const Slice& k, void* arg,
                          void (*handle_result)(void*, const Slice&,
                                                const Slice&)) {
  Status s;
  Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);
  iiter->Seek(k);
  if (iiter->Valid()) {
    Slice handle_value = iiter->value();
    FilterBlockReader* filter = rep_->filter;
    BlockHandle handle;
    if (filter != nullptr && handle.DecodeFrom(&handle_value).ok() &&
        !filter->KeyMayMatch(handle.offset(), k)) {
      // Not found
    } else {
      Iterator* block_iter = BlockReader(this, options, iiter->value());
      block_iter->Seek(k);
      if (block_iter->Valid()) {
        (*handle_result)(arg, block_iter->key(), block_iter->value());
      }
      s = block_iter->status();
      delete block_iter;
    }
  }
  if (s.ok()) {
    s = iiter->status();
  }
  delete iiter;
  return s;
}

Table Cache

  • Table cache提供了对SSTable的LRU缓存。

  • 支持读取(并缓存)sstable:

  • 支持读取(并缓存)sstable,返回对应sstable的iterator

  • 支持读取(并缓存)sstable,在sstable中,查找key并回调。

  • 手动从LRU cache中淘汰sstable

读取(并缓存)sstable:

Status TableCache::FindTable(uint64_t file_number, uint64_t file_size,
                             Cache::Handle** handle) {
  Status s;
  char buf[sizeof(file_number)]; // file_number作为LRU cache的key
  EncodeFixed64(buf, file_number);
  Slice key(buf, sizeof(buf));
  *handle = cache_->Lookup(key);
  // cache miss:
  if (*handle == nullptr) {
    std::string fname = TableFileName(dbname_, file_number);
    RandomAccessFile* file = nullptr;
    Table* table = nullptr;
    // 文件系统中对应的file
    s = env_->NewRandomAccessFile(fname, &file);
    if (!s.ok()) {
        // 对老版本的兼容
      std::string old_fname = SSTTableFileName(dbname_, file_number);
      if (env_->NewRandomAccessFile(old_fname, &file).ok()) {
        s = Status::OK();
      }
    }
    if (s.ok()) {
        // 从文件系统中加载sstable至内存
      s = Table::Open(options_, file, file_size, &table);
    }

    if (!s.ok()) {
      assert(table == nullptr);
      delete file;
      // We do not cache error results so that if the error is transient,
      // or somebody repairs the file, we recover automatically.
    } else {
        // LRU cache缓存ss table
      TableAndFile* tf = new TableAndFile;
      tf->file = file;
      tf->table = table;
      *handle = cache_->Insert(key, tf, 1, &DeleteEntry);
    }
  }
  // cache hit,直接返回,handle中为cache的table and file
  return s;
}

读取(并缓存)sstable,返回对应sstable的iterator

Iterator* TableCache::NewIterator(const ReadOptions& options,
                                  uint64_t file_number, uint64_t file_size,
                                  Table** tableptr) {
  if (tableptr != nullptr) {
    *tableptr = nullptr;
  }

  Cache::Handle* handle = nullptr;
  // 从缓存获取sstable,或者从文件系统读取并缓存
  Status s = FindTable(file_number, file_size, &handle);
  if (!s.ok()) {
    return NewErrorIterator(s);
  }

  Table* table = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
  // 构建sstable 的iterator
  Iterator* result = table->NewIterator(options);
  // 注册iterator回调,iterator析构时释放cache iterm,未释放的cache item不能被清除。
  result->RegisterCleanup(&UnrefEntry, cache_, handle);
  if (tableptr != nullptr) {
    *tableptr = table;
  }
  return result;
}

读取(并缓存)sstable,在sstable中,查找key并回调:

Status TableCache::Get(const ReadOptions& options, uint64_t file_number,
                       uint64_t file_size, const Slice& k, void* arg,
                       void (*handle_result)(void*, const Slice&,
                                             const Slice&)) {
  Cache::Handle* handle = nullptr;
  Status s = FindTable(file_number, file_size, &handle);
  if (s.ok()) {
    Table* t = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
    s = t->InternalGet(options, k, arg, handle_result);
    cache_->Release(handle);
  }
  return s;
}

主动缓存驱除:

void TableCache::Evict(uint64_t file_number) {
  char buf[sizeof(file_number)];
  EncodeFixed64(buf, file_number);
  cache_->Erase(Slice(buf, sizeof(buf)));
}

LRU Cache

LRU Handle

因为LevelDB 中的 hash table 仅仅为了辅助实现 LRU 用,所以 hash table 的实现并不是模板化的,是特化的,hash table 中的元素为 LRUHandle

struct LRUHandle {
  void* value;
  void (*deleter)(const Slice&, void* value);
  LRUHandle* next_hash; 
  LRUHandle* next;
  LRUHandle* prev;
  size_t charge;  // TODO(opt): Only allow uint32_t?
  size_t key_length;
  bool in_cache;     // Whether entry is in the cache.
  uint32_t refs;     // References, including cache reference, if present.
  uint32_t hash;     // 加速运算,不用每次hash table resize时重新计算节点的 hash 值
  char key_data[1];  // Beginning of key

  Slice key() const {
    // next is only equal to this if the LRU handle is the list head of an
    // empty list. List heads never have meaningful keys.
    assert(next != this);

    return Slice(key_data, key_length);
  }
};

Hash Table

LRU cache需要使用 hash table 能力,leveldb 提供了简单 hash table 的内部实现,据说性能上会比 buildin 的 实现快一些。

  • 参数:HashTable 使用链式法来解决 hash 冲突,内部参数有
// The table consists of an array of buckets where each bucket is
  // a linked list of cache entries that hash into the bucket.
  uint32_t length_;
  uint32_t elems_;
  LRUHandle** list_;
  • 因为这里的 hash table 仅仅为了辅助实现 LRU 用,所以 hash table 的实现并不是模板化的,是特化的,hash table 中的元素为 LRUHandle
struct LRUHandle {
  void* value;
  void (*deleter)(const Slice&, void* value);
  LRUHandle* next_hash; 
  LRUHandle* next;
  LRUHandle* prev;
  size_t charge;  // TODO(opt): Only allow uint32_t?
  size_t key_length;
  bool in_cache;     // Whether entry is in the cache.
  uint32_t refs;     // References, including cache reference, if present.
  uint32_t hash;     // 加速运算,不用每次hash table resize时重新计算节点的 hash 值
  char key_data[1];  // Beginning of key

  Slice key() const {
    // next is only equal to this if the LRU handle is the list head of an
    // empty list. List heads never have meaningful keys.
    assert(next != this);

    return Slice(key_data, key_length);
  }
};

  • resize: 当 hashTable 容量达到阈值时,触发 resize。新申请一快内存用作 hashtable,将旧的 hash table 中的元素 hash 到新的 hashtable 中去,释放旧 hash table 内存
void Resize() {
    uint32_t new_length = 4;
    while (new_length < elems_) {
      new_length *= 2;
    }
    LRUHandle** new_list = new LRUHandle*[new_length];
    memset(new_list, 0, sizeof(new_list[0]) * new_length);
    uint32_t count = 0;
    for (uint32_t i = 0; i < length_; i++) {
      LRUHandle* h = list_[i];
      while (h != nullptr) {
        LRUHandle* next = h->next_hash;
        uint32_t hash = h->hash;
        LRUHandle** ptr = &new_list[hash & (new_length - 1)];
        h->next_hash = *ptr;
        *ptr = h;
        h = next;
        count++;
      }
    }
    assert(elems_ == count);
    delete[] list_;
    list_ = new_list;
    length_ = new_length;
  }
  • Insert: 根据 hash 和 key 在原 hashTable 中查找,如不存在返回 nullptr。将新节点插入 hashTable 中,按需进行 resize
LRUHandle* Insert(LRUHandle* h) {
    LRUHandle** ptr = FindPointer(h->key(), h->hash);
    LRUHandle* old = *ptr;
    h->next_hash = (old == nullptr ? nullptr : old->next_hash);
    *ptr = h;
    if (old == nullptr) {
      ++elems_;
      if (elems_ > length_) {
        // Since each cache entry is fairly large, we aim for a small
        // average linked list length (<= 1).
        Resize();
      }
    }
    return old;
  }
  
  // Return a pointer to slot that points to a cache entry that
  // matches key/hash.  If there is no such cache entry, return a
  // pointer to the trailing slot in the corresponding linked list.
  LRUHandle** FindPointer(const Slice& key, uint32_t hash) {
    LRUHandle** ptr = &list_[hash & (length_ - 1)];
    while (*ptr != nullptr && ((*ptr)->hash != hash || key != (*ptr)->key())) {
      ptr = &(*ptr)->next_hash;
    }
    return ptr;
  }
  • remove
  LRUHandle* Remove(const Slice& key, uint32_t hash) {
    LRUHandle** ptr = FindPointer(key, hash);
    LRUHandle* result = *ptr;
    if (result != nullptr) {
      *ptr = result->next_hash;
      --elems_;
    }
    return result;
  }

接下来正式进入 LRU Cache 部分:

LRU 使用双向链表 + hash table 实现。Cache 由两块双向链表构成,当前的任意cache items在其中的一个链表上;client 仍在访问,但是已经被 cache 驱逐的 item 不在任何链表中。LRU 内部使用了 mutex 进行竞态资源保护,所有是线程安全的。两个链表分别为:

  • in-use:保存当前正被 client 访问的 items,无序

  • LRU: 保存当前未被 LRU 访问的 items,LRU 序

当外部 client 请求或者释放 对 cache item 的访问时,items 会在两个 list 上完成相应的移动

主要成员:

// Initialized before use.
  size_t capacity_;

  // mutex_ protects the following state.
  mutable port::Mutex mutex_;
  size_t usage_ GUARDED_BY(mutex_);  // GUARDED_BY是 leveldb 拓展的竞态资源clang 标记属性

  // Dummy head of LRU list.
  // lru.prev is newest entry, lru.next is oldest entry.
  // Entries have refs==1 and in_cache==true.
  LRUHandle lru_ GUARDED_BY(mutex_);

  // Dummy head of in-use list.
  // Entries are in use by clients, and have refs >= 2 and in_cache==true.
  LRUHandle in_use_ GUARDED_BY(mutex_);

  HandleTable table_ GUARDED_BY(mutex_);
};

初始化为两个空链表

LRUCache::LRUCache() : capacity_(0), usage_(0) {
  // Make empty circular linked lists.
  lru_.next = &lru_;
  lru_.prev = &lru_;
  in_use_.next = &in_use_;
  in_use_.prev = &in_use_;
}

析构:

LRUCache::~LRUCache() {
  assert(in_use_.next == &in_use_);  // Error if caller has an unreleased handle
  for (LRUHandle* e = lru_.next; e != &lru_;) {
    LRUHandle* next = e->next;
    assert(e->in_cache);
    e->in_cache = false;
    assert(e->refs == 1);  // Invariant of lru_ list.
    Unref(e);
    e = next;
  }
}
void LRUCache::Unref(LRUHandle* e) {
  assert(e->refs > 0);
  e->refs--;
  if (e->refs == 0) {  // Deallocate.
    assert(!e->in_cache);
    (*e->deleter)(e->key(), e->value);
    free(e);
  } else if (e->in_cache && e->refs == 1) {
   // No longer in use; move to lru_ list.
    LRU_Remove(e);
    LRU_Append(&lru_, e);
  }
}

查找与释放:通过 ref 和 unref 再 lru 和 in_use两个 list 之间进行移动

Cache::Handle* LRUCache::Lookup(const Slice& key, uint32_t hash) {
  MutexLock l(&mutex_);
  LRUHandle* e = table_.Lookup(key, hash);
  if (e != nullptr) {
    Ref(e);
  }
  return reinterpret_cast<Cache::Handle*>(e);
}

void LRUCache::Release(Cache::Handle* handle) {
  MutexLock l(&mutex_);
  Unref(reinterpret_cast<LRUHandle*>(handle));
}

插入如下,主要包含:

  • 初始化插入节点

  • 插入 in_use 列表,加入 hash table

  • 如果 capacity 超过阈值,从 lru 列表中按 LRU 序进行删除

Cache::Handle* LRUCache::Insert(const Slice& key, uint32_t hash, void* value,
                                size_t charge,
                                void (*deleter)(const Slice& key,
                                                void* value)) {
  MutexLock l(&mutex_);

  LRUHandle* e =
      reinterpret_cast<LRUHandle*>(malloc(sizeof(LRUHandle) - 1 + key.size()));
  e->value = value;
  e->deleter = deleter;
  e->charge = charge;
  e->key_length = key.size();
  e->hash = hash;
  e->in_cache = false;
  e->refs = 1;  // for the returned handle.
  std::memcpy(e->key_data, key.data(), key.size());

  if (capacity_ > 0) {
    e->refs++;  // for the cache's reference.
    e->in_cache = true;
    LRU_Append(&in_use_, e);
    usage_ += charge;
    FinishErase(table_.Insert(e));
  } else {  // don't cache. (capacity_==0 is supported and turns off caching.)
    // next is read by key() in an assert, so it must be initialized
    e->next = nullptr;
  }
  while (usage_ > capacity_ && lru_.next != &lru_) {
    LRUHandle* old = lru_.next;
    assert(old->refs == 1);
    bool erased = FinishErase(table_.Remove(old->key(), old->hash));
    if (!erased) {  // to avoid unused variable when compiled NDEBUG
      assert(erased);
    }
  }

 return reinterpret_cast<Cache::Handle*>(e);
}

Sharded LRU Cache

内部适用了 LRU cache 作底层 cache,适用 hash 将数据 shard 到底层的 LRU cache 中。最终 LevelDB 中直接使用到的默认 cache 为 sharded LRU Cache

class ShardedLRUCache : public Cache {
 private:
  LRUCache shard_[kNumShards];
  port::Mutex id_mutex_;
  uint64_t last_id_;

  static inline uint32_t HashSlice(const Slice& s) {
    return Hash(s.data(), s.size(), 0);
  }

  static uint32_t Shard(uint32_t hash) { return hash >> (32 - kNumShardBits); }

 public:
  explicit ShardedLRUCache(size_t capacity) : last_id_(0) {
    const size_t per_shard = (capacity + (kNumShards - 1)) / kNumShards;
    for (int s = 0; s < kNumShards; s++) {
      shard_[s].SetCapacity(per_shard);
    }
  }
  ~ShardedLRUCache() override {}
  Handle* Insert(const Slice& key, void* value, size_t charge,
                 void (*deleter)(const Slice& key, void* value)) override {
    const uint32_t hash = HashSlice(key);
    return shard_[Shard(hash)].Insert(key, hash, value, charge, deleter);
  }
  Handle* Lookup(const Slice& key) override {
    const uint32_t hash = HashSlice(key);
    return shard_[Shard(hash)].Lookup(key, hash);
  }
  void Release(Handle* handle) override {
    LRUHandle* h = reinterpret_cast<LRUHandle*>(handle);
    shard_[Shard(h->hash)].Release(handle);
  }
  void Erase(const Slice& key) override {
    const uint32_t hash = HashSlice(key);
    shard_[Shard(hash)].Erase(key, hash);
  }
  void* Value(Handle* handle) override {
    return reinterpret_cast<LRUHandle*>(handle)->value;
  }
  uint64_t NewId() override {
    MutexLock l(&id_mutex_);
    return ++(last_id_);
  }
  void Prune() override {
    for (int s = 0; s < kNumShards; s++) {
      shard_[s].Prune();
    }
  }
  size_t TotalCharge() const override {
    size_t total = 0;
    for (int s = 0; s < kNumShards; s++) {
      total += shard_[s].TotalCharge();
    }
    return total;
  }
};

}  // end anonymous namespace

Filter

bloom filter

[bloom filter](en.wikipedia.org/wiki/Bloom\… hash 模型。通过多个 hash 函数来减小 hash 碰撞的概率,在 hash 空间不变的情况下,通过增加 hash 函数数目来减小同时碰撞的概率。

由于本质是 hash 函数,hash冲突不可避免,因此bloom filter 判断 key 是否存在的结论是不准确的。未命中 bloom filter 说明 key 必然不存在;命中 bloom filter时,key 也有可能不存在(发生了 hash 冲突)。因此 bloomfilter 在 leveldb 中用来快速判断 key 不在某个 block 中。

double hash

bloom filter实现中需要若干个不同的 hash 函数, leveldb 通过 double hash 模拟出了任意多个 hash 函数。现有 hash 函数 h1 和 h2, 其中

H2(x)=(H1(x)>>17) | (H1(x)<<15)

,则 bloom filter 中的第 i 个 hash 函数Gi为:

Gi (key) = h1(key) + i * h2(key).

filter policy 接口定义在 filter policy 中,用户可以自行实现

class LEVELDB_EXPORT FilterPolicy {
 public:
  virtual ~FilterPolicy();

  // Return the name of this policy.  Note that if the filter encoding
  // changes in an incompatible way, the name returned by this method
  // must be changed.  Otherwise, old incompatible filters may be
  // passed to methods of this type.
  virtual const char* Name() const = 0;

  // keys[0,n-1] contains a list of keys (potentially with duplicates)
  // that are ordered according to the user supplied comparator.
  // Append a filter that summarizes keys[0,n-1] to *dst.
  //
  // Warning: do not change the initial contents of *dst.  Instead,
  // append the newly constructed filter to *dst.
  virtual void CreateFilter(const Slice* keys, int n,
                            std::string* dst) const = 0;

  // "filter" contains the data appended by a preceding call to
  // CreateFilter() on this class.  This method must return true if
  // the key was in the list of keys passed to CreateFilter().
  // This method may return true or false if the key was not on the
  // list, but it should aim to return false with a high probability.
  virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};

// Return a new filter policy that uses a bloom filter with approximately
// the specified number of bits per key.  A good value for bits_per_key
// is 10, which yields a filter with ~ 1% false positive rate.
//
// Callers must delete the result after any database that is using the
// result has been closed.
//
// Note: if you are using a custom comparator that ignores some parts
// of the keys being compared, you must not use NewBloomFilterPolicy()
// and must provide your own FilterPolicy that also ignores the
// corresponding parts of the keys.  For example, if the comparator
// ignores trailing spaces, it would be incorrect to use a
// FilterPolicy (like NewBloomFilterPolicy) that does not ignore
// trailing spaces in keys.
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);

}  // namespace leveldb

#endif  // STORAGE_LEVELDB_INCLUDE_FILTER_POLICY_H_

leveldb 默认的 filter 实现:bloom filter

class BloomFilterPolicy : public FilterPolicy {
 public:
  explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
    // We intentionally round down to reduce probing cost a little bit
    k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)
    if (k_ < 1) k_ = 1;
    if (k_ > 30) k_ = 30;
  }

  const char* Name() const override { return "leveldb.BuiltinBloomFilter2"; }

  void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
    // Compute bloom filter size (in both bits and bytes)
    size_t bits = n * bits_per_key_;

    // For small n, we can see a very high false positive rate.  Fix it
    // by enforcing a minimum bloom filter length.
    if (bits < 64) bits = 64;

    size_t bytes = (bits + 7) / 8;
    bits = bytes * 8;

    const size_t init_size = dst->size();
    dst->resize(init_size + bytes, 0);
    dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter
    char* array = &(*dst)[init_size];
    for (int i = 0; i < n; i++) {
      // Use double-hashing to generate a sequence of hash values.
      // See analysis in [Kirsch,Mitzenmacher 2006].
      uint32_t h = BloomHash(keys[i]);
      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k_; j++) {
        const uint32_t bitpos = h % bits;
        array[bitpos / 8] |= (1 << (bitpos % 8));
        h += delta;
      }
    }
  }

  bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
    const size_t len = bloom_filter.size();
    if (len < 2) return false;

    const char* array = bloom_filter.data();
    const size_t bits = (len - 1) * 8;

    // Use the encoded k so that we can read filters generated by
    // bloom filters created using different parameters.
    const size_t k = array[len - 1];
    if (k > 30) {
      // Reserved for potentially new encodings for short bloom filters.
      // Consider it a match.
      return true;
    }

    uint32_t h = BloomHash(key);
    const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
    for (size_t j = 0; j < k; j++) {
      const uint32_t bitpos = h % bits;
      if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
      h += delta;
    }
    return true;
  }

 private:
  size_t bits_per_key_;
  size_t k_;
};
}  // namespace

filter block builder

filter block 会作为 meta block 存在 sstable 中,关于 meta block 的读、写实现参考上述内容。