skiplist
略
MemTable
内部使用skipList存储数据
ADD
input:seq number, type, user key, value
internalKey: user key + seq num (7 bytes) + type (1byte)
memKey: internalKey size (var int) + internalKey + value size (var int) + value
memKey最终被存入skiplist
LookupKey
keyLength (var int) + user key + seq num (7 bytes) + type (1 byte)
Get
input: LookupKey
output: value
使用memKey(value部分为空)到skipList中进行>=查询。在skiplist中,当MemTable中userKey相同,会再根据seqNum降序排序,因此相同key的最新的记录会被最先取到(注意需要根据type判断是否为一条删除记录)。再从查询到的item中解析出value。
MemTableIterator
实际为MemTable内部skiplist的iterator
Ref
MemTable维护引用计数,计数为零时,执行析构
Comparator接口
对slice进行比较
Name: 会记录到leveldb元数据中,防止多次打开db时,写入和查询使用的Comparator不匹配
FindShortestSeparator:返回一个更短的在两个key之间的slice,用来节约存储空间
FindShortSuccessor: 返回一个(可能)更短的 && >=key 的slice,用来节约存储空间
InternalKeyComparator
用作MemTable内memKey比较,内部组合了一个Comparator的实现:
依次按照: 1)user key升序,2)seq num降序, 3)type 降序 对memKey进行排序
WAL log
更新MemTable之前需要先写log,down机时可以根据WAL log恢复memTable。log写均为顺序写,append-only
Format:
The log file contents are a sequence of 32KB blocks.
The only exception is that the tail of thefile may contain a partial block.
Each block consists of a sequence of records:
block:= record* trailer?
record :=
checksum: uint32 // crc32c of type and data[] ; little-endian
length: uint16 // little-endian
type: uint8 // One of FULL,FIRST, MIDDLE, LAST
data: uint8[length]
record type:
-
full: block 中data为一条完整record
-
first:长record被切分为多个block,first是第一个block
-
Middle:长record中间block
-
last:长record的最后一个block
最后一个block剩余长度<=7时(checksum + length + type), 已经无法放record数据,为直接append 0补齐block
Writer
writer记录当前block_offest. 对每个record,writter根据block_offset将record依次拆分成若干physical record,并将所有physical record append到file的各个block中。
WritebleFile
是leveldb对file的接口抽象,默认提供了posix环境下的接口实现。支持用户自行继承并实现此类接口来扩展迁移到其他平台(比如windows,对象存储,hdfs等)。支持:
-
append
-
close
-
flush
-
sync: 同步写。在默认posix语意下,会同步写到磁盘才返回成功。否则写page cache成功便会返回。数据成功写入page cache便能保证进程crash数据依然完整,但是不能保证os/machine crash数据依然完整。相反,直写disk能确保os/machine crash时数据完整,但是会带来性能退化。
Reader
主要成员:
-
reporter: 报告错误
-
SequentialFile: 支持seek和顺序读的文件接口,拓展能力和方式同上述的WritebleFile
-
checkSum:是否对record的checkSum进行检查
-
SkipToInitialBlock:将文件offset移动到第一个block开始处
-
ReadRecord
SSTable
sstable是leveldb在文件系统中组织存储有序kv对的地方。
Format
table/block.h
-
Data blocks:有序kv
-
Meta block: filters
-
Meta index block:对filter的索引
-
Index block: 对data block的索引
-
Footer:
Block
Restart point
sstable中对key的存储是有序的,按前缀压缩的。解决存储空间,但增加了查询的计算复杂度和查询时间。因此每隔若干个key就全量不前缀压缩的存储一个key,这种全量存储的点叫做restart point。在block的最后记录重启点的偏移量。
block格式被大量复用,其格式为:
Block data + type (1 byte) + crc32 (4bytes)
type表示压缩方式:none or snappy
block中kv对
-
共享前缀长度 shared_bytes: varint32
-
前缀之后的字符串长度 unshared_bytes: varint32
-
值的长度 value_length: varint32
-
前缀之后的字符串 key_delta: char[unshared_bytes]
-
值 value: char[value_length]
对于重启点:shared_bytes = 0
block结尾:
-
restarts: uint32[num_restarts], 各个重启点的偏移量
-
num_restarts uint32
Block Builder
由block builder构建block.
- Add()方法提供顺序append block的能力,要求append的block是严格递增的。
使用slice记录上次append的key,记录restart点后append的数目,从而计算出 shared_bytes、unshared_bytes、value_length、key_delta。
- 最后调用Finish,补上block中restart索引
Block Reader
在类block中
const char* data_;
size_t size_;
uint32_t restart_offset_; // Offset in data_ of restart array
bool owned_; // Block owns data_[]
提供一个iterator用来遍历block中entry
const Comparator* const comparator_;
const char* const data_; // underlying block contents
uint32_t const restarts_; // Offset of restart array (list of fixed32)
uint32_t const num_restarts_; // Number of uint32_t entries in restart array
// current_ is offset in data_ of current entry. >= restarts_ if !Valid
uint32_t current_;
uint32_t restart_index_; // Index of restart block in which current_ falls
std::string key_;
Slice value_;
Status status_;
- Seek
Create SSTable
源代码位于: table_builder.cc/block_builder.cc
成员为Rep*, 包含在另一个struct中,应该是为了隐藏其内部实现,支持内部后续迭代,因为table_builder.h是直接暴露给Leveldb库用户的(属于include/leveldb目录下)。
struct TableBuilder::Rep {
Options options;
Options index_block_options;
WritableFile* file;
uint64_t offset;
Status status;
BlockBuilder data_block;
BlockBuilder index_block;
std::string last_key;
int64_t num_entries;
bool closed; // Either Finish() or Abandon() has been called.
FilterBlockBuilder* filter_block;
bool pending_index_entry; // true only if data block is empty
BlockHandle pending_handle; // Handle to add to index block
std::string compressed_output;
};
Add key
-
需确保插入key>last_key。
-
对block的第一条节点创建index_handle,并将index_handle插入index block中。index的key是last_key和当前key的shortest seprator (比如key: "abd" 和 key: "aef"的shortest seprator可以是“ac", 这样可以有效减少index的key长度)
-
将当前key将入bloom filter中
-
在当前data block中插入key & value
-
当当前data block大小达到阈值时,调用flucsh将当前data block写入到文件中。
void TableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->num_entries > 0) {
assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
}
if (r->pending_index_entry) {
assert(r->data_block.empty());
r->options.comparator->FindShortestSeparator(&r->last_key, key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
if (r->filter_block != nullptr) {
r->filter_block->AddKey(key);
}
r->last_key.assign(key.data(), key.size());
r->num_entries++;
r->data_block.Add(key, value);
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
if (estimated_block_size >= r->options.block_size) {
Flush();
}
}
Finish
SSTable创建完成后调用,往ssTable中依次写入:
-
filterBlock
-
Meta index block
-
Index block
-
footer
Read SSTable
-
读取并验证footer
-
根据footer获取index block和mete block的offset。
-
读取index block和 meta block
ooter footer;
s = footer.DecodeFrom(&footer_input);
if (!s.ok()) return s;
// Read the index block
BlockContents index_block_contents;
ReadOptions opt;
if (options.paranoid_checks) {
opt.verify_checksums = true;
}
s = ReadBlock(file, opt, footer.index_handle(), &index_block_contents);
if (s.ok()) {
// We've successfully read the footer and the index block: we're
// ready to serve requests.
Block* index_block = new Block(index_block_contents);
Rep* rep = new Table::Rep;
rep->options = options;
rep->file = file;
rep->metaindex_handle = footer.metaindex_handle();
rep->index_block = index_block;
rep->cache_id = (options.block_cache ? options.block_cache->NewId() : 0);
rep->filter_data = nullptr;
rep->filter = nullptr;
*table = new Table(rep);
(*table)->ReadMeta(footer);
}
SSTable iterator
table的iterator是一个TwoLevelIterator. 第一层是index block的iterator,每个iterator又指向了一个data block,data block也有自己的iterator构成第二层的iterator。Table Iterator的查找,也是在两层iterator上构建了一个二分查找。
Status Table::InternalGet(const ReadOptions& options, const Slice& k, void* arg,
void (*handle_result)(void*, const Slice&,
const Slice&)) {
Status s;
Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);
iiter->Seek(k);
if (iiter->Valid()) {
Slice handle_value = iiter->value();
FilterBlockReader* filter = rep_->filter;
BlockHandle handle;
if (filter != nullptr && handle.DecodeFrom(&handle_value).ok() &&
!filter->KeyMayMatch(handle.offset(), k)) {
// Not found
} else {
Iterator* block_iter = BlockReader(this, options, iiter->value());
block_iter->Seek(k);
if (block_iter->Valid()) {
(*handle_result)(arg, block_iter->key(), block_iter->value());
}
s = block_iter->status();
delete block_iter;
}
}
if (s.ok()) {
s = iiter->status();
}
delete iiter;
return s;
}
Table Cache
-
Table cache提供了对SSTable的LRU缓存。
-
支持读取(并缓存)sstable:
-
支持读取(并缓存)sstable,返回对应sstable的iterator
-
支持读取(并缓存)sstable,在sstable中,查找key并回调。
-
手动从LRU cache中淘汰sstable
读取(并缓存)sstable:
Status TableCache::FindTable(uint64_t file_number, uint64_t file_size,
Cache::Handle** handle) {
Status s;
char buf[sizeof(file_number)]; // file_number作为LRU cache的key
EncodeFixed64(buf, file_number);
Slice key(buf, sizeof(buf));
*handle = cache_->Lookup(key);
// cache miss:
if (*handle == nullptr) {
std::string fname = TableFileName(dbname_, file_number);
RandomAccessFile* file = nullptr;
Table* table = nullptr;
// 文件系统中对应的file
s = env_->NewRandomAccessFile(fname, &file);
if (!s.ok()) {
// 对老版本的兼容
std::string old_fname = SSTTableFileName(dbname_, file_number);
if (env_->NewRandomAccessFile(old_fname, &file).ok()) {
s = Status::OK();
}
}
if (s.ok()) {
// 从文件系统中加载sstable至内存
s = Table::Open(options_, file, file_size, &table);
}
if (!s.ok()) {
assert(table == nullptr);
delete file;
// We do not cache error results so that if the error is transient,
// or somebody repairs the file, we recover automatically.
} else {
// LRU cache缓存ss table
TableAndFile* tf = new TableAndFile;
tf->file = file;
tf->table = table;
*handle = cache_->Insert(key, tf, 1, &DeleteEntry);
}
}
// cache hit,直接返回,handle中为cache的table and file
return s;
}
读取(并缓存)sstable,返回对应sstable的iterator
Iterator* TableCache::NewIterator(const ReadOptions& options,
uint64_t file_number, uint64_t file_size,
Table** tableptr) {
if (tableptr != nullptr) {
*tableptr = nullptr;
}
Cache::Handle* handle = nullptr;
// 从缓存获取sstable,或者从文件系统读取并缓存
Status s = FindTable(file_number, file_size, &handle);
if (!s.ok()) {
return NewErrorIterator(s);
}
Table* table = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
// 构建sstable 的iterator
Iterator* result = table->NewIterator(options);
// 注册iterator回调,iterator析构时释放cache iterm,未释放的cache item不能被清除。
result->RegisterCleanup(&UnrefEntry, cache_, handle);
if (tableptr != nullptr) {
*tableptr = table;
}
return result;
}
读取(并缓存)sstable,在sstable中,查找key并回调:
Status TableCache::Get(const ReadOptions& options, uint64_t file_number,
uint64_t file_size, const Slice& k, void* arg,
void (*handle_result)(void*, const Slice&,
const Slice&)) {
Cache::Handle* handle = nullptr;
Status s = FindTable(file_number, file_size, &handle);
if (s.ok()) {
Table* t = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
s = t->InternalGet(options, k, arg, handle_result);
cache_->Release(handle);
}
return s;
}
主动缓存驱除:
void TableCache::Evict(uint64_t file_number) {
char buf[sizeof(file_number)];
EncodeFixed64(buf, file_number);
cache_->Erase(Slice(buf, sizeof(buf)));
}
LRU Cache
LRU Handle
因为LevelDB 中的 hash table 仅仅为了辅助实现 LRU 用,所以 hash table 的实现并不是模板化的,是特化的,hash table 中的元素为 LRUHandle
struct LRUHandle {
void* value;
void (*deleter)(const Slice&, void* value);
LRUHandle* next_hash;
LRUHandle* next;
LRUHandle* prev;
size_t charge; // TODO(opt): Only allow uint32_t?
size_t key_length;
bool in_cache; // Whether entry is in the cache.
uint32_t refs; // References, including cache reference, if present.
uint32_t hash; // 加速运算,不用每次hash table resize时重新计算节点的 hash 值
char key_data[1]; // Beginning of key
Slice key() const {
// next is only equal to this if the LRU handle is the list head of an
// empty list. List heads never have meaningful keys.
assert(next != this);
return Slice(key_data, key_length);
}
};
Hash Table
LRU cache需要使用 hash table 能力,leveldb 提供了简单 hash table 的内部实现,据说性能上会比 buildin 的 实现快一些。
- 参数:HashTable 使用链式法来解决 hash 冲突,内部参数有
// The table consists of an array of buckets where each bucket is
// a linked list of cache entries that hash into the bucket.
uint32_t length_;
uint32_t elems_;
LRUHandle** list_;
- 因为这里的 hash table 仅仅为了辅助实现 LRU 用,所以 hash table 的实现并不是模板化的,是特化的,hash table 中的元素为 LRUHandle
struct LRUHandle {
void* value;
void (*deleter)(const Slice&, void* value);
LRUHandle* next_hash;
LRUHandle* next;
LRUHandle* prev;
size_t charge; // TODO(opt): Only allow uint32_t?
size_t key_length;
bool in_cache; // Whether entry is in the cache.
uint32_t refs; // References, including cache reference, if present.
uint32_t hash; // 加速运算,不用每次hash table resize时重新计算节点的 hash 值
char key_data[1]; // Beginning of key
Slice key() const {
// next is only equal to this if the LRU handle is the list head of an
// empty list. List heads never have meaningful keys.
assert(next != this);
return Slice(key_data, key_length);
}
};
- resize: 当 hashTable 容量达到阈值时,触发 resize。新申请一快内存用作 hashtable,将旧的 hash table 中的元素 hash 到新的 hashtable 中去,释放旧 hash table 内存
void Resize() {
uint32_t new_length = 4;
while (new_length < elems_) {
new_length *= 2;
}
LRUHandle** new_list = new LRUHandle*[new_length];
memset(new_list, 0, sizeof(new_list[0]) * new_length);
uint32_t count = 0;
for (uint32_t i = 0; i < length_; i++) {
LRUHandle* h = list_[i];
while (h != nullptr) {
LRUHandle* next = h->next_hash;
uint32_t hash = h->hash;
LRUHandle** ptr = &new_list[hash & (new_length - 1)];
h->next_hash = *ptr;
*ptr = h;
h = next;
count++;
}
}
assert(elems_ == count);
delete[] list_;
list_ = new_list;
length_ = new_length;
}
- Insert: 根据 hash 和 key 在原 hashTable 中查找,如不存在返回 nullptr。将新节点插入 hashTable 中,按需进行 resize
LRUHandle* Insert(LRUHandle* h) {
LRUHandle** ptr = FindPointer(h->key(), h->hash);
LRUHandle* old = *ptr;
h->next_hash = (old == nullptr ? nullptr : old->next_hash);
*ptr = h;
if (old == nullptr) {
++elems_;
if (elems_ > length_) {
// Since each cache entry is fairly large, we aim for a small
// average linked list length (<= 1).
Resize();
}
}
return old;
}
// Return a pointer to slot that points to a cache entry that
// matches key/hash. If there is no such cache entry, return a
// pointer to the trailing slot in the corresponding linked list.
LRUHandle** FindPointer(const Slice& key, uint32_t hash) {
LRUHandle** ptr = &list_[hash & (length_ - 1)];
while (*ptr != nullptr && ((*ptr)->hash != hash || key != (*ptr)->key())) {
ptr = &(*ptr)->next_hash;
}
return ptr;
}
- remove
LRUHandle* Remove(const Slice& key, uint32_t hash) {
LRUHandle** ptr = FindPointer(key, hash);
LRUHandle* result = *ptr;
if (result != nullptr) {
*ptr = result->next_hash;
--elems_;
}
return result;
}
接下来正式进入 LRU Cache 部分:
LRU 使用双向链表 + hash table 实现。Cache 由两块双向链表构成,当前的任意cache items在其中的一个链表上;client 仍在访问,但是已经被 cache 驱逐的 item 不在任何链表中。LRU 内部使用了 mutex 进行竞态资源保护,所有是线程安全的。两个链表分别为:
-
in-use:保存当前正被 client 访问的 items,无序
-
LRU: 保存当前未被 LRU 访问的 items,LRU 序
当外部 client 请求或者释放 对 cache item 的访问时,items 会在两个 list 上完成相应的移动
主要成员:
// Initialized before use.
size_t capacity_;
// mutex_ protects the following state.
mutable port::Mutex mutex_;
size_t usage_ GUARDED_BY(mutex_); // GUARDED_BY是 leveldb 拓展的竞态资源clang 标记属性
// Dummy head of LRU list.
// lru.prev is newest entry, lru.next is oldest entry.
// Entries have refs==1 and in_cache==true.
LRUHandle lru_ GUARDED_BY(mutex_);
// Dummy head of in-use list.
// Entries are in use by clients, and have refs >= 2 and in_cache==true.
LRUHandle in_use_ GUARDED_BY(mutex_);
HandleTable table_ GUARDED_BY(mutex_);
};
初始化为两个空链表
LRUCache::LRUCache() : capacity_(0), usage_(0) {
// Make empty circular linked lists.
lru_.next = &lru_;
lru_.prev = &lru_;
in_use_.next = &in_use_;
in_use_.prev = &in_use_;
}
析构:
LRUCache::~LRUCache() {
assert(in_use_.next == &in_use_); // Error if caller has an unreleased handle
for (LRUHandle* e = lru_.next; e != &lru_;) {
LRUHandle* next = e->next;
assert(e->in_cache);
e->in_cache = false;
assert(e->refs == 1); // Invariant of lru_ list.
Unref(e);
e = next;
}
}
void LRUCache::Unref(LRUHandle* e) {
assert(e->refs > 0);
e->refs--;
if (e->refs == 0) { // Deallocate.
assert(!e->in_cache);
(*e->deleter)(e->key(), e->value);
free(e);
} else if (e->in_cache && e->refs == 1) {
// No longer in use; move to lru_ list.
LRU_Remove(e);
LRU_Append(&lru_, e);
}
}
查找与释放:通过 ref 和 unref 再 lru 和 in_use两个 list 之间进行移动
Cache::Handle* LRUCache::Lookup(const Slice& key, uint32_t hash) {
MutexLock l(&mutex_);
LRUHandle* e = table_.Lookup(key, hash);
if (e != nullptr) {
Ref(e);
}
return reinterpret_cast<Cache::Handle*>(e);
}
void LRUCache::Release(Cache::Handle* handle) {
MutexLock l(&mutex_);
Unref(reinterpret_cast<LRUHandle*>(handle));
}
插入如下,主要包含:
-
初始化插入节点
-
插入 in_use 列表,加入 hash table
-
如果 capacity 超过阈值,从 lru 列表中按 LRU 序进行删除
Cache::Handle* LRUCache::Insert(const Slice& key, uint32_t hash, void* value,
size_t charge,
void (*deleter)(const Slice& key,
void* value)) {
MutexLock l(&mutex_);
LRUHandle* e =
reinterpret_cast<LRUHandle*>(malloc(sizeof(LRUHandle) - 1 + key.size()));
e->value = value;
e->deleter = deleter;
e->charge = charge;
e->key_length = key.size();
e->hash = hash;
e->in_cache = false;
e->refs = 1; // for the returned handle.
std::memcpy(e->key_data, key.data(), key.size());
if (capacity_ > 0) {
e->refs++; // for the cache's reference.
e->in_cache = true;
LRU_Append(&in_use_, e);
usage_ += charge;
FinishErase(table_.Insert(e));
} else { // don't cache. (capacity_==0 is supported and turns off caching.)
// next is read by key() in an assert, so it must be initialized
e->next = nullptr;
}
while (usage_ > capacity_ && lru_.next != &lru_) {
LRUHandle* old = lru_.next;
assert(old->refs == 1);
bool erased = FinishErase(table_.Remove(old->key(), old->hash));
if (!erased) { // to avoid unused variable when compiled NDEBUG
assert(erased);
}
}
return reinterpret_cast<Cache::Handle*>(e);
}
Sharded LRU Cache
内部适用了 LRU cache 作底层 cache,适用 hash 将数据 shard 到底层的 LRU cache 中。最终 LevelDB 中直接使用到的默认 cache 为 sharded LRU Cache
class ShardedLRUCache : public Cache {
private:
LRUCache shard_[kNumShards];
port::Mutex id_mutex_;
uint64_t last_id_;
static inline uint32_t HashSlice(const Slice& s) {
return Hash(s.data(), s.size(), 0);
}
static uint32_t Shard(uint32_t hash) { return hash >> (32 - kNumShardBits); }
public:
explicit ShardedLRUCache(size_t capacity) : last_id_(0) {
const size_t per_shard = (capacity + (kNumShards - 1)) / kNumShards;
for (int s = 0; s < kNumShards; s++) {
shard_[s].SetCapacity(per_shard);
}
}
~ShardedLRUCache() override {}
Handle* Insert(const Slice& key, void* value, size_t charge,
void (*deleter)(const Slice& key, void* value)) override {
const uint32_t hash = HashSlice(key);
return shard_[Shard(hash)].Insert(key, hash, value, charge, deleter);
}
Handle* Lookup(const Slice& key) override {
const uint32_t hash = HashSlice(key);
return shard_[Shard(hash)].Lookup(key, hash);
}
void Release(Handle* handle) override {
LRUHandle* h = reinterpret_cast<LRUHandle*>(handle);
shard_[Shard(h->hash)].Release(handle);
}
void Erase(const Slice& key) override {
const uint32_t hash = HashSlice(key);
shard_[Shard(hash)].Erase(key, hash);
}
void* Value(Handle* handle) override {
return reinterpret_cast<LRUHandle*>(handle)->value;
}
uint64_t NewId() override {
MutexLock l(&id_mutex_);
return ++(last_id_);
}
void Prune() override {
for (int s = 0; s < kNumShards; s++) {
shard_[s].Prune();
}
}
size_t TotalCharge() const override {
size_t total = 0;
for (int s = 0; s < kNumShards; s++) {
total += shard_[s].TotalCharge();
}
return total;
}
};
} // end anonymous namespace
Filter
bloom filter
[bloom filter](en.wikipedia.org/wiki/Bloom\… hash 模型。通过多个 hash 函数来减小 hash 碰撞的概率,在 hash 空间不变的情况下,通过增加 hash 函数数目来减小同时碰撞的概率。
由于本质是 hash 函数,hash冲突不可避免,因此bloom filter 判断 key 是否存在的结论是不准确的。未命中 bloom filter 说明 key 必然不存在;命中 bloom filter时,key 也有可能不存在(发生了 hash 冲突)。因此 bloomfilter 在 leveldb 中用来快速判断 key 不在某个 block 中。
double hash
bloom filter实现中需要若干个不同的 hash 函数, leveldb 通过 double hash 模拟出了任意多个 hash 函数。现有 hash 函数 h1 和 h2, 其中
H2(x)=(H1(x)>>17) | (H1(x)<<15)
,则 bloom filter 中的第 i 个 hash 函数Gi为:
Gi (key) = h1(key) + i * h2(key).
filter policy 接口定义在 filter policy 中,用户可以自行实现
class LEVELDB_EXPORT FilterPolicy {
public:
virtual ~FilterPolicy();
// Return the name of this policy. Note that if the filter encoding
// changes in an incompatible way, the name returned by this method
// must be changed. Otherwise, old incompatible filters may be
// passed to methods of this type.
virtual const char* Name() const = 0;
// keys[0,n-1] contains a list of keys (potentially with duplicates)
// that are ordered according to the user supplied comparator.
// Append a filter that summarizes keys[0,n-1] to *dst.
//
// Warning: do not change the initial contents of *dst. Instead,
// append the newly constructed filter to *dst.
virtual void CreateFilter(const Slice* keys, int n,
std::string* dst) const = 0;
// "filter" contains the data appended by a preceding call to
// CreateFilter() on this class. This method must return true if
// the key was in the list of keys passed to CreateFilter().
// This method may return true or false if the key was not on the
// list, but it should aim to return false with a high probability.
virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};
// Return a new filter policy that uses a bloom filter with approximately
// the specified number of bits per key. A good value for bits_per_key
// is 10, which yields a filter with ~ 1% false positive rate.
//
// Callers must delete the result after any database that is using the
// result has been closed.
//
// Note: if you are using a custom comparator that ignores some parts
// of the keys being compared, you must not use NewBloomFilterPolicy()
// and must provide your own FilterPolicy that also ignores the
// corresponding parts of the keys. For example, if the comparator
// ignores trailing spaces, it would be incorrect to use a
// FilterPolicy (like NewBloomFilterPolicy) that does not ignore
// trailing spaces in keys.
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);
} // namespace leveldb
#endif // STORAGE_LEVELDB_INCLUDE_FILTER_POLICY_H_
leveldb 默认的 filter 实现:bloom filter
class BloomFilterPolicy : public FilterPolicy {
public:
explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
// We intentionally round down to reduce probing cost a little bit
k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
if (k_ < 1) k_ = 1;
if (k_ > 30) k_ = 30;
}
const char* Name() const override { return "leveldb.BuiltinBloomFilter2"; }
void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
// Compute bloom filter size (in both bits and bytes)
size_t bits = n * bits_per_key_;
// For small n, we can see a very high false positive rate. Fix it
// by enforcing a minimum bloom filter length.
if (bits < 64) bits = 64;
size_t bytes = (bits + 7) / 8;
bits = bytes * 8;
const size_t init_size = dst->size();
dst->resize(init_size + bytes, 0);
dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter
char* array = &(*dst)[init_size];
for (int i = 0; i < n; i++) {
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos / 8] |= (1 << (bitpos % 8));
h += delta;
}
}
}
bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
const size_t len = bloom_filter.size();
if (len < 2) return false;
const char* array = bloom_filter.data();
const size_t bits = (len - 1) * 8;
// Use the encoded k so that we can read filters generated by
// bloom filters created using different parameters.
const size_t k = array[len - 1];
if (k > 30) {
// Reserved for potentially new encodings for short bloom filters.
// Consider it a match.
return true;
}
uint32_t h = BloomHash(key);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k; j++) {
const uint32_t bitpos = h % bits;
if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
h += delta;
}
return true;
}
private:
size_t bits_per_key_;
size_t k_;
};
} // namespace
filter block builder
filter block 会作为 meta block 存在 sstable 中,关于 meta block 的读、写实现参考上述内容。