在理解了LevelDB的整体架构后,再理解读操作就会很容易了,而且我们很容易猜到:先去内存中的memtable中查找key;如果没有,再去内存中只读的immutable中查找;如果还没有,再去level-0的SSTable文件中查找;如果还没有,则到下一级level-1层的SSTable文件中查找,依次类推。但是我们也会发现:越往下一层,代价越大,读性能越不好。这是由于简化并提高了写操作及写性能带来的代价的,因此,优化读操作就是需要重点考虑的问题,怎么优化呢?如果让我们去设计,我们该怎么设计呢?
我们分析一下思考的过程:
首先,内存中数据结构的选择。
读操作实际上就是查找key的操作,查找算法我们首先想到的就是二分查找,效率非常的高,时间复杂度是O(logn),但是二分查找是有要求的,需要在有序的数组中查找,有序可以通过比较函数在写入操作时保证key有序排列,但是数组是要求连续的内存空间,这就很难满足要求。链表不受连续内存的限制,但是查找起来比较费劲,效率不高。继续想还有哪些高效的查找数据结构?跳表SkipList、二叉查找树、红黑树。红黑树是非常常用的高效的查找数据结构,LevelDB并没有选择红黑树,而是选用了跳表SkipList,它的性能也非常优秀,时间复杂度是O(logn),而且相比红黑树容易实现,出错概率低。跳表SkipList实际上是链表的二分查找,链表是不适合用二分查找的,单链表的二分查找时间复杂度是O(n),于是跳表SkipList被发明出来,就是利用了指针的内存链接优势和二分查找的高效性能,二者的结合就是跳表SkipList,本质是利用空间换时间。
其次,磁盘上SSTable文件的结构设计。
要查找磁盘上的SSTable文件,必然有磁盘IO,在机械硬盘HDD时代,这显然是很低效的,避免磁盘IO或者尽量减少磁盘IO就是优先考虑的重点。读SSTable文件,磁盘IO是避免不了的,那就想着如何减少磁盘IO的次数,以及在一个SSTable文件中查找key时如何更快。所以要求SSTable文件中的key必须是有序的,这就是Sorted String Table,有序的固化表文件。有序体现在Key是按序存储的,也体现在除了Level-0之外,其他Level中的SSTable文件之间也是Key有序的,即:Key不重叠。且在一个SSTable文件中记录最小key和最大key,这样key是否在[minKey, maxKey]区间内就可以判定这个文件是否包含待查找的key,节省了遍历查找时间,没有就去读下一个SSTable文件。如果一个key落在了某个SSTable文件的[minKey, maxKey]区间内,怎么再高效查找呢?至少先判断这个key存不存在,布隆过滤器是一个很好的解决方案,因此,在SSTable文件中放一个布隆过滤器。如果布隆过滤器判断key存在,那就继续在该文件中查找,一个SSTable文件包含了大量的块Block,块Block里面才是用户key,如何快速查找key所在的块Block?仍然利用高效的二分查找算法,先明确这个key所在的块Block位置,然后再到这个块Block位置处继续查找。一个块Block所包含的key也是大量的数据,如何快速查找?还是利用高效的二分查找算法,进一步缩小范围,明确key所在的key区间,这个key区间数据量就有限了,可以遍历。因此,这就是SSTable文件的设计考虑。
再次,level层次的策略。
Level-0是由内存中的immutable生成的,所以Level-0中的SSTable文件中的Key是存在重叠的,不同的SSTable文件也存在Key重叠的情况,这显然就不利于查找,读性能在Level-0就不好,因此Level-0会有很多的限制条件,1)Level-0中文件的个数达到4个时,会触发压缩Compaction;2)Level-0中文件的个数达到8个时,写入操作将会受到限制;3)Level-0中文件的个数达到12个时,写入操作将会被停止。后期归并生成的SSTable文件在Level-i层,这就是LevelDB的名字的由来。而之所以叫leveled,而不是tiered,是因为第i+1层的数据量是i层的倍数,这样减少了文件数目,也就减少了磁盘IO的次数。
好了,经过以上分析,我们来看下LevelDB具体是如何查找key的。
LevelDB中数据的流向是这样的: Memtable > Immutable Memtable > level 0 > level L > level L+1,因此,其查找顺序也必然是这个顺序,越往后查找代价越高。
一、数据库读操作:
LevelDB提供了 Get 方法来查询数据库。
接口:
Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value)
下面的代码展示了读 key 对应的 value:
std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key, &value);
二 、读取过程:
1)内存Memtable中查找:LevelDB首先会去查看内存中的Memtable。如果Memtable中包含key及其对应的value,则返回value值即可;没有则继续向下一级查找。
2)内存Immutable Memtable中查找:如果在Memtable没有读到key,则接下来到内存中的Immutable Memtable中去读取。同样,如果读到就返回,如果没有读到,那就只能从磁盘中的大量SSTable文件中查找了。
3)磁盘SSTable文件查找:因为磁盘上的SSTable文件数量较多,而且分成多个Level,所以在SSTable中读数据就像大海捞针,是非常的费劲的。总的原则是这样的:首先从level-0的SSTable文件中查找,如果找到则返回对应的value值;如果没有找到,那么继续到下一level-1中的SSTable文件中去找,如此查找下去,直到在某层SSTable文件中找到这个key对应的value为止。
4)从SSTable文件中读取一个键的步骤:
4.1)首先需要打开这个SSTable,读取文件最后48字节,即:Footer。这样就可以读取Footer里面的Meta Index Block和Index Block,将Index Block的内容缓存到内存中;再根据Meta Index Block读取布隆过滤器的数据,缓存到内存中。
4.2)根据key对Index Block的restart point进行二分搜索,找到这个key对应的Data Block的BlockHandler;
4.3)根据Meta Index Block指向的BlockHandler的offset_和size_,计算出布隆过滤器的位置,读取相应的布隆过滤器;在布隆过滤器中查找key是否存在,如果判定不存在,则返回;
4.4)如果判定存在,进一步读取对应的Data Block;对Data Block里的restart point进行二分搜索,找到key对应的restart point,对这个restart point对应的key进行搜索,最多搜索16个key,找到key或者找不到key。
如上可知,Index Block和布隆过滤器的内容都可以缓存在内存里的,所以当一个键在SSTable不存在时,99%的概率是不需要磁盘IO的。
三、下面我们结合代码分析读操作
1、代码层次非常清晰,先从mem->Get()查找,再从imm->Get查找,最后从current->Get查找。
代码:
Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value) {
… …
|-MemTable* mem = mem_;//memtable variable
|-MemTable* imm = imm_;//immutable variable
|-Version* current = versions_->current();//get current version
… …
{
|-mutex_.Unlock();
// First look in the memtable, then in the immutable memtable (if any).
|-LookupKey lkey(key, snapshot);
|-if (mem->Get(lkey, value, &s)) {//search from memtable firstly
// Done
|-} else if (imm != nullptr && imm->Get(lkey, value, &s)) {//search from immutable sencondly
// Done
|-} else {
|-s = current->Get(options, lkey, value, &stats);//from current version to search user key
|-have_stat_update = true;//if there is need to get key from sstable file, it means need to update state that's MaybeScheduleCompaction()
|-}
|-mutex_.Lock();
}
|-if (have_stat_update && current->UpdateStats(stats)) {//trigger compaction
|-MaybeScheduleCompaction();
|-}
… …
|-return s;
}
2、从memtable和immutable memtable中如何查找key的,二者是一样的。
本质就是如何从跳表SkipList中查找数据:
2.1)从跳表SkipList中查找道第一个大于等于memkey的节点node;
2.2)解码key,如果key == userkey,那么就找到了,如果是有效值,返回value;如果是删除值,返回空,结束;
2.3)没有找到返回false
代码:
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
|-Slice memkey = key.memtable_key();// Return a key suitable for lookup in a MemTable.
|-return Slice(start_, end_ - start_);//that's memkey == lookupkey
|-Table::Iterator iter(&table_); //initialize a iterator with SkipList
|-list_ = list; //SkipList table_(cache memkey)
|-node_ = nullptr;
|-iter.Seek(memkey.data());//search greater or equal memkey from skiplist and assigned to node_ of iter
|-node_ = list_->FindGreaterOrEqual(target, nullptr);//get the first key >= mem_key
|-if (iter.Valid()) { //return node_ != nullptr;
|-const char* entry = iter.key(); //return node_->key;
|-uint32_t key_length;
|-const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);//entry+5 means maxsize of
varint32, return key_ptr is interna_key and key_length is varing32 size of internal_key.
|-if (comparator_.comparator.user_comparator()->Compare(Slice(key_ptr, key_length - 8), key.user_key()) == 0) {//key_length-8
is user_key, so compare user_key of node_ and user key. if equal, find this key.
|-const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);//parse out sequenceNumber+valueType
|-switch (static_cast<ValueType>(tag & 0xff)) {//jadge valueType
|-case kTypeValue: {//if this is a normal value, get the target value, return true.
|-Slice v = GetLengthPrefixedSlice(key_ptr + key_length);//key_ptr+key_length is end of memkey, this is data address.
|-value->assign(v.data(), v.size());
|-return true;
|-}
|-case kTypeDeletion: //if this is a deleted value, status is NotFound, return empty Slice
|-*s = Status::NotFound(Slice());
|-return true;
|-}
|-}
|-}
|-return false;//otherwise not found key-value in memtable
}
3、从SSTable中查找key
从当前最新版本CurrentVersion中查找,先查找level-0,如果没有,再从level-1~level-N中依次查找。
我们看看LevelDB是怎么搜索SSTable文件的,在前面04| LevelDB中版本控制Version介绍中,我们知道SSTable是由版本控制的,并且当前版本Current拥有最新的SSTable文件,因此必然从Current中查找current->Get(options, lkey, value, &stats);。
3.1)先遍历搜索level-0中的文件files_[0]:快速在每个文件的元数据FileMetaData中查找是否落在文件key的[smallest, largest]区间,在这个区间,就把这个SSTable文件记录下来,因为level-0中存在key重复的问题,因此需要把所有满足这一条件的文件都记录下来。
3.2)如果level-0中确有文件包含查找key,回调函数Match()进一步判定,如果Match()判定找到key,则返回;如果没有找到,那就从level-1level-N中依次查找,因为level-1level-N中的SSTable文件记录的key是不重复的,因此可以使用二分查找算法,快速查找符合要求的文件。
代码:
void Version::ForEachOverlapping(Slice user_key, Slice internal_key, void* arg, bool (*func)(void*, int, FileMetaData*)) {
|-const Comparator* ucmp = vset_->icmp_.user_comparator();
// Search level-0 in order from newest to oldest.
|-std::vector<FileMetaData*> tmp;
|-tmp.reserve(files_[0].size());//capacity is expanded to size of files_ under level-0 in Version!
|-for (uint32_t i = 0; i < files_[0].size(); i++) { //traverse over files under level-0
|-FileMetaData* f = files_[0][i];//get this file's FileMetaData, and compare smallest user key and largest user key in this file
|-if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 && ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
|-tmp.push_back(f);//if smallestUserKey <= this key <= largestUserKey, record it. that's means this key in this file! As level-0 has many file, the same key has different values in different files. So collection all files which contains this key.
|-}
|-}
|-if (!tmp.empty()) {//if found files contain this key,
|-std::sort(tmp.begin(), tmp.end(), NewestFirst);//return a->number > b->number; sort them from largest to smallest!
|-for (uint32_t i = 0; i < tmp.size(); i++) {//traverse over this sorted files whick contains this key.
|-if (!(*func)(arg, 0, tmp[i])) {//call func State::Match to judge.
|-return;
|-}
|-}
|-}
// Search other levels, from level-1 to leve-max
|-for (int level = 1; level < config::kNumLevels; level++) {
|-size_t num_files = files_[level].size(); //get files numbers of this level
|-if (num_files == 0) continue;//this level has no file, continue.
// Binary search to find earliest index whose largest key >= internal_key. In order to filter these files whick largestKey < key
|-uint32_t index = FindFile(vset_->icmp_, files_[level], internal_key);
|-if (index < num_files) {
|-FileMetaData* f = files_[level][index];//get this file
|-if (ucmp->Compare(user_key, f->smallest.user_key()) < 0) {//if this file's smallest key > this key, this file has not this key
// All of "f" is past any data for user_key
|-} else {
|-if (!(*func)(arg, level, f)) {//otherwise, this file has this key, call func State::Match to judge.
|-return;
|-}
|-}//if
|-}//if
|-}//for
}
3.3)接下来,我们重点看Match()是怎么查找key的。
核心处理:state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue)
代码:
static bool Match(void* arg, int level, FileMetaData* f) {
|-State* state = reinterpret_cast<State*>(arg);
|-if (state->stats->seek_file == nullptr && state->last_file_read != nullptr) {
// We have had more than one seek for this read. Charge the 1st file.
|-state->stats->seek_file = state->last_file_read;
|-state->stats->seek_file_level = state->last_file_read_level;
|-}
|-state->last_file_read = f;//record this file and level
|-state->last_file_read_level = level;
|-state->s = state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue);//search this file:
[step1]: find this file's table, if not exist, open file and read its footer, get MetaIndexBlock, IndexBlock, FilterBlock, construct table and cache Table-File;
[step2]: search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save.
|-if (!state->s.ok()) {//return true indicate found, otherwise indicate not found.
|-state->found = true;
|-return false;
|-}
|-switch (state->saver.state) {//judge the state
case kNotFound:
return true; // Keep searching in other files
case kFound: //found
state->found = true;
return false;
case kDeleted: //user key has beed deleted, return not found.
return false;
case kCorrupt:
state->s =
Status::Corruption("corrupted key for ", state->saver.user_key);
state->found = true;
return false;
|-}
// Not reached. Added to avoid false compilation warnings of "control reaches end of non-void function".
return false;
}
3.3.1)首先,记录查找的文件seek_file及文件的所在的level,这是为了根据seek进行压缩做记录。
3.3.2)其次,从缓存table_cache_中查找,也就是SSTable文件的一些关键信息是缓存在内存中的,哪些信息呢?MetaIndexBlock、IndexBlock、FilterBlock,这些信息都是为了快速查找key而设计的。如果一个文件没有在缓存中呢?LevelDB会下发读磁盘IO,打开文件,读取文件的最后48字节的Footer(包含metaindex_handle_和index_handle_),然后就解析出metaindex_handle_和index_handle_,这两个文件指针BlockHandler(offset_, size_)分别定位到一个块Block,它们分别是MetaIndexBlock和IndexBlock。
3.3.3)继续读IndexBlock和MetaIndexBlock的块Block到内存中:Index Block是一系列的KV,每个KV又指向一个具体的块Data Block;而MetaIndexBlock也是一系列KV,目前只有一个布隆过滤器,其value是filter_handler指向一个具体的块Filter Block,把Filter Block也读到内存中缓存。
3.3.4)把读到内存的Index Block、Filter Block构建一个Table对象。
class LEVELDB_EXPORT Table {
Rep* const rep_;
};
struct Table::Rep {
Options options;
Status status;
RandomAccessFile* file;
uint64_t cache_id;
FilterBlockReader* filter;
const char* filter_data;
BlockHandle metaindex_handle; // Handle to metaindex_block: saved from footer
Block* index_block;
};
3.3.5)将file和Table构建TableAndFile对象,将其插入到LRUCache中。
struct TableAndFile {
RandomAccessFile* file;
Table* table;
};
// A single shard of sharded cache.
class LRUCache {
// Initialized before use.
size_t capacity_;
// mutex_ protects the following state.
mutable port::Mutex mutex_;
size_t usage_ GUARDED_BY(mutex_);
// Dummy head of LRU list.
// lru.prev is newest entry, lru.next is oldest entry.
// Entries have refs==1 and in_cache==true.
LRUHandle lru_ GUARDED_BY(mutex_);
// Dummy head of in-use list.
// Entries are in use by clients, and have refs >= 2 and in_cache==true.
LRUHandle in_use_ GUARDED_BY(mutex_);
HandleTable table_ GUARDED_BY(mutex_);
};
struct LRUHandle {
void* value;
void (*deleter)(const Slice&, void* value);
LRUHandle* next_hash;
LRUHandle* next;
LRUHandle* prev;
size_t charge; // TODO(opt): Only allow uint32_t?
size_t key_length;
bool in_cache; // Whether entry is in the cache.
uint32_t refs; // References, including cache reference, if present.
uint32_t hash; // Hash of key(); used for fast sharding and comparisons
char key_data[1]; // Beginning of key
};
3.3.6)现在文件的元数据缓存信息TableCache有了,接下来就可以查找key是否真正在一个文件了t->InternalGet(options, k, arg, handle_result)。
代码:
Status TableCache::Get(const ReadOptions& options, uint64_t file_number, uint64_t file_size, const Slice& k,
void* arg, void (*handle_result)(void*, const Slice&, const Slice&)) {
|-Cache::Handle* handle = nullptr;
|-Status s = FindTable(file_number, file_size, &handle);//get handle of this file from LRU cache_, if not cached
then construct a key(file_number):value(Table-File) and insert into cache_. Now &handle has this fils's LRUHandle
|-if (s.ok()) {
|-Table* t = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;//get the value of &handle, which
is TableAndFile object, then return element table that contains MetaIndexBlock, IndexBlock, File, FilterBlock.
|-s = t->InternalGet(options, k, arg, handle_result);//search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save
|-cache_->Release(handle);
|-}
|-return s;
}
核心处理流程:t->InternalGet(options, k, arg, handle_result)
首先,在布隆过滤器中判定这个key是否存在,如果不存在,则返回,查找结束。
其次,在Index Block中用二分查找算法快速查找第一个大于等于key的restart point,读取该restart point对应的block_handle所指向的块Block,然后在该块Block中再次二分查找第一个大于等于key的restart point,读取该restart point对应的block_handle所指向的KV处,线性遍历key-value,如果查找到key,则返回value。
代码:
Status Table::InternalGet(const ReadOptions& options, const Slice& k, void* arg,
void (*handle_result)(void*, const Slice&, const Slice&)) {
|-Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);//construct a iterator for index_block
|-const uint32_t num_restarts = NumRestarts();//the last 4-bytes of index_block is number restarts
|-return DecodeFixed32(data_ + size_ - sizeof(uint32_t));
|-return new Iter(comparator, data_, restart_offset_, num_restarts);//like FilterBlock's iterator
|-iiter->Seek(k);//binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle)
|-if (iiter->Valid()) {
|-Slice handle_value = iiter->value();//return value of key from iterator whick has decode key(restart point key) and value(block_handle)
from IndexBlock
|-FilterBlockReader* filter = rep_->filter;//return bloom filter, and bloom filter search this key match or not?
|-BlockHandle handle;//decode from handle_value to get offset and size
|-if (filter != nullptr && handle.DecodeFrom(&handle_value).ok() && !filter->KeyMayMatch(handle.offset(), k)) {
|-GetVarint64(input, &offset_) && GetVarint64(input, &size_)
// Not found //bllom filter judge this user key is not exist! OK, return status, and game over!!!
|-} else {//otherwise, this user key may be exist!
|-Iterator* block_iter = BlockReader(this, options, iiter->value());//decode offset_ and size_ from BlockHandle, then read size_ from offset_ in DataBlock, and construct a Block object, then cache it, finally construct a iterator for datablock. Everything is ready!!!
|-block_iter->Seek(k);//Now, binary search this user key!
|-if (block_iter->Valid()) {//return current_ < restarts_, OK, find user key!!!
|-(*handle_result)(arg, block_iter->key(), block_iter->value());//callback function SaveValue() to save result!
|-}
|-s = block_iter->status();
|-delete block_iter;//free block_iterator
|-}
|-}
|-if (s.ok()) {
|-s = iiter->status();
|-}
|-delete iiter;//free iterator
|-return s;
}
至此,读操作就分析完成了。