06| LevelDB读操作

379 阅读13分钟

在理解了LevelDB的整体架构后,再理解读操作就会很容易了,而且我们很容易猜到:先去内存中的memtable中查找key;如果没有,再去内存中只读的immutable中查找;如果还没有,再去level-0的SSTable文件中查找;如果还没有,则到下一级level-1层的SSTable文件中查找,依次类推。但是我们也会发现:越往下一层,代价越大,读性能越不好。这是由于简化并提高了写操作及写性能带来的代价的,因此,优化读操作就是需要重点考虑的问题,怎么优化呢?如果让我们去设计,我们该怎么设计呢?

 

我们分析一下思考的过程:

首先,内存中数据结构的选择。

读操作实际上就是查找key的操作,查找算法我们首先想到的就是二分查找,效率非常的高,时间复杂度是O(logn),但是二分查找是有要求的,需要在有序的数组中查找,有序可以通过比较函数在写入操作时保证key有序排列,但是数组是要求连续的内存空间,这就很难满足要求。链表不受连续内存的限制,但是查找起来比较费劲,效率不高。继续想还有哪些高效的查找数据结构?跳表SkipList、二叉查找树、红黑树。红黑树是非常常用的高效的查找数据结构,LevelDB并没有选择红黑树,而是选用了跳表SkipList,它的性能也非常优秀,时间复杂度是O(logn),而且相比红黑树容易实现,出错概率低。跳表SkipList实际上是链表的二分查找,链表是不适合用二分查找的,单链表的二分查找时间复杂度是O(n),于是跳表SkipList被发明出来,就是利用了指针的内存链接优势和二分查找的高效性能,二者的结合就是跳表SkipList,本质是利用空间换时间。

 

其次,磁盘上SSTable文件的结构设计。

要查找磁盘上的SSTable文件,必然有磁盘IO,在机械硬盘HDD时代,这显然是很低效的,避免磁盘IO或者尽量减少磁盘IO就是优先考虑的重点。读SSTable文件,磁盘IO是避免不了的,那就想着如何减少磁盘IO的次数,以及在一个SSTable文件中查找key时如何更快。所以要求SSTable文件中的key必须是有序的,这就是Sorted String Table,有序的固化表文件。有序体现在Key是按序存储的,也体现在除了Level-0之外,其他Level中的SSTable文件之间也是Key有序的,即:Key不重叠。且在一个SSTable文件中记录最小key和最大key,这样key是否在[minKey, maxKey]区间内就可以判定这个文件是否包含待查找的key,节省了遍历查找时间,没有就去读下一个SSTable文件。如果一个key落在了某个SSTable文件的[minKey, maxKey]区间内,怎么再高效查找呢?至少先判断这个key存不存在,布隆过滤器是一个很好的解决方案,因此,在SSTable文件中放一个布隆过滤器。如果布隆过滤器判断key存在,那就继续在该文件中查找,一个SSTable文件包含了大量的块Block,块Block里面才是用户key,如何快速查找key所在的块Block?仍然利用高效的二分查找算法,先明确这个key所在的块Block位置,然后再到这个块Block位置处继续查找。一个块Block所包含的key也是大量的数据,如何快速查找?还是利用高效的二分查找算法,进一步缩小范围,明确key所在的key区间,这个key区间数据量就有限了,可以遍历。因此,这就是SSTable文件的设计考虑。

 

再次,level层次的策略。

Level-0是由内存中的immutable生成的,所以Level-0中的SSTable文件中的Key是存在重叠的,不同的SSTable文件也存在Key重叠的情况,这显然就不利于查找,读性能在Level-0就不好,因此Level-0会有很多的限制条件,1)Level-0中文件的个数达到4个时,会触发压缩Compaction;2)Level-0中文件的个数达到8个时,写入操作将会受到限制;3)Level-0中文件的个数达到12个时,写入操作将会被停止。后期归并生成的SSTable文件在Level-i层,这就是LevelDB的名字的由来。而之所以叫leveled,而不是tiered,是因为第i+1层的数据量是i层的倍数,这样减少了文件数目,也就减少了磁盘IO的次数。

 

好了,经过以上分析,我们来看下LevelDB具体是如何查找key的。

LevelDB中数据的流向是这样的: Memtable > Immutable Memtable > level 0 > level L > level L+1,因此,其查找顺序也必然是这个顺序,越往后查找代价越高。

 

一、数据库读操作:

LevelDB提供了 Get 方法来查询数据库。

接口:

Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value)

下面的代码展示了读 key 对应的 value:

std::string value;
 leveldb::Status s = db->Get(leveldb::ReadOptions(), key, &value);

 

、读取过程:

1)内存Memtable中查找:LevelDB首先会去查看内存中的Memtable。如果Memtable中包含key及其对应的value,则返回value值即可;没有则继续向下一级查找。

2)内存Immutable Memtable中查找:如果在Memtable没有读到key,则接下来到内存中的Immutable Memtable中去读取。同样,如果读到就返回,如果没有读到,那就只能从磁盘中的大量SSTable文件中查找了。

3)磁盘SSTable文件查找:因为磁盘上的SSTable文件数量较多,而且分成多个Level,所以在SSTable中读数据就像大海捞针,是非常的费劲的。总的原则是这样的:首先从level-0的SSTable文件中查找,如果找到则返回对应的value值;如果没有找到,那么继续到下一level-1中的SSTable文件中去找,如此查找下去,直到在某层SSTable文件中找到这个key对应的value为止。

4)从SSTable文件中读取一个键的步骤:

4.1)首先需要打开这个SSTable,读取文件最后48字节,即:Footer。这样就可以读取Footer里面的Meta Index Block和Index Block,将Index Block的内容缓存到内存中;再根据Meta Index Block读取布隆过滤器的数据,缓存到内存中。

4.2)根据key对Index Block的restart point进行二分搜索,找到这个key对应的Data Block的BlockHandler;

4.3)根据Meta Index Block指向的BlockHandler的offset_和size_,计算出布隆过滤器的位置,读取相应的布隆过滤器;在布隆过滤器中查找key是否存在,如果判定不存在,则返回;

4.4)如果判定存在,进一步读取对应的Data Block;对Data Block里的restart point进行二分搜索,找到key对应的restart point,对这个restart point对应的key进行搜索,最多搜索16个key,找到key或者找不到key。

如上可知,Index Block和布隆过滤器的内容都可以缓存在内存里的,所以当一个键在SSTable不存在时,99%的概率是不需要磁盘IO的。

 

三、下面我们结合代码分析读操作

1、代码层次非常清晰,先从mem->Get()查找,再从imm->Get查找,最后从current->Get查找。

代码:

Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value) {
    … …
  |-MemTable* mem = mem_;//memtable variable
  |-MemTable* imm = imm_;//immutable variable
  |-Version* current = versions_->current();//get current version
    … …
  {
    |-mutex_.Unlock();
    // First look in the memtable, then in the immutable memtable (if any).
    |-LookupKey lkey(key, snapshot);
    |-if (mem->Get(lkey, value, &s)) {//search from memtable firstly
      // Done
    |-} else if (imm != nullptr && imm->Get(lkey, value, &s)) {//search from immutable sencondly
      // Done
    |-} else {
      |-s = current->Get(options, lkey, value, &stats);//from current version to search user key
      |-have_stat_update = true;//if there is need to get key from sstable file, it means need to update state that's MaybeScheduleCompaction()
    |-}
    |-mutex_.Lock();
  }
 
  |-if (have_stat_update && current->UpdateStats(stats)) {//trigger compaction
    |-MaybeScheduleCompaction();
  |-}
 
  … …
  |-return s;
}

 

2、从memtable和immutable memtable中如何查找key的,二者是一样的。

本质就是如何从跳表SkipList中查找数据:

2.1)从跳表SkipList中查找道第一个大于等于memkey的节点node;

2.2)解码key,如果key == userkey,那么就找到了,如果是有效值,返回value;如果是删除值,返回空,结束;

2.3)没有找到返回false

代码:

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  |-Slice memkey = key.memtable_key();// Return a key suitable for lookup in a MemTable.
                      |-return Slice(start_, end_ - start_);//that's memkey == lookupkey
  |-Table::Iterator iter(&table_); //initialize a iterator with SkipList 
                   |-list_ = list; //SkipList table_(cache memkey)
                   |-node_ = nullptr;
  |-iter.Seek(memkey.data());//search greater or equal memkey from skiplist and assigned to node_ of iter
         |-node_ = list_->FindGreaterOrEqual(target, nullptr);//get the first key >= mem_key
  |-if (iter.Valid()) { //return node_ != nullptr;
    |-const char* entry = iter.key(); //return node_->key;
    |-uint32_t key_length;
    |-const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);//entry+5 means maxsize of
varint32, return key_ptr is interna_key and key_length is varing32 size of internal_key.
    |-if (comparator_.comparator.user_comparator()->Compare(Slice(key_ptr, key_length - 8), key.user_key()) == 0) {//key_length-8
is user_key, so compare user_key of node_ and user key. if equal, find this key.
      |-const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);//parse out sequenceNumber+valueType
      |-switch (static_cast<ValueType>(tag & 0xff)) {//jadge valueType
        |-case kTypeValue: {//if this is a normal value, get the target value, return true.
          |-Slice v = GetLengthPrefixedSlice(key_ptr + key_length);//key_ptr+key_length is end of memkey, this is data address.
          |-value->assign(v.data(), v.size());
          |-return true;
        |-}
        |-case kTypeDeletion: //if this is a deleted value, status is NotFound, return empty Slice
          |-*s = Status::NotFound(Slice());
          |-return true;
      |-}
    |-}
  |-}
  |-return false;//otherwise not found key-value in memtable
}

 

3、从SSTable中查找key

从当前最新版本CurrentVersion中查找,先查找level-0,如果没有,再从level-1~level-N中依次查找。

我们看看LevelDB是怎么搜索SSTable文件的,在前面04| LevelDB中版本控制Version介绍中,我们知道SSTable是由版本控制的,并且当前版本Current拥有最新的SSTable文件,因此必然从Current中查找current->Get(options, lkey, value, &stats);。

3.1)先遍历搜索level-0中的文件files_[0]:快速在每个文件的元数据FileMetaData中查找是否落在文件key的[smallest, largest]区间,在这个区间,就把这个SSTable文件记录下来,因为level-0中存在key重复的问题,因此需要把所有满足这一条件的文件都记录下来。

3.2)如果level-0中确有文件包含查找key,回调函数Match()进一步判定,如果Match()判定找到key,则返回;如果没有找到,那就从level-1level-N中依次查找,因为level-1level-N中的SSTable文件记录的key是不重复的,因此可以使用二分查找算法,快速查找符合要求的文件。

代码:

void Version::ForEachOverlapping(Slice user_key, Slice internal_key, void* arg, bool (*func)(void*, int, FileMetaData*)) {
  |-const Comparator* ucmp = vset_->icmp_.user_comparator();
  // Search level-0 in order from newest to oldest.
  |-std::vector<FileMetaData*> tmp;
  |-tmp.reserve(files_[0].size());//capacity is expanded to size of files_ under level-0 in Version!
  |-for (uint32_t i = 0; i < files_[0].size(); i++) { //traverse over files under level-0
    |-FileMetaData* f = files_[0][i];//get this file's FileMetaData, and compare smallest user key and largest user key in this file
    |-if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 && ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
      |-tmp.push_back(f);//if smallestUserKey <= this key <= largestUserKey, record it. that's means this key in this file! As level-0 has many file, the same key has different values in different files. So collection all files which contains this key.
    |-}
  |-}
  |-if (!tmp.empty()) {//if found files contain this key,
    |-std::sort(tmp.begin(), tmp.end(), NewestFirst);//return a->number > b->number; sort them from largest to smallest!
    |-for (uint32_t i = 0; i < tmp.size(); i++) {//traverse over this sorted files whick contains this key.
      |-if (!(*func)(arg, 0, tmp[i])) {//call func State::Match to judge.
        |-return;
      |-}
    |-}
  |-}
  // Search other levels, from level-1 to leve-max
  |-for (int level = 1; level < config::kNumLevels; level++) {
    |-size_t num_files = files_[level].size(); //get files numbers of this level
    |-if (num_files == 0) continue;//this level has no file, continue.
    // Binary search to find earliest index whose largest key >= internal_key. In order to filter these files whick largestKey < key
    |-uint32_t index = FindFile(vset_->icmp_, files_[level], internal_key);
    |-if (index < num_files) {
      |-FileMetaData* f = files_[level][index];//get this file
      |-if (ucmp->Compare(user_key, f->smallest.user_key()) < 0) {//if this file's smallest key > this key, this file has not this key
        // All of "f" is past any data for user_key
      |-} else {
        |-if (!(*func)(arg, level, f)) {//otherwise, this file has this key, call func State::Match to judge.
          |-return;
        |-}
      |-}//if
    |-}//if
  |-}//for
}

 

3.3)接下来,我们重点看Match()是怎么查找key的。

核心处理:state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue)

代码:

static bool Match(void* arg, int level, FileMetaData* f) {
  |-State* state = reinterpret_cast<State*>(arg);
  |-if (state->stats->seek_file == nullptr && state->last_file_read != nullptr) {
        // We have had more than one seek for this read.  Charge the 1st file.
    |-state->stats->seek_file = state->last_file_read;
    |-state->stats->seek_file_level = state->last_file_read_level;
  |-}
  |-state->last_file_read = f;//record this file and level
  |-state->last_file_read_level = level;
  |-state->s = state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue);//search this file: 
  [step1]: find this file's table, if not exist, open file and read its footer, get MetaIndexBlock, IndexBlock, FilterBlock, construct table and cache Table-File;
  [step2]: search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save.
  |-if (!state->s.ok()) {//return true indicate found, otherwise indicate not found.
    |-state->found = true;
    |-return false;
  |-}
  |-switch (state->saver.state) {//judge the state
        case kNotFound:
          return true;  // Keep searching in other files
        case kFound: //found
          state->found = true;
          return false;
        case kDeleted: //user key has beed deleted, return not found.
          return false;
        case kCorrupt:
          state->s =
              Status::Corruption("corrupted key for ", state->saver.user_key);
          state->found = true;
          return false;
  |-}
  // Not reached. Added to avoid false compilation warnings of "control reaches end of non-void function".
  return false;
}

 

3.3.1)首先,记录查找的文件seek_file及文件的所在的level,这是为了根据seek进行压缩做记录。

3.3.2)其次,从缓存table_cache_中查找,也就是SSTable文件的一些关键信息是缓存在内存中的,哪些信息呢?MetaIndexBlock、IndexBlock、FilterBlock,这些信息都是为了快速查找key而设计的。如果一个文件没有在缓存中呢?LevelDB会下发读磁盘IO,打开文件,读取文件的最后48字节的Footer(包含metaindex_handle_和index_handle_),然后就解析出metaindex_handle_和index_handle_,这两个文件指针BlockHandler(offset_, size_)分别定位到一个块Block,它们分别是MetaIndexBlock和IndexBlock。

3.3.3)继续读IndexBlock和MetaIndexBlock的块Block到内存中:Index Block是一系列的KV,每个KV又指向一个具体的块Data Block;而MetaIndexBlock也是一系列KV,目前只有一个布隆过滤器,其value是filter_handler指向一个具体的块Filter Block,把Filter Block也读到内存中缓存。

3.3.4)把读到内存的Index Block、Filter Block构建一个Table对象。

class LEVELDB_EXPORT Table {
  Rep* const rep_;
};
struct Table::Rep {
  Options options;
  Status status;
  RandomAccessFile* file;
  uint64_t cache_id;
  FilterBlockReader* filter;
  const char* filter_data;
  BlockHandle metaindex_handle;  // Handle to metaindex_block: saved from footer
  Block* index_block;
};

3.3.5)将file和Table构建TableAndFile对象,将其插入到LRUCache中。

struct TableAndFile {
  RandomAccessFile* file;
  Table* table;
};
 
// A single shard of sharded cache.
class LRUCache {
  // Initialized before use.
  size_t capacity_;
  // mutex_ protects the following state.
  mutable port::Mutex mutex_;
  size_t usage_ GUARDED_BY(mutex_);
  // Dummy head of LRU list.
  // lru.prev is newest entry, lru.next is oldest entry.
  // Entries have refs==1 and in_cache==true.
  LRUHandle lru_ GUARDED_BY(mutex_);
  // Dummy head of in-use list.
  // Entries are in use by clients, and have refs >= 2 and in_cache==true.
  LRUHandle in_use_ GUARDED_BY(mutex_);
  HandleTable table_ GUARDED_BY(mutex_);
};
 
struct LRUHandle {
  void* value;
  void (*deleter)(const Slice&, void* value);
  LRUHandle* next_hash;
  LRUHandle* next;
  LRUHandle* prev;
  size_t charge;  // TODO(opt): Only allow uint32_t?
  size_t key_length;
  bool in_cache;     // Whether entry is in the cache.
  uint32_t refs;     // References, including cache reference, if present.
  uint32_t hash;     // Hash of key(); used for fast sharding and comparisons
  char key_data[1];  // Beginning of key
};

 

3.3.6)现在文件的元数据缓存信息TableCache有了,接下来就可以查找key是否真正在一个文件了t->InternalGet(options, k, arg, handle_result)。

代码:

Status TableCache::Get(const ReadOptions& options, uint64_t file_number, uint64_t file_size, const Slice& k, 
                       void* arg, void (*handle_result)(void*, const Slice&, const Slice&)) {
  |-Cache::Handle* handle = nullptr;
  |-Status s = FindTable(file_number, file_size, &handle);//get handle of this file from LRU cache_, if not cached
then construct a key(file_number):value(Table-File) and insert into cache_. Now &handle has this fils's LRUHandle
  |-if (s.ok()) {
    |-Table* t = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;//get the value of &handle, which
is TableAndFile object, then return element table that contains MetaIndexBlock, IndexBlock, File, FilterBlock.
    |-s = t->InternalGet(options, k, arg, handle_result);//search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save
    |-cache_->Release(handle);
  |-}
  |-return s;
}

 

核心处理流程:t->InternalGet(options, k, arg, handle_result)

首先,在布隆过滤器中判定这个key是否存在,如果不存在,则返回,查找结束。

其次,在Index Block中用二分查找算法快速查找第一个大于等于key的restart point,读取该restart point对应的block_handle所指向的块Block,然后在该块Block中再次二分查找第一个大于等于key的restart point,读取该restart point对应的block_handle所指向的KV处,线性遍历key-value,如果查找到key,则返回value。

代码:

Status Table::InternalGet(const ReadOptions& options, const Slice& k, void* arg,
                          void (*handle_result)(void*, const Slice&, const Slice&)) {
  |-Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);//construct a iterator for index_block
                                       |-const uint32_t num_restarts = NumRestarts();//the last 4-bytes of index_block is number restarts
                                                          |-return DecodeFixed32(data_ + size_ - sizeof(uint32_t));
                                       |-return new Iter(comparator, data_, restart_offset_, num_restarts);//like FilterBlock's iterator
  |-iiter->Seek(k);//binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle)
  |-if (iiter->Valid()) {
    |-Slice handle_value = iiter->value();//return value of key from iterator whick has decode key(restart point key) and value(block_handle)
 from IndexBlock
    |-FilterBlockReader* filter = rep_->filter;//return bloom filter, and bloom filter search this key match or not?
    |-BlockHandle handle;//decode from handle_value to get offset and size
    |-if (filter != nullptr && handle.DecodeFrom(&handle_value).ok() && !filter->KeyMayMatch(handle.offset(), k)) {
                                     |-GetVarint64(input, &offset_) && GetVarint64(input, &size_)
      // Not found //bllom filter judge this user key is not exist! OK, return status, and game over!!!
    |-} else {//otherwise, this user key may be exist!
      |-Iterator* block_iter = BlockReader(this, options, iiter->value());//decode offset_ and size_ from BlockHandle, then read size_ from offset_ in DataBlock, and construct a Block object, then cache it, finally construct a iterator for datablock. Everything is ready!!!
      |-block_iter->Seek(k);//Now, binary search this user key!
      |-if (block_iter->Valid()) {//return current_ < restarts_, OK, find user key!!!
        |-(*handle_result)(arg, block_iter->key(), block_iter->value());//callback function SaveValue() to save result!
      |-}
      |-s = block_iter->status();
      |-delete block_iter;//free block_iterator
    |-}
  |-}
  |-if (s.ok()) {
    |-s = iiter->status();
  |-}
  |-delete iiter;//free iterator
  |-return s;
}

 

至此,读操作就分析完成了。