06| LevelDB读操作在理解了LevelDB的整体架构后，再理解读操作就会很容易了，而且我们很容易猜到：先去内存中

在理解了LevelDB的整体架构后，再理解读操作就会很容易了，而且我们很容易猜到：先去内存中的memtable中查找key；如果没有，再去内存中只读的immutable中查找；如果还没有，再去level-0的SSTable文件中查找；如果还没有，则到下一级level-1层的SSTable文件中查找，依次类推。但是我们也会发现：越往下一层，代价越大，读性能越不好。这是由于简化并提高了写操作及写性能带来的代价的，因此，优化读操作就是需要重点考虑的问题，怎么优化呢？如果让我们去设计，我们该怎么设计呢？

我们分析一下思考的过程：

首先，内存中数据结构的选择。

读操作实际上就是查找key的操作，查找算法我们首先想到的就是二分查找，效率非常的高，时间复杂度是O(logn)，但是二分查找是有要求的，需要在有序的数组中查找，有序可以通过比较函数在写入操作时保证key有序排列，但是数组是要求连续的内存空间，这就很难满足要求。链表不受连续内存的限制，但是查找起来比较费劲，效率不高。继续想还有哪些高效的查找数据结构？跳表SkipList、二叉查找树、红黑树。红黑树是非常常用的高效的查找数据结构，LevelDB并没有选择红黑树，而是选用了跳表SkipList，它的性能也非常优秀，时间复杂度是O(logn)，而且相比红黑树容易实现，出错概率低。跳表SkipList实际上是链表的二分查找，链表是不适合用二分查找的，单链表的二分查找时间复杂度是O(n)，于是跳表SkipList被发明出来，就是利用了指针的内存链接优势和二分查找的高效性能，二者的结合就是跳表SkipList，本质是利用空间换时间。

其次，磁盘上SSTable文件的结构设计。

要查找磁盘上的SSTable文件，必然有磁盘IO，在机械硬盘HDD时代，这显然是很低效的，避免磁盘IO或者尽量减少磁盘IO就是优先考虑的重点。读SSTable文件，磁盘IO是避免不了的，那就想着如何减少磁盘IO的次数，以及在一个SSTable文件中查找key时如何更快。所以要求SSTable文件中的key必须是有序的，这就是Sorted String Table，有序的固化表文件。有序体现在Key是按序存储的，也体现在除了Level-0之外，其他Level中的SSTable文件之间也是Key有序的，即：Key不重叠。且在一个SSTable文件中记录最小key和最大key，这样key是否在[minKey, maxKey]区间内就可以判定这个文件是否包含待查找的key，节省了遍历查找时间，没有就去读下一个SSTable文件。如果一个key落在了某个SSTable文件的[minKey, maxKey]区间内，怎么再高效查找呢？至少先判断这个key存不存在，布隆过滤器是一个很好的解决方案，因此，在SSTable文件中放一个布隆过滤器。如果布隆过滤器判断key存在，那就继续在该文件中查找，一个SSTable文件包含了大量的块Block，块Block里面才是用户key，如何快速查找key所在的块Block？仍然利用高效的二分查找算法，先明确这个key所在的块Block位置，然后再到这个块Block位置处继续查找。一个块Block所包含的key也是大量的数据，如何快速查找？还是利用高效的二分查找算法，进一步缩小范围，明确key所在的key区间，这个key区间数据量就有限了，可以遍历。因此，这就是SSTable文件的设计考虑。

再次，level层次的策略。

Level-0是由内存中的immutable生成的，所以Level-0中的SSTable文件中的Key是存在重叠的，不同的SSTable文件也存在Key重叠的情况，这显然就不利于查找，读性能在Level-0就不好，因此Level-0会有很多的限制条件，1）Level-0中文件的个数达到4个时，会触发压缩Compaction；2）Level-0中文件的个数达到8个时，写入操作将会受到限制；3）Level-0中文件的个数达到12个时，写入操作将会被停止。后期归并生成的SSTable文件在Level-i层，这就是LevelDB的名字的由来。而之所以叫leveled，而不是tiered，是因为第i+1层的数据量是i层的倍数，这样减少了文件数目，也就减少了磁盘IO的次数。

好了，经过以上分析，我们来看下LevelDB具体是如何查找key的。

LevelDB中数据的流向是这样的： Memtable > Immutable Memtable > level 0 > level L > level L+1，因此，其查找顺序也必然是这个顺序，越往后查找代价越高。

一、数据库读操作：

LevelDB提供了 Get 方法来查询数据库。

接口：

Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value)

下面的代码展示了读 key 对应的 value：

std::string value;
 leveldb::Status s = db->Get(leveldb::ReadOptions(), key, &value);

二 、读取过程：

1）内存Memtable中查找：LevelDB首先会去查看内存中的Memtable。如果Memtable中包含key及其对应的value，则返回value值即可；没有则继续向下一级查找。

2）内存Immutable Memtable中查找：如果在Memtable没有读到key，则接下来到内存中的Immutable Memtable中去读取。同样，如果读到就返回，如果没有读到，那就只能从磁盘中的大量SSTable文件中查找了。

3）磁盘SSTable文件查找：因为磁盘上的SSTable文件数量较多，而且分成多个Level，所以在SSTable中读数据就像大海捞针，是非常的费劲的。总的原则是这样的：首先从level-0的SSTable文件中查找，如果找到则返回对应的value值；如果没有找到，那么继续到下一level-1中的SSTable文件中去找，如此查找下去，直到在某层SSTable文件中找到这个key对应的value为止。

4）从SSTable文件中读取一个键的步骤：

4.1）首先需要打开这个SSTable，读取文件最后48字节，即：Footer。这样就可以读取Footer里面的Meta Index Block和Index Block，将Index Block的内容缓存到内存中；再根据Meta Index Block读取布隆过滤器的数据，缓存到内存中。

4.2）根据key对Index Block的restart point进行二分搜索，找到这个key对应的Data Block的BlockHandler；

4.3）根据Meta Index Block指向的BlockHandler的offset_和size_，计算出布隆过滤器的位置，读取相应的布隆过滤器；在布隆过滤器中查找key是否存在，如果判定不存在，则返回；

4.4）如果判定存在，进一步读取对应的Data Block；对Data Block里的restart point进行二分搜索，找到key对应的restart point，对这个restart point对应的key进行搜索，最多搜索16个key，找到key或者找不到key。

如上可知，Index Block和布隆过滤器的内容都可以缓存在内存里的，所以当一个键在SSTable不存在时，99%的概率是不需要磁盘IO的。

三、下面我们结合代码分析读操作

1、代码层次非常清晰，先从mem->Get()查找，再从imm->Get查找，最后从current->Get查找。

代码：

Status DBImpl::Get(const ReadOptions& options, const Slice& key, std::string* value) {
    … …
  |-MemTable* mem = mem_;//memtable variable
  |-MemTable* imm = imm_;//immutable variable
  |-Version* current = versions_->current();//get current version
    … …
  {
    |-mutex_.Unlock();
    // First look in the memtable, then in the immutable memtable (if any).
    |-LookupKey lkey(key, snapshot);
    |-if (mem->Get(lkey, value, &s)) {//search from memtable firstly
      // Done
    |-} else if (imm != nullptr && imm->Get(lkey, value, &s)) {//search from immutable sencondly
      // Done
    |-} else {
      |-s = current->Get(options, lkey, value, &stats);//from current version to search user key
      |-have_stat_update = true;//if there is need to get key from sstable file, it means need to update state that's MaybeScheduleCompaction()
    |-}
    |-mutex_.Lock();
  }
 
  |-if (have_stat_update && current->UpdateStats(stats)) {//trigger compaction
    |-MaybeScheduleCompaction();
  |-}
 
  … …
  |-return s;
}

2、从memtable和immutable memtable中如何查找key的，二者是一样的。

本质就是如何从跳表SkipList中查找数据：

2.1）从跳表SkipList中查找道第一个大于等于memkey的节点node；

2.2）解码key，如果key == userkey，那么就找到了，如果是有效值，返回value；如果是删除值，返回空，结束；

2.3）没有找到返回false

代码：

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  |-Slice memkey = key.memtable_key();// Return a key suitable for lookup in a MemTable.
                      |-return Slice(start_, end_ - start_);//that's memkey == lookupkey
  |-Table::Iterator iter(&table_); //initialize a iterator with SkipList 
                   |-list_ = list; //SkipList table_(cache memkey)
                   |-node_ = nullptr;
  |-iter.Seek(memkey.data());//search greater or equal memkey from skiplist and assigned to node_ of iter
         |-node_ = list_->FindGreaterOrEqual(target, nullptr);//get the first key >= mem_key
  |-if (iter.Valid()) { //return node_ != nullptr;
    |-const char* entry = iter.key(); //return node_->key;
    |-uint32_t key_length;
    |-const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);//entry+5 means maxsize of
varint32, return key_ptr is interna_key and key_length is varing32 size of internal_key.
    |-if (comparator_.comparator.user_comparator()->Compare(Slice(key_ptr, key_length - 8), key.user_key()) == 0) {//key_length-8
is user_key, so compare user_key of node_ and user key. if equal, find this key.
      |-const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);//parse out sequenceNumber+valueType
      |-switch (static_cast<ValueType>(tag & 0xff)) {//jadge valueType
        |-case kTypeValue: {//if this is a normal value, get the target value, return true.
          |-Slice v = GetLengthPrefixedSlice(key_ptr + key_length);//key_ptr+key_length is end of memkey, this is data address.
          |-value->assign(v.data(), v.size());
          |-return true;
        |-}
        |-case kTypeDeletion: //if this is a deleted value, status is NotFound, return empty Slice
          |-*s = Status::NotFound(Slice());
          |-return true;
      |-}
    |-}
  |-}
  |-return false;//otherwise not found key-value in memtable
}

3、从SSTable中查找key

从当前最新版本CurrentVersion中查找，先查找level-0，如果没有，再从level-1~level-N中依次查找。

我们看看LevelDB是怎么搜索SSTable文件的，在前面04| LevelDB中版本控制Version介绍中，我们知道SSTable是由版本控制的，并且当前版本Current拥有最新的SSTable文件，因此必然从Current中查找current->Get(options, lkey, value, &stats);。

3.1）先遍历搜索level-0中的文件files_[0]：快速在每个文件的元数据FileMetaData中查找是否落在文件key的[smallest, largest]区间，在这个区间，就把这个SSTable文件记录下来，因为level-0中存在key重复的问题，因此需要把所有满足这一条件的文件都记录下来。

3.2）如果level-0中确有文件包含查找key，回调函数Match()进一步判定，如果Match()判定找到key，则返回；如果没有找到，那就从level-1~~level-N中依次查找，因为level-1~~level-N中的SSTable文件记录的key是不重复的，因此可以使用二分查找算法，快速查找符合要求的文件。

代码：

void Version::ForEachOverlapping(Slice user_key, Slice internal_key, void* arg, bool (*func)(void*, int, FileMetaData*)) {
  |-const Comparator* ucmp = vset_->icmp_.user_comparator();
  // Search level-0 in order from newest to oldest.
  |-std::vector<FileMetaData*> tmp;
  |-tmp.reserve(files_[0].size());//capacity is expanded to size of files_ under level-0 in Version!
  |-for (uint32_t i = 0; i < files_[0].size(); i++) { //traverse over files under level-0
    |-FileMetaData* f = files_[0][i];//get this file's FileMetaData, and compare smallest user key and largest user key in this file
    |-if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 && ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
      |-tmp.push_back(f);//if smallestUserKey <= this key <= largestUserKey, record it. that's means this key in this file! As level-0 has many file, the same key has different values in different files. So collection all files which contains this key.
    |-}
  |-}
  |-if (!tmp.empty()) {//if found files contain this key,
    |-std::sort(tmp.begin(), tmp.end(), NewestFirst);//return a->number > b->number; sort them from largest to smallest!
    |-for (uint32_t i = 0; i < tmp.size(); i++) {//traverse over this sorted files whick contains this key.
      |-if (!(*func)(arg, 0, tmp[i])) {//call func State::Match to judge.
        |-return;
      |-}
    |-}
  |-}
  // Search other levels, from level-1 to leve-max
  |-for (int level = 1; level < config::kNumLevels; level++) {
    |-size_t num_files = files_[level].size(); //get files numbers of this level
    |-if (num_files == 0) continue;//this level has no file, continue.
    // Binary search to find earliest index whose largest key >= internal_key. In order to filter these files whick largestKey < key
    |-uint32_t index = FindFile(vset_->icmp_, files_[level], internal_key);
    |-if (index < num_files) {
      |-FileMetaData* f = files_[level][index];//get this file
      |-if (ucmp->Compare(user_key, f->smallest.user_key()) < 0) {//if this file's smallest key > this key, this file has not this key
        // All of "f" is past any data for user_key
      |-} else {
        |-if (!(*func)(arg, level, f)) {//otherwise, this file has this key, call func State::Match to judge.
          |-return;
        |-}
      |-}//if
    |-}//if
  |-}//for
}

3.3）接下来，我们重点看Match()是怎么查找key的。

核心处理：state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue)

代码：

static bool Match(void* arg, int level, FileMetaData* f) {
  |-State* state = reinterpret_cast<State*>(arg);
  |-if (state->stats->seek_file == nullptr && state->last_file_read != nullptr) {
        // We have had more than one seek for this read.  Charge the 1st file.
    |-state->stats->seek_file = state->last_file_read;
    |-state->stats->seek_file_level = state->last_file_read_level;
  |-}
  |-state->last_file_read = f;//record this file and level
  |-state->last_file_read_level = level;
  |-state->s = state->vset->table_cache_->Get(*state->options, f->number, f->file_size, state->ikey, &state->saver, SaveValue);//search this file: 
  [step1]: find this file's table, if not exist, open file and read its footer, get MetaIndexBlock, IndexBlock, FilterBlock, construct table and cache Table-File;
  [step2]: search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save.
  |-if (!state->s.ok()) {//return true indicate found, otherwise indicate not found.
    |-state->found = true;
    |-return false;
  |-}
  |-switch (state->saver.state) {//judge the state
        case kNotFound:
          return true;  // Keep searching in other files
        case kFound: //found
          state->found = true;
          return false;
        case kDeleted: //user key has beed deleted, return not found.
          return false;
        case kCorrupt:
          state->s =
              Status::Corruption("corrupted key for ", state->saver.user_key);
          state->found = true;
          return false;
  |-}
  // Not reached. Added to avoid false compilation warnings of "control reaches end of non-void function".
  return false;
}

3.3.1）首先，记录查找的文件seek_file及文件的所在的level，这是为了根据seek进行压缩做记录。

3.3.2）其次，从缓存table_cache_中查找，也就是SSTable文件的一些关键信息是缓存在内存中的，哪些信息呢？MetaIndexBlock、IndexBlock、FilterBlock，这些信息都是为了快速查找key而设计的。如果一个文件没有在缓存中呢？LevelDB会下发读磁盘IO，打开文件，读取文件的最后48字节的Footer(包含metaindex_handle_和index_handle_)，然后就解析出metaindex_handle_和index_handle_，这两个文件指针BlockHandler(offset_, size_)分别定位到一个块Block，它们分别是MetaIndexBlock和IndexBlock。

3.3.3）继续读IndexBlock和MetaIndexBlock的块Block到内存中：Index Block是一系列的KV，每个KV又指向一个具体的块Data Block；而MetaIndexBlock也是一系列KV，目前只有一个布隆过滤器，其value是filter_handler指向一个具体的块Filter Block，把Filter Block也读到内存中缓存。

3.3.4）把读到内存的Index Block、Filter Block构建一个Table对象。

class LEVELDB_EXPORT Table {
  Rep* const rep_;
};
struct Table::Rep {
  Options options;
  Status status;
  RandomAccessFile* file;
  uint64_t cache_id;
  FilterBlockReader* filter;
  const char* filter_data;
  BlockHandle metaindex_handle;  // Handle to metaindex_block: saved from footer
  Block* index_block;
};

3.3.5）将file和Table构建TableAndFile对象，将其插入到LRUCache中。

struct TableAndFile {
  RandomAccessFile* file;
  Table* table;
};
 
// A single shard of sharded cache.
class LRUCache {
  // Initialized before use.
  size_t capacity_;
  // mutex_ protects the following state.
  mutable port::Mutex mutex_;
  size_t usage_ GUARDED_BY(mutex_);
  // Dummy head of LRU list.
  // lru.prev is newest entry, lru.next is oldest entry.
  // Entries have refs==1 and in_cache==true.
  LRUHandle lru_ GUARDED_BY(mutex_);
  // Dummy head of in-use list.
  // Entries are in use by clients, and have refs >= 2 and in_cache==true.
  LRUHandle in_use_ GUARDED_BY(mutex_);
  HandleTable table_ GUARDED_BY(mutex_);
};
 
struct LRUHandle {
  void* value;
  void (*deleter)(const Slice&, void* value);
  LRUHandle* next_hash;
  LRUHandle* next;
  LRUHandle* prev;
  size_t charge;  // TODO(opt): Only allow uint32_t?
  size_t key_length;
  bool in_cache;     // Whether entry is in the cache.
  uint32_t refs;     // References, including cache reference, if present.
  uint32_t hash;     // Hash of key(); used for fast sharding and comparisons
  char key_data[1];  // Beginning of key
};

3.3.6）现在文件的元数据缓存信息TableCache有了，接下来就可以查找key是否真正在一个文件了t->InternalGet(options, k, arg, handle_result)。

代码：

Status TableCache::Get(const ReadOptions& options, uint64_t file_number, uint64_t file_size, const Slice& k, 
                       void* arg, void (*handle_result)(void*, const Slice&, const Slice&)) {
  |-Cache::Handle* handle = nullptr;
  |-Status s = FindTable(file_number, file_size, &handle);//get handle of this file from LRU cache_, if not cached
then construct a key(file_number):value(Table-File) and insert into cache_. Now &handle has this fils's LRUHandle
  |-if (s.ok()) {
    |-Table* t = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;//get the value of &handle, which
is TableAndFile object, then return element table that contains MetaIndexBlock, IndexBlock, File, FilterBlock.
    |-s = t->InternalGet(options, k, arg, handle_result);//search this user key in Table, first binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle), then bloom filter judge this key exist or not, if exist, further to read data block to search this key, if found, hadnle_result to save
    |-cache_->Release(handle);
  |-}
  |-return s;
}

核心处理流程：t->InternalGet(options, k, arg, handle_result)

首先，在布隆过滤器中判定这个key是否存在，如果不存在，则返回，查找结束。

其次，在Index Block中用二分查找算法快速查找第一个大于等于key的restart point，读取该restart point对应的block_handle所指向的块Block，然后在该块Block中再次二分查找第一个大于等于key的restart point，读取该restart point对应的block_handle所指向的KV处，线性遍历key-value，如果查找到key，则返回value。

代码：

Status Table::InternalGet(const ReadOptions& options, const Slice& k, void* arg,
                          void (*handle_result)(void*, const Slice&, const Slice&)) {
  |-Iterator* iiter = rep_->index_block->NewIterator(rep_->options.comparator);//construct a iterator for index_block
                                       |-const uint32_t num_restarts = NumRestarts();//the last 4-bytes of index_block is number restarts
                                                          |-return DecodeFixed32(data_ + size_ - sizeof(uint32_t));
                                       |-return new Iter(comparator, data_, restart_offset_, num_restarts);//like FilterBlock's iterator
  |-iiter->Seek(k);//binary search userkey in IndexBlock and find the first key(restart point key) >= target, reserve its value(BlockHandle)
  |-if (iiter->Valid()) {
    |-Slice handle_value = iiter->value();//return value of key from iterator whick has decode key(restart point key) and value(block_handle)
 from IndexBlock
    |-FilterBlockReader* filter = rep_->filter;//return bloom filter, and bloom filter search this key match or not?
    |-BlockHandle handle;//decode from handle_value to get offset and size
    |-if (filter != nullptr && handle.DecodeFrom(&handle_value).ok() && !filter->KeyMayMatch(handle.offset(), k)) {
                                     |-GetVarint64(input, &offset_) && GetVarint64(input, &size_)
      // Not found //bllom filter judge this user key is not exist! OK, return status, and game over!!!
    |-} else {//otherwise, this user key may be exist!
      |-Iterator* block_iter = BlockReader(this, options, iiter->value());//decode offset_ and size_ from BlockHandle, then read size_ from offset_ in DataBlock, and construct a Block object, then cache it, finally construct a iterator for datablock. Everything is ready!!!
      |-block_iter->Seek(k);//Now, binary search this user key!
      |-if (block_iter->Valid()) {//return current_ < restarts_, OK, find user key!!!
        |-(*handle_result)(arg, block_iter->key(), block_iter->value());//callback function SaveValue() to save result!
      |-}
      |-s = block_iter->status();
      |-delete block_iter;//free block_iterator
    |-}
  |-}
  |-if (s.ok()) {
    |-s = iiter->status();
  |-}
  |-delete iiter;//free iterator
  |-return s;
}

至此，读操作就分析完成了。