MemTable

MemTable的使用

mem_为内存中写入数据的memtable；imm_为CompactMemTable时临时memtable，用于将memtable的内容写到磁盘。在get操作中，如果在mem_中没有找到key，也会去imm_里面查找，即imm_是只读的。imm_是由mem_赋值而来的指向old mem_的指针，在函数DBImpl::MakeRoomForWrite(bool force)中切换，并用has_imm_标识是否存在。

MakeRoomForWrite中的切换：

      imm_ = mem_;
      has_imm_.store(true, std::memory_order_release);
      mem_ = new MemTable(internal_comparator_);
      mem_->Ref();

在CompactMemTable()函数中执行释放

    // Commit to the new state
    imm_->Unref();
    imm_ = nullptr;
    has_imm_.store(false, std::memory_order_release);
    RemoveObsoleteFiles();

跳表定义

MemTable

实现文件：memtable.h/memtable.cc

定义：

class MemTable {
  typedef SkipList<const char*, KeyComparator> Table;
  KeyComparator comparator_;  // Slice（key， value）的比较函数
  int refs_; // 引用计数，调用Ref()时加计数，调用Unref()时减引用
  Arena arena_; // 用于跳表的内存分配
  Table table_;  // 跳表
};

其中，定义struct KeyComparator用于Insert／get中key的比较，

int MemTable::KeyComparator::operator()(const char* aptr,
                                        const char* bptr) const {
  // Internal keys are encoded as length-prefixed strings.
  // 只从aptr和bptr中解码出来key length和key，返回使用key构造的Slice
  Slice a = GetLengthPrefixedSlice(aptr);
  Slice b = GetLengthPrefixedSlice(bptr);
  return comparator.Compare(a, b);
}

由于在memtable中保存的key为由多个变量编码好的一个字符串，这些变量包括

klength varint32

userkey char[klength]

tag uint64

vlength varint32

value char[vlength]

在插入和查找操作中，在对key进行比较的时候，会从整个大Key中解析出klength和userkey

Get函数实现：

// memtable key查找，如果tag是delete返回not found状态码
// 输入：LookupKey key
// 输出： string＊ value
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  Slice memkey = key.memtable_key(); // length+userkey+tag
  Table::Iterator iter(&table_);  // 迭代器初始化node_ = null
  iter.Seek(memkey.data()); // 调用skiplist的FindGreaterOrEqual函数
  if (iter.Valid()) {
    // entry format is:
    //    klength  varint32
    //    userkey  char[klength]
    //    tag      uint64
    //    vlength  varint32
    //    value    char[vlength]
    // Check that it belongs to same user key.  We do not check the
    // sequence number since the Seek() call above should have skipped
    // all entries with overly large sequence numbers.
    const char* entry = iter.key();
    uint32_t key_length;
    const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length); // varint32最大5个字节，key length
    if (comparator_.comparator.user_comparator()->Compare(
            Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
      // Correct user key
      const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8); // key
      switch (static_cast<ValueType>(tag & 0xff)) {   // tag
        case kTypeValue: {
          Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
          value->assign(v.data(), v.size());    // value
          return true;
        }
        case kTypeDeletion:
          *s = Status::NotFound(Slice());  // 如果找到的纪录是deletion类型，那么这个key没有找到
          return true;
      }
    }
  }
  return false;
}

SkipTable

跳表结构

类定义

skiplist.h文件没有对应的cc文件，是模板实现，实现了类：

class SkipList；// 内部定义了自己的Iterator类、Node类

SkipList成员定义：

  Comparator const compare_; // 构建之后可以改
  Arena* const arena_;  // 节点内存分配
  Node* const head_; // Node定义：Key const key ＋ std::atomic<Node*> next_[1]  next_数组个数＝height，0是最低层
  std::atomic<int> max_height_;  // 最大12，只通过Insert()改，读很快，但是会读到过期数据？
  Random rnd_;  // 一个随机数产生器，只使用Insert()读写

其中，Node定义比较重要，其包含两个数据成员，一个是key，一个是next_[]数组；其中key是该节点保存的数据，而next_数组这里定义元素个数是12，那next_[11]就表示指向第11层该节点的下一个节点；类里只定义了高度为1因为这个指针个数(层)是动态的，通过动态分配内存方式节省内存

node定义

// Implementation details follow
template <typename Key, class Comparator>
struct SkipList<Key, Comparator>::Node {
  explicit Node(const Key& k) : key(k) {}

  Key const key;    //这个节点保存的数据
  // Accessors/mutators for links.  Wrapped in methods so we can
  // add the appropriate barriers as necessary.
  // next_数组的第n个元素
  Node* Next(int n) {  // n表示level
    assert(n >= 0);
    //Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return next_[n].load(std::memory_order_acquire);
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    //Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    next_[n].store(x, std::memory_order_release);
  }
  // No-barrier variants that can be safely used in a few locations.
  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return next_[n].load(std::memory_order_relaxed);
  }
  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].store(x, std::memory_order_relaxed);
  }

 private:
  // Array of length equal to the node height.  next_[0] is lowest level link.
  std::atomic<Node*> next_[1];  // next_[0]是最低层的链接(指向height＝0的节点）
};

skiptable的线程读写安全在node里面依靠原子变量及memory barrier实现

SkipList的迭代器：

class Iterator {
    const SkipList* list_;
    Node* node_;
};

分配新Node：

使用arena分配一段内存，然后在已分配内存上new对象

typename SkipList<Key, Comparator>::Node* SkipList<Key, Comparator>::NewNode(
    const Key& key, int height) {
  // 分配给Node的内存，为node原来的内存＋多出来的next_数组元素内存
  char* const node_memory = arena_->AllocateAligned(
      sizeof(Node) + sizeof(std::atomic<Node*>) * (height - 1));
  return new (node_memory) Node(key); // 在node_memory内存上，使用key构造一个node
}

关键函数实现

FindGreaterOrEqual

// 输入：key  key
//      pre  数组，大小12
// 功能：从最高层（12）开始查找，找到第0层结束
//      找到当前key在每一层的前向节点prev，返回值为第0层的后向节点next(key不大于next)
template <typename Key, class Comparator>
typename SkipList<Key, Comparator>::Node*
SkipList<Key, Comparator>::FindGreaterOrEqual(const Key& key,
                                              Node** prev) const {
  Node* x = head_;
  int level = GetMaxHeight() - 1;  // max_height_
  while (true) {
    Node* next = x->Next(level);    // x节点的next_指针指向第level层的下一个节点，从最高层开始查找
    if (KeyIsAfterNode(key, next)) {   
      // key比next大的时候为true，继续遍历该层的链表？
      // Keep searching in this list
      x = next;
    } else {    
      // key不比next大，纪录这一层的node；已经找到0层结束返回；否则继续找level更低的层
      if (prev != nullptr) prev[level] = x;
      if (level == 0) {
        return next;   // 找到第0层结束
      } else {
        // Switch to next list
        level--;  // 高度更低的方向继续探索
      }
    }
  }
}

Insert函数：

// memtable使用Add方法写入，key类型为char*，包含key－value
template <typename Key, class Comparator>
void SkipList<Key, Comparator>::Insert(const Key& key) {
  // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
  // here since Insert() is externally synchronized.
  Node* prev[kMaxHeight];  // 12
  Node* x = FindGreaterOrEqual(key, prev);  // 找到当前key在每一层的前向节点prev，返回值为第0层的后向节点next

  // Our data structure does not allow duplicate insertion
  assert(x == nullptr || !Equal(key, x->key));
  // 随机一个height，满足0<height<=12
  // 随机一个height，满足0<height<=12.每次以1/4概率从低到高随机到某一层，低层被随机到的概率更高？
  int height = RandomHeight(); 
  if (height > GetMaxHeight()) {   
    // 为什么会大于max_height_？这max_height_纪录的是什么？这个if语句是不是很少发生？
    // 是不是max_height_可以小于kMaxHeight？在skiplist早期，level还没有涨到kMaxHeight的时候会走到这个分支
    // 随机设置某个节点的最大高度
    for (int i = GetMaxHeight(); i < height; i++) {
      // max_height_还没有达到最高，高层还没有节点，所以设置更高层的prev为head_
      prev[i] = head_;
    }
    // It is ok to mutate max_height_ without any synchronization
    // with concurrent readers.  A concurrent reader that observes
    // the new value of max_height_ will see either the old value of
    // new level pointers from head_ (nullptr), or a new value set in
    // the loop below.  In the former case the reader will
    // immediately drop to the next level since nullptr sorts after all
    // keys.  In the latter case the reader will use the new node.
    max_height_.store(height, std::memory_order_relaxed);
  }

  // new新节点，并且设置(0~height)各level新节点的后向节点为各level前向节点的next，各个level前向节点的next为新节点
  // 这里for循环从低到高层遍历，防止在写入过程，查找高层可以找到但是到了低层却发现没有的情况
  x = NewNode(key, height);  // 这个 node只需要height高度
  for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
    // 修改各层指针
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));  // x.next_[i] = pre_[i].next_[i]设置第i层前向节点的next为新节点x的next
    prev[i]->SetNext(i, x);  // pre[i].next_[i].store(x, std::memory_order_release);设置第i层前向节点的next_[i]为x
  }
}

跳表结构

从网上找了个图，比较形象地表示了跳表结构和插入操作

图示中为一个max_height_为4（level为3）的skiplist，每个node包含一个key（数字）以及高度为level的next_指针数组，update[i]->forward[i]的连接线即为node中next_[i]

举个例子：key为6的节点，height为4，其next_依次为：next_[0]为key＝7的节点（height为1），next_[1]为key＝9的节点（height为2），next_[2]为key＝25的节点（height为3），next_[3]为nil

插入key＝17的节点过程：

从height＝4(level=3)找起，head的next_[4]为key＝6的节点，key＝6 < key=17找到key＝6的next_[3]为nil，因此level3中node17的前向节点就是key＝6的节点
level＝2，从key＝6的节点找起，key＝6的节点next_[2] = key为25的结点，25 > 17，所以level2中node17的前向节点为
依次找到level＝1，0层的前向节点分别为key＝9的节点，和key＝12的节点
random一个height得到新节点的height为2
遍历level(0 ~ 1) 设置新节点的next_[0], next_[1]，以及设置前向节点key9.next_[1]=key17, key12.next[0]=key17

关于skiplist的查找效率：

template <typename Key, class Comparator>
int SkipList<Key, Comparator>::RandomHeight() {
  // Increase height with probability 1 in kBranching
  static const unsigned int kBranching = 4;
  int height = 1;
  while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
    height++;
  }
  assert(height > 0);
  assert(height <= kMaxHeight);
  return height;
}

以1/4的概率增加height，height2的节点数为height1节点数的1/4，height3节点数为height2节点数的1/4，以此类推

假设height分布是均匀的，总共n个节点，最高层查找区间约为 n* 1/4 * 1/4 * 1/4 ... = n/(4^max_height) ，其他层的查找为常数项4、4、4...，

Arena

主要实现在arena.h/arena.cc中

arena用于跳表的Node以及key字符串的内存分配

class {
  ......
  char* alloc_ptr_; // 内存buffer指针
  size_t alloc_bytes_remaining_; // arena总共可以分配的bytes数
  std::vector<char*> blocks_;  // 保存已分配空间和alloc_ptr_
  std::atomic<size_t> memory_usage_; // blocks_中纪录的所有内存＋blocks_自身的内存
};

提供一个AllocateAligned方法一个Allocate方法，AllocateAligned考虑按（void＊）位数对齐。buffer够用直接使用buffer分配，buffer不够用使用AllocateFallback方法分配，分配规则：

对象size大于一个block的1/4直接分配，新分配内存记录到blocks_；
否则，分配Block size大小push到blocks_，并将新分配的block指针赋值给alloc_ptr_，由alloc_ptr_给对象分配，当前alloc_ptr_中的内存就丢掉了

对齐分配内存

// 返回分配好的内存块首地址(alloc_ptr_＋x)，首地址满足字节对齐
char* Arena::AllocateAligned(size_t bytes) {
  const int align = (sizeof(void*) > 8) ? sizeof(void*) : 8;  // 与系统相关，align最少按8个字节对齐
  static_assert((align & (align - 1)) == 0,
                "Pointer size should be a power of 2");   // 验证align是2的幂次
  // 这里判断alloc_ptr_是不是按系统位数对齐的
  // 数据存储是以字节为单位的，地址按位与对齐字节数，完成字节对齐
  size_t current_mod = reinterpret_cast<uintptr_t>(alloc_ptr_) & (align - 1);
  size_t slop = (current_mod == 0 ? 0 : align - current_mod);  // 按void(*)对齐分配空间，current_mod == 0是正好对齐不需要填补
  size_t needed = bytes + slop; // 需要分配的字节数
  char* result;
  if (needed <= alloc_bytes_remaining_) {
    // alloc_ptr_里面的内存还够直接分配
    result = alloc_ptr_ + slop;
    alloc_ptr_ += needed;
    alloc_bytes_remaining_ -= needed;
  } else {
    // AllocateFallback always returned aligned memory
    result = AllocateFallback(bytes);   // 如果直接分配就不需要考虑字节对齐了,因为总是能对齐
  }
  assert((reinterpret_cast<uintptr_t>(result) & (align - 1)) == 0);
  return result;
}

分配新的block

// 内存分配每次分配一个kBlockSize（4096）字节或者按需
// (block_bytes决定分配给谁）
char* Arena::AllocateNewBlock(size_t block_bytes) {
  char* result = new char[block_bytes];
  blocks_.push_back(result);
  // 加一个sizeof(char*)是blocks_的内存
  memory_usage_.fetch_add(block_bytes + sizeof(char*),
                          std::memory_order_relaxed); // atomic
  return result;
}

线程安全问题

由定义可以看到，memtable使用引用计数来保证memtable的析构安全，int ref_为引用计数，在Ref()以及UnRef()过程并没有加锁保证安全，需要在外部使用上保证

但是获取到memtable的指针之后，在memtable的操作上，插入／删除并没有加锁，即对skiptable的操作没有加锁，那skiptable是如何保证线程安全的呢？从可以知道，写请求虽然可以被多个线程接受，但是写memtable是通过队列实现的，同一时间只有一个线程在写，所以是一个一写多读的情况

多个读请求之间是没有问题的，skiptable不支持删除元素，那么只需要一个插入元素线程与多个读取元素之间是安全的，跳表实现上使用了很多的atomic原子操作与memory barrier保证插入元素过程读取安全。

［leveldb］memtable

MemTable

MemTable的使用

跳表定义

MemTable

Get函数实现：

SkipTable

跳表结构

类定义

node定义

SkipList的迭代器：

关键函数实现

FindGreaterOrEqual

Insert函数：

跳表结构

Arena

线程安全问题