0.前沿

lsm基础：
- 文档地址：LSM-论文导读与Leveldb源码解读
- 文档地址：hardcore.feishu.cn/docs/doccnx…
- 视频地址：www.bilibili.com/video/BV16A…
bigtable基础：
- 文档地址：:hardcore.feishu.cn/docs/doccnY…
- 视频地址：www.bilibili.com/video/BV1jh…

1.leveldb准备

1.1 简介

leveldb是一个key/value型的单机存储引擎，由google开发，并宣布在BSD许可下开放源代码。他是leveling+分区实现的LSM典型代表。

1.2 特性

key、value支持任意的byte类型数组，不单单支持字符串
LevelDb是一个持久化存储的KV系统，将大部分数据存储到磁盘上
按照记录key值顺序存储数据，并且LevleDb支持按照用户定义的比较函数进行排序
操作接口简单，包括写/读记录以及删除记录，也支持针对多条操作的原子批量操作。
支持数据快照（snapshot）功能，使得读取操作不受写操作影响，可以在读操作过程中始终看到一致的数据。
支持数据压缩(snappy压缩)操作，有效减小存储空间、并增快IO效率。
LSM典型实现，适合写多读少。

1.3 源码编译与使用

源码下载

git clone github.com/google/leve…

git submodule update --init

执行编译

cd leveldb

mkdir -p build && cd build

cmake -DCMAKE_BUILD_TYPE=Release(Debug) .. && cmake --build .

具体操作如下：

release版的

cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .

或者debug版的

cmake -DCMAKE_BUILD_TYPE=Debug .. && cmake --build .

cp -r ./include/leveldb /usr/include/

cp build/libleveldb.a /usr/local/lib/

编写demo
编译demo

g++ -o xxx xxx.cpp libleveldb.a -lpthread

1.4 压测

参照源码中的benchmarks目录

2.基本组件

先简单的浏览这张图，知道leveldb的大体哪几部分。

2.1 字节序

字节序是处理器架构的特性，比如一个16位的整数，他是由2个字节组成。内存中存储2个字节有两种方法：

将低序字节存储在起始地址，称为小端；
将高序字节存储在起始地址，称为大端；

#include <iostream>

using namespace std;

int main() {

    union {

        short s;

        char c[sizeof(short)];

    } un;

    un.s = 0x0102;

    if (sizeof(short) == 2) {

        if (un.c[0] == 1 && un.c[1] == 2) {

            cout << "大端" << endl;

        } else if (un.c[1] == 1 && un.c[0] == 2) {

            cout << "小端" << endl;

        } else {

            cout << "unknown" << endl;

        }

    } else {

        cout << "sizeof(short)=" << sizeof(short) << endl;

    }

    return 0;

}

leveldb中使用的是小端。

2.2 slice

slice是leveldb中自定义的字符串处理类，主要是因为标准库中的string，

默认语意为拷贝，会损失性能(在可预期的条件下，指针传递即可)
标准库不支持remove_prefix和starts_with等函数，不太方便

无法复制加载中的内容

2.3 status

用于记录leveldb中状态信息，保存错误码和对应的字符串错误信息(不过不支持自定义)。其基本组成


 private:

  enum Code {

    kOk = 0,

    kNotFound = 1,

    kCorruption = 2,

    kNotSupported = 3,

    kInvalidArgument = 4,

    kIOError = 5

  };



Status::Status(Code code, const Slice& msg, const Slice& msg2) {

  assert(code != kOk);

  const uint32_t len1 = static_cast<uint32_t>(msg.size());

  const uint32_t len2 = static_cast<uint32_t>(msg2.size());

  const uint32_t size = len1 + (len2 ? (2 + len2) : 0);

  char* result = new char[size + 5];

  std::memcpy(result, &size, sizeof(size));

  result[4] = static_cast<char>(code);

  std::memcpy(result + 5, msg.data(), len);

  if (len2) {

    result[5 + len1] = ':';

    result[6 + len1] = ' ';

    std::memcpy(result + 7 + len1, msg2.data(), len2);

  }

  state_ = result;

}

2.4 编码

leveldb中分为定长和变长编码，其中变长编码目的是为了减少空间占用。其基本思想是：每一个Byte最高位bit用0/1表示该整数是否结束，用剩余7bit表示实际的数值，在protobuf中被广泛使用。

见函数char* EncodeVarint32(char* dst, uint32_t v)

2.5 Option

Option记录了leveldb中参数信息

无法复制加载中的内容

2.6 skiplist

常规线段跳表

定义

是一种可以代替平衡树的数据结构，可以看做是并联的有序链表。跳跃表通过概率保证平衡，而平衡树采用严格的旋转来保证平衡，因此跳跃表比较容易实现，而且相比平衡树有着较高的运行效率。其中Redis默认的最大level为64。

常规操作
- 初始化

插入

查找

删除

Redis中的线段跳表

#define ZSKIPLIST_MAXLEVEL 64 /* Should be enough for 2^64 elements */

#define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */



//可能一个节点会存在多层

/* ZSETs use a specialized version of Skiplists */

typedef struct zskiplistNode {

 sds ele;//成员对象

 double score;//各个节点中分值，在跳跃表中，节点按各自所保存的分值从小到大排列

 struct zskiplistNode *backward;//指向位于当前节点的前一个节点。后退指针在程序从表尾想表头遍历时使用

 struct zskiplistLevel {

 struct zskiplistNode *forward;//用于访问位于表尾方向的其他节点，当程序从表头向表尾进行遍历时，访问会沿着层的前进指针进行

 unsigned long span;//跨度，他和遍历操作无关，forward才是用来遍历操作的。跨度实际上是用来计算排位的。

 } level[];

} zskiplistNode;



typedef struct zskiplist {

 struct zskiplistNode *header, *tail;

 unsigned long length;//节点个数

 int level;//总层数

} zskiplist;

特点

排序按照score来排序，如果是score相等，那么则按照ele来排序
平均查询时间复杂度为O(logn)。

常见操作

level计算

#define ZSKIPLIST_MAXLEVEL 64 /* Should be enough for 2^64 elements */

#define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */

int zslRandomLevel(void) {

 int level = 1;

 while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))

 level += 1;

 return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;

}

由上式可以，假设level每增加一层的概率为p。

节点层数为1，概率1-p

节点层数为2，概率p(1-p)

节点层数为3，概率p*p(1-p)

节点层数为4，概率ppp(1-p)

…

节点平均层数为=1(1-p)+2(1-p)p+3pp(1-p)+…=1/(1-p)，带入redis的0.25，计算出每个节点的平均指针数1.33

插入

排序按照score来排序，如果是score相等，那么则按照ele来排序

查找和删除旧不再给出了

Leveldb中的线段跳表



template <typename Key, class Comparator>

class SkipList {

 private:

  struct Node;



 public:

  

   private:

    const SkipList* list_;

    Node* node_;

    // Intentionally copyable

  };



 private:

  enum { kMaxHeight = 12 };



  inline int GetMaxHeight() const {

    return max_height_.load(std::memory_order_relaxed);

  }



  Node* NewNode(const Key& key, int height);

  int RandomHeight();

  bool Equal(const Key& a, const Key& b) const { return (compare_(a, b) == 0); }



  // Return true if key is greater than the data stored in "n"

  bool KeyIsAfterNode(const Key& key, Node* n) const;



  // Return the earliest node that comes at or after key.

  // Return nullptr if there is no such node.

  //

  // If prev is non-null, fills prev[level] with pointer to previous

  // node at "level" for every level in [0..max_height_-1].

  Node* FindGreaterOrEqual(const Key& key, Node** prev) const;



  // Return the latest node with a key < key.

  // Return head_ if there is no such node.

  Node* FindLessThan(const Key& key) const;



  // Return the last node in the list.

  // Return head_ if list is empty.

  Node* FindLast() const;



  // Immutable after construction

  Comparator const compare_;

  Arena* const arena_;  // Arena used for allocations of nodes



  Node* const head_;



  // Modified only by Insert().  Read racily by readers, but stale

  // values are ok.

  std::atomic<int> max_height_;  // Height of the entire list



  // Read/written only by Insert().

  Random rnd_;

};





// Implementation details follow

template <typename Key, class Comparator>

struct SkipList<Key, Comparator>::Node {

  explicit Node(const Key& k) : key(k) {}



  Key const key;



  // Accessors/mutators for links.  Wrapped in methods so we can

  // add the appropriate barriers as necessary.

  Node* Next(int n) {

    assert(n >= 0);

    // Use an 'acquire load' so that we observe a fully initialized

    // version of the returned Node.

    return next_[n].load(std::memory_order_acquire);

  }

  void SetNext(int n, Node* x) {

    assert(n >= 0);

    // Use a 'release store' so that anybody who reads through this

    // pointer observes a fully initialized version of the inserted node.

    next_[n].store(x, std::memory_order_release);

  }



  // No-barrier variants that can be safely used in a few locations.

  Node* NoBarrier_Next(int n) {

    assert(n >= 0);

    return next_[n].load(std::memory_order_relaxed);

  }

  void NoBarrier_SetNext(int n, Node* x) {

    assert(n >= 0);

    next_[n].store(x, std::memory_order_relaxed);

  }



 private:

  // Array of length equal to the node height.  next_[0] is lowest level link.

  //1.这里提前使用声明分配1个对象的内存，是因为，第0层数据肯定是都有的，而且，是全部数据

  //2.使用数组方式，那么后续分配的内存就是连续的，cache-friend

  std::atomic<Node*> next_[1]; //atomic保证原子性

};





template <typename Key, class Comparator>

typename SkipList<Key, Comparator>::Node* SkipList<Key, Comparator>::NewNode(

    const Key& key, int height) {

  //仔细想想，这个为啥是level-1？？？

  //答案：前面已经给你分配了一层了

  char* const node_memory = arena_->AllocateAligned(

      sizeof(Node) + sizeof(std::atomic<Node*>) * (height - 1));

  //这个是定位new写法

  return new (node_memory) Node(key);

}

2.7 bloomfilter

文档地址：hardcore.feishu.cn/docs/doccnt…

视频地址：www.bilibili.com/video/BV1mK…

Leveldb源码解读(一)