RocksDB

zhuanlan.zhihu.com/p/632841342

RocksDB 是一种可持久化的、内嵌型 kv 存储。它是为了存储大量的 key 及其对应 value 设计出来的数据库。可以基于这种简单 kv 数据模型来构建倒排索引、文档数据库、SQL 数据库、缓存系统和消息代理等复杂系统。

RocksDB 是 2012 年从 Google 的 LevelDB fork 出来的分支，并针对跑在 SSD 上的服务器进行了优化。目前，RocksDB 由 Meta 开发和维护。

RocksDB 使用 C++ 编写而成，因此除了支持 C 和 C++ 之外，还能通过 С binding 的形式嵌入到使用其他语言编写的应用中，例如 Rust、Go 或 Java。

在数据库领域，特别是在 RocksDB 的上下文中，“内嵌”意味着：

该数据库没有独立进程，而是被集成进应用中，和应用共享内存等资源，从而避免了跨进程通信的开销
它没有内置服务器，无法通过网络进行远程访问。
它不是分布式的，这意味着它不提供容错性、冗余或分片（sharding）机制。

如有必要，需依赖于应用层来实现上述功能。

RocksDB 以 kv 对集合的形式存储数据， key 和 value 是任意长度的字节数组（byte array），因此都是没有类型的。RocksDB 提供了很少的几个用于修改 kv 集合的函数底层接口：

put(key, value)：插入新的键值对或更新已有键值对
merge(key, value)：将新值与给定键的原值进行合并
delete(key)：从集合中删除键值对
get(key)

memtable

The default implementation of memtable is based on skiplist. Other than the default memtable implementation, users can use other types of memtable implementation, for example HashLinkList, HashSkipList or Vector, to speed-up some queries.

Skiplist MemTable

Skiplist-based memtable provides general good performance to both read and write, random access and sequential scan. Besides, it provides some other useful features that other memtable implementations don't currently support, like [Concurrent Insert] and [Insert with Hint].

HashSkiplist MemTable

HashSkipList organizes data in a hash table with each hash bucket to be a skip list, while HashLinkList organizes data in a hash table with each hash bucket as a sorted single linked list. Both types are built to reduce number of comparisons when doing queries. One good use case is to combine them with PlainTable SST format and store data in RAMFS.

When doing a look-up or inserting a key, target key's prefix is retrieved using Options.prefix_extractor, which is used to find the hash bucket. Inside a hash bucket, all the comparisons are done using whole (internal) keys, just as SkipList based memtable.

The biggest limitation of the hash based memtables is that doing scan across multiple prefixes requires copy and sort, which is very slow and memory costly.

Log-structured Merge (LSM) Tree

distributed-computing-musings.com/2022/07/lsm…

for write intensive databases

LSM Trees are the data structure underlying many highly scalable NoSQL distributed key-value type databases such as HBase, Cassandra, Google Big Table, LevelDB

LSM-trees can be viewed as n-level merge-trees. It transforms random writes into sequential writes using logfile and in-memory store.

All the writes are initially persisted into an in-memory sorted data structure(Known as Memtable). The sorting property typically comes from an [AVL-tree] implementation.
- An AVL tree defined as a self-balancing [Binary Search Tree] (BST) where the difference between heights of left and right subtrees for any node cannot be more than one.
Once the Memtable reaches a threshold memory, it is converted into an immutable data structure known as SSTable(Ssorted string table) and flushed to disk. The original Memtable is cleared after this and allocated with new memory.
If a key is deleted, it is assigned a tombstone value(It can be a unique value that distinguishes the tombstone from other user assigned values for the key).
From time to time, a background job takes care of merging the on-disk SSTables in order to compress the overlapping keys. For example the current SSTable has an entry of SET A 2, whereas a previous SSTable has an entry of SET A 4 then we can merge these two entries and just keep SET A 2. This process is known as compaction.

On the other hand, an update on B-Tree results in multiple updates happening as part of balancing the original tree. These updates happen on disk that will result in multiple disk hops affecting the latency of write operation. Whereas in case of LSM tree, an update just results in appending a log to an in-memory store so it comparatively faster than a B-Tree.

it is slower than a B-Tree for read operations:

For reading the value associated with a key, we first look into the in-memory Memtable. If the value if updated recently, then there is a high chance that we can find it in-memory and we can avoid interacting with the disk. This case is pretty efficient in terms of latency.
If we are unable to find the key in Memtable, we start reading it from the SSTables on disk in decreasing order of creation time. So the latest created disk is queried first, then the second latest and so on. If our compaction algorithm is efficient then there is a high-chance that we might be able to find the key in top few SSTables.
If key is deleted then we will find a tombstone associated with the key and we can return a KEY_NOT_FOUND exception.
But what happens if the key was never written in our system. This ends up becoming the worse case as we will end up reading all the SSTables on the disk in order to determine that the key is not present.

There are some optimizations which can be done in order to avoid the worse case lookup of non-existent key such as a Bloom Filter. To improve read performance, we can also maintain an index file on disk which contains the mapping of keys to the id of latest on-disk segment they are written to. Now when we are unable to find the key in Memtable, we can read the latest segment id from index file and directly read the given segment. This requires only two disk hops and prevents us from reading all the segments on the disk. But all these optimizations come with their own challenges and maintenance costs.

Now if we compare this with read performance of B-Tree, we will need to query at most till the depth of tree which is logarithmic time complexity for a balanced tree. Hence the read-amplification(amount of work done per read operation) of LSM tree tends to be higher than a B-Tree unless we perform some additional optimizations.

lsm

users.cs.utah.edu/~pandey/cou…

When a tree exceeds its size limit, its data is merged and rewritten

• Higher level is always merged into next lower level (Ci merged with Ci+1)

‣Merging always proceeds top down

Recall mergesort from data structures/algorithms
- We can efficiently merge two sorted structures in linear time using iterators
When merging two levels, newer key-value pair versions replace older (GC)
- LSM-tree invariant: newest version of any key-value pair is version nearest to top of LSM-tree

LevelDB

Google’s Open Source LSM-tree-ish KV-store

LevelDB consists of a hierarchy of SSTables

An SSTable is a sorted set of key-value pairs (Sorted Strings Table)
- Typical SSTable size is 2MiB

The growth factor describes how the size of each level scales

Let F be the growth factor (fanout)
Let M be the size of the first level (e.g., 10MiB)
Then the ith level, Ci has size (F^i)*M

The spine stores metadata about each level

• {keyi, offseti} for a all SSTables in a level (plus other metadata TBD)

• Spine cached for fast searches of a given level

‣(if too big, a B-tree can be used to hold the spine for optimal searches)

Final Thoughts

LSM-trees are a write-optimized data structure:
- Many updates are batched and committed in a sequential I/O
Although we may need to search for data in multiple levels, we can avoid unnecessary I/Os with additional metadata
- Boom filters help avoid unnecessary searches in a given level
- Metadata in “spine” helps to target searches within a level
I/O amplification is one of the biggest challenges for LSM-trees
- Leveled-design causes read amplification
  - Searches may require I/Os at each level in worst case
- Compaction causes write amplification
  - Different compaction strategies favor write vs. read performance

RocksDB LSM