The origin of Bitcask is tied to the history of the Riak distributed database. In a Riak key/value cluster, each node uses pluggable local storage; nearly anything k/v-shaped can be used as the per-host storage engine. This pluggability allowed progress on Riak to be parallelized such that storage engines could be improved and tested without impact on the rest of the codebase.
Bitcask的起源与Riak分布式数据库的历史紧密相连。在一个Riak键值集群中,每一个节点都使用可插拔的键值存储;几乎任何键值形式的存储引擎都可以用作每个主机的存储引擎。这种可插拔的性质允许Riak上的进程可以并行运行,因此可以改进和测试存储引擎,而不会对代码库的其他部分产生影响。
Many such local key/value stores already exist, including but not limited to Berkeley DB, Tokyo Cabinet, and Innostore. There are many goals we sought when evaluating such storage engines, including:
• low latency per item read or written
• high throughput, especially when writing an incoming stream of random items
• ability to handle datasets much larger than RAM w/o degradation
• crash friendliness, both in terms of fast recovery and not losing data
• ease of backup and restore
• a relatively simple, understandable (and thus supportable) code structure and data format
• predictable behavior under heavy access load or large volume
• a license that allowed for easy default use in Riak
已经存在许多这样的本地键值存储,包括但不限于Berkeley DB, Tokyo Cabinet, and Innostore。在评估这些存储引擎时,我们追求很多指标,包括但不限于以下几点:
- 低延迟地读写每个条目
- 高吞吐量,特别是在写入随机条目的输入流时
- 能够处理比RAM大得多的数据且不降低性能
- 在崩溃时能够快速恢复且不丢失数据
- 数据易于备份和恢复
- 具有相对简单、易于理解(易于支持)的代码结构和数据格式
- 在高访问负载或大容量下行为可预测
- 具有允许在Riak中轻松默认使用的许可证
Achieving some of these is easy. Achieving them all is less so.
实现其中部分目标是容易的,但是实现全部目标是困难的
None of the local key/value storage systems available (including but not limited to those written by the authors) were ideal with regard to all of the above goals. We were discussing this issue with Eric Brewer when he had a key insight about hash table log merging: that doing so could potentially be made as fast or faster than LSM-trees.
在所有可用的本地键值存储系统中(包括但不限于笔者们编写的系统),没有一个系统满足上述的要求。在与Eric Brewer讨论这个问题时,他提出了一个关键的想法,即哈希表日志合并的速度可以做到比LSM树一样或更快。
This led us to explore some of the techniques used in the log-structured fifile systems fifirst developed in the 1980s and 1990s in a new light. That exploration led to the development of Bitcask, a storage system that meets all of the above goals very well. While Bitcask was originally developed with a goal of being used under Riak, it was built to be generic and can serve as a local key/value store for other applications as well.
这促使我们以新的视角探索了20世纪80年代和90年代首次开发的基于日志结构的文件系统中使用的一些技术。这种探索促使了Bitcask的开发,它是一个满足了上述所有要求的存储系统。虽然Bitcask最初是为在Riak下使用开发而开发的,但它是通用的,能作用在其他应用中。
The model we ended up going with is conceptually very simple. A Bitcask instance is a directory, and we enforce that only one operating system process will open that Bitcask for writing at a given time. You can think of that process effectively as the ”database server”. At any moment, one fifile is ”active” in that directory for writing by the server. When that fifile meets a size threshold it will be closed and a new active fifile will be created. Once a fifile is closed, either purposefully or due to server exit, it is considered immutable and will never be opened for writing again.
我们最终采用的模型在概念上非常简单。一个bitcask实例就是一个目录,并且同一时间只能一个操作系统进程打开bitcask并写入。你可以认为这个进程就”是数据服务器“。在任何时刻,该目录中只有一个文件是“活跃的”,用于被服务器写入。当该文件达到最大阈值时,它将被关闭并创建一个新的活跃文件。一旦文件关闭,无论是出于意图还是由于服务器退出,它都被视为不可变的,将不再被打开写入。
The active fifile is only written by appending, which means that sequential writes do not require disk seeking.The format that is written for each key/value entry is simple:
活跃文件是只会被追加写,这意味着在磁盘顺序写而无需磁盘寻道。每个写入的键值项格式都很简单:
With each write, a new entry is appeneded to the active fifile. Note that deletion is simply a write of a special tombstone value, which will be removed on the next merge. Thus, a Bitcask data fifile is nothing more than a linear sequence of these entries:
每次写入时,一个新的日志项新都会被追加到活跃文件。请注意,删除只是写入特殊的逻辑删除值,该值将在下次合并时删除。因此一个bitcask数据文件不过是这些条目组成的日志序列:
After the append completes, an in-memory structure called a ”keydir” is updated. A keydir is simply a hash table that maps every key in a Bitcask to a fifixed-size structure giving the fifile, offset, and size of the most recently written entry for that key.
在追加操作完成后,一个名为"keydir"的内存结构会被更新。keydir是一个简单的哈希表,将bitcask中的每个键映射到一个固定大小的结构,该结构给出了该键对应的写入条目的文件、偏移量和大小。
When a write occurs, the keydir is atomically updated with the location of the newest data. The old data is still present on disk, but any new reads will use the latest version available in the keydir. As we’ll see later, the merge process will eventually remove the old value.
当发生写操作时,keydir 会以原子方式更新为最新数据的位置。旧数据仍然存在于磁盘上,但任何新的读取操作将使用 keydir 中可用的最新版本。正如我们稍后将看到的,合并过程最终将删除旧值。
Reading a value is simple, and doesn’t ever require more than a single disk seek. We look up the key in our keydir, and from there we read the data using the fifile id, position, and size that are returned from that lookup. In many cases, the operating system’s fifilesystem read-ahead cache makes this a much faster operation than would be otherwise expected.
读取一个值很简单,只需一次磁盘查找。我们在 keydir 中查找键,并从那里使用返回的文件ID、位置和大小读取数据。在许多情况下,操作系统的文件系统预读缓存使得这个操作比预期的要快得多。
This simple model may use up a lot of space over time, since we just write out new values without touching the old ones. A process for compaction that we refer to as ”merging” solves this. The merge process iterates over all non-active (i.e. immutable) fifiles in a Bitcask and produces as output a set of data fifiles containing only the ”live” or latest versions of each present key.
随着时间的推移,这种简单模型可能会占用大量的空间,因为我们只是写入新值而不处理旧值。我们称之为“合并”的压缩过程解决了这个问题。合并过程遍历 Bitcask 中的所有非活动(即不可变)文件,并生成一组只包含每个现有键的“活动”或最新版本的数据文件作为输出。
When this is done we also create a ”hint fifile” next to each data fifile. These are essentially like the data fifiles but instead of the values they contain the position and size of the values within the corresponding data fifile.
完成合并后,我们还会在每个数据文件旁边创建一个“提示文件”(hint file)。这些文件与数据文件类似,但不包含值,而是包含对应数据文件中值的位置和大小的信息。
When a Bitcask is opened by an Erlang process, it checks to see if there is already another Erlang process in the same VM that is using that Bitcask. If so, it will share the keydir with that process. If not, it scans all of the data fifiles in a directory in order to build a new keydir. For any data fifile that has a hint fifile, that will be scanned instead for a much quicker startup time.
当一个 Erlang 进程打开一个 Bitcask 时,它会检查是否已经有另一个在同一个虚拟机中的 Erlang 进程正在使用该 Bitcask。如果是这样,它将与该进程共享 keydir。如果没有,则会扫描目录中的所有数据文件以构建一个新的 keydir。对于具有提示文件的任何数据文件,将扫描提示文件以实现更快的启动时间。