参考列表
- rocksdb介绍之数据写入流程_rocksdb 写入流程-CSDN博客
- RocksDB写流程梳理 - 知乎 / RocksDB写流程梳理 - 文章详情
- RocksDB学习笔记#3 写流程_rocksdb读写流程-CSDN博客
- RocksDB 第三课 读取写入和并发 - 墨天轮
- Rocksdb 系列 - 写流程 - 知乎
函数调用顺序
DB::Put
用户调用 Put函数
// include/rocsdb/db.h
virtual Status Put(const WriteOptions& options, const Slice& key, const Slice& value);
经过
// include/rocsdb/db.h
virtual Status Put(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key, const Slice& value) = 0;
这是一个纯虚函数,但有默认实现(C++11 之后允许纯虚函数有默认实现)。由于 DBImpl::Put对其override,所以此处将调用它
DBImpl::Put
// db/db_impl/db_impl_write.cc
Status Put(const WriteOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, const Slice& value) override;
// Convenience methods
Status DBImpl::Put(const WriteOptions& o, ColumnFamilyHandle* column_family,
const Slice& key, const Slice& val) {
const Status s = FailIfCfHasTs(column_family); //对column family做关于timestamp的检查
if (!s.ok()) {
return s;
}
return DB::Put(o, column_family, key, val); // 调用DB::Put纯虚函数的默认实现
}
可见最后又调用了纯虚函数的默认实现
DB:Put
// db/db_impl/db_impl_write.cc
// Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, ColumnFamilyHandle* column_family,
const Slice& key, const Slice& value) {
// Pre-allocate size of write batch conservatively.
// 8 bytes are taken by header, 4 bytes for count, 1 byte for type,
// and we allocate 11 extra bytes for key length, as well as value length.
WriteBatch batch(key.size() + value.size() + 24, 0 /* max_bytes */,
opt.protection_bytes_per_key, 0 /* default_cf_ts_sz */);
Status s = batch.Put(column_family, key, value);
if (!s.ok()) {
return s;
}
return Write(opt, &batch);
}
该函数先创建了一个 WriteBatch,再将key value放进去。这就很奇怪了,明明只有一个key,为什么要创建此对象?毕竟batch的含义是“批”。因为,rocksdb支持用户批量修改——原子地写入一批更新(RocksDB 第三课 读取写入和并发 - 墨天轮),例如
WriteBatch batch;
batch.Delete("key1");
batch.Put("key2", value);
s = db->Write(WriteOptions(), &batch);
故rocksdb内部统一接口,都以 WriteBatch的形式写。调用链来到了 DB::Write,这又是一个纯虚函数,其实现是·DBImpl::Write。该实现最后会调用 DBImpl::WriteImpl,这个函数非常复杂
DBImpl::WriteImpl
// db/db_impl/db_impl_write.cc
// The main write queue. This is the only write queue that updates LastSequence.
// When using one write queue, the same sequence also indicates the last
// published sequence.
Status DBImpl::WriteImpl(const WriteOptions& write_options,
WriteBatch* my_batch, WriteCallback* callback,
uint64_t* log_used, uint64_t log_ref,
bool disable_memtable, uint64_t* seq_used,
size_t batch_cnt,
PreReleaseCallback* pre_release_callback,
PostMemTableCallback* post_memtable_callback) {
// 根据不同的配置,走不同的分支
// 。。。
// 创建Writer
WriteThread::Writer w(write_options, my_batch, callback, log_ref,
disable_memtable, batch_cnt, pre_release_callback,
post_memtable_callback);
// 加入write batch group。即把自己挂在writer链表末尾,并获取自己的角色
write_thread_.JoinBatchGroup(&w);
// 不同角色做不同的事情
// 如果此writer是follower,且允许并发写memtable
if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
// 并发写memtable
if (w.ShouldWriteToMemtable()) {
// ...
}
// 向group的leader报告自己已完成写memtable,并等待其它Writer的完成
// 如果该线程是group中最后一个完成,那么函数返回true然后执行善后工作
if (write_thread_.CompleteParallelMemTableWriter(&w)) {
// 善后工作
// ...
write_thread_.ExitAsBatchGroupFollower(&w);
}
// 代码运行到这,表示该writer的状态是STATE_COMPLETED
assert(w.state == WriteThread::STATE_COMPLETED);
// STATE_COMPLETED conditional below handles exit
}
// 作为follower的writer在上一个if中完成了memtable的写入
// 或 leader代替该follower写入memtable
// 那么函数可返回了
if (w.state == WriteThread::STATE_COMPLETED) {
if (log_used != nullptr) {
*log_used = w.log_used;
}
if (seq_used != nullptr) {
*seq_used = w.sequence;
}
// write is complete and leader has updated sequence
return w.FinalStatus();
}
// 如果此writer是leader,需完成的工作如下
// else we are the leader of the write batch group
assert(w.state == WriteThread::STATE_GROUP_LEADER);
Status status;
// Once reaches this point, the current writer "w" will try to do its write
// job. It may also pick up some of the remaining writers in the "writers_"
// when it finds suitable, and finish them in the same write batch.
// This is how a write job could be done by the other writer.
WriteContext write_context;
LogContext log_context(write_options.sync);
WriteThread::WriteGroup write_group;
// ...
assert(!two_write_queues_ || !disable_memtable);
{
// ...
// leader会检查wal是否满了等信息
status = PreprocessWrite(write_options, &log_context, &write_context);
// ...
}
// 从write_thread_里的writer链表上选择一定数量的writer构建write batch group
last_batch_group_size_ = write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
// ...
if (status.ok()) {
//检查group中每个writer,获取总写入大小,确认是否能并发写
bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
write_group.size > 1;
size_t total_count = 0;
size_t valid_batches = 0;
size_t total_byte_size = 0;
size_t pre_release_callback_cnt = 0;
// ...
// 统计一些信息
// ...
if (!two_write_queues_) {
if (status.ok() && !write_options.disableWAL) {
// ...
// leader写WAL
io_s =
WriteToWAL(write_group, log_context.writer, log_used,
log_context.need_log_sync, log_context.need_log_dir_sync,
last_sequence + 1, log_file_number_size);
}
} else {
if (status.ok() && !write_options.disableWAL) {
// ...
io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
seq_inc);
} else {
// Otherwise we inc seq number for memtable writes
last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
}
}
// 不懂
// PreReleaseCallback is called after WAL write and before memtable write
if (status.ok()) {
// ...
}
if (status.ok()) {
PERF_TIMER_GUARD(write_memtable_time);
// 上面判断了是否能并发写memtable
if (!parallel) {
// 如果不能并发写,则leader为group中的follower写memtable
// w.sequence will be set inside InsertInto
w.status = WriteBatchInternal::InsertInto(
write_group, current_sequence, column_family_memtables_.get(),
&flush_scheduler_, &trim_history_scheduler_,
write_options.ignore_missing_column_families,
0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
batch_per_txn_);
} else {
write_group.last_sequence = last_sequence;
// 如果可以并发写,则leader唤醒follower,就是上述的WriteThread::STATE_GROUP_LEADER
write_thread_.LaunchParallelMemTableWriters(&write_group);
in_parallel_group = true;
// Each parallel follower is doing each own writes. The leader should
// also do its own.
if (w.ShouldWriteToMemtable()) {
// ...
}
}
if (seq_used != nullptr) {
*seq_used = w.sequence;
}
}
}
// ...
// 好像是将wal落盘
if (log_context.need_log_sync) {
// ...
}
bool should_exit_batch_group = true;
// 如果能并发写mmetable,那么该if就会执行
if (in_parallel_group) {
// CompleteParallelWorker returns true if this thread should
// handle exit, false means somebody else did
// 等待其它线程的写入完成,并判断leader是否最后一个完成,follower也调用了此函数
should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
}
// 如果leader是group中最后一个完成,那么leader做善后工作,
if (should_exit_batch_group) {
// 这部分代码与上述follower的善后代码很类似
// ...
write_thread_.ExitAsBatchGroupLeader(write_group, status);
}
if (status.ok()) {
status = w.FinalStatus();
}
return status;
}
此函数会根据option(既有 WriteOptions又有 ImmutableDBOptions)和参数,走不同的分支。我们就分析默认情况:
-
创建
WriteThread::Writer。 -
调用
WriteThread::JoinBatchGroup将Writer加入write batch group,并获得自己的角色,即follower或leader。这步可能会阻塞,该函数实际上是将当前Writer挂在请求链表末尾,如果此Writer是首个,则直接获取leader角色,否则阻塞等待成为leader或被leader分配角色(下文会着重分析该函数) -
不同的角色执行不同的任务:
-
如果是leader
- 调用
DBImpl::PreprocessWrite检查wal是否满否满等等,这个函数涉及flush的产生 - 调用
WriteThread::EnterAsBatchGroupLeader创建write batch group。即从请求链表上取出一定数量的Writer组建group - 写wal
- 替group中的followers写memtable,或调用
WriteThread::LaunchParallelMemTableWriters唤醒followers并发写memtable - 若leader是本group最后一个写入memtable,leader需完成善后工作并调用
WriteThread::ExitAsBatchGroupLeader
- 调用
-
如果是follower
- 可能被
WriteThread::LaunchParallelMemTableWriters唤醒写memtable,写入后调用WriteThread::CompleteParallelMemTableWriter等待其他writers的完成。如果是最后一个写完,那么它需完成善后工作,并调用WriteThread::ExitAsBatchGroupFollower - 自己完成了memtable的写入 或 leader代替followers写入 后,流程结束从函数返回
- 可能被
-
TODO: 补充
Writer,WriteThread等数据结构的分析,以及分析group如何创建等
总结
未完待续