RocksDB写流程

148 阅读6分钟

参考列表

函数调用顺序

DB::Put

用户调用 Put函数

// include/rocsdb/db.h

virtual Status Put(const WriteOptions& options, const Slice& key, const Slice& value);

经过

// include/rocsdb/db.h

virtual Status Put(const WriteOptions& options, ColumnFamilyHandle* column_family, const Slice& key, const Slice& value) = 0;

这是一个纯虚函数,但有默认实现(C++11 之后允许纯虚函数有默认实现)。由于 DBImpl::Put对其override,所以此处将调用它

DBImpl::Put

// db/db_impl/db_impl_write.cc

Status Put(const WriteOptions& options, ColumnFamilyHandle* column_family,
             const Slice& key, const Slice& value) override;
// Convenience methods
Status DBImpl::Put(const WriteOptions& o, ColumnFamilyHandle* column_family,
                   const Slice& key, const Slice& val) {
  const Status s = FailIfCfHasTs(column_family); //对column family做关于timestamp的检查
  if (!s.ok()) {
    return s;
  }
  return DB::Put(o, column_family, key, val); // 调用DB::Put纯虚函数的默认实现
}

可见最后又调用了纯虚函数的默认实现

DB:Put

// db/db_impl/db_impl_write.cc

// Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, ColumnFamilyHandle* column_family,
               const Slice& key, const Slice& value) {
    // Pre-allocate size of write batch conservatively.
    // 8 bytes are taken by header, 4 bytes for count, 1 byte for type,
    // and we allocate 11 extra bytes for key length, as well as value length.
    WriteBatch batch(key.size() + value.size() + 24, 0 /* max_bytes */,
                     opt.protection_bytes_per_key, 0 /* default_cf_ts_sz */);
    Status s = batch.Put(column_family, key, value);
    if (!s.ok()) {
        return s;
    }
    return Write(opt, &batch);
}

该函数先创建了一个 WriteBatch,再将key value放进去。这就很奇怪了,明明只有一个key,为什么要创建此对象?毕竟batch的含义是“批”。因为,rocksdb支持用户批量修改——原子地写入一批更新(RocksDB 第三课 读取写入和并发 - 墨天轮),例如

WriteBatch batch;
batch.Delete("key1");
batch.Put("key2", value);
s = db->Write(WriteOptions(), &batch);

故rocksdb内部统一接口,都以 WriteBatch的形式写。调用链来到了 DB::Write,这又是一个纯虚函数,其实现是·DBImpl::Write。该实现最后会调用 DBImpl::WriteImpl,这个函数非常复杂

DBImpl::WriteImpl

// db/db_impl/db_impl_write.cc

// The main write queue. This is the only write queue that updates LastSequence.
// When using one write queue, the same sequence also indicates the last
// published sequence.
Status DBImpl::WriteImpl(const WriteOptions& write_options,
                         WriteBatch* my_batch, WriteCallback* callback,
                         uint64_t* log_used, uint64_t log_ref,
                         bool disable_memtable, uint64_t* seq_used,
                         size_t batch_cnt,
                         PreReleaseCallback* pre_release_callback,
                         PostMemTableCallback* post_memtable_callback) {
    // 根据不同的配置,走不同的分支
    // 。。。 

    // 创建Writer
    WriteThread::Writer w(write_options, my_batch, callback, log_ref,
                          disable_memtable, batch_cnt, pre_release_callback,
                          post_memtable_callback);
    // 加入write batch group。即把自己挂在writer链表末尾,并获取自己的角色
    write_thread_.JoinBatchGroup(&w);

    // 不同角色做不同的事情

    // 如果此writer是follower,且允许并发写memtable
    if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
        // 并发写memtable
        if (w.ShouldWriteToMemtable()) {
            // ...
        }
        // 向group的leader报告自己已完成写memtable,并等待其它Writer的完成
        // 如果该线程是group中最后一个完成,那么函数返回true然后执行善后工作
        if (write_thread_.CompleteParallelMemTableWriter(&w)) {
            // 善后工作
            // ...
            write_thread_.ExitAsBatchGroupFollower(&w);  
        }

        // 代码运行到这,表示该writer的状态是STATE_COMPLETED
        assert(w.state == WriteThread::STATE_COMPLETED);
        // STATE_COMPLETED conditional below handles exit
    }

    // 作为follower的writer在上一个if中完成了memtable的写入 
    // 或 leader代替该follower写入memtable
    // 那么函数可返回了
    if (w.state == WriteThread::STATE_COMPLETED) {
        if (log_used != nullptr) {
            *log_used = w.log_used;
        }
        if (seq_used != nullptr) {
            *seq_used = w.sequence;
        }
        // write is complete and leader has updated sequence
        return w.FinalStatus();
    }

    // 如果此writer是leader,需完成的工作如下
    // else we are the leader of the write batch group
    assert(w.state == WriteThread::STATE_GROUP_LEADER);
    Status status;
    // Once reaches this point, the current writer "w" will try to do its write
    // job.  It may also pick up some of the remaining writers in the "writers_"
    // when it finds suitable, and finish them in the same write batch.
    // This is how a write job could be done by the other writer.
    WriteContext write_context;
    LogContext log_context(write_options.sync);
    WriteThread::WriteGroup write_group;
    // ... 
    assert(!two_write_queues_ || !disable_memtable);
    {
        // ...
        // leader会检查wal是否满了等信息
        status = PreprocessWrite(write_options, &log_context, &write_context);
        // ...
    }

    // 从write_thread_里的writer链表上选择一定数量的writer构建write batch group
    last_batch_group_size_ = write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
    // ...


    if (status.ok()) {
        //检查group中每个writer,获取总写入大小,确认是否能并发写
        bool parallel = immutable_db_options_.allow_concurrent_memtable_write &&
            write_group.size > 1;
        size_t total_count = 0;
        size_t valid_batches = 0;
        size_t total_byte_size = 0;
        size_t pre_release_callback_cnt = 0;
        // ...
        // 统计一些信息
        // ...
        if (!two_write_queues_) {
            if (status.ok() && !write_options.disableWAL) {
                // ...
                // leader写WAL
                io_s =
                    WriteToWAL(write_group, log_context.writer, log_used,
                               log_context.need_log_sync, log_context.need_log_dir_sync,
                               last_sequence + 1, log_file_number_size);
            }
        } else {
            if (status.ok() && !write_options.disableWAL) {
                // ...
                io_s = ConcurrentWriteToWAL(write_group, log_used, &last_sequence,
                                            seq_inc);
            } else {
                // Otherwise we inc seq number for memtable writes
                last_sequence = versions_->FetchAddLastAllocatedSequence(seq_inc);
            }
        }

        // 不懂
        // PreReleaseCallback is called after WAL write and before memtable write
        if (status.ok()) {
            // ...
        }

        if (status.ok()) {
            PERF_TIMER_GUARD(write_memtable_time);
            // 上面判断了是否能并发写memtable
            if (!parallel) {
                // 如果不能并发写,则leader为group中的follower写memtable
                // w.sequence will be set inside InsertInto
                w.status = WriteBatchInternal::InsertInto(
                    write_group, current_sequence, column_family_memtables_.get(),
                    &flush_scheduler_, &trim_history_scheduler_,
                    write_options.ignore_missing_column_families,
                    0 /*recovery_log_number*/, this, parallel, seq_per_batch_,
                    batch_per_txn_);
            } else {
                write_group.last_sequence = last_sequence;
                // 如果可以并发写,则leader唤醒follower,就是上述的WriteThread::STATE_GROUP_LEADER
                write_thread_.LaunchParallelMemTableWriters(&write_group);
                in_parallel_group = true;

                // Each parallel follower is doing each own writes. The leader should
                // also do its own.
                if (w.ShouldWriteToMemtable()) {
                    // ...
                }
            }
            if (seq_used != nullptr) {
                *seq_used = w.sequence;
            }
        }
    }
    // ...
    // 好像是将wal落盘
    if (log_context.need_log_sync) {
        // ...
    }

    bool should_exit_batch_group = true;
    // 如果能并发写mmetable,那么该if就会执行
    if (in_parallel_group) {
        // CompleteParallelWorker returns true if this thread should
        // handle exit, false means somebody else did
        // 等待其它线程的写入完成,并判断leader是否最后一个完成,follower也调用了此函数
        should_exit_batch_group = write_thread_.CompleteParallelMemTableWriter(&w);
    }

    // 如果leader是group中最后一个完成,那么leader做善后工作,
    if (should_exit_batch_group) {
        // 这部分代码与上述follower的善后代码很类似
        // ...
        write_thread_.ExitAsBatchGroupLeader(write_group, status);
    }

    if (status.ok()) {
        status = w.FinalStatus();
    }
    return status;
}

此函数会根据option(既有 WriteOptions又有 ImmutableDBOptions)和参数,走不同的分支。我们就分析默认情况:

  • 创建 WriteThread::Writer

  • 调用 WriteThread::JoinBatchGroupWriter加入write batch group,并获得自己的角色,即follower或leader。这步可能会阻塞,该函数实际上是将当前 Writer挂在请求链表末尾,如果此 Writer是首个,则直接获取leader角色,否则阻塞等待成为leader或被leader分配角色(下文会着重分析该函数)

  • 不同的角色执行不同的任务:

    • 如果是leader

      • 调用 DBImpl::PreprocessWrite检查wal是否满否满等等,这个函数涉及flush的产生
      • 调用 WriteThread::EnterAsBatchGroupLeader创建write batch group。即从请求链表上取出一定数量的 Writer组建group
      • 写wal
      • 替group中的followers写memtable,或调用 WriteThread::LaunchParallelMemTableWriters唤醒followers并发写memtable
      • 若leader是本group最后一个写入memtable,leader需完成善后工作并调用 WriteThread::ExitAsBatchGroupLeader
    • 如果是follower

      • 可能被 WriteThread::LaunchParallelMemTableWriters唤醒写memtable,写入后调用 WriteThread::CompleteParallelMemTableWriter等待其他writers的完成。如果是最后一个写完,那么它需完成善后工作,并调用 WriteThread::ExitAsBatchGroupFollower
      • 自己完成了memtable的写入 或 leader代替followers写入 后,流程结束从函数返回

TODO: 补充 WriterWriteThread等数据结构的分析,以及分析group如何创建等

总结

未完待续