05| LevelDB写操作

467 阅读16分钟

通过上两篇文章的分析,我们已经清楚LevelDB的架构是什么样的,一个SSTable文件的格式是什么样的,那么如何存储数据也就很清楚明了了,下面我们就来分析LevelDB的写操作。

一、数据库写操作涉及如下几个操作:( 1 )普通写操作;( 2 )原子写操作;( 3 )同步写操作。

1 )普通写操作

LevelDB提供了Put()方法来写操作数据库,下面的代码展示了写(key, value)。

std::string value;
 leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
 if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
 if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);

2 )原子写操作

有时候,我们需要对象数据库连续执行操作,如下代码所示将 key1 对应的 value 移动到 key2 下,将连续调用Put、Delete 和 Get 方法来修改/查询数据库。

std::string value;
 leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
 if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
 if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);

 

这个时候如果进程在 Put key2 后 Delete key1 之前挂了,那么同样的 value 将被存储在多个 key 下。如何避免这样一个问题呢?LevelDB提供了 WriteBatch 原子地应用一组操作来避免类似的问题。

#include "leveldb/write_batch.h"
 ...
 std::string value;
 leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
 if (s.ok()) {
   leveldb::WriteBatch batch;
   batch.Delete(key1);
   batch.Put(key2, value);
   s = db->Write(leveldb::WriteOptions(), &batch);
 }

WriteBatch 保存着一系列将被应用到数据库的操作,这些操作会按照添加的顺序依次被执行。以上代码示例中,我们先执行 Delete 后执行 Put,确保不丢失数据。

除了原子性,WriteBatch 也能加速更新过程,因为可以把一大批独立的操作添加到同一个 batch 中然后一次性执行。

如下是WriteBatch的类:

class LEVELDB_EXPORT WriteBatch
 public:
  class LEVELDB_EXPORT Handler {
   public:
    virtual ~Handler();
    virtual void Put(const Slice& key, const Slice& value) = 0;
    virtual void Delete(const Slice& key) = 0;
  };
 
 private:
  friend class WriteBatchInternal;
  std::string rep_;  // See comment in write_batch.cc for the format of rep_
};

其成员变量rep_用于记录用户数据(key, value),而记录的用户数据是有格式的,格式是什么样呢?

[WriteBatch header[SequenceNumber64|Count32] | Data[type|keysize|key|valuesize|value]]

其中头header 12-byte = 8-byte SequenceNumber + 4-byte kv数目

// WriteBatch header has an 8-byte sequence number followed by a 4-byte count.
static const size_t kHeader = 12;

 

3 )同步写操作

LevelDB每个写操作默认都是异步的,即:进程把要写的内容写到缓存后即返回,从缓存到持久化存储到磁盘是异步进行的。

可以为写操作打开同步标识:write_options.sync = true,等数据真正被持久化到磁盘存储后再返回(在 Linux 系统上,是通过在写操作返回前调用 fsync(...) 或 fdatasync(...) 来实现的)。

leveldb::WriteOptions write_options;
 write_options.sync = true;
 db->Put(write_options, ...);

 

异步写通常比同步写快 1000 倍。异步写的缺点是,一旦机器崩溃,内存中的数据还未刷新到磁盘上,可能会导致最近的几个写操作数据丢失。如果只是写进程崩溃,而不是机器重启,则不会造成任何数据损失,因为在进程退出之前,会将内存数据刷新到磁盘。

WriteBatch 可以作为异步写操作的替代品,多个更新操作可以放到同一个 WriteBatch 中然后通过一次同步写(即 write_options.sync = true)一起落盘。

二、写操作

LevelDB的写操作相对简单,只要写入日志文件,然后写入内存memtable即完成,而内存中memtable的持久化则需要压缩Compaction流程来进行。

1 、接口

如上所述,LevelDB对外提供的写操作接口:

Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value)
Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates)
Status DBImpl::Delete(const WriteOptions& options, const Slice& key) {
  return DB::Delete(options, key);
}

其中leveldb::WriteOptions()可以设置同步或异步。Put写操作也是使用了WriteBatch,只不过Batch里面只有一次操作,最终调的Write接口;Delete删除操作最终也是调用Write接口:

Status DB::Delete(const WriteOptions& opt, const Slice& key) {
  WriteBatch batch;
  batch.Delete(key);
  return Write(opt, &batch);
}

下面我们来分析这三个写操作接口的实现流程:

Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
  |-WriteBatch batch;
  |-batch.Put(key, value);
         |-WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);//increase a write operator
         |-rep_.push_back(static_cast<char>(kTypeValue));//rep_ is a buffer of data which need to be written, 
rep_ is used to format the data, the first value is kTypeValue
         |-PutLengthPrefixedSlice(&rep_, key); //write key into buffer rep_
           |-PutVarint32(dst, value.size()); //the first is size of key, the second is key
           |-dst->append(value.data(), value.size());
         |-PutLengthPrefixedSlice(&rep_, value); //write value into buffer rep_, also size and value
rep_: kTypeValue|keySize|key|valueSize|value
  |-return Write(opt, &batch);
}

由上面的代码我们可以看到rep_.push_back(static_cast(kTypeValue)); 其中kTypeValue表示此次写操作是加入一个用户数据(key, value),而不是删除一个key。随后可以看到把用户key的大小和key本身写入到rep_中,且key大小是用变长Varint32编码,最后把用户value的大小和value本身也写入rep_中,所以,最后rep_就是:rep_: kTypeValue|keySize|key|valueSize|value

Value的类型:

enum ValueType { kTypeDeletion = 0x0, kTypeValue = 0x1 };

因为用户数据(key, value)是不停追加Append操作的,因此,写一次数据(key, value),又删除同一个key,LevelDB并不会找到已写入的key并将其删除,而是两次追加操作,因此,同一个key就出现了数据冗余,而旧数据是无效的,如果区分这个key是否被删除?LevelDB定义了value的两种类型:添加操作是kTypeValue,删除操作是kTypeDeletion,以此来区分value。

如下是删除操作:

void WriteBatch::Delete(const Slice& key) {
  WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
  rep_.push_back(static_cast<char>(kTypeDeletion));
  PutLengthPrefixedSlice(&rep_, key);
}

 

2 、下面着重来分析Write接口:

任务队列 writers_

Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {
  |-Writer w(&mutex_); //use mutex_ to initialize a condition variable which Writer object internal used
  |-w.batch = updates; //this WriteBatch is the contents of key-value. w is once written operation.
  |-w.sync = options.sync;
  |-w.done = false;
 
    |-MutexLock l(&mutex_); //initialize a MutexLock object via mutex_ and call mutex_.lock()
  |-writers_.push_back(&w); //add this Writer object into tail of dqueue writers_(double-ended queue)
  |-while (!w.done && &w != writers_.front()) { //if this writer is not done and not the first one in deque, that's not scheduled 
    |-w.cv.Wait(); //just wait on condition variable. why??? [answer]: from the end of this function, if once WriteBatch has finished that will notify the new head! If this Writer is first element - front(), it can be handled its writer!
  |-}
 
  |-if (w.done) { //if wake up from condition variable and find this writer has done, that's means this writer containe once WriteBatch, now it is finished, just return.
    |-return w.status;
  |-}

image-35.png

writers_.push_back(&w); 是class DBImpl : public DB中的成员变量,它是一个双端操作的队列。每次的写操作并不是立即执行,而是生成一个Writer对象,然后加入双端操作队列writers_中等待被调度。

std::deque<Writer*> writers_ GUARDED_BY(mutex_);  // Queue of writers.

Writer对象如下所示:记录写操作WriteBatch、是否同步、是否完成、状态,以及用于通信的条件变量port::CondVar

struct DBImpl::Writer {
  explicit Writer(port::Mutex* mu)
      : batch(nullptr), sync(false), done(false), cv(mu) {}
  Status status;
  WriteBatch* batch;
  bool sync;
  bool done;
  port::CondVar cv;
};

writers_是一个任务队列,符合生产者和消费者模型:生产者线程不断向任务队列中添加待处理的任务Writer,而LevelDB的消费者是从生产者线程中选择一个线程来处理任务。

真正获得调度后,将执行写入任务,下一节再分析,我们继续把writers_队列机制分析完。当写操作完成后,将处理完的任务从队列里取出,并置状态为done,然后通知对应的CondVar启动。

代码:

  |-while (true) {//at present, double-end queue has all above writers!
    |-Writer* ready = writers_.front();//get first element from double-end queue, pop from queue.
    |-writers_.pop_front();
    |-if (ready != &w) {//these writers has written into memtable, done is true. signal to waiter!
      |-ready->status = status;
      |-ready->done = true;
      |-ready->cv.Signal();
    |-}
    |-if (ready == last_writer) break;//all writer in WriteBatch has marked.
  |-}
  // Notify new head of write queue. 
  |-if (!writers_.empty()) {//This WriteBatch has finished!!! At present, if queue is not empty, that's means new writer has queued!
notify the new head of queue that you can handle the request!
    |-writers_.front()->cv.Signal();
  |-}
  • LevelDB支持多线程,所以加了互斥锁MutexLock保护writers_。
  • 每个生产者在向Writers_队列中添加任务之后,都会进入一个while循环,然后睡眠。只有当这个生产者所加入的任务位于队列的头部,或者该线程加入的任务已经被处理(即writer.done == true),线程才会被唤醒。线程被唤醒后会继续检查循环条件,如果仍不满足调度条件,则还会继续睡眠。
  • 如果所加入的任务被其他线程处理,本线程则直接退出。
  • 如果所加入的任务排在了队列writer_的头部,且未处理,本线程将进行写操作处理。

 

3 、接下来这部分是真正写数据的逻辑:

流程图如下所示:

1、做写入前的检查:Status status = MakeRoomForWrite(updates == nullptr);

2、Batch组合:WriteBatch* write_batch = BuildBatchGroup(&last_writer);

3、先写入日志文件log,用于故障恢复:log_->AddRecord(WriteBatchInternal::Contents(updates));

4、再写入内存Memtable:WriteBatchInternal::InsertInto(updates, mem_);

image-36.png

下面结合代码详细分析写入过程:

Step 1 、首先,调用 Status status = MakeRoomForWrite(updates == nullptr); 做写入前的检查:

1)检查后台线程(Compaction是后台线程)是否有错误bg_error_.ok() ?有错误,则直接返回错误,写操作中止。

2)level-0中的文件数目是否达到软限制——8个,kL0_SlowdownWritesTrigger = 8; 此时为了性能必须降低写速率,所以休眠1s再进行;因为level-0中的文件存在key重复问题,为了读性能,level-0的文件数目必须严格限制。

3)缓存Memtable的使用量是否超过了设置的缓存限制(默认值4MB):mem_->ApproximateMemoryUsage() <= options_.write_buffer_size,没有超过,则可以写入Memtable缓存。

4)如果Memtable缓存超过了,此时Memtable需要切换到immutable文件:imm_ != nullptr,生成只读的immutable缓存文件,并且向level-0压缩,则等待后台压缩Compaction完成。

5)到达此步,说明immutable已经向level-0压缩完成了,如果level-0的文件数目达到了最大限制kL0_StopWritesTrigger = 12; 则也需要停止写操作,等待level-0的文件数目降下来。

6)到达此步说明:memtable已经没有空间,immutable已经压缩到level-0,而level-0的文件数目也符合要求,那么当前的这个memtable缓存就可以转换成只读的immutable,并且开启后台压缩Compaction,然后新生成一个缓存memtable、log日志文件,写操作写入该新memtable缓存。

代码:

Status DBImpl::MakeRoomForWrite(bool force) {
  |-mutex_.AssertHeld(); //make sure lock is holded
  |-assert(!writers_.empty()); //writers dequeue is not empty, that at least one writer in dequeue.
  |-bool allow_delay = !force; // not force means allow delay.
  |-Status s;
  |-while (true) {
    |-if (!bg_error_.ok()) {
      // Yield previous error
      s = bg_error_;
      break;
    |-} else if (allow_delay && versions_->NumLevelFiles(0) >= config::kL0_SlowdownWritesTrigger) { // currently, there are getting
close to hitting a hard limit on the number of level-0 files, so slow down and sleep for a while.
      mutex_.Unlock();
      env_->SleepForMicroseconds(1000);
      allow_delay = false;  // Do not delay a single write more than once
      mutex_.Lock();
    |-} else if (!force && (mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) { //currently, memtable has enough space
 to hold the write data.
      // There is room in current memtable
      break;
    |-} else if (imm_ != nullptr) { //step into this branch means memtable has no enough space, immutable table also exists, so
 just wait for compaction finished.
      // We have filled up the current memtable, but the previous one is still being compacted, so we wait.
      Log(options_.info_log, "Current memtable full; waiting...\n");
      background_work_finished_signal_.Wait(); // wait for compaction finished
    |-} else if (versions_->NumLevelFiles(0) >= config::kL0_StopWritesTrigger){//currently leve-0 has reached maximum number of files
      // There are too many level-0 files.
      Log(options_.info_log, "Too many L0 files; waiting...\n");
      background_work_finished_signal_.Wait();//just wait for compaction finished.
    |-} else {//at this place, that's means memtable has no enough space and immutable table is nullptr(compaction has finished), so
we can allocate a new memtable and previous memtable changed to immutable memtable, this writer operation will write into a new memtable
      // Attempt to switch to a new memtable and trigger compaction of old
      |-assert(versions_->PrevLogNumber() == 0);
      |-uint64_t new_log_number = versions_->NewFileNumber(); //increase a new file number
      |-WritableFile* lfile = nullptr;
      |-s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);//construct a log file under dbname,return WritableFile
 e.g.: /tmp/dbname/000002.log, and open this log file, instance a WritableFile object with fd, prepaer for writting operation.
                 |-int fd = ::open(filename.c_str(), O_TRUNC | O_WRONLY | O_CREAT | kOpenBaseFlags, 0644);
                 |-*result = new PosixWritableFile(filename, fd);
                                 |-PosixWritableFile(std::string filename, int fd) : pos_(0), fd_(fd),
                                                                                     is_manifest_(IsManifest(filename)),
                                                                                     filename_(std::move(filename)),
                                                                                     dirname_(Dirname(filename_)) {}
      |-delete log_; //delete old written log operation
      |-delete logfile_; //delete old log file
      |-logfile_ = lfile; //save this new log file
      |-logfile_number_ = new_log_number; //save new log number
      |-log_ = new log::Writer(lfile); //instance a new Writer for this written operation
                      |-Writer::Writer(WritableFile* dest) : dest_(dest), block_offset_(0) { InitTypeCrc(type_crc_); }
      |-imm_ = mem_; //change memtable to immutable 
      |-has_imm_.store(true, std::memory_order_release);
      |-mem_ = new MemTable(internal_comparator_); //instantiate a new MemTable object. table_ is a skiplist
                  |-: comparator_(comparator), refs_(0), table_(comparator_, &arena_) {}
      |-mem_->Ref(); //increase reference count
      |-force = false;  // Do not force another compaction if have room
      |-MaybeScheduleCompaction();//schedule compaction
    |-}
  }
  return s;

经过 MakeRoomForWrite(updates == nullptr)后,已经得到可以写入数据的memtable缓存了。

Step 2、然后,构建一个更大的批量操作组: WriteBatch* write_batch = BuildBatchGroup(&last_writer);

1)迭代队列writers_中的Writer对象,std::deque<Writer*>::iterator iter = writers_.begin();

将多个Writer对象中的批操作WriteBatch,组合在一起形成一个大的WriteBatch,再进行后续写入操作,这样可提高写性能。

        |-WriteBatchInternal::Append(result, first->batch);//result == DBImpl->tmp_batch
                             |-SetCount(dst, Count(dst) + Count(src));//plus new writer
                             |-assert(src->rep_.size() >= kHeader);
                             |-dst->rep_.append(src->rep_.data() + kHeader, src->rep_.size() - kHeader);//skip all writer's kHeader, just append the rep_ data of every writer's

2)添加到批量操作组是有限制和要求的,两个限制:

2.1)sync标记,同步还是异步:需要sync同步的,则不带它玩,需要立刻返回去写操作,不加入DBImpl->tmp_batch;不需要sync同步的,就统统加入到DBImpl->tmp_batch缓存;

2.2)为避免一次写入量太大,设定了限制值:

    |-if (w->batch != nullptr) {//current writer's WriteBatch is not nullptr, so plus current writer's rep_ size.
      |-size += WriteBatchInternal::ByteSize(w->batch);
      |-if (size > max_size) { //exceed the max_size, so break.
        // Do not make batch too big
        |-break;
      |-}

经过这一步操作后,返回的WriteBatch* write_batch 要么是需要及时同步sync的单个WriteBatch,要么是批量的WriteBatch。

 

Step 3 、将本次写操作的头部进行编码( WriteBatch header ):

// WriteBatch header has an 8-byte sequence number followed by a 4-byte count.
static const size_t kHeader = 12;

12-byte kHeader = 8-byte SequenceNumber + 4-byte kv数目,这里的头kHeader是记录一次批量写操作信息的。

 

    |-WriteBatchInternal::SetSequence(write_batch, last_sequence + 1);//there is a reserved 8bytes place for sequence number
                         |-EncodeFixed64(&b->rep_[0], seq);//先把上次的序列号+1,再把64bits/8bytes的序列号编码到rep_开始位置
    |-last_sequence += WriteBatchInternal::Count(write_batch);//序列号加上本次批量写操作的数据个数
                                         |-return DecodeFixed32(b->rep_.data() + 8);//在rep_中偏移8字节,解码出来本次批量写操作个数

这样缓存变量rep_中记录用户数据(key, value)信息格式如下:

[WriteBatch header[sequencenumber64|count32] | Data[valuetype|keysize|key|valuesize|value]]

 

Step 4、先将WriteBatch中的用户数据写入日志文件log_中,防止机器故障后,用于恢复。

写Memtable前的WAL:

WAL: Write-Ahead Logging 预写日志系统,数据库中一种高效的日志算法,对于非内存数据库而言,磁盘I/O操作是数据库效率的一大瓶颈。在相同的数据量下,采用WAL日志的数据库系统在事务提交时,磁盘写操作只有传统的回滚日志的一半左右,大大提高了数据库磁盘I/O操作的效率,从而提高了数据库的性能。

 

写操作是添加一个Record,日志文件log_逻辑上按 Record 进行读写,而物理上按 Block 进行组织,每个Record 有个小 header,保存着 checksum、长度和类型。单个Record 的长度可能大于单个 Block,为此对于这类Record 配备了 FIRST、MIDDLE、LAST 三种类型,表示横跨多个Block。

      |-status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));
                                                     |-return Slice(batch->rep_);

 

而日志文件log_是在Step 1中生成的:

      |-WritableFile* lfile = nullptr;
      |-s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);
 
      |-log_ = new log::Writer(lfile); //instance a new Writer for this written operation
                   |-Writer::Writer(WritableFile* dest) : dest_(dest), block_offset_(0) { InitTypeCrc(type_crc_); }

而Writer对象如下所示:WritableFile* dest_就是写入数据的缓存目的地。

class Writer {
 private:
  WritableFile* dest_;
  int block_offset_;  // Current offset in block
  // crc32c values for all supported record types.  These are
  // pre-computed to reduce the overhead of computing the crc of the
  // record type stored in the header.
  uint32_t type_crc_[kMaxRecordType + 1];
};

4.1)默认一个块Block的大小是32KB,每个块Block也有结构:kHeader|KVdata,这里也有kHeader,这是块头部信息,前面是WriteBatch header,是记录一次批量操作的信息。

static const int kBlockSize = 32768;
// Header is checksum (4 bytes), length (2 bytes), type (1 byte).
static const int kHeaderSize = 4 + 2 + 1;

块头kHeader由校验和crc + 写入数据长度length + 块类型type。

块类型:定义如下,表示写入一个块Block数据是否写完,如果一个块不能容纳本次写如数据,那就使用多个块,同时标记这些块的类型,以便完整记录一次写入操作。

enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,    //表示数据可以完整写入一个块。
  // For fragments
  kFirstType = 2,  //表示数据不能完整写入一个块,本次写入的块是用户数据的第一个块。
  kMiddleType = 3, //表示数据不能完整写入一个块,本次写入的块是用户数据的中间块。
  kLastType = 4    //表示数据不能完整写入一个块,本次写入的块是用户数据的最后一个块。
};

 

4.2)先写入块头,再写入用户数据,如果一个块Block已不足写一个块头,则该块剩余空间填0,新写入一个块。写操作是写入的缓存,数据写入完成后,执行一次刷新缓存操作,来刷新缓存数据到磁盘上。

代码:

//Write data info disk file Append->[header|payload][header|payload][header|payload]...
Status Writer::EmitPhysicalRecord(RecordType t, const char* ptr,size_t length) {
  // Format the header 7bytes [ crc(4bytes) | length(2bytes) | type(1byte) ]
  char buf[kHeaderSize];
  buf[4] = static_cast<char>(length & 0xff);//length hold 2 bytes
  buf[5] = static_cast<char>(length >> 8);
  buf[6] = static_cast<char>(t); //type hold 1 byte
  // Compute the crc of the record type and the payload.
  uint32_t crc = crc32c::Extend(type_crc_[t], ptr, length);
  crc = crc32c::Mask(crc);  // Adjust for storage
  EncodeFixed32(buf, crc);
  // Write the header and the payload
  Status s = dest_->Append(Slice(buf, kHeaderSize)); //write header firstly. append the data directly into the buffer of log file or flush into disk.
  if (s.ok()) {
    s = dest_->Append(Slice(ptr, length)); //then write payload data. append the data directly into the buffer of log file or flush into disk.
    if (s.ok()) {
      s = dest_->Flush(); //finally flush buffer into disk file.
                 |-{ return FlushBuffer(); }
    }
  }
  block_offset_ += kHeaderSize + length;//calculate the offset in current block.
  return s;
}

 

4.3)日志文件log_的缓存数据结构,其中缓存默认大小是64KB:kWritableFileBufferSize = 65536;

如果没有超过缓存空间大小,数据写入缓存后,即可返回,写操作完成;如果超过了缓存大小,则主动执行刷新缓存的操作,将缓存中的数据写入磁盘文件。

class PosixWritableFile final : public WritableFile {
  // buf_[0, pos_ - 1] contains data to be written to fd_.
  char buf_[kWritableFileBufferSize];
  size_t pos_;
  int fd_;
  const bool is_manifest_;  // True if the file's name starts with MANIFEST.
  const std::string filename_;
  const std::string dirname_;  // The directory of filename_.
};

代码:

//Write slice data into buffer or disk file directly!
Status Append(const Slice& data) override {
  |-size_t write_size = data.size(); //get size of this slice data
  |-const char* write_data = data.data(); //get address of this slice data
  // Fit as much as possible into buffer. buffer size is 64KB, get minimum value
  |-size_t copy_size = std::min(write_size, kWritableFileBufferSize - pos_);
  |-std::memcpy(buf_ + pos_, write_data, copy_size);//copy slice data to buffer
  |-write_data += copy_size;//skip the written part, reposition address of slice data buffer
  |-write_size -= copy_size;//subtract the written part, calculate left size of slice data
  |-pos_ += copy_size;//plus the written data, reposition the file pos_
  |-if (write_size == 0) {//if slice data size has written fully, the write is done, return.
    return Status::OK();
  }
  //Otherwise, slice data has leftover, write operator hasn't finished. 
  //Can't fit in buffer, so need to do at least one write.
  |-Status status = FlushBuffer(); //flush buffer data into disk file.
    if (!status.ok()) {
      return status;
    }
  // Small writes go to buffer, large writes are written directly.
  |-if (write_size < kWritableFileBufferSize) {//if leftover size less than buffer size 64KB.
     std::memcpy(buf_, write_data, write_size);//copy slice data into buffer
      pos_ = write_size; //because of buffer has flush, pos_ is 0, so this is.
      return Status::OK();
    }
  |-return WriteUnbuffered(write_data, write_size);//if leftover write size larger than buffer
then write data into disk file directly.

 

Step 5 、sync同步写: 如果是同步写,则主动执行一次sync同步缓存数据到磁盘文件中,在Linux系统中,最终调用fdatasync(fd)/fsync(fd)。

      |-if (status.ok() && options.sync) {//write success and if need to sync, call log file's Sync() to flush to disk file.
        |-status = logfile_->Sync();//flush data in buffer back to disk file.
        |-if (!status.ok()) {
          |-sync_error = true;
        |-}
      |-}

代码:

Status Sync() override {
  |-Status status = SyncDirIfManifest();
  //flush buffer data into disk file
  |-status = FlushBuffer();
  //flush file fd_ whose data exist in kernel buffer into disk
  |-return SyncFd(fd_, filename_);
 
static Status SyncFd(int fd, const std::string& fd_path) {
#if HAVE_FULLFSYNC
  |-if (::fcntl(fd, F_FULLFSYNC) == 0) {
      return Status::OK();
   }
#if HAVE_FDATASYNC
  |-bool sync_success = ::fdatasync(fd) == 0;
#else
  |-bool sync_success = ::fsync(fd) == 0;
#endif  // HAVE_FDATASYNC
  |-if (sync_success) {
     return Status::OK();
  }
  |-return PosixError(fd_path, errno);

数据写入日志文件log_完成,此时发生机器故障,也不用怕了。

 

Step 6 、接下来将WriteBatch数据写入内存中的memtable。

日志文件可以批量的写入用户数据key-value,但是写入memtable,就需要一个key-value一个key-value的添加进来。

status = WriteBatchInternal::InsertInto(write_batch, mem_);

代码:

Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) {
  |-MemTableInserter inserter;
  |-inserter.sequence_ = WriteBatchInternal::Sequence(b);//get 8-bytes sequence number
                         |-return SequenceNumber(DecodeFixed64(b->rep_.data()));
  |-inserter.mem_ = memtable;//DBImpl's mem_
  |-return b->Iterate(&inserter);
}

 

前面分析WriteBatch对象时知道,其内部有个Handler类,定义了两个纯虚函数Put和Delete,在这里就排上用场了。

  class LEVELDB_EXPORT Handler {
   public:
    virtual ~Handler();
    virtual void Put(const Slice& key, const Slice& value) = 0;
    virtual void Delete(const Slice& key) = 0;
  };

 

前面我们分析到缓存变量rep_中记录用户数据(key, value)信息格式如下:

[WriteBatch header[sequencenumber64|count32] | Data[valuetype|keysize|key|valuesize|value]]

因此需要先去掉WriteBatch header,然后读每一个Data信息(用户key-value数据),根据存入rep_时的设置enum ValueType { kTypeDeletion = 0x0, kTypeValue = 0x1 }分布调用Put、Delete方法。

代码:

  |-while (!input.empty()) {//If this Record has data
    |-found++;//statistics number
    |-char tag = input[0];//read a byte, this tag is kValueType{kTypeDeletion = 0x0, kTypeValue = 0x1}
    |-input.remove_prefix(1);//remove this tag from buffer because it already read
    |-switch (tag) {//judge valueType
      |-case kTypeValue://kTypeValue that means putting key:value
        |-if (GetLengthPrefixedSlice(&input, &key) && GetLengthPrefixedSlice(&input, &value)) {
          |-handler->Put(key, value);//get key and value from Record buffer rep_ and put into 
SkipList of Memtable
        }
        break;
      |-case kTypeDeletion://kTypeDeletion that means deleting key:value
        |-if (GetLengthPrefixedSlice(&input, &key)) {//get key and delete it
          |-handler->Delete(key);
        }
        break;
      |-default:
        return Status::Corruption("unknown WriteBatch tag");
    }//switch
  }//while

6.1)Put:向memtable中添加一个key-value:mem_->Add(),添加类型为kTypeValue。

void WriteBatch::Handler::Put(const Slice& key, const Slice& value) override {
    mem_->Add(sequence_, kTypeValue, key, value);
    sequence_++;
  }

 

6.2)Delete:向memtable中删除一个key-value,而所谓的Delete也是调用了:mem_->Add(),添加一个类型为kTypeDeletion的空值。

void Delete(const Slice& key) override {//delete that means add a void Slice!
    mem_->Add(sequence_, kTypeDeletion, key, Slice());
    sequence_++;
  }

 

6.3)分析mem_->Add()的实现:

编码internal_key(| User key (string)sequence number (7 bytes)value type (1 byte)  |)

内存管理对象arena_申请存储数据的内存buf大小。

依次向内存buf中写入用户数据:internal_key_size | internal_key | val_size | value

最后用户数据key-value添加到跳表Table table_.Insert(buf);

void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key, const Slice& value) {
  |-size_t key_size = key.size();
  |-size_t val_size = value.size();
  |-size_t internal_key_size = key_size + 8;//8 bytes is SequenceNumber + valueType, so internel
key = | User key (string) | sequence number (7 bytes) | value type (1 byte) |
  |-const size_t encoded_len = VarintLength(internal_key_size) +
                             internal_key_size + VarintLength(val_size) + val_size;
  |-char* buf = arena_.Allocate(encoded_len);// new a memory to store Record data
  |-char* p = EncodeVarint32(buf, internal_key_size);//encode internal key size and return pointer
  |-std::memcpy(p, key.data(), key_size);//copy key into memtable buffer
  |-p += key_size;
  |-EncodeFixed64(p, (s << 8) | type);//encode sequence number 8 bytes
  p += 8;
  p = EncodeVarint32(p, val_size);//encode value
  std::memcpy(p, value.data(), val_size);//copy value into memtable buffer
  assert(p + val_size == buf + encoded_len);
  table_.Insert(buf);//insert the key:value data into SkipList of Memtable

 

6.4)分析table_.Insert(buf)的实现:根据数据结构与算法中对跳表的描述和实现,很容易弄懂如下的操作。

找到key应该插入的位置:跳表SkipList中的key是按顺序插入的,根据用户定义的比较函数,找到新添加的key应该插入的位置x。

随机产生插入的高度height,并在prev中记录每层插入的位置。

在每层中插入用户数据。

void SkipList<Key, Comparator>::Insert(const Key& key) {
  |-Node* prev[kMaxHeight];
  |-Node* x = FindGreaterOrEqual(key, prev);
  // Our data structure does not allow duplicate insertion
  |-assert(x == nullptr || !Equal(key, x->key));
  |-int height = RandomHeight();//get random height for node
  |-if (height > GetMaxHeight()) {
    |-for (int i = GetMaxHeight(); i < height; i++) {
      |-prev[i] = head_;
      |-max_height_.store(height, std::memory_order_relaxed);
    }
  |-x = NewNode(key, height);//malloc memory for node and construct Node object
  |-for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
    prev[i]->SetNext(i, x);
  }

至此,用户数据插入memtable完成。

 

==============================================================================