通过上两篇文章的分析,我们已经清楚LevelDB的架构是什么样的,一个SSTable文件的格式是什么样的,那么如何存储数据也就很清楚明了了,下面我们就来分析LevelDB的写操作。
一、数据库写操作涉及如下几个操作:( 1 )普通写操作;( 2 )原子写操作;( 3 )同步写操作。
( 1 )普通写操作
LevelDB提供了Put()方法来写操作数据库,下面的代码展示了写(key, value)。
std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);
( 2 )原子写操作
有时候,我们需要对象数据库连续执行操作,如下代码所示将 key1 对应的 value 移动到 key2 下,将连续调用Put、Delete 和 Get 方法来修改/查询数据库。
std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);
这个时候如果进程在 Put key2 后 Delete key1 之前挂了,那么同样的 value 将被存储在多个 key 下。如何避免这样一个问题呢?LevelDB提供了 WriteBatch 原子地应用一组操作来避免类似的问题。
#include "leveldb/write_batch.h"
...
std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) {
leveldb::WriteBatch batch;
batch.Delete(key1);
batch.Put(key2, value);
s = db->Write(leveldb::WriteOptions(), &batch);
}
WriteBatch 保存着一系列将被应用到数据库的操作,这些操作会按照添加的顺序依次被执行。以上代码示例中,我们先执行 Delete 后执行 Put,确保不丢失数据。
除了原子性,WriteBatch 也能加速更新过程,因为可以把一大批独立的操作添加到同一个 batch 中然后一次性执行。
如下是WriteBatch的类:
class LEVELDB_EXPORT WriteBatch
public:
class LEVELDB_EXPORT Handler {
public:
virtual ~Handler();
virtual void Put(const Slice& key, const Slice& value) = 0;
virtual void Delete(const Slice& key) = 0;
};
private:
friend class WriteBatchInternal;
std::string rep_; // See comment in write_batch.cc for the format of rep_
};
其成员变量rep_用于记录用户数据(key, value),而记录的用户数据是有格式的,格式是什么样呢?
[WriteBatch header[SequenceNumber64|Count32] | Data[type|keysize|key|valuesize|value]]
其中头header 12-byte = 8-byte SequenceNumber + 4-byte kv数目
// WriteBatch header has an 8-byte sequence number followed by a 4-byte count.
static const size_t kHeader = 12;
( 3 )同步写操作
LevelDB每个写操作默认都是异步的,即:进程把要写的内容写到缓存后即返回,从缓存到持久化存储到磁盘是异步进行的。
可以为写操作打开同步标识:write_options.sync = true,等数据真正被持久化到磁盘存储后再返回(在 Linux 系统上,是通过在写操作返回前调用 fsync(...) 或 fdatasync(...) 来实现的)。
leveldb::WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, ...);
异步写通常比同步写快 1000 倍。异步写的缺点是,一旦机器崩溃,内存中的数据还未刷新到磁盘上,可能会导致最近的几个写操作数据丢失。如果只是写进程崩溃,而不是机器重启,则不会造成任何数据损失,因为在进程退出之前,会将内存数据刷新到磁盘。
WriteBatch 可以作为异步写操作的替代品,多个更新操作可以放到同一个 WriteBatch 中然后通过一次同步写(即 write_options.sync = true)一起落盘。
二、写操作
LevelDB的写操作相对简单,只要写入日志文件,然后写入内存memtable即完成,而内存中memtable的持久化则需要压缩Compaction流程来进行。
1 、接口
如上所述,LevelDB对外提供的写操作接口:
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value)
Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates)
Status DBImpl::Delete(const WriteOptions& options, const Slice& key) {
return DB::Delete(options, key);
}
其中leveldb::WriteOptions()可以设置同步或异步。Put写操作也是使用了WriteBatch,只不过Batch里面只有一次操作,最终调的Write接口;Delete删除操作最终也是调用Write接口:
Status DB::Delete(const WriteOptions& opt, const Slice& key) {
WriteBatch batch;
batch.Delete(key);
return Write(opt, &batch);
}
下面我们来分析这三个写操作接口的实现流程:
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
|-WriteBatch batch;
|-batch.Put(key, value);
|-WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);//increase a write operator
|-rep_.push_back(static_cast<char>(kTypeValue));//rep_ is a buffer of data which need to be written,
rep_ is used to format the data, the first value is kTypeValue
|-PutLengthPrefixedSlice(&rep_, key); //write key into buffer rep_
|-PutVarint32(dst, value.size()); //the first is size of key, the second is key
|-dst->append(value.data(), value.size());
|-PutLengthPrefixedSlice(&rep_, value); //write value into buffer rep_, also size and value
rep_: kTypeValue|keySize|key|valueSize|value
|-return Write(opt, &batch);
}
由上面的代码我们可以看到rep_.push_back(static_cast(kTypeValue)); 其中kTypeValue表示此次写操作是加入一个用户数据(key, value),而不是删除一个key。随后可以看到把用户key的大小和key本身写入到rep_中,且key大小是用变长Varint32编码,最后把用户value的大小和value本身也写入rep_中,所以,最后rep_就是:rep_: kTypeValue|keySize|key|valueSize|value
Value的类型:
enum ValueType { kTypeDeletion = 0x0, kTypeValue = 0x1 };
因为用户数据(key, value)是不停追加Append操作的,因此,写一次数据(key, value),又删除同一个key,LevelDB并不会找到已写入的key并将其删除,而是两次追加操作,因此,同一个key就出现了数据冗余,而旧数据是无效的,如果区分这个key是否被删除?LevelDB定义了value的两种类型:添加操作是kTypeValue,删除操作是kTypeDeletion,以此来区分value。
如下是删除操作:
void WriteBatch::Delete(const Slice& key) {
WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
rep_.push_back(static_cast<char>(kTypeDeletion));
PutLengthPrefixedSlice(&rep_, key);
}
2 、下面着重来分析Write接口:
任务队列 writers_
Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {
|-Writer w(&mutex_); //use mutex_ to initialize a condition variable which Writer object internal used
|-w.batch = updates; //this WriteBatch is the contents of key-value. w is once written operation.
|-w.sync = options.sync;
|-w.done = false;
|-MutexLock l(&mutex_); //initialize a MutexLock object via mutex_ and call mutex_.lock()
|-writers_.push_back(&w); //add this Writer object into tail of dqueue writers_(double-ended queue)
|-while (!w.done && &w != writers_.front()) { //if this writer is not done and not the first one in deque, that's not scheduled
|-w.cv.Wait(); //just wait on condition variable. why??? [answer]: from the end of this function, if once WriteBatch has finished that will notify the new head! If this Writer is first element - front(), it can be handled its writer!
|-}
|-if (w.done) { //if wake up from condition variable and find this writer has done, that's means this writer containe once WriteBatch, now it is finished, just return.
|-return w.status;
|-}
writers_.push_back(&w); 是class DBImpl : public DB中的成员变量,它是一个双端操作的队列。每次的写操作并不是立即执行,而是生成一个Writer对象,然后加入双端操作队列writers_中等待被调度。
std::deque<Writer*> writers_ GUARDED_BY(mutex_); // Queue of writers.
Writer对象如下所示:记录写操作WriteBatch、是否同步、是否完成、状态,以及用于通信的条件变量port::CondVar
struct DBImpl::Writer {
explicit Writer(port::Mutex* mu)
: batch(nullptr), sync(false), done(false), cv(mu) {}
Status status;
WriteBatch* batch;
bool sync;
bool done;
port::CondVar cv;
};
writers_是一个任务队列,符合生产者和消费者模型:生产者线程不断向任务队列中添加待处理的任务Writer,而LevelDB的消费者是从生产者线程中选择一个线程来处理任务。
真正获得调度后,将执行写入任务,下一节再分析,我们继续把writers_队列机制分析完。当写操作完成后,将处理完的任务从队列里取出,并置状态为done,然后通知对应的CondVar启动。
代码:
|-while (true) {//at present, double-end queue has all above writers!
|-Writer* ready = writers_.front();//get first element from double-end queue, pop from queue.
|-writers_.pop_front();
|-if (ready != &w) {//these writers has written into memtable, done is true. signal to waiter!
|-ready->status = status;
|-ready->done = true;
|-ready->cv.Signal();
|-}
|-if (ready == last_writer) break;//all writer in WriteBatch has marked.
|-}
// Notify new head of write queue.
|-if (!writers_.empty()) {//This WriteBatch has finished!!! At present, if queue is not empty, that's means new writer has queued!
notify the new head of queue that you can handle the request!
|-writers_.front()->cv.Signal();
|-}
- LevelDB支持多线程,所以加了互斥锁MutexLock保护writers_。
- 每个生产者在向Writers_队列中添加任务之后,都会进入一个while循环,然后睡眠。只有当这个生产者所加入的任务位于队列的头部,或者该线程加入的任务已经被处理(即writer.done == true),线程才会被唤醒。线程被唤醒后会继续检查循环条件,如果仍不满足调度条件,则还会继续睡眠。
- 如果所加入的任务被其他线程处理,本线程则直接退出。
- 如果所加入的任务排在了队列writer_的头部,且未处理,本线程将进行写操作处理。
3 、接下来这部分是真正写数据的逻辑:
流程图如下所示:
1、做写入前的检查:Status status = MakeRoomForWrite(updates == nullptr);
2、Batch组合:WriteBatch* write_batch = BuildBatchGroup(&last_writer);
3、先写入日志文件log,用于故障恢复:log_->AddRecord(WriteBatchInternal::Contents(updates));
4、再写入内存Memtable:WriteBatchInternal::InsertInto(updates, mem_);
下面结合代码详细分析写入过程:
Step 1 、首先,调用 Status status = MakeRoomForWrite(updates == nullptr); 做写入前的检查:
1)检查后台线程(Compaction是后台线程)是否有错误bg_error_.ok() ?有错误,则直接返回错误,写操作中止。
2)level-0中的文件数目是否达到软限制——8个,kL0_SlowdownWritesTrigger = 8; 此时为了性能必须降低写速率,所以休眠1s再进行;因为level-0中的文件存在key重复问题,为了读性能,level-0的文件数目必须严格限制。
3)缓存Memtable的使用量是否超过了设置的缓存限制(默认值4MB):mem_->ApproximateMemoryUsage() <= options_.write_buffer_size,没有超过,则可以写入Memtable缓存。
4)如果Memtable缓存超过了,此时Memtable需要切换到immutable文件:imm_ != nullptr,生成只读的immutable缓存文件,并且向level-0压缩,则等待后台压缩Compaction完成。
5)到达此步,说明immutable已经向level-0压缩完成了,如果level-0的文件数目达到了最大限制kL0_StopWritesTrigger = 12; 则也需要停止写操作,等待level-0的文件数目降下来。
6)到达此步说明:memtable已经没有空间,immutable已经压缩到level-0,而level-0的文件数目也符合要求,那么当前的这个memtable缓存就可以转换成只读的immutable,并且开启后台压缩Compaction,然后新生成一个缓存memtable、log日志文件,写操作写入该新memtable缓存。
代码:
Status DBImpl::MakeRoomForWrite(bool force) {
|-mutex_.AssertHeld(); //make sure lock is holded
|-assert(!writers_.empty()); //writers dequeue is not empty, that at least one writer in dequeue.
|-bool allow_delay = !force; // not force means allow delay.
|-Status s;
|-while (true) {
|-if (!bg_error_.ok()) {
// Yield previous error
s = bg_error_;
break;
|-} else if (allow_delay && versions_->NumLevelFiles(0) >= config::kL0_SlowdownWritesTrigger) { // currently, there are getting
close to hitting a hard limit on the number of level-0 files, so slow down and sleep for a while.
mutex_.Unlock();
env_->SleepForMicroseconds(1000);
allow_delay = false; // Do not delay a single write more than once
mutex_.Lock();
|-} else if (!force && (mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) { //currently, memtable has enough space
to hold the write data.
// There is room in current memtable
break;
|-} else if (imm_ != nullptr) { //step into this branch means memtable has no enough space, immutable table also exists, so
just wait for compaction finished.
// We have filled up the current memtable, but the previous one is still being compacted, so we wait.
Log(options_.info_log, "Current memtable full; waiting...\n");
background_work_finished_signal_.Wait(); // wait for compaction finished
|-} else if (versions_->NumLevelFiles(0) >= config::kL0_StopWritesTrigger){//currently leve-0 has reached maximum number of files
// There are too many level-0 files.
Log(options_.info_log, "Too many L0 files; waiting...\n");
background_work_finished_signal_.Wait();//just wait for compaction finished.
|-} else {//at this place, that's means memtable has no enough space and immutable table is nullptr(compaction has finished), so
we can allocate a new memtable and previous memtable changed to immutable memtable, this writer operation will write into a new memtable
// Attempt to switch to a new memtable and trigger compaction of old
|-assert(versions_->PrevLogNumber() == 0);
|-uint64_t new_log_number = versions_->NewFileNumber(); //increase a new file number
|-WritableFile* lfile = nullptr;
|-s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);//construct a log file under dbname,return WritableFile
e.g.: /tmp/dbname/000002.log, and open this log file, instance a WritableFile object with fd, prepaer for writting operation.
|-int fd = ::open(filename.c_str(), O_TRUNC | O_WRONLY | O_CREAT | kOpenBaseFlags, 0644);
|-*result = new PosixWritableFile(filename, fd);
|-PosixWritableFile(std::string filename, int fd) : pos_(0), fd_(fd),
is_manifest_(IsManifest(filename)),
filename_(std::move(filename)),
dirname_(Dirname(filename_)) {}
|-delete log_; //delete old written log operation
|-delete logfile_; //delete old log file
|-logfile_ = lfile; //save this new log file
|-logfile_number_ = new_log_number; //save new log number
|-log_ = new log::Writer(lfile); //instance a new Writer for this written operation
|-Writer::Writer(WritableFile* dest) : dest_(dest), block_offset_(0) { InitTypeCrc(type_crc_); }
|-imm_ = mem_; //change memtable to immutable
|-has_imm_.store(true, std::memory_order_release);
|-mem_ = new MemTable(internal_comparator_); //instantiate a new MemTable object. table_ is a skiplist
|-: comparator_(comparator), refs_(0), table_(comparator_, &arena_) {}
|-mem_->Ref(); //increase reference count
|-force = false; // Do not force another compaction if have room
|-MaybeScheduleCompaction();//schedule compaction
|-}
}
return s;
经过 MakeRoomForWrite(updates == nullptr)后,已经得到可以写入数据的memtable缓存了。
Step 2、然后,构建一个更大的批量操作组: WriteBatch* write_batch = BuildBatchGroup(&last_writer);
1)迭代队列writers_中的Writer对象,std::deque<Writer*>::iterator iter = writers_.begin();
将多个Writer对象中的批操作WriteBatch,组合在一起形成一个大的WriteBatch,再进行后续写入操作,这样可提高写性能。
|-WriteBatchInternal::Append(result, first->batch);//result == DBImpl->tmp_batch
|-SetCount(dst, Count(dst) + Count(src));//plus new writer
|-assert(src->rep_.size() >= kHeader);
|-dst->rep_.append(src->rep_.data() + kHeader, src->rep_.size() - kHeader);//skip all writer's kHeader, just append the rep_ data of every writer's
2)添加到批量操作组是有限制和要求的,两个限制:
2.1)sync标记,同步还是异步:需要sync同步的,则不带它玩,需要立刻返回去写操作,不加入DBImpl->tmp_batch;不需要sync同步的,就统统加入到DBImpl->tmp_batch缓存;
2.2)为避免一次写入量太大,设定了限制值:
|-if (w->batch != nullptr) {//current writer's WriteBatch is not nullptr, so plus current writer's rep_ size.
|-size += WriteBatchInternal::ByteSize(w->batch);
|-if (size > max_size) { //exceed the max_size, so break.
// Do not make batch too big
|-break;
|-}
经过这一步操作后,返回的WriteBatch* write_batch 要么是需要及时同步sync的单个WriteBatch,要么是批量的WriteBatch。
Step 3 、将本次写操作的头部进行编码( WriteBatch header ):
// WriteBatch header has an 8-byte sequence number followed by a 4-byte count.
static const size_t kHeader = 12;
12-byte kHeader = 8-byte SequenceNumber + 4-byte kv数目,这里的头kHeader是记录一次批量写操作信息的。
|-WriteBatchInternal::SetSequence(write_batch, last_sequence + 1);//there is a reserved 8bytes place for sequence number
|-EncodeFixed64(&b->rep_[0], seq);//先把上次的序列号+1,再把64bits/8bytes的序列号编码到rep_开始位置
|-last_sequence += WriteBatchInternal::Count(write_batch);//序列号加上本次批量写操作的数据个数
|-return DecodeFixed32(b->rep_.data() + 8);//在rep_中偏移8字节,解码出来本次批量写操作个数
这样缓存变量rep_中记录用户数据(key, value)信息格式如下:
[WriteBatch header[sequencenumber64|count32] | Data[valuetype|keysize|key|valuesize|value]]
Step 4、先将WriteBatch中的用户数据写入日志文件log_中,防止机器故障后,用于恢复。
写Memtable前的WAL:
WAL: Write-Ahead Logging 预写日志系统,数据库中一种高效的日志算法,对于非内存数据库而言,磁盘I/O操作是数据库效率的一大瓶颈。在相同的数据量下,采用WAL日志的数据库系统在事务提交时,磁盘写操作只有传统的回滚日志的一半左右,大大提高了数据库磁盘I/O操作的效率,从而提高了数据库的性能。
写操作是添加一个Record,日志文件log_逻辑上按 Record 进行读写,而物理上按 Block 进行组织,每个Record 有个小 header,保存着 checksum、长度和类型。单个Record 的长度可能大于单个 Block,为此对于这类Record 配备了 FIRST、MIDDLE、LAST 三种类型,表示横跨多个Block。
|-status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));
|-return Slice(batch->rep_);
而日志文件log_是在Step 1中生成的:
|-WritableFile* lfile = nullptr;
|-s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);
|-log_ = new log::Writer(lfile); //instance a new Writer for this written operation
|-Writer::Writer(WritableFile* dest) : dest_(dest), block_offset_(0) { InitTypeCrc(type_crc_); }
而Writer对象如下所示:WritableFile* dest_就是写入数据的缓存目的地。
class Writer {
private:
WritableFile* dest_;
int block_offset_; // Current offset in block
// crc32c values for all supported record types. These are
// pre-computed to reduce the overhead of computing the crc of the
// record type stored in the header.
uint32_t type_crc_[kMaxRecordType + 1];
};
4.1)默认一个块Block的大小是32KB,每个块Block也有结构:kHeader|KVdata,这里也有kHeader,这是块头部信息,前面是WriteBatch header,是记录一次批量操作的信息。
static const int kBlockSize = 32768;
// Header is checksum (4 bytes), length (2 bytes), type (1 byte).
static const int kHeaderSize = 4 + 2 + 1;
块头kHeader由校验和crc + 写入数据长度length + 块类型type。
块类型:定义如下,表示写入一个块Block数据是否写完,如果一个块不能容纳本次写如数据,那就使用多个块,同时标记这些块的类型,以便完整记录一次写入操作。
enum RecordType {
// Zero is reserved for preallocated files
kZeroType = 0,
kFullType = 1, //表示数据可以完整写入一个块。
// For fragments
kFirstType = 2, //表示数据不能完整写入一个块,本次写入的块是用户数据的第一个块。
kMiddleType = 3, //表示数据不能完整写入一个块,本次写入的块是用户数据的中间块。
kLastType = 4 //表示数据不能完整写入一个块,本次写入的块是用户数据的最后一个块。
};
4.2)先写入块头,再写入用户数据,如果一个块Block已不足写一个块头,则该块剩余空间填0,新写入一个块。写操作是写入的缓存,数据写入完成后,执行一次刷新缓存操作,来刷新缓存数据到磁盘上。
代码:
//Write data info disk file Append->[header|payload][header|payload][header|payload]...
Status Writer::EmitPhysicalRecord(RecordType t, const char* ptr,size_t length) {
// Format the header 7bytes [ crc(4bytes) | length(2bytes) | type(1byte) ]
char buf[kHeaderSize];
buf[4] = static_cast<char>(length & 0xff);//length hold 2 bytes
buf[5] = static_cast<char>(length >> 8);
buf[6] = static_cast<char>(t); //type hold 1 byte
// Compute the crc of the record type and the payload.
uint32_t crc = crc32c::Extend(type_crc_[t], ptr, length);
crc = crc32c::Mask(crc); // Adjust for storage
EncodeFixed32(buf, crc);
// Write the header and the payload
Status s = dest_->Append(Slice(buf, kHeaderSize)); //write header firstly. append the data directly into the buffer of log file or flush into disk.
if (s.ok()) {
s = dest_->Append(Slice(ptr, length)); //then write payload data. append the data directly into the buffer of log file or flush into disk.
if (s.ok()) {
s = dest_->Flush(); //finally flush buffer into disk file.
|-{ return FlushBuffer(); }
}
}
block_offset_ += kHeaderSize + length;//calculate the offset in current block.
return s;
}
4.3)日志文件log_的缓存数据结构,其中缓存默认大小是64KB:kWritableFileBufferSize = 65536;
如果没有超过缓存空间大小,数据写入缓存后,即可返回,写操作完成;如果超过了缓存大小,则主动执行刷新缓存的操作,将缓存中的数据写入磁盘文件。
class PosixWritableFile final : public WritableFile {
// buf_[0, pos_ - 1] contains data to be written to fd_.
char buf_[kWritableFileBufferSize];
size_t pos_;
int fd_;
const bool is_manifest_; // True if the file's name starts with MANIFEST.
const std::string filename_;
const std::string dirname_; // The directory of filename_.
};
代码:
//Write slice data into buffer or disk file directly!
Status Append(const Slice& data) override {
|-size_t write_size = data.size(); //get size of this slice data
|-const char* write_data = data.data(); //get address of this slice data
// Fit as much as possible into buffer. buffer size is 64KB, get minimum value
|-size_t copy_size = std::min(write_size, kWritableFileBufferSize - pos_);
|-std::memcpy(buf_ + pos_, write_data, copy_size);//copy slice data to buffer
|-write_data += copy_size;//skip the written part, reposition address of slice data buffer
|-write_size -= copy_size;//subtract the written part, calculate left size of slice data
|-pos_ += copy_size;//plus the written data, reposition the file pos_
|-if (write_size == 0) {//if slice data size has written fully, the write is done, return.
return Status::OK();
}
//Otherwise, slice data has leftover, write operator hasn't finished.
//Can't fit in buffer, so need to do at least one write.
|-Status status = FlushBuffer(); //flush buffer data into disk file.
if (!status.ok()) {
return status;
}
// Small writes go to buffer, large writes are written directly.
|-if (write_size < kWritableFileBufferSize) {//if leftover size less than buffer size 64KB.
std::memcpy(buf_, write_data, write_size);//copy slice data into buffer
pos_ = write_size; //because of buffer has flush, pos_ is 0, so this is.
return Status::OK();
}
|-return WriteUnbuffered(write_data, write_size);//if leftover write size larger than buffer
then write data into disk file directly.
Step 5 、sync同步写: 如果是同步写,则主动执行一次sync同步缓存数据到磁盘文件中,在Linux系统中,最终调用fdatasync(fd)/fsync(fd)。
|-if (status.ok() && options.sync) {//write success and if need to sync, call log file's Sync() to flush to disk file.
|-status = logfile_->Sync();//flush data in buffer back to disk file.
|-if (!status.ok()) {
|-sync_error = true;
|-}
|-}
代码:
Status Sync() override {
|-Status status = SyncDirIfManifest();
//flush buffer data into disk file
|-status = FlushBuffer();
//flush file fd_ whose data exist in kernel buffer into disk
|-return SyncFd(fd_, filename_);
static Status SyncFd(int fd, const std::string& fd_path) {
#if HAVE_FULLFSYNC
|-if (::fcntl(fd, F_FULLFSYNC) == 0) {
return Status::OK();
}
#if HAVE_FDATASYNC
|-bool sync_success = ::fdatasync(fd) == 0;
#else
|-bool sync_success = ::fsync(fd) == 0;
#endif // HAVE_FDATASYNC
|-if (sync_success) {
return Status::OK();
}
|-return PosixError(fd_path, errno);
数据写入日志文件log_完成,此时发生机器故障,也不用怕了。
Step 6 、接下来将WriteBatch数据写入内存中的memtable。
日志文件可以批量的写入用户数据key-value,但是写入memtable,就需要一个key-value一个key-value的添加进来。
status = WriteBatchInternal::InsertInto(write_batch, mem_);
代码:
Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) {
|-MemTableInserter inserter;
|-inserter.sequence_ = WriteBatchInternal::Sequence(b);//get 8-bytes sequence number
|-return SequenceNumber(DecodeFixed64(b->rep_.data()));
|-inserter.mem_ = memtable;//DBImpl's mem_
|-return b->Iterate(&inserter);
}
前面分析WriteBatch对象时知道,其内部有个Handler类,定义了两个纯虚函数Put和Delete,在这里就排上用场了。
class LEVELDB_EXPORT Handler {
public:
virtual ~Handler();
virtual void Put(const Slice& key, const Slice& value) = 0;
virtual void Delete(const Slice& key) = 0;
};
前面我们分析到缓存变量rep_中记录用户数据(key, value)信息格式如下:
[WriteBatch header[sequencenumber64|count32] | Data[valuetype|keysize|key|valuesize|value]]
因此需要先去掉WriteBatch header,然后读每一个Data信息(用户key-value数据),根据存入rep_时的设置enum ValueType { kTypeDeletion = 0x0, kTypeValue = 0x1 }分布调用Put、Delete方法。
代码:
|-while (!input.empty()) {//If this Record has data
|-found++;//statistics number
|-char tag = input[0];//read a byte, this tag is kValueType{kTypeDeletion = 0x0, kTypeValue = 0x1}
|-input.remove_prefix(1);//remove this tag from buffer because it already read
|-switch (tag) {//judge valueType
|-case kTypeValue://kTypeValue that means putting key:value
|-if (GetLengthPrefixedSlice(&input, &key) && GetLengthPrefixedSlice(&input, &value)) {
|-handler->Put(key, value);//get key and value from Record buffer rep_ and put into
SkipList of Memtable
}
break;
|-case kTypeDeletion://kTypeDeletion that means deleting key:value
|-if (GetLengthPrefixedSlice(&input, &key)) {//get key and delete it
|-handler->Delete(key);
}
break;
|-default:
return Status::Corruption("unknown WriteBatch tag");
}//switch
}//while
6.1)Put:向memtable中添加一个key-value:mem_->Add(),添加类型为kTypeValue。
void WriteBatch::Handler::Put(const Slice& key, const Slice& value) override {
mem_->Add(sequence_, kTypeValue, key, value);
sequence_++;
}
6.2)Delete:向memtable中删除一个key-value,而所谓的Delete也是调用了:mem_->Add(),添加一个类型为kTypeDeletion的空值。
void Delete(const Slice& key) override {//delete that means add a void Slice!
mem_->Add(sequence_, kTypeDeletion, key, Slice());
sequence_++;
}
6.3)分析mem_->Add()的实现:
编码internal_key(| User key (string) | sequence number (7 bytes) | value type (1 byte) |)
内存管理对象arena_申请存储数据的内存buf大小。
依次向内存buf中写入用户数据:internal_key_size | internal_key | val_size | value
最后用户数据key-value添加到跳表Table table_.Insert(buf);
void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key, const Slice& value) {
|-size_t key_size = key.size();
|-size_t val_size = value.size();
|-size_t internal_key_size = key_size + 8;//8 bytes is SequenceNumber + valueType, so internel
key = | User key (string) | sequence number (7 bytes) | value type (1 byte) |
|-const size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) + val_size;
|-char* buf = arena_.Allocate(encoded_len);// new a memory to store Record data
|-char* p = EncodeVarint32(buf, internal_key_size);//encode internal key size and return pointer
|-std::memcpy(p, key.data(), key_size);//copy key into memtable buffer
|-p += key_size;
|-EncodeFixed64(p, (s << 8) | type);//encode sequence number 8 bytes
p += 8;
p = EncodeVarint32(p, val_size);//encode value
std::memcpy(p, value.data(), val_size);//copy value into memtable buffer
assert(p + val_size == buf + encoded_len);
table_.Insert(buf);//insert the key:value data into SkipList of Memtable
6.4)分析table_.Insert(buf)的实现:根据数据结构与算法中对跳表的描述和实现,很容易弄懂如下的操作。
找到key应该插入的位置:跳表SkipList中的key是按顺序插入的,根据用户定义的比较函数,找到新添加的key应该插入的位置x。
随机产生插入的高度height,并在prev中记录每层插入的位置。
在每层中插入用户数据。
void SkipList<Key, Comparator>::Insert(const Key& key) {
|-Node* prev[kMaxHeight];
|-Node* x = FindGreaterOrEqual(key, prev);
// Our data structure does not allow duplicate insertion
|-assert(x == nullptr || !Equal(key, x->key));
|-int height = RandomHeight();//get random height for node
|-if (height > GetMaxHeight()) {
|-for (int i = GetMaxHeight(); i < height; i++) {
|-prev[i] = head_;
|-max_height_.store(height, std::memory_order_relaxed);
}
|-x = NewNode(key, height);//malloc memory for node and construct Node object
|-for (int i = 0; i < height; i++) {
// NoBarrier_SetNext() suffices since we will add a barrier when
// we publish a pointer to "x" in prev[i].
x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
prev[i]->SetNext(i, x);
}
至此,用户数据插入memtable完成。
==============================================================================