wal(write ahead log)在数据库事务写操作提交前向文件中写入一条log,这个保存log的顺序写文件就是wal,是etcd实现数据在raft中两阶段提交crash-safe的关键组件之一,在学习crash-safe之前我们需要先对这个组件有一个了解。
etcd-wal有两种访问模式:read、append,wal只能运行于read/append中的一种模式,wal模式切换和wal在etcd中的使用场景密切相关,wal主要用于crash-safe,顾名思义,节点突然宕机后重启恢复数据使用,wal模式切换分为以下几种情况:
- 如果一个wal刚刚创建,那么该wal处于append模式
- 如果一个wal刚刚被打开,那么该wal处于read模式,在wal保存的数据读出后,wal切换到append模式
WAL对象
wal结构主要定义在etcd/wal/wal.go中,具体定义如下:
// WAL is a logical representation of the stable storage.
// WAL is either in read mode or append mode but not both.
// A newly created WAL is in append mode, and ready for appending records.
// A just opened WAL is in read mode, and ready for reading records.
// The WAL will be ready for appending after reading out all the previous records.
type WAL struct {
lg *zap.Logger
dir string // the living directory of the underlay files
// dirFile is a fd for the wal directory for syncing on Rename
dirFile *os.File
metadata []byte // metadata recorded at the head of each WAL
state raftpb.HardState // hardstate recorded at the head of WAL
start walpb.Snapshot // snapshot to start reading
decoder *decoder // decoder to decode records
readClose func() error // closer for decode reader
mu sync.Mutex
enti uint64 // index of the last entry saved to the wal
encoder *encoder // encoder to encode records
locks []*fileutil.LockedFile // the locked files the WAL holds (the name is increasing)
fp *filePipeline
}
wal的关键对象介绍如下:
dir:wal文件保存的路径
dirFile:dir打开后的的一个目录fd对象
metadata:创建wal时传入的字节序列,etcd里面主要是序列化的是节点id及集群id相关信息,后续每创建一个wal文件就会将其写到wal的首部
state:wal在append过程中保存的hardState信息,每次raft传出的hardState有变化都会被更新,并会及时刷盘,在wal有切割时会在新的wal头部保存最新的hardState信息,etcd重启后会读取最后一次保存的hardState用来恢复宕机或机器重启时storage中hardState状态信息,hardState的结构如下:
type HardState struct {
Term uint64 `protobuf:"varint,1,opt,name=term" json:"term"`
Vote uint64 `protobuf:"varint,2,opt,name=vote" json:"vote"`
Commit uint64 `protobuf:"varint,3,opt,name=commit" json:"commit"`
XXX_unrecognized []byte `json:"-"`
}
start:记录最后一次保存的snapshot的元数据信息,主要是snapshot中最后一条日志Entry的index和Term,walpb.Snapshot的结构如:
type Snapshot struct {
Index uint64 `protobuf:"varint,1,opt,name=index" json:"index"`
Term uint64 `protobuf:"varint,2,opt,name=term" json:"term"`
XXX_unrecognized []byte `json:"-"`
}
decoder:对wal文件对象reader的封装,最要用于从wal中读取数据
readClose:用于关闭decoder关联的reader,关闭wal读模式,通过是在readALL之后调用该函数实现的
enti:最后一次保存到wal中的日志Entry的index
encoder:对wal文件对象的封装,最要用于向wal中Append各类数据
wal文命名
wal文件在etcd中是大小有限制的一系列文件,默认为64M,一个文件无法容纳现有的Entry时,wal会尝试创建新的wal文件,和之前的文件衔接上,为了比较方便的区别不同的wal文件的关联关系,wal文件的命名格式如下:
{{seq}}-{{index}}.wal
seq:从0递增的,第一个wal文件的seq为0,第二个为1,依次类推,第N为为N
index:wal文件中保留的第一条Entry的index值
WAL创建或加载
etcd server在启动时,会根据在wal目录是否存在任意.wal结尾的文件来确定之前etcd是否创建过wal,如果没有创建wal,etcd会尝试调用wal.Create方法,创建wal,否则使用wal.Open及wal.ReadAll方法是reload之前的wal,逻辑在etcd/etcdserver/server.go的NewServer方法里,存在wal时会调用restartNode,下面分创建wal和加载wal两种情况作介绍。
创建WAL
当etcd server启动时如果没有发现wal,会调用startNode,startNode中会执行创建wal的具体方法wal.Create,具体代码如下:
func startNode(cfg ServerConfig, cl *membership.RaftCluster, ids []types.ID) (id types.ID, n raft.Node, s *raft.MemoryStorage, w *wal.WAL) {
var err error
member := cl.MemberByName(cfg.Name)
metadata := pbutil.MustMarshal(
&pb.Metadata{
NodeID: uint64(member.ID),
ClusterID: uint64(cl.ID()),
},
)
if w, err = wal.Create(cfg.Logger, cfg.WALDir(), metadata); err != nil {
if cfg.Logger != nil {
cfg.Logger.Panic("failed to create WAL", zap.Error(err))
} else {
plog.Panicf("create wal error: %v", err)
}
}
// ...
}
通过startNode里面的关键代码,可以看到wal在创建是metadata的内容,保存了节点ID及集群ID的序列化内容,wal.Create的具体内容如下:
// Create creates a WAL ready for appending records. The given metadata is
// recorded at the head of each WAL file, and can be retrieved with ReadAll.
func Create(lg *zap.Logger, dirpath string, metadata []byte) (*WAL, error) {
if Exist(dirpath) {
return nil, os.ErrExist
}
// keep temporary wal directory so WAL initialization appears atomic
tmpdirpath := filepath.Clean(dirpath) + ".tmp"
if fileutil.Exist(tmpdirpath) {
if err := os.RemoveAll(tmpdirpath); err != nil {
return nil, err
}
}
if err := fileutil.CreateDirAll(tmpdirpath); err != nil {
if lg != nil {
lg.Warn(
"failed to create a temporary WAL directory",
zap.String("tmp-dir-path", tmpdirpath),
zap.String("dir-path", dirpath),
zap.Error(err),
)
}
return nil, err
}
p := filepath.Join(tmpdirpath, walName(0, 0))
f, err := fileutil.LockFile(p, os.O_WRONLY|os.O_CREATE, fileutil.PrivateFileMode)
if err != nil {
if lg != nil {
lg.Warn(
"failed to flock an initial WAL file",
zap.String("path", p),
zap.Error(err),
)
}
return nil, err
}
if _, err = f.Seek(0, io.SeekEnd); err != nil {
if lg != nil {
lg.Warn(
"failed to seek an initial WAL file",
zap.String("path", p),
zap.Error(err),
)
}
return nil, err
}
if err = fileutil.Preallocate(f.File, SegmentSizeBytes, true); err != nil {
if lg != nil {
lg.Warn(
"failed to preallocate an initial WAL file",
zap.String("path", p),
zap.Int64("segment-bytes", SegmentSizeBytes),
zap.Error(err),
)
}
return nil, err
}
w := &WAL{
lg: lg,
dir: dirpath,
metadata: metadata,
}
w.encoder, err = newFileEncoder(f.File, 0)
if err != nil {
return nil, err
}
w.locks = append(w.locks, f)
if err = w.saveCrc(0); err != nil {
return nil, err
}
if err = w.encoder.encode(&walpb.Record{Type: metadataType, Data: metadata}); err != nil {
return nil, err
}
if err = w.SaveSnapshot(walpb.Snapshot{}); err != nil {
return nil, err
}
if w, err = w.renameWAL(tmpdirpath); err != nil {
if lg != nil {
lg.Warn(
"failed to rename the temporary WAL directory",
zap.String("tmp-dir-path", tmpdirpath),
zap.String("dir-path", w.dir),
zap.Error(err),
)
}
return nil, err
}
var perr error
defer func() {
if perr != nil {
w.cleanupWAL(lg)
}
}()
// directory was renamed; sync parent dir to persist rename
pdir, perr := fileutil.OpenDir(filepath.Dir(w.dir))
if perr != nil {
if lg != nil {
lg.Warn(
"failed to open the parent data directory",
zap.String("parent-dir-path", filepath.Dir(w.dir)),
zap.String("dir-path", w.dir),
zap.Error(perr),
)
}
return nil, perr
}
if perr = fileutil.Fsync(pdir); perr != nil {
if lg != nil {
lg.Warn(
"failed to fsync the parent data directory file",
zap.String("parent-dir-path", filepath.Dir(w.dir)),
zap.String("dir-path", w.dir),
zap.Error(perr),
)
}
return nil, perr
}
if perr = pdir.Close(); perr != nil {
if lg != nil {
lg.Warn(
"failed to close the parent data directory file",
zap.String("parent-dir-path", filepath.Dir(w.dir)),
zap.String("dir-path", w.dir),
zap.Error(perr),
)
}
return nil, perr
}
return w, nil
}
在wal.Create函数中,主要的作用的是创建了第一个固定大小64M的wal文件:0-0.wal,基于0-0.wal构建一个Encoder对象,此时并没有初始化decoder,因此可以断定此时wal是无法读取的,此时,wal处于read模式,对应文章开头wal模式切换的第一条,创建完Encoder对象,etcd会向其中写入节点元数据metadata及空snapshot元数据,最后,返回wal.WAL指针。
注:在初始化wal目录时,首先在一个临时目录下完成了上面的操作,然后将临时目录替换真正的wal目录,这个主要是为了保证wal初始化的时候破坏了wal正式目录下原有数据,通过在临时目录下完成各种数据的创建,原子性的调用mv操作,达到安全替换旧wal目录的目的。
加载WAL
etcd server启动时,如果发现有wal存在,在没有强制创建集群的情况下,调用restartNode,restartNode中会调用wal.Open使用现有wal文件创建WAL对象,调用wal对象ReadAll加载节点重启之前的wal数据,具体代码如下:
func readWAL(lg *zap.Logger, waldir string, snap walpb.Snapshot) (w *wal.WAL, id, cid types.ID, st raftpb.HardState, ents []raftpb.Entry) {
var (
err error
wmetadata []byte
)
repaired := false
for {
if w, err = wal.Open(lg, waldir, snap); err != nil {
if lg != nil {
lg.Fatal("failed to open WAL", zap.Error(err))
} else {
plog.Fatalf("open wal error: %v", err)
}
}
if wmetadata, st, ents, err = w.ReadAll(); err != nil {
w.Close()
// we can only repair ErrUnexpectedEOF and we never repair twice.
if repaired || err != io.ErrUnexpectedEOF {
if lg != nil {
lg.Fatal("failed to read WAL, cannot be repaired", zap.Error(err))
} else {
plog.Fatalf("read wal error (%v) and cannot be repaired", err)
}
}
if !wal.Repair(lg, waldir) {
if lg != nil {
lg.Fatal("failed to repair WAL", zap.Error(err))
} else {
plog.Fatalf("WAL error (%v) cannot be repaired", err)
}
} else {
if lg != nil {
lg.Info("repaired WAL", zap.Error(err))
} else {
plog.Infof("repaired WAL error (%v)", err)
}
repaired = true
}
continue
}
break
}
var metadata pb.Metadata
pbutil.MustUnmarshal(&metadata, wmetadata)
id = types.ID(metadata.NodeID)
cid = types.ID(metadata.ClusterID)
return w, id, cid, st, ents
}
在Open函数里面主要是调用OpenAtIndex,根据日志Entry的Index来打开对应的wal文件,具体逻辑:
func openAtIndex(lg *zap.Logger, dirpath string, snap walpb.Snapshot, write bool) (*WAL, error) {
names, nameIndex, err := selectWALFiles(lg, dirpath, snap)
if err != nil {
return nil, err
}
rs, ls, closer, err := openWALFiles(lg, dirpath, names, nameIndex, write)
if err != nil {
return nil, err
}
// create a WAL ready for reading
w := &WAL{
lg: lg,
dir: dirpath,
start: snap,
decoder: newDecoder(rs...),
readClose: closer,
locks: ls,
}
if write {
// write reuses the file descriptors from read; don't close so
// WAL can append without dropping the file lock
w.readClose = nil
if _, _, err := parseWALName(filepath.Base(w.tail().Name())); err != nil {
closer()
return nil, err
}
w.fp = newFilePipeline(lg, w.dir, SegmentSizeBytes)
}
return w, nil
}
在openAtIndex中,首先感觉snap最后一条Entry的Index找到,包含这条Entry的wal文件,然后打开该wal及其之后的所有wal文件,根据这些wal文件构建出WAL对象,有了wal.WAL对象,便可以调用其ReadAll方法去读取wal里面的内容,读取代码如下:
// ReadAll reads out records of the current WAL.
// If opened in write mode, it must read out all records until EOF. Or an error
// will be returned.
// If opened in read mode, it will try to read all records if possible.
// If it cannot read out the expected snap, it will return ErrSnapshotNotFound.
// If loaded snap doesn't match with the expected one, it will return
// all the records and error ErrSnapshotMismatch.
// TODO: detect not-last-snap error.
// TODO: maybe loose the checking of match.
// After ReadAll, the WAL will be ready for appending new records.
func (w *WAL) ReadAll() (metadata []byte, state raftpb.HardState, ents []raftpb.Entry, err error) {
w.mu.Lock()
defer w.mu.Unlock()
rec := &walpb.Record{}
decoder := w.decoder
var match bool
for err = decoder.decode(rec); err == nil; err = decoder.decode(rec) {
switch rec.Type {
case entryType:
e := mustUnmarshalEntry(rec.Data)
if e.Index > w.start.Index {
ents = append(ents[:e.Index-w.start.Index-1], e)
}
w.enti = e.Index
case stateType:
state = mustUnmarshalState(rec.Data)
case metadataType:
if metadata != nil && !bytes.Equal(metadata, rec.Data) {
state.Reset()
return nil, state, nil, ErrMetadataConflict
}
metadata = rec.Data
case crcType:
crc := decoder.crc.Sum32()
// current crc of decoder must match the crc of the record.
// do no need to match 0 crc, since the decoder is a new one at this case.
if crc != 0 && rec.Validate(crc) != nil {
state.Reset()
return nil, state, nil, ErrCRCMismatch
}
decoder.updateCRC(rec.Crc)
case snapshotType:
var snap walpb.Snapshot
pbutil.MustUnmarshal(&snap, rec.Data)
if snap.Index == w.start.Index {
if snap.Term != w.start.Term {
state.Reset()
return nil, state, nil, ErrSnapshotMismatch
}
match = true
}
default:
state.Reset()
return nil, state, nil, fmt.Errorf("unexpected block type %d", rec.Type)
}
}
switch w.tail() {
case nil:
// We do not have to read out all entries in read mode.
// The last record maybe a partial written one, so
// ErrunexpectedEOF might be returned.
if err != io.EOF && err != io.ErrUnexpectedEOF {
state.Reset()
return nil, state, nil, err
}
default:
// We must read all of the entries if WAL is opened in write mode.
if err != io.EOF {
state.Reset()
return nil, state, nil, err
}
// decodeRecord() will return io.EOF if it detects a zero record,
// but this zero record may be followed by non-zero records from
// a torn write. Overwriting some of these non-zero records, but
// not all, will cause CRC errors on WAL open. Since the records
// were never fully synced to disk in the first place, it's safe
// to zero them out to avoid any CRC errors from new writes.
if _, err = w.tail().Seek(w.decoder.lastOffset(), io.SeekStart); err != nil {
return nil, state, nil, err
}
if err = fileutil.ZeroToEnd(w.tail().File); err != nil {
return nil, state, nil, err
}
}
err = nil
if !match {
err = ErrSnapshotNotFound
}
// close decoder, disable reading
if w.readClose != nil {
w.readClose()
w.readClose = nil
}
w.start = walpb.Snapshot{}
w.metadata = metadata
if w.tail() != nil {
// create encoder (chain crc with the decoder), enable appending
w.encoder, err = newFileEncoder(w.tail().File, w.decoder.lastCRC())
if err != nil {
return
}
}
w.decoder = nil
return metadata, state, ents, err
}
wal的decoder是对多个wal文件对象的封装,因此,for循环中decoder会依次读取每一个wal文件,返回其中的各类数据,最后返回其中的metadata、state、ents信息。
注:ReadAll最后,由于wal文件中所有的记录已经被加载出来,所以wal会将decoder置为nil,后续无法对wal进行读取操作,对应wal模式切换中的第二条:如果一个wal刚刚被打开,那么该wal处于read模式,在wal保存的数据读出后,wal切换到append模式
wal文件切割
保存日志记录
wal文件有固定的大小,向wal文件中写入日志Entry主要是WAL.Save方法,Save在保存日志的同时,如果有传入HardState,还会保存hardState,并调用sync方法,保证日志以及State刷入磁盘,当调用WAL.Save方法向wal文件中写入日志Entry的时候,如果文件超过了64M,就会触发WAL.cut操作,Save逻辑如下:
func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
w.mu.Lock()
defer w.mu.Unlock()
// short cut, do not call sync
if raft.IsEmptyHardState(st) && len(ents) == 0 {
return nil
}
mustSync := raft.MustSync(st, w.state, len(ents))
// TODO(xiangli): no more reference operator
for i := range ents {
if err := w.saveEntry(&ents[i]); err != nil {
return err
}
}
if err := w.saveState(&st); err != nil {
return err
}
curOff, err := w.tail().Seek(0, io.SeekCurrent)
if err != nil {
return err
}
if curOff < SegmentSizeBytes {
if mustSync {
return w.sync()
}
return nil
}
return w.cut()
}
保存snapshot元数据
wal保存snapshot元数据的时,和wal保存日志Entry不通,由于元数据比较小,并没有调用cut,而是在元数据写入wal之后立刻执行一次sync,保证元数据落入磁盘,保存snapshot的元数据的逻辑如下:
func (w *WAL) SaveSnapshot(e walpb.Snapshot) error {
b := pbutil.MustMarshal(&e)
w.mu.Lock()
defer w.mu.Unlock()
rec := &walpb.Record{Type: snapshotType, Data: b}
if err := w.encoder.encode(rec); err != nil {
return err
}
// update enti only when snapshot is ahead of last index
if w.enti < e.Index {
w.enti = e.Index
}
return w.sync()
}
切割wal
切割wal文件的操作在WAL.cut函数中,主要操作是保存上一个wal文件,并对wal文件进行Truncate操作,防止之前创建wal时预分配的64M文件未用完,浪费空间。
保存完旧的wal文件,创建一个新的文件,新文件的命名如下:
walName(w.seq()+1, w.enti+1)
然后保存metadata、state、上一个文件的crc信息等,并调用sync操作将这些信息落盘,最后基于新的wal文件创建新的encoder对象替换到WAL.encoder,便于后续对wal的写操作,能够将内容落到新的wal文件中,具体cut逻辑如下:
// cut closes current file written and creates a new one ready to append.
// cut first creates a temp wal file and writes necessary headers into it.
// Then cut atomically rename temp wal file to a wal file.
func (w *WAL) cut() error {
// close old wal file; truncate to avoid wasting space if an early cut
off, serr := w.tail().Seek(0, io.SeekCurrent)
if serr != nil {
return serr
}
if err := w.tail().Truncate(off); err != nil {
return err
}
if err := w.sync(); err != nil {
return err
}
fpath := filepath.Join(w.dir, walName(w.seq()+1, w.enti+1))
// create a temp wal file with name sequence + 1, or truncate the existing one
newTail, err := w.fp.Open()
if err != nil {
return err
}
// update writer and save the previous crc
w.locks = append(w.locks, newTail)
prevCrc := w.encoder.crc.Sum32()
w.encoder, err = newFileEncoder(w.tail().File, prevCrc)
if err != nil {
return err
}
if err = w.saveCrc(prevCrc); err != nil {
return err
}
if err = w.encoder.encode(&walpb.Record{Type: metadataType, Data: w.metadata}); err != nil {
return err
}
if err = w.saveState(&w.state); err != nil {
return err
}
// atomically move temp wal file to wal file
if err = w.sync(); err != nil {
return err
}
off, err = w.tail().Seek(0, io.SeekCurrent)
if err != nil {
return err
}
if err = os.Rename(newTail.Name(), fpath); err != nil {
return err
}
if err = fileutil.Fsync(w.dirFile); err != nil {
return err
}
// reopen newTail with its new path so calls to Name() match the wal filename format
newTail.Close()
if newTail, err = fileutil.LockFile(fpath, os.O_WRONLY, fileutil.PrivateFileMode); err != nil {
return err
}
if _, err = newTail.Seek(off, io.SeekStart); err != nil {
return err
}
w.locks[len(w.locks)-1] = newTail
prevCrc = w.encoder.crc.Sum32()
w.encoder, err = newFileEncoder(w.tail().File, prevCrc)
if err != nil {
return err
}
if w.lg != nil {
w.lg.Info("created a new WAL segment", zap.String("path", fpath))
} else {
plog.Infof("segmented wal file %v is created", fpath)
}
return nil
}
清理WAL
当应用层apply一定数量的Entry时,就会触发storage创建一次snapshot,将MemoryStorage中打包到snapshot中的Entry删除掉(删除Entry有一个优化,并不会完全将打包到snapshot中的所有Entry,而是根据配置项SnapshotCatchUpEntries,保留一部分在内存中,方便后续直接从内存里面讲Entry拷贝到Follower,防止频繁的拷贝内存)。
MemoryStorage在创建snapshot之后,会清理多余的wal文件,在ReleaseLockTo方法中会关闭/释放多余的wal文件,这样其他监控wal的进程,便可以清理掉这些多余的wal文件,具体代码如下:
// SaveSnap saves the snapshot to disk and release the locked
// wal files since they will not be used.
func (st *storage) SaveSnap(snap raftpb.Snapshot) error {
walsnap := walpb.Snapshot{
Index: snap.Metadata.Index,
Term: snap.Metadata.Term,
}
err := st.WAL.SaveSnapshot(walsnap)
if err != nil {
return err
}
err = st.Snapshotter.SaveSnap(snap)
if err != nil {
return err
}
return st.WAL.ReleaseLockTo(snap.Metadata.Index)
}
// ReleaseLockTo releases the locks, which has smaller index than the given index
// except the largest one among them.
// For example, if WAL is holding lock 1,2,3,4,5,6, ReleaseLockTo(4) will release
// lock 1,2 but keep 3. ReleaseLockTo(5) will release 1,2,3 but keep 4.
func (w *WAL) ReleaseLockTo(index uint64) error {
w.mu.Lock()
defer w.mu.Unlock()
if len(w.locks) == 0 {
return nil
}
var smaller int
found := false
for i, l := range w.locks {
_, lockIndex, err := parseWALName(filepath.Base(l.Name()))
if err != nil {
return err
}
if lockIndex >= index {
smaller = i - 1
found = true
break
}
}
// if no lock index is greater than the release index, we can
// release lock up to the last one(excluding).
if !found {
smaller = len(w.locks) - 1
}
if smaller <= 0 {
return nil
}
for i := 0; i < smaller; i++ {
if w.locks[i] == nil {
continue
}
w.locks[i].Close()
}
w.locks = w.locks[smaller:]
return nil
}
实际监控并清理wal文件的逻辑在EtcdServer的purgeFile中,逻辑如下:
func (s *EtcdServer) purgeFile() {
var werrc <-chan error
// ...
if s.Cfg.MaxWALFiles > 0 {
werrc = fileutil.PurgeFile(s.getLogger(), s.Cfg.WALDir(), "wal", s.Cfg.MaxWALFiles, purgeFileInterval, s.done)
}
lg := s.getLogger()
select {
// other case ...
case e := <-werrc:
if lg != nil {
lg.Fatal("failed to purge wal file", zap.Error(e))
} else {
plog.Fatalf("failed to purge wal file %v", e)
}
case <-s.stopping:
return
}
}