raft[1]算法相比于paxos更加容易理解,除了raft论文原文,还有raft论文翻译类文章,都对raft算法的基础概念做了比较详细的介绍,因此本系列文章主要是结合etcd[2]-raft的实现版本做介绍,通过分析etcd-raft来学习raft时如何在工程实践中落地的。
raft角色分配
raft是基于日志复制状态机的分布式共识算法,raft日志复制状态机在工作时共有三种角色:
Leader:负责接收提议,产生提议log,将提议log复制到follower,然后commit-log,并将log的commit情况复制到所有的follower上
Follower:接收Leader复制的日志请求,将日志持久化到本地,并向Leader者反馈本地的日志情况
Candidate:raft状态机切换的中间角色,主要用于集群没有Leader的情况下,由Follower转化身份而来,完成Leader竞选的中间角色,成功竞选的Candidate将成为集群的Leader
raft状态机基于raft任期机制,在Leader、Follower及Candidate这三者之间的转化,具体的转化如图所示:
由上图可知raft集群在启动之初所有的节点都处于Follower状态,由于集群中没有Leader,所有Follower在选举定时器超时之后仍然收不到Leader的心跳,就会进入Candidate状态,Candidate会开启新的任期,发起竞选Leader,如果一任期内没有Candidate胜出,Candidate等待一段随机时间后发起下一轮的选举,具体的流程见下面的etcd-raft源码剖析
etcd中的关键对象
状态机对象-raft
raft算法的状态机实现对象,主要是完成raft算法的核心处理逻辑,定义如下文件里面:
https://github.com/etcd-io/etcd/blob/v3.4.9/raft/raft.go
raft集群节点对象-Node
Node时代表raft集群内部的一个节点,主要是负责驱动raft状态机,Node接口及其实现node,具体定义在如下文件中:
https://github.com/etcd-io/etcd/blob/v3.4.9/raft/node.go
Follower定时器
raft-node启动
raft一个集群节点在启动时的代码如下:
// StartNode returns a new Node given configuration and a list of raft peers.
// It appends a ConfChangeAddNode entry for each given peer to the initial log.
func StartNode(c *Config, peers []Peer) Node {
r := newRaft(c)
// become the follower at term 1 and apply initial configuration
// entries of term 1
r.becomeFollower(1, None)
for _, peer := range peers {
cc := pb.ConfChange{Type: pb.ConfChangeAddNode, NodeID: peer.ID, Context: peer.Context}
d, err := cc.Marshal()
if err != nil {
panic("unexpected marshal error")
}
e := pb.Entry{Type: pb.EntryConfChange, Term: 1, Index: r.raftLog.lastIndex() + 1, Data: d}
r.raftLog.append(e)
}
// Mark these initial entries as committed.
// TODO(bdarnell): These entries are still unstable; do we need to preserve
// the invariant that committed < unstable?
r.raftLog.committed = r.raftLog.lastIndex()
// Now apply them, mainly so that the application can call Campaign
// immediately after StartNode in tests. Note that these nodes will
// be added to raft twice: here and when the application's Ready
// loop calls ApplyConfChange. The calls to addNode must come after
// all calls to raftLog.append so progress.next is set after these
// bootstrapping entries (it is an error if we try to append these
// entries since they have already been committed).
// We do not set raftLog.applied so the application will be able
// to observe all conf changes via Ready.CommittedEntries.
for _, peer := range peers {
r.addNode(peer.ID)
}
n := newNode()
n.logger = c.Logger
go n.run(r)
return &n
}
raft节点在启动时,首相将raft状态机通过becomeFollower,切换到follower角色:
func (r *raft) becomeFollower(term uint64, lead uint64) {
r.step = stepFollower
r.reset(term)
r.tick = r.tickElection
r.lead = lead
r.state = StateFollower
r.logger.Infof("%x became follower at term %d", r.id, r.Term)
}
设置raft其他node-peer信息、raft日志(index+term)信息,如果该节点未在raft集群中提供过服务,raft日志的index、term都是0,设置完raft的角色元信息之后,启动raft节点:
n := newNode()
go n.run(r)
node的run方法是一个loop,主要是用于驱动raft状态机的运行,并将raft状态机的产物传递给应用层,具体run代码逻辑如下,只列举出了选举相关内容:
func (n *node) run(r *raft) {
for {
select {
case m := <-n.recvc:
// filter out response message from unknown From.
if pr := r.getProgress(m.From); pr != nil || !IsResponseMsg(m.Type) {
r.Step(m) // raft never returns an error
}
case <-n.tickc:
r.tick()
case <-n.stop:
close(n.done)
return
}
}
}
raft选举相关的定时器信号由应用层产生,通过tickc传递到raft层,调用raft->tick函数,tick是个变量,主要在becomeXXX函数里面被赋值,在becomeFollower将tick赋值为tickElection,tickElection实现如下:
// tickElection is run by followers and candidates after r.electionTimeout.
func (r *raft) tickElection() {
r.electionElapsed++
if r.promotable() && r.pastElectionTimeout() {
r.electionElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
}
}
每次tick在node->run中被调用,electionElapsed都会加一,electionElapsed在收到Follower收到Leader的心跳请求时被清零,如果集群中没有Leader或则网络问题导致Follower长时间收到Leader的心跳,就会触发选举,将MsgHup类型的消息消息传递到raft状态机中,raft状态机的入口是raft.Step函数,该函数处理竞选消息的内容如下:
func (r *raft) Step(m pb.Message) error {
// ...
switch m.Type {
case pb.MsgHup:
if r.state != StateLeader {
ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
if err != nil {
r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
}
if n := numOfPendingConf(ents); n != 0 && r.raftLog.committed > r.raftLog.applied {
r.logger.Warningf("%x cannot campaign at term %d since there are still %d pending configuration changes to apply", r.id, r.Term, n)
return nil
}
r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
if r.preVote {
r.campaign(campaignPreElection)
} else {
r.campaign(campaignElection)
}
} else {
r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
}
// ...
default:
r.step(r, m)
}
return nil
}
暂时先忽略PreVote,那么流程将会进入raft.campaign(campaignElection),在campaign中,raft会切换到Candidate状态,如下:
func (r *raft) campaign(t CampaignType) {
var term uint64
var voteMsg pb.MessageType
if t == campaignPreElection {
r.becomePreCandidate()
voteMsg = pb.MsgPreVote
// PreVote RPCs are sent for the next term before we've incremented r.Term.
term = r.Term + 1
} else {
r.becomeCandidate()
voteMsg = pb.MsgVote
term = r.Term
}
// ...
}
func (r *raft) becomeCandidate() {
// TODO(xiangli) remove the panic when the raft implementation is stable
if r.state == StateLeader {
panic("invalid transition [leader -> candidate]")
}
r.step = stepCandidate
r.reset(r.Term + 1)
r.tick = r.tickElection
r.Vote = r.id
r.state = StateCandidate
r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}
raft切换到Candidate之后,就会向其他其他节点发送竞选消息,竞选消息的内容如下:
func (r *raft) campaign(t CampaignType) {
// ...
for id := range r.prs {
if id == r.id {
continue
}
r.logger.Infof("%x [logterm: %d, index: %d] sent %s request to %x at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), voteMsg, id, r.Term)
var ctx []byte
if t == campaignTransfer {
ctx = []byte(t)
}
r.send(pb.Message{Term: term, To: id, Type: voteMsg, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm(), Context: ctx})
}
}
Term:竞选Leader的任期,becomeCandidate中会对节点任期+1
To:接收消息的节点ID,具体消息如何发送出去,由应用层的transport层实现
Type:消息类型,竞选消息类型是voteMsg
Index:节点最后一条日志的index
LogTerm:节点最后一条日志的任期号
ctx:其他数据,不考虑leader角色交接的话暂时用不到
竞选消息处理
竞选者发送完消息后,其他节点收到消息后,会集中通过raft.Step处理,raft.Step对竞选消息的处理流程如下:
func (r *raft) Step(m pb.Message) error {
// Handle the message term, which may result in our stepping down to a follower.
switch {
case m.Term == 0:
// local message
case m.Term > r.Term:
// Leader心跳正常,如果不是Leader交接,就忽略该竞选消息
if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {
force := bytes.Equal(m.Context, []byte(campaignTransfer))
inLease := r.checkQuorum && r.lead != None && r.electionElapsed < r.electionTimeout
if !force && inLease {
// If a server receives a RequestVote request within the minimum election timeout
// of hearing from a current leader, it does not update its term or grant its vote
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: lease is not expired (remaining ticks: %d)",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term, r.electionTimeout-r.electionElapsed)
return nil
}
}
switch {
case m.Type == pb.MsgPreVote:
// Never change our term in response to a PreVote
case m.Type == pb.MsgPreVoteResp && !m.Reject:
// We send pre-vote requests with a term in our future. If the
// pre-vote is granted, we will increment our term when we get a
// quorum. If it is not, the term comes from the node that
// rejected our vote so we should become a follower at the new
// term.
default:
// 如果不是竞选消息,收到任期大的消息,本节点主动切换到Follower
r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
if m.Type == pb.MsgApp || m.Type == pb.MsgHeartbeat || m.Type == pb.MsgSnap {
r.becomeFollower(m.Term, m.From)
} else {
r.becomeFollower(m.Term, None)
}
}
case m.Term < r.Term: // 收到任期小的心跳、或者是追加日志请求,可能集群存在旧的Leader,通过直接返回响应,让其检测到大任期主动切换回Follower
if r.checkQuorum && (m.Type == pb.MsgHeartbeat || m.Type == pb.MsgApp) {
// We have received messages from a leader at a lower term. It is possible
// that these messages were simply delayed in the network, but this could
// also mean that this node has advanced its term number during a network
// partition, and it is now unable to either win an election or to rejoin
// the majority on the old term. If checkQuorum is false, this will be
// handled by incrementing term numbers in response to MsgVote with a
// higher term, but if checkQuorum is true we may not advance the term on
// MsgVote and must generate other messages to advance the term. The net
// result of these two features is to minimize the disruption caused by
// nodes that have been removed from the cluster's configuration: a
// removed node will send MsgVotes (or MsgPreVotes) which will be ignored,
// but it will not receive MsgApp or MsgHeartbeat, so it will not create
// disruptive term increases
r.send(pb.Message{To: m.From, Type: pb.MsgAppResp})
} else {
// ignore other cases
r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
r.id, r.Term, m.Type, m.From, m.Term)
}
return nil
}
switch m.Type {
case pb.MsgHup:
// ...
case pb.MsgVote, pb.MsgPreVote:
if r.isLearner {
// TODO: learner may need to vote, in case of node down when confchange.
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: learner can not vote",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
return nil
}
// The m.Term > r.Term clause is for MsgPreVote. For MsgVote m.Term should
// always equal r.Term.
if (r.Vote == None || m.Term > r.Term || r.Vote == m.From) && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] cast %s for %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
// When responding to Msg{Pre,}Vote messages we include the term
// from the message, not the local term. To see why consider the
// case where a single node was previously partitioned away and
// it's local term is now of date. If we include the local term
// (recall that for pre-votes we don't update the local term), the
// (pre-)campaigning node on the other end will proceed to ignore
// the message (it ignores all out of date messages).
// The term in the original message and current local term are the
// same in the case of regular votes, but different for pre-votes.
r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
if m.Type == pb.MsgVote {
// Only record real votes.
r.electionElapsed = 0
r.Vote = m.From
}
} else {
r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
}
default:
r.step(r, m)
}
return nil
}
Step流程中有逻辑与任期或选举消息有关,更多是为了处理一些异常情况,对raft在工程中落地比较关键,相关逻辑已经在注释中有简短说明,下面主要是讲解下raft在正常情况下是如何处理竞选消息的,raft收到竞选消息后,对竞选消息进行投票的条件:
- 没有将票投投递给其他竟选者,或已经为该竞选者投过票(etcd是raft状态记录和消息发送流程是异步的,可能出现投票状态有记录但是未投票的情况,投票状态记录在reset函数里面被清除)
- 竞选者的任期大于当前节点的任期
- 竞选者最后一条日志比当前节点新或一样,日志新的判断条件如下:
(1)term大的日志新
(2)term相等,index大的日志新
(3)term相等,index相等的是同一条日志
如果满足上面投票条件,当前节点会向竞选者发送voteRespMsgType的消息,来支持竞选者
收集统计投票
竞选者收到voteRespMsgType,会通过raft.Step调用竞选者的step方法,由于step方法因角色而异,因此voteRespMsgType消息的最终处理流程为stepCandidate,主要流程如下:
// stepCandidate is shared by StateCandidate and StatePreCandidate; the difference is
// whether they respond to MsgVoteResp or MsgPreVoteResp.
func stepCandidate(r *raft, m pb.Message) {
// Only handle vote responses corresponding to our candidacy (while in
// StateCandidate, we may get stale MsgPreVoteResp messages in this term from
// our pre-candidate state).
var myVoteRespType pb.MessageType
if r.state == StatePreCandidate {
myVoteRespType = pb.MsgPreVoteResp
} else {
myVoteRespType = pb.MsgVoteResp
}
switch m.Type {
// ...
case myVoteRespType:
gr := r.poll(m.From, m.Type, !m.Reject)
r.logger.Infof("%x [quorum:%d] has received %d %s votes and %d vote rejections", r.id, r.quorum(), gr, m.Type, len(r.votes)-gr)
switch r.quorum() {
case gr:
if r.state == StatePreCandidate {
r.campaign(campaignElection)
} else {
r.becomeLeader()
r.bcastAppend()
}
case len(r.votes) - gr:
r.becomeFollower(r.Term, None)
}
}
}
竞选者收到myVoteRespType消息后,首先通过poll统计获得的选票数,如果拒绝选票数等quorum节点,就切到Follower状态,如果获得的支持选票数等于quorum就成为Leader,节点在成为Leader之后,会初始化Follower相关的日志同步进度信息,并向日志记录中追加一条空日志,追加一条空日志的主要原因是raft通过日志拷贝到quorum来判断日志是否可以提交只能用于本任期内的日志,如果想提交前面任期拥有的且未提交的日志,只能通过提交本任期内的日志间接的提交(见:raft论文5.4),具体becomeLeader的处理流程如下:
func (r *raft) becomeLeader() {
// TODO(xiangli) remove the panic when the raft implementation is stable
if r.state == StateFollower {
panic("invalid transition [follower -> leader]")
}
r.step = stepLeader
r.reset(r.Term)
r.tick = r.tickHeartbeat
r.lead = r.id
r.state = StateLeader
// Followers enter replicate mode when they've been successfully probed
// (perhaps after having received a snapshot as a result). The leader is
// trivially in this state. Note that r.reset() has initialized this
// progress with the last index already.
r.prs[r.id].becomeReplicate()
// Conservatively set the pendingConfIndex to the last index in the
// log. There may or may not be a pending config change, but it's
// safe to delay any future proposals until we commit all our
// pending log entries, and scanning the entire tail of the log
// could be expensive.
r.pendingConfIndex = r.raftLog.lastIndex()
emptyEnt := pb.Entry{Data: nil}
if !r.appendEntry(emptyEnt) {
// This won't happen because we just called reset() above.
r.logger.Panic("empty entry was dropped")
}
// As a special case, don't count the initial empty entry towards the
// uncommitted log quota. This is because we want to preserve the
// behavior of allowing one entry larger than quota if the current
// usage is zero.
r.reduceUncommittedSize([]pb.Entry{emptyEnt})
r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}
参考
- raft-原文 raft.github.io/raft.pdf
- etcd github.com/etcd-io/etc…