raft(一) 领导者选举

340 阅读7分钟

raft[1]算法相比于paxos更加容易理解,除了raft论文原文,还有raft论文翻译类文章,都对raft算法的基础概念做了比较详细的介绍,因此本系列文章主要是结合etcd[2]-raft的实现版本做介绍,通过分析etcd-raft来学习raft时如何在工程实践中落地的。

raft角色分配

raft是基于日志复制状态机的分布式共识算法,raft日志复制状态机在工作时共有三种角色:

Leader:负责接收提议,产生提议log,将提议log复制到follower,然后commit-log,并将log的commit情况复制到所有的follower上

Follower:接收Leader复制的日志请求,将日志持久化到本地,并向Leader者反馈本地的日志情况

Candidate:raft状态机切换的中间角色,主要用于集群没有Leader的情况下,由Follower转化身份而来,完成Leader竞选的中间角色,成功竞选的Candidate将成为集群的Leader

raft状态机基于raft任期机制,在Leader、Follower及Candidate这三者之间的转化,具体的转化如图所示:

由上图可知raft集群在启动之初所有的节点都处于Follower状态,由于集群中没有Leader,所有Follower在选举定时器超时之后仍然收不到Leader的心跳,就会进入Candidate状态,Candidate会开启新的任期,发起竞选Leader,如果一任期内没有Candidate胜出,Candidate等待一段随机时间后发起下一轮的选举,具体的流程见下面的etcd-raft源码剖析

etcd中的关键对象

状态机对象-raft

raft算法的状态机实现对象,主要是完成raft算法的核心处理逻辑,定义如下文件里面:

https://github.com/etcd-io/etcd/blob/v3.4.9/raft/raft.go

raft集群节点对象-Node

Node时代表raft集群内部的一个节点,主要是负责驱动raft状态机,Node接口及其实现node,具体定义在如下文件中:

https://github.com/etcd-io/etcd/blob/v3.4.9/raft/node.go

Follower定时器

raft-node启动

raft一个集群节点在启动时的代码如下:

// StartNode returns a new Node given configuration and a list of raft peers.
// It appends a ConfChangeAddNode entry for each given peer to the initial log.
func StartNode(c *Config, peers []Peer) Node {
	r := newRaft(c)
	// become the follower at term 1 and apply initial configuration
	// entries of term 1
	r.becomeFollower(1, None)
	for _, peer := range peers {
		cc := pb.ConfChange{Type: pb.ConfChangeAddNode, NodeID: peer.ID, Context: peer.Context}
		d, err := cc.Marshal()
		if err != nil {
			panic("unexpected marshal error")
		}
		e := pb.Entry{Type: pb.EntryConfChange, Term: 1, Index: r.raftLog.lastIndex() + 1, Data: d}
		r.raftLog.append(e)
	}
	// Mark these initial entries as committed.
	// TODO(bdarnell): These entries are still unstable; do we need to preserve
	// the invariant that committed < unstable?
	r.raftLog.committed = r.raftLog.lastIndex()
	// Now apply them, mainly so that the application can call Campaign
	// immediately after StartNode in tests. Note that these nodes will
	// be added to raft twice: here and when the application's Ready
	// loop calls ApplyConfChange. The calls to addNode must come after
	// all calls to raftLog.append so progress.next is set after these
	// bootstrapping entries (it is an error if we try to append these
	// entries since they have already been committed).
	// We do not set raftLog.applied so the application will be able
	// to observe all conf changes via Ready.CommittedEntries.
	for _, peer := range peers {
		r.addNode(peer.ID)
	}

	n := newNode()
	n.logger = c.Logger
	go n.run(r)
	return &n
}

raft节点在启动时,首相将raft状态机通过becomeFollower,切换到follower角色:

func (r *raft) becomeFollower(term uint64, lead uint64) {
	r.step = stepFollower 
	r.reset(term)
	r.tick = r.tickElection 
	r.lead = lead 
	r.state = StateFollower
	r.logger.Infof("%x became follower at term %d", r.id, r.Term)
}

设置raft其他node-peer信息、raft日志(index+term)信息,如果该节点未在raft集群中提供过服务,raft日志的index、term都是0,设置完raft的角色元信息之后,启动raft节点:

n := newNode()
go n.run(r)

node的run方法是一个loop,主要是用于驱动raft状态机的运行,并将raft状态机的产物传递给应用层,具体run代码逻辑如下,只列举出了选举相关内容:

func (n *node) run(r *raft) {
	for {
		select {
		case m := <-n.recvc:
			// filter out response message from unknown From.
			if pr := r.getProgress(m.From); pr != nil || !IsResponseMsg(m.Type) {
				r.Step(m) // raft never returns an error
			}
		case <-n.tickc:
			r.tick()
		case <-n.stop:
			close(n.done)
			return
		}
	}
}

raft选举相关的定时器信号由应用层产生,通过tickc传递到raft层,调用raft->tick函数,tick是个变量,主要在becomeXXX函数里面被赋值,在becomeFollower将tick赋值为tickElection,tickElection实现如下:

// tickElection is run by followers and candidates after r.electionTimeout.
func (r *raft) tickElection() {
	r.electionElapsed++

	if r.promotable() && r.pastElectionTimeout() {
		r.electionElapsed = 0
		r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
	}
}

每次tick在node->run中被调用,electionElapsed都会加一,electionElapsed在收到Follower收到Leader的心跳请求时被清零,如果集群中没有Leader或则网络问题导致Follower长时间收到Leader的心跳,就会触发选举,将MsgHup类型的消息消息传递到raft状态机中,raft状态机的入口是raft.Step函数,该函数处理竞选消息的内容如下:

func (r *raft) Step(m pb.Message) error {
	// ...
	switch m.Type {
	case pb.MsgHup:
		if r.state != StateLeader {
			ents, err := r.raftLog.slice(r.raftLog.applied+1, r.raftLog.committed+1, noLimit)
			if err != nil {
				r.logger.Panicf("unexpected error getting unapplied entries (%v)", err)
			}
			if n := numOfPendingConf(ents); n != 0 && r.raftLog.committed > r.raftLog.applied {
				r.logger.Warningf("%x cannot campaign at term %d since there are still %d pending configuration changes to apply", r.id, r.Term, n)
				return nil
			}

			r.logger.Infof("%x is starting a new election at term %d", r.id, r.Term)
			if r.preVote {
				r.campaign(campaignPreElection)
			} else {
				r.campaign(campaignElection)
			}
		} else {
			r.logger.Debugf("%x ignoring MsgHup because already leader", r.id)
		}

                 // ...
	default:
		r.step(r, m)
	}
	return nil
}

暂时先忽略PreVote,那么流程将会进入raft.campaign(campaignElection),在campaign中,raft会切换到Candidate状态,如下:

func (r *raft) campaign(t CampaignType) {
	var term uint64
	var voteMsg pb.MessageType
	if t == campaignPreElection {
		r.becomePreCandidate()
		voteMsg = pb.MsgPreVote
		// PreVote RPCs are sent for the next term before we've incremented r.Term.
		term = r.Term + 1
	} else {
		r.becomeCandidate()
		voteMsg = pb.MsgVote
		term = r.Term
	}
	// ...
}

func (r *raft) becomeCandidate() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateLeader {
		panic("invalid transition [leader -> candidate]")
	}
	r.step = stepCandidate
	r.reset(r.Term + 1)
	r.tick = r.tickElection
	r.Vote = r.id
	r.state = StateCandidate
	r.logger.Infof("%x became candidate at term %d", r.id, r.Term)
}

raft切换到Candidate之后,就会向其他其他节点发送竞选消息,竞选消息的内容如下:

func (r *raft) campaign(t CampaignType) {
        // ...
	for id := range r.prs {
		if id == r.id {
			continue
		}
		r.logger.Infof("%x [logterm: %d, index: %d] sent %s request to %x at term %d",
			r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), voteMsg, id, r.Term)

		var ctx []byte
		if t == campaignTransfer {
			ctx = []byte(t)
		}
		r.send(pb.Message{Term: term, To: id, Type: voteMsg, Index: r.raftLog.lastIndex(), LogTerm: r.raftLog.lastTerm(), Context: ctx})
	}
}

Term:竞选Leader的任期,becomeCandidate中会对节点任期+1

To:接收消息的节点ID,具体消息如何发送出去,由应用层的transport层实现

Type:消息类型,竞选消息类型是voteMsg

Index:节点最后一条日志的index

LogTerm:节点最后一条日志的任期号

ctx:其他数据,不考虑leader角色交接的话暂时用不到

竞选消息处理

竞选者发送完消息后,其他节点收到消息后,会集中通过raft.Step处理,raft.Step对竞选消息的处理流程如下:

func (r *raft) Step(m pb.Message) error {
	// Handle the message term, which may result in our stepping down to a follower.
	switch {
	case m.Term == 0:
		// local message
         
	case m.Term > r.Term:
                // Leader心跳正常,如果不是Leader交接,就忽略该竞选消息
		if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {
			force := bytes.Equal(m.Context, []byte(campaignTransfer))
			inLease := r.checkQuorum && r.lead != None && r.electionElapsed < r.electionTimeout
			if !force && inLease {
				// If a server receives a RequestVote request within the minimum election timeout
				// of hearing from a current leader, it does not update its term or grant its vote
				r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: lease is not expired (remaining ticks: %d)",
					r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term, r.electionTimeout-r.electionElapsed)
				return nil
			}
		}
		switch {
		case m.Type == pb.MsgPreVote:
			// Never change our term in response to a PreVote
		case m.Type == pb.MsgPreVoteResp && !m.Reject:
			// We send pre-vote requests with a term in our future. If the
			// pre-vote is granted, we will increment our term when we get a
			// quorum. If it is not, the term comes from the node that
			// rejected our vote so we should become a follower at the new
			// term.
		default:
                        // 如果不是竞选消息,收到任期大的消息,本节点主动切换到Follower
			r.logger.Infof("%x [term: %d] received a %s message with higher term from %x [term: %d]",
				r.id, r.Term, m.Type, m.From, m.Term)
			if m.Type == pb.MsgApp || m.Type == pb.MsgHeartbeat || m.Type == pb.MsgSnap {
				r.becomeFollower(m.Term, m.From)
			} else {
				r.becomeFollower(m.Term, None)
			}
		}

	case m.Term < r.Term: // 收到任期小的心跳、或者是追加日志请求,可能集群存在旧的Leader,通过直接返回响应,让其检测到大任期主动切换回Follower
		if r.checkQuorum && (m.Type == pb.MsgHeartbeat || m.Type == pb.MsgApp) {
			// We have received messages from a leader at a lower term. It is possible
			// that these messages were simply delayed in the network, but this could
			// also mean that this node has advanced its term number during a network
			// partition, and it is now unable to either win an election or to rejoin
			// the majority on the old term. If checkQuorum is false, this will be
			// handled by incrementing term numbers in response to MsgVote with a
			// higher term, but if checkQuorum is true we may not advance the term on
			// MsgVote and must generate other messages to advance the term. The net
			// result of these two features is to minimize the disruption caused by
			// nodes that have been removed from the cluster's configuration: a
			// removed node will send MsgVotes (or MsgPreVotes) which will be ignored,
			// but it will not receive MsgApp or MsgHeartbeat, so it will not create
			// disruptive term increases
			r.send(pb.Message{To: m.From, Type: pb.MsgAppResp})
		} else {
			// ignore other cases
			r.logger.Infof("%x [term: %d] ignored a %s message with lower term from %x [term: %d]",
				r.id, r.Term, m.Type, m.From, m.Term)
		}
		return nil
	}

	switch m.Type {
	case pb.MsgHup:
		// ...
	case pb.MsgVote, pb.MsgPreVote:
		if r.isLearner {
			// TODO: learner may need to vote, in case of node down when confchange.
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] ignored %s from %x [logterm: %d, index: %d] at term %d: learner can not vote",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			return nil
		}
		// The m.Term > r.Term clause is for MsgPreVote. For MsgVote m.Term should
		// always equal r.Term.
		if (r.Vote == None || m.Term > r.Term || r.Vote == m.From) && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] cast %s for %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			// When responding to Msg{Pre,}Vote messages we include the term
			// from the message, not the local term. To see why consider the
			// case where a single node was previously partitioned away and
			// it's local term is now of date. If we include the local term
			// (recall that for pre-votes we don't update the local term), the
			// (pre-)campaigning node on the other end will proceed to ignore
			// the message (it ignores all out of date messages).
			// The term in the original message and current local term are the
			// same in the case of regular votes, but different for pre-votes.
			r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
			if m.Type == pb.MsgVote {
				// Only record real votes.
				r.electionElapsed = 0
				r.Vote = m.From
			}
		} else {
			r.logger.Infof("%x [logterm: %d, index: %d, vote: %x] rejected %s from %x [logterm: %d, index: %d] at term %d",
				r.id, r.raftLog.lastTerm(), r.raftLog.lastIndex(), r.Vote, m.Type, m.From, m.LogTerm, m.Index, r.Term)
			r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
		}

	default:
		r.step(r, m)
	}
	return nil
}

Step流程中有逻辑与任期或选举消息有关,更多是为了处理一些异常情况,对raft在工程中落地比较关键,相关逻辑已经在注释中有简短说明,下面主要是讲解下raft在正常情况下是如何处理竞选消息的,raft收到竞选消息后,对竞选消息进行投票的条件:

  • 没有将票投投递给其他竟选者,或已经为该竞选者投过票(etcd是raft状态记录和消息发送流程是异步的,可能出现投票状态有记录但是未投票的情况,投票状态记录在reset函数里面被清除)
  • 竞选者的任期大于当前节点的任期
  • 竞选者最后一条日志比当前节点新或一样,日志新的判断条件如下:

(1)term大的日志新

(2)term相等,index大的日志新

(3)term相等,index相等的是同一条日志

如果满足上面投票条件,当前节点会向竞选者发送voteRespMsgType的消息,来支持竞选者

收集统计投票

竞选者收到voteRespMsgType,会通过raft.Step调用竞选者的step方法,由于step方法因角色而异,因此voteRespMsgType消息的最终处理流程为stepCandidate,主要流程如下:

// stepCandidate is shared by StateCandidate and StatePreCandidate; the difference is
// whether they respond to MsgVoteResp or MsgPreVoteResp.
func stepCandidate(r *raft, m pb.Message) {
	// Only handle vote responses corresponding to our candidacy (while in
	// StateCandidate, we may get stale MsgPreVoteResp messages in this term from
	// our pre-candidate state).
	var myVoteRespType pb.MessageType
	if r.state == StatePreCandidate {
		myVoteRespType = pb.MsgPreVoteResp
	} else {
		myVoteRespType = pb.MsgVoteResp
	}
	switch m.Type {
	// ...

	case myVoteRespType:
		gr := r.poll(m.From, m.Type, !m.Reject)
		r.logger.Infof("%x [quorum:%d] has received %d %s votes and %d vote rejections", r.id, r.quorum(), gr, m.Type, len(r.votes)-gr)
		switch r.quorum() {
		case gr:
			if r.state == StatePreCandidate {
				r.campaign(campaignElection)
			} else {
				r.becomeLeader()
				r.bcastAppend()
			}
		case len(r.votes) - gr:
			r.becomeFollower(r.Term, None)
		}
	}
}

竞选者收到myVoteRespType消息后,首先通过poll统计获得的选票数,如果拒绝选票数等quorum节点,就切到Follower状态,如果获得的支持选票数等于quorum就成为Leader,节点在成为Leader之后,会初始化Follower相关的日志同步进度信息,并向日志记录中追加一条空日志,追加一条空日志的主要原因是raft通过日志拷贝到quorum来判断日志是否可以提交只能用于本任期内的日志,如果想提交前面任期拥有的且未提交的日志,只能通过提交本任期内的日志间接的提交(见:raft论文5.4),具体becomeLeader的处理流程如下:

func (r *raft) becomeLeader() {
	// TODO(xiangli) remove the panic when the raft implementation is stable
	if r.state == StateFollower {
		panic("invalid transition [follower -> leader]")
	}
	r.step = stepLeader
	r.reset(r.Term)
	r.tick = r.tickHeartbeat
	r.lead = r.id
	r.state = StateLeader
	// Followers enter replicate mode when they've been successfully probed
	// (perhaps after having received a snapshot as a result). The leader is
	// trivially in this state. Note that r.reset() has initialized this
	// progress with the last index already.
	r.prs[r.id].becomeReplicate()

	// Conservatively set the pendingConfIndex to the last index in the
	// log. There may or may not be a pending config change, but it's
	// safe to delay any future proposals until we commit all our
	// pending log entries, and scanning the entire tail of the log
	// could be expensive.
	r.pendingConfIndex = r.raftLog.lastIndex()

	emptyEnt := pb.Entry{Data: nil}
	if !r.appendEntry(emptyEnt) {
		// This won't happen because we just called reset() above.
		r.logger.Panic("empty entry was dropped")
	}
	// As a special case, don't count the initial empty entry towards the
	// uncommitted log quota. This is because we want to preserve the
	// behavior of allowing one entry larger than quota if the current
	// usage is zero.
	r.reduceUncommittedSize([]pb.Entry{emptyEnt})
	r.logger.Infof("%x became leader at term %d", r.id, r.Term)
}

参考

  1. raft-原文 raft.github.io/raft.pdf
  2. etcd github.com/etcd-io/etc…