前置知识: raft 协议
整体架构图
下面是一张rocketmq选主流程的架构图,接下来我们就对着这张架构图来看看rocketmq是如何完成选举流程的操作的 www.processon.com/diagraming/…
功能拆解:
1 启动状态机
我们都知道rocketmq是基于raft协议实现的主从架构在raft协议中有每个节点有三个角色分别是leader,follower,candidate,当一个broker启动选举模块功能的时候需要先启动状态机
DLedgerLeaderElector#startup()
public class StateMaintainer extends ShutdownAbleThread {
public StateMaintainer(String name, Logger logger) {
super(name, logger);
}
@Override public void doWork() {
try {
if (DLedgerLeaderElector.this.dLedgerConfig.isEnableLeaderElector()) { //是否开启
DLedgerLeaderElector.this.refreshIntervals(dLedgerConfig); //重置时间
DLedgerLeaderElector.this.maintainState(); //根据当前不同的角色进行不同的操作
}
sleep(10);
} catch (Throwable t) {
DLedgerLeaderElector.logger.error("Error in heartbeat", t);
}
}
}
2.发起投票
DLedgerLeaderElector.this.maintainState()
private void maintainState() throws Exception {
if (memberState.isLeader()) {
maintainAsLeader();
} else if (memberState.isFollower()) {
maintainAsFollower();
} else {
maintainAsCandidate();
}
}
maintainAsCandidate()
当broker启动的时候默认是candidate也就是候选者这个时候它需要做的就是进行拉票操作,我们看看rocketmq是怎么做的。 我将拉票的环节分为下面几个步骤:
-
preCheck阶段
①当前时间小于下次投票的时间并且needIncreaseTermImmediately=false 直接返回 表示当前不需要进行投票
②如果当前节点的角色不是候选者直接返回 ③如果上次的投票结果(lastParseResult)是等待下一轮投票或者是needIncreaseTermImmediately=true 轮次加一并将轮次持久化到磁盘,然后获取下一次的发起投票的时间 返回
-
vote 阶段
①遍历当前集群中所有的节点,发起投票的请求【rpc】,等待响应结果 注:如果当前遍历的节点是自己的话直接调用自己的vote方法(给自己投一票) ②子节点处理vote请求调用自己的handleVote方法 返回结果
-
checkResult阶段
遍历子节点返回的投票结果,进行后续操作
遍历节点发起投票请求
private List<CompletableFuture<VoteResponse>> voteForQuorumResponses(long term, long ledgerEndTerm,
long ledgerEndIndex) throws Exception {
List<CompletableFuture<VoteResponse>> responses = new ArrayList<>();
for (String id : memberState.getPeerMap().keySet()) {
VoteRequest voteRequest = new VoteRequest();
voteRequest.setGroup(memberState.getGroup());
voteRequest.setLedgerEndIndex(ledgerEndIndex);
voteRequest.setLedgerEndTerm(ledgerEndTerm);
voteRequest.setLeaderId(memberState.getSelfId());
voteRequest.setTerm(term);
voteRequest.setRemoteId(id);
CompletableFuture<VoteResponse> voteResponse;
if (memberState.getSelfId().equals(id)) {
voteResponse = handleVote(voteRequest, true);
} else {
//async
voteResponse = dLedgerRpcService.vote(voteRequest);
}
responses.add(voteResponse);
}
return responses;
}
3.子节点处理投票请求
简单概括
step1: 判断当前发起请求的remote server 是不是集群中的节点【在broker启动的时候会读取一个dledgerConfig里面包含所有的节点信息】 如果不是直接返回未知节点异常信息
step2: 如果当前发起投票的节点是自己,并且不是自己调用自己,那么就会返回一个,拒绝的异常信息:REJECT_UNEXPECTED_LEADER
step3: 如果当前发起请求的轮次小于当前节点的轮次返回异常信息[REJECT_EXPIRED_VOTE_TERM]
step4: 如果当前发起投票的term==当前节点的term
step4-1: 如果当前没有给别的节点投票直接可以往下进行
step4-2: 如果当前的leader信息就是对应当前节点那么可以往下进行证明是重复的请求
step4-3: 如果不符合上述的两种情况
step4-3-1 :如果 leaderId 不为空 返回 已经有leader的错误信息
step4-3-2: 如果当前的leaderId为空并且 curVoteFor不为空 返回已经投票给其他人的错误信息
step5: 最后一种情况是当前的请求term比当前节点的term大 当前节点将角色转换为candidate并使用req的term直接发起下一轮的拉票请求
step6: 判断当前请求的日志index的位置和自己的index比较如果比自己的小则直接拒绝
代码验证:
public CompletableFuture<VoteResponse> handleVote(VoteRequest request, boolean self) {
//hold the lock to get the latest term, leaderId, ledgerEndIndex
synchronized (memberState) {
if (!memberState.isPeerMember(request.getLeaderId())) { //判断是不是集群中的节点
//日志省略
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_UNKNOWN_LEADER));
}
if (!self && memberState.getSelfId().equals(request.getLeaderId())) { //判断请求的节点是不是自己
//日志省略
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_UNEXPECTED_LEADER));
}
if (request.getTerm() < memberState.currTerm()) { //请求的轮次和自己当前的轮次进行对比 小于当前的轮次
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_EXPIRED_VOTE_TERM));
} else if (request.getTerm() == memberState.currTerm()) { //等于当前轮次
if (memberState.currVoteFor() == null) { //如果当前节点在当前轮次还没有投票那么就可以往下进行
//let it go
} else if (memberState.currVoteFor().equals(request.getLeaderId())) { //如果当前节点的leader是请求的节点证明是重复请求那么也可以通过
//repeat just let it go
} else {
if (memberState.getLeaderId() != null) { //已经有leader了
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_ALREADY__HAS_LEADER));
} else { //已投票给其他人了
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_ALREADY_VOTED));
}
}
} else { //当前节点马上发起下一轮的投票 转换角色为candidate
changeRoleToCandidate(request.getTerm());
needIncreaseTermImmediately = true;
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_TERM_NOT_READY));
}
//assert acceptedTerm is true
if (request.getLedgerEndTerm() < memberState.getLedgerEndTerm()) {
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_EXPIRED_LEDGER_TERM));
} else if (request.getLedgerEndTerm() == memberState.getLedgerEndTerm() && request.getLedgerEndIndex() < memberState.getLedgerEndIndex()) {
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.REJECT_SMALL_LEDGER_END_INDEX));
}
if (request.getTerm() < memberState.getLedgerEndTerm()) {
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.getLedgerEndTerm()).voteResult(VoteResponse.RESULT.REJECT_TERM_SMALL_THAN_LEDGER));
}
memberState.setCurrVoteFor(request.getLeaderId());
return CompletableFuture.completedFuture(new VoteResponse(request).term(memberState.currTerm()).voteResult(VoteResponse.RESULT.ACCEPT));
}
}
4.处理投票结果
private void maintainAsCandidate() throws Exception {
long term;
long ledgerEndTerm;
long ledgerEndIndex;
synchronized (memberState) {
//...
long startVoteTimeMs = System.currentTimeMillis();
final List<CompletableFuture<VoteResponse>> quorumVoteResponses = voteForQuorumResponses(term, ledgerEndTerm, ledgerEndIndex);
final AtomicLong knownMaxTermInGroup = new AtomicLong(-1); //当前集群最大轮次
final AtomicInteger allNum = new AtomicInteger(0); //所有成员的数量
final AtomicInteger validNum = new AtomicInteger(0); //有效成员数量
final AtomicInteger acceptedNum = new AtomicInteger(0); //同意的成员数量
final AtomicInteger notReadyTermNum = new AtomicInteger(0); //还没准备好的成员数量【req.term>cur.term】
final AtomicInteger biggerLedgerNum = new AtomicInteger(0); //日志比当前节点多的节点【index>cur.index】
final AtomicBoolean alreadyHasLeader = new AtomicBoolean(false); //已经有leader的节点
CountDownLatch voteLatch = new CountDownLatch(1);
for (CompletableFuture<VoteResponse> future : quorumVoteResponses) {
//处理投票结果
future.whenComplete((VoteResponse x, Throwable ex) -> {
try {
if (ex != null) {
throw ex;
}
if (x.getVoteResult() != VoteResponse.RESULT.UNKNOWN) {
validNum.incrementAndGet(); //只要不是未知 有效节点数量加一
}
synchronized (knownMaxTermInGroup) {
switch (x.getVoteResult()) {
case ACCEPT:
acceptedNum.incrementAndGet(); //同意票加一
break;
case REJECT_ALREADY_VOTED:
break;
case REJECT_ALREADY__HAS_LEADER:
alreadyHasLeader.compareAndSet(false, true);//记录已经有leader 后续使用
break;
case REJECT_TERM_SMALL_THAN_LEDGER:
case REJECT_EXPIRED_VOTE_TERM:
if (x.getTerm() > knownMaxTermInGroup.get()) {
knownMaxTermInGroup.set(x.getTerm()); //如果当前的term>目前已知的最大term--->更新
}
break;
case REJECT_EXPIRED_LEDGER_TERM:
case REJECT_SMALL_LEDGER_END_INDEX:
biggerLedgerNum.incrementAndGet(); //记录index比当前节点大的个数
break;
case REJECT_TERM_NOT_READY:
notReadyTermNum.incrementAndGet(); //记录term比当前节点小的个数
break;
default:
break;
}
}
if (alreadyHasLeader.get()
|| memberState.isQuorum(acceptedNum.get())
|| memberState.isQuorum(acceptedNum.get() + notReadyTermNum.get())) {
voteLatch.countDown(); //如果是已经有leader或者是超过半数通过|| 统一票+没准备好的超过半数 往下走
}
} catch (Throwable t) {
logger.error("Get error when parsing vote response ", t);
} finally {
allNum.incrementAndGet();
if (allNum.get() == memberState.peerSize()) {
voteLatch.countDown();
}
}
});
}
try {
voteLatch.await(3000 + random.nextInt(maxVoteIntervalMs), TimeUnit.MILLISECONDS);
} catch (Throwable ignore) {
}
lastVoteCost = DLedgerUtils.elapsed(startVoteTimeMs); //计算投票花费的时间
VoteResponse.ParseResult parseResult;
if (knownMaxTermInGroup.get() > term) { //如果有term比自己大的用大的term发起下一轮投票
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote();
changeRoleToCandidate(knownMaxTermInGroup.get());
} else if (alreadyHasLeader.get()) { //如果是已经有leader了那么将result置位等待下一次投票 并在投票周期的基础上加上 heartBeatTimeIntervalMs * maxHeartBeatLeak
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote() + heartBeatTimeIntervalMs * maxHeartBeatLeak;
} else if (!memberState.isQuorum(validNum.get())) { //如果有效节点的个数不超过一半那么发起下一轮的投票
parseResult = VoteResponse.ParseResult.WAIT_TO_REVOTE;
nextTimeToRequestVote = getNextTimeToRequestVote();
} else if (memberState.isQuorum(acceptedNum.get())) { //同意票超过了一半直接通过
parseResult = VoteResponse.ParseResult.PASSED;
} else if (memberState.isQuorum(acceptedNum.get() + notReadyTermNum.get())) { //如果是同意+没准备好的>一半 立即发起下一轮的投票
parseResult = VoteResponse.ParseResult.REVOTE_IMMEDIATELY;
} else if (memberState.isQuorum(acceptedNum.get() + biggerLedgerNum.get())) { //如果是同意+index大于自己的个数相加超过了一般的话 那么等待发起下一轮的投票
parseResult = VoteResponse.ParseResult.WAIT_TO_REVOTE;
nextTimeToRequestVote = getNextTimeToRequestVote();
} else {
parseResult = VoteResponse.ParseResult.WAIT_TO_VOTE_NEXT;
nextTimeToRequestVote = getNextTimeToRequestVote();
}
lastParseResult = parseResult;
logger.info("[{}] [PARSE_VOTE_RESULT] cost={} term={} memberNum={} allNum={} acceptedNum={} notReadyTermNum={} biggerLedgerNum={} alreadyHasLeader={} maxTerm={} result={}",
memberState.getSelfId(), lastVoteCost, term, memberState.peerSize(), allNum, acceptedNum, notReadyTermNum, biggerLedgerNum, alreadyHasLeader, knownMaxTermInGroup.get(), parseResult);
if (parseResult == VoteResponse.ParseResult.PASSED) {
changeRoleToLeader(term); //通过 变换角色为leader
}
}
心跳检测&故障感知
通过上面的步骤我们的这个节点就当选为leader了那么 他如何维持自己的统治地位以及其他节点如何感知他的异常重新触发选举的呢?
回到我们上面说的这里:
private void maintainState() throws Exception {
if (memberState.isLeader()) {
maintainAsLeader();
} else if (memberState.isFollower()) {
maintainAsFollower();
} else {
maintainAsCandidate();
}
}
这里方法实在一个线程里面持续跑的 当我们的节点状态来到leader的时候我们就进入了第一个分支:
private void maintainAsLeader() throws Exception {
if (DLedgerUtils.elapsed(lastSendHeartBeatTime) > heartBeatTimeIntervalMs) {
long term;
String leaderId;
synchronized (memberState) {
if (!memberState.isLeader()) {
//stop sending
return;
}
term = memberState.currTerm();
leaderId = memberState.getLeaderId();
lastSendHeartBeatTime = System.currentTimeMillis();
}
sendHeartbeats(term, leaderId);
}
}
sendHeartbeats
主节点发送心跳信息给从节点并等待处理心跳结果:
private void sendHeartbeats(long term, String leaderId) throws Exception {
final AtomicInteger allNum = new AtomicInteger(1); //所有的节点
final AtomicInteger succNum = new AtomicInteger(1); //成功的节点
final AtomicInteger notReadyNum = new AtomicInteger(0); //没有准备好的节点【term>从节点的term】
final AtomicLong maxTerm = new AtomicLong(-1);
final AtomicBoolean inconsistLeader = new AtomicBoolean(false);
final CountDownLatch beatLatch = new CountDownLatch(1);
long startHeartbeatTimeMs = System.currentTimeMillis();
for (String id : memberState.getPeerMap().keySet()) {
if (memberState.getSelfId().equals(id)) {
continue;
}
HeartBeatRequest heartBeatRequest = new HeartBeatRequest();
heartBeatRequest.setGroup(memberState.getGroup());
heartBeatRequest.setLocalId(memberState.getSelfId());
heartBeatRequest.setRemoteId(id);
heartBeatRequest.setLeaderId(leaderId);
heartBeatRequest.setTerm(term);
CompletableFuture<HeartBeatResponse> future = dLedgerRpcService.heartBeat(heartBeatRequest); //发送心跳数据给从节点
//后续处理心跳结果的代码
其他节点处理心跳数据并改变自己的状态
public CompletableFuture<HeartBeatResponse> handleHeartBeat(HeartBeatRequest request) throws Exception {
if (!memberState.isPeerMember(request.getLeaderId())) {
//如果当前节点的信息不是集群中的成员返回错误信息
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.UNKNOWN_MEMBER.getCode()));
}
if (memberState.getSelfId().equals(request.getLeaderId())) {
//如果当前leaderId=自己的id 返回错误信息
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.UNEXPECTED_MEMBER.getCode()));
}
if (request.getTerm() < memberState.currTerm()) {
//如果当前节点的term> 主节点的term 返回EXPIRED_TERM 信息
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.EXPIRED_TERM.getCode()));
} else if (request.getTerm() == memberState.currTerm()) {
//轮次相等的情况下
if (request.getLeaderId().equals(memberState.getLeaderId())) { //如果当前节点的leader等于这个leaderId 记录上次的心跳时间为当前时间 返回success
lastLeaderHeartBeatTime = System.currentTimeMillis();
return CompletableFuture.completedFuture(new HeartBeatResponse());
}
}
synchronized (memberState) {
if (request.getTerm() < memberState.currTerm()) {
return CompletableFuture.completedFuture(new HeartBeatResponse().term(memberState.currTerm()).code(DLedgerResponseCode.EXPIRED_TERM.getCode()));
} else if (request.getTerm() == memberState.currTerm()) {
if (memberState.getLeaderId() == null) {
//如果当前节点没有leader直接将当前节点的leaderId设置为这个leaderId 并调用changeRoleToFollower 方法将当前节点的role 改为follower
changeRoleToFollower(request.getTerm(), request.getLeaderId());
return CompletableFuture.completedFuture(new HeartBeatResponse());
} else if (request.getLeaderId().equals(memberState.getLeaderId())) {
//如果当前节点的leader等于这个leaderId 记录上次的心跳时间为当前时间 返回success
lastLeaderHeartBeatTime = System.currentTimeMillis();
return CompletableFuture.completedFuture(new HeartBeatResponse());
} else {
//证明当前节点有leader 返回INCONSISTENT_LEADER
return CompletableFuture.completedFuture(new HeartBeatResponse().code(DLedgerResponseCode.INCONSISTENT_LEADER.getCode()));
}
} else {
//curTerm<req.term 返回TERM_NOT_READY
return CompletableFuture.completedFuture(new HeartBeatResponse().code(DLedgerResponseCode.TERM_NOT_READY.getCode()));
}
}
}
主节点处理心跳数据
for (String id : memberState.getPeerMap().keySet()) {
if (memberState.getSelfId().equals(id)) {
continue;
}
HeartBeatRequest heartBeatRequest = new HeartBeatRequest();
heartBeatRequest.setGroup(memberState.getGroup());
heartBeatRequest.setLocalId(memberState.getSelfId());
heartBeatRequest.setRemoteId(id);
heartBeatRequest.setLeaderId(leaderId);
heartBeatRequest.setTerm(term);
CompletableFuture<HeartBeatResponse> future = dLedgerRpcService.heartBeat(heartBeatRequest);
//处理心跳数据的返回结果
future.whenComplete((HeartBeatResponse x, Throwable ex) -> {
try {
if (ex != null) {
throw ex;
}
switch (DLedgerResponseCode.valueOf(x.getCode())) {
case SUCCESS:
//成功数量加一
succNum.incrementAndGet();
break;
case EXPIRED_TERM:
// 代表从节点term>当前主节点的term 设置maxTerm为从节点的term
maxTerm.set(x.getTerm());
break;
case INCONSISTENT_LEADER:
//证明当前的从节点有leader设置inconsistLeader=true
inconsistLeader.compareAndSet(false, true);
break;
case TERM_NOT_READY:
//从节点中term<主节点的term的个数
notReadyNum.incrementAndGet();
break;
default:
break;
}
if (memberState.isQuorum(succNum.get())
|| memberState.isQuorum(succNum.get() + notReadyNum.get())) {
//判断当前success的数量大于半数的话/success+notReadyNum>半数 放行
beatLatch.countDown();
}
} catch (Throwable t) {
logger.error("Parse heartbeat response failed", t);
} finally {
allNum.incrementAndGet();
if (allNum.get() == memberState.peerSize()) {
beatLatch.countDown();
}
}
});
}
beatLatch.await(heartBeatTimeIntervalMs, TimeUnit.MILLISECONDS);
if (memberState.isQuorum(succNum.get())) {
//超过半数 ---> 记录上次心跳时间
lastSuccHeartBeatTime = System.currentTimeMillis();
} else {
logger.info("[{}] Parse heartbeat responses in cost={} term={} allNum={} succNum={} notReadyNum={} inconsistLeader={} maxTerm={} peerSize={} lastSuccHeartBeatTime={}",
memberState.getSelfId(), DLedgerUtils.elapsed(startHeartbeatTimeMs), term, allNum.get(), succNum.get(), notReadyNum.get(), inconsistLeader.get(), maxTerm.get(), memberState.peerSize(), new Timestamp(lastSuccHeartBeatTime));
if (memberState.isQuorum(succNum.get() + notReadyNum.get())) {
lastSendHeartBeatTime = -1; //设置下次心跳时间为-1 表示立即发起心跳检测
} else if (maxTerm.get() > term) {
changeRoleToCandidate(maxTerm.get()); //将当前节点置位candidate 再次发起投票
} else if (inconsistLeader.get()) { //如果其他节点有leader的话 当前节点状态变为canidate再次发起投票
changeRoleToCandidate(term);
} else if (DLedgerUtils.elapsed(lastSuccHeartBeatTime) > maxHeartBeatLeak * heartBeatTimeIntervalMs) {
//如果上次成功心跳检测的时间> maxHeartBeatLeak * heartBeatTimeIntervalMs 那么当前节点变为candidate 再次发起选举
changeRoleToCandidate(term);
}
}
}
总结:
以上就是rocketmq的选举全流程,可能有些地方讲解的还不是很清晰,如果大家希望深入研究的话可以去看看这本书 《RocketMQ技术内幕》 里面讲解了许多rocketmq底层的知识 有助于我们对rocketmq的学习,我认为只有懂得了它是怎么来的才能在日常的工作中更好的运用。
资料
rocketmq: 《RocketMQ技术内幕》
以上就是rocketmq broker基于raft协议实现的 分布式部署高可用的选举流程了~ 接下来我们会学习rocketmq broker 主从节点之间的日志复制 !