前言
Share Group(也称 Kafka Queues)共享组,是Kafka4.0中由KIP-932引入的新消费模型。
共享与传统消费组的区别:
- 分区独占:共享:一个分区可以被多个消费组成员共同消费;传统:一对一独占;
- 消费组成员数量上限:共享:默认最多200,最大支持1000;传统:有效成员≤分区数;
- 消息处理粒度:共享:单条消息ack;传统:分区维度offset批量ack;
- 顺序保证:共享:仅在单次拉取的批次内有序,整体无顺序保证;传统:分区内有序;
Kafka4.x的共享组类似于RocketMQ5.x的POP消费,但是功能和实现方式上还是有一些区别。
本文基于Kafka 4.2.0,分析Kafka共享组的实现原理。
一、共享组使用案例
使用KafkaShareConsumer创建消费者,使用方式和普通消费者类似。
share.acknowledgement.mode:默认implicit-隐式提交,每次poll会将上次poll返回的消息标记为ACCEPT提交;可选explicit-显式提交,每次poll必须将上次poll返回的消息通过acknowledge api显式标记。
Properties properties = new Properties();
// share.acknowledgement.mode = implicit-默认,隐式提交 explicit-显示提交
properties.put(ConsumerConfig.SHARE_ACKNOWLEDGEMENT_MODE_CONFIG, "explicit");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "group001");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
ShareConsumer<String, String> consumer = new KafkaShareConsumer<>(properties);
consumer.subscribe(Collections.singleton("TopicA"));
while (running) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
try {
System.out.println(Thread.currentThread() + "===" + record.partition() + "===" + record.offset());
// consumer.acknowledge(record, AcknowledgeType.RENEW);
consumer.acknowledge(record, AcknowledgeType.ACCEPT);
} catch (Exception e) {
e.printStackTrace();
// consumer.acknowledge(record, AcknowledgeType.RELEASE);
consumer.acknowledge(record, AcknowledgeType.REJECT);
}
}
}
consumer.close();
二、共享组消费者概览
普通KafkaConsumer,在4.x中有两种消费组协议,配置项group.protocol:
- classic:ClassicKafkaConsumer,普通消费组默认协议,老版本协议(以前分析过,包括JoinGroup/SyncGroup)。整个消费者实例只用一个线程(应用+IO)+一个心跳线程;
- consumer:AsyncKafkaConsumer,可选,未来新协议(KIP-848)将部分客户端消费组逻辑迁移到Broker服务端。并且客户端使用新的线程模型,应用线程(poll api)和IO线程(与Broker通讯)分离,Consumer+threading+refactor+design;
对于新出的KafkaShareConsumer,仅支持consumer新协议。
2-1、应用线程
ShareConsumerImpl.poll:应用线程循环
- handleCompletedAcknowledgements,如果用户注册KafkaShareConsumer.setAcknowledgementCommitCallback回调,当收到ack响应时触发;
- acknowledgeBatchIfImplicitAcknowledgement,如果隐式提交,设置上次poll的记录为ACCEPT;
- ensureInFlightAcknowledgedIfExplicitAcknowledgement,如果显式提交,校验上次poll的记录都必须调用acknowledge api;
- 发送SharePollEvent给IO线程;
- pollForFetches:应用线程从ShareFetchBuffer拉取数据;
public synchronized ConsumerRecords<K, V> poll(final Duration timeout) {
Timer timer = time.timer(timeout);
acquireAndEnsureOpen();
try {
// 处理IO线程返回的ErrorEvent,忽略
processBackgroundEvents();
// 触发用户KafkaShareConsumer.setAcknowledgementCommitCallback回调,
// 用于处理ack响应,忽略
handleCompletedAcknowledgements();
// share.acknowledgement.mode = implicit-默认隐式提交
// 设置上次所有poll的记录的ack=ACCEPT
acknowledgeBatchIfImplicitAcknowledgement();
// share.acknowledgement.mode = explicit-显示提交
// 用户必须将上次poll的数据都consumer#acknowledge,否则异常
ensureInFlightAcknowledgedIfExplicitAcknowledgement();
shouldSendShareFetchEvent = true;
do {
// 应用线程提交SharePollEvent,重置poll超时计时
applicationEventHandler.add(new SharePollEvent(timer.currentTimeMs()));
wakeupTrigger.maybeTriggerWakeup();
// 从内存ShareFetchBuffer拉取数据
final ShareFetch<K, V> fetch = pollForFetches(timer);
if (!fetch.isEmpty()) {
currentFetch = fetch;
handleCompletedAcknowledgements();
return new ConsumerRecords<>(fetch.records(), Map.of());
}
processBackgroundEvents();
metadata.maybeThrowAnyException();
} while (timer.notExpired());
handleCompletedAcknowledgements();
return ConsumerRecords.empty();
} catch (ShareFetchException e) {
currentFetch = (ShareFetch<K, V>) e.shareFetch();
throw e.cause();
} finally {
release();
}
}
2-2、IO线程
ConsumerNetworkThread:IO线程。
public class ConsumerNetworkThread extends KafkaThread implements Closeable {
// 应用线程事件队列
private final BlockingQueue<ApplicationEvent> applicationEventQueue;
// 处理应用线程事件
private ApplicationEventProcessor applicationEventProcessor;
// 通讯客户端,与服务端通讯
private NetworkClientDelegate networkClientDelegate;
// n个请求处理器
private RequestManagers requestManagers;
}
public RequestManagers(...) {
List<RequestManager> list = new ArrayList<>();
// 1. FindCoordinatorRequest 发现协调者
coordinatorRequestManager.ifPresent(list::add);
// 2. ShareGroupHeartbeatRequest 发送心跳 获取分区分配
shareHeartbeatRequestManager.ifPresent(list::add);
// 3. 维护当前消费组成员状态(不发送数据,处理心跳响应)、更新分区分配
shareMembershipManager.ifPresent(list::add);
// 4. 发送ShareFetchRequest和ShareAcknowledgeRequest
list.add(shareConsumeRequestManager);
entries = Collections.unmodifiableList(list);
}
ConsumerNetworkThread.runOnce:IO线程循环:
- 处理应用线程发来的事件,比如:SharePollEvent需要重置poll超时计时(max.poll.interval.ms=5分钟),ShareFetchEvent需要拉消息;
- 循环n个RequestManager构造请求,加入请求队列;
- networkClientDelegate.poll发送请求并接收响应,比如收到ShareFetch响应,将消息写入ShareFetchBuffer供应用线程消费;
void runOnce() {
// 1. 处理应用事件
processApplicationEvents();
// 2. 调用所有RequestManager,将需要发发送的请求加入networkClientDelegate
for (RequestManager rm : requestManagers.entries()) {
// 构建请求
NetworkClientDelegate.PollResult pollResult = rm.poll(currentTimeMs);
// 将请求加入通讯客户端的队列
long timeoutMs = networkClientDelegate.addAll(pollResult);
}
// 3. 发送请求接收响应...
networkClientDelegate.poll(pollWaitTimeMs, currentTimeMs);
// ...
}
// 处理应用线程发来的事件
public void process(ApplicationEvent event) {
switch (event.type()) {
// ...
case SHARE_POLL:
process((SharePollEvent) event);
return;
}
}
CoordinatorRequestManager.poll:RequestManager以FindCoordinatorRequest构建请求为例,构建时会注册响应回调,networkClientDelegate.poll如果收到响应会直接触发回调。
public NetworkClientDelegate.PollResult poll(final long currentTimeMs) {
if (coordinatorRequestState.canSendRequest(currentTimeMs)) {
NetworkClientDelegate.UnsentRequest request = makeFindCoordinatorRequest(currentTimeMs);
return new NetworkClientDelegate.PollResult(request);
}
}
NetworkClientDelegate.UnsentRequest makeFindCoordinatorRequest(final long currentTimeMs) {
// 构建请求
NetworkClientDelegate.UnsentRequest unsentRequest = new NetworkClientDelegate.UnsentRequest(
new FindCoordinatorRequest.Builder(data),
Optional.empty()
);
// 响应回调
return unsentRequest.whenComplete((clientResponse, throwable) -> {
});
}
三、协调者概览
简单介绍高版本的消费组协调者写操作的实现方式。
逻辑和低版本差不多:写日志+应用到内存,重启以后重放日志+应用到内存。
在共享组场景下,有两套状态:
- GroupCoordinatorService(GroupCoordinatorShard)管理consumer_offset对应的状态机,后文简称消费组状态,管理消费组成员和分区分配;(如果是传统消费组,这里还维护消费进度)
- ShareCoordinatorService(ShareCoordinatorShard)管理sharegroupstate对应的状态机,后文简称共享组状态,管理共享组消费进度;
public class GroupCoordinatorService implements GroupCoordinator {
private final CoordinatorRuntime
<GroupCoordinatorShard, CoordinatorRecord> runtime;
}
public class ShareCoordinatorService implements ShareCoordinator {
private final CoordinatorRuntime
<ShareCoordinatorShard, CoordinatorRecord> runtime;
}
GroupCoordinatorService.shareGroupHeartbeat:写操作调用CoordinatorRuntime.scheduleWriteOperation传入4个参数。
// consumer_offset状态机
private final CoordinatorRuntime<GroupCoordinatorShard, CoordinatorRecord> runtime;
// ....
return runtime.scheduleWriteOperation(
"share-group-heartbeat",
// TopicPartition
topicPartitionFor(request.groupId()),
// 超时时间
Duration.ofMillis(config.offsetCommitTimeoutMs()),
// CoordinatorWriteOperation
coordinator -> coordinator.shareGroupHeartbeat(context, request)
).thenCompose(result -> {
});
CoordinatorRuntime.scheduleWriteOperation:写操作封装为CoordinatorWriteEvent,放入accumulator队列。
// CoordinatorRuntime
private final CoordinatorEventProcessor processor;
public <T> CompletableFuture<T> scheduleWriteOperation(
String name, TopicPartition tp,
Duration timeout, CoordinatorWriteOperation<S, T, U> op) {
throwIfNotRunning();
CoordinatorWriteEvent<T> event = new CoordinatorWriteEvent<>(name, tp, timeout, op);
enqueueLast(event);
return event.future;
}
private void enqueueLast(CoordinatorEvent event) {
processor.enqueueLast(event);
}
// MultiThreadedEventProcessor
private final EventAccumulator<TopicPartition, CoordinatorEvent> accumulator;
public void enqueueLast(CoordinatorEvent event) throws RejectedExecutionException {
accumulator.addLast(event);
}
MultiThreadedEventProcessor多线程消费accumulator队列。
- consumer_offset使用group.coordinator.threads=4个线程;
- share_group_state使用share.coordinator.threads=1个线程;
private final EventAccumulator<TopicPartition, CoordinatorEvent> accumulator;
private final List<EventProcessorThread> threads;
public MultiThreadedEventProcessor(int numThreads, ...) {
this.accumulator = eventAccumulator;
this.threads = IntStream.range(0, numThreads).mapToObj(threadId ->
new EventProcessorThread(
threadPrefix + threadId
)
).collect(Collectors.toList());
this.threads.forEach(EventProcessorThread::start);
}
private class EventProcessorThread extends Thread {
private void handleEvents() {
while (!shuttingDown) {
if (event != null) {
try {
event.run();
} catch (Throwable t) {
event.complete(t);
} finally {
accumulator.done(event);
}
}
}
}
@Override
public void run() {
handleEvents();
}
}
CoordinatorWriteEvent:
- withActiveContextOrThrow,执行写操作,会持有分区级别锁;
- CoordinatorWriteOperation生成日志记录;
- CoordinatorShard.replay应用日志记录到内存;
- maybeFlushCurrentBatch写日志;
- 响应客户端;
public void run() {
try {
withActiveContextOrThrow(tp, context -> {
// CoordinatorWriteOperation生成log记录
result = op.generateRecordsAndResult(context.coordinator.coordinator());
// 写log并应用到内存
context.append(
producerId,
producerEpoch,
verificationGuard,
result.records(),
result.replayRecords(),
result.isAtomic(),
this
);
// 超时响应
if (!future.isDone()) {
operationTimeout = new OperationTimeout(tp, this, writeTimeout.toMillis());
timer.add(operationTimeout);
deferredEventQueuedTimestamp = time.milliseconds();
}
});
} catch (Throwable t) {
complete(t);
}
}
// CoordinatorRuntime.CoordinatorContext#append
private void append(...) {
// 1. CoordinatorShard应用到内存
coordinator.replay(
currentBatch.nextOffset, // 该消息批次的起始offset
producerId,
producerEpoch,
recordToReplay
);
// 2. 写日志
maybeFlushCurrentBatch(currentTimeMs);
}
}
private void withActiveContextOrThrow(
TopicPartition tp,
Consumer<CoordinatorContext> func
) throws NotCoordinatorException, CoordinatorLoadInProgressException {
// 上下文会在该分区成为leader时创建
CoordinatorContext context = contextOrThrow(tp);
try {
context.lock.lock();
if (context.state == CoordinatorState.ACTIVE) {
func.accept(context);
}
// ...
} finally {
context.lock.unlock();
}
}
四、消费者发现协调者
CoordinatorRequestManager.poll:
- 构建FindCoordinatorRequest,注意入参keyType=0(GROUP),即共享组协调者与普通消费组完全一致;
- 注册处理FindCoordinatorResponse回调,将协调者节点缓存在coordinator;
// 消费组协调者节点
private Node coordinator;
public NetworkClientDelegate.PollResult poll(final long currentTimeMs) {
// 如果协调者已经发现,跳过
if (closing || this.coordinator != null)
return EMPTY;
// 可以发送
if (coordinatorRequestState.canSendRequest(currentTimeMs)) {
NetworkClientDelegate.UnsentRequest request = makeFindCoordinatorRequest(currentTimeMs);
return new NetworkClientDelegate.PollResult(request);
}
// 上次失败了,在backoff时间内,不发送
return new NetworkClientDelegate.PollResult(coordinatorRequestState.remainingBackoffMs(currentTimeMs));
}
NetworkClientDelegate.UnsentRequest makeFindCoordinatorRequest(final long currentTimeMs) {
coordinatorRequestState.onSendAttempt(currentTimeMs);
FindCoordinatorRequestData data = new FindCoordinatorRequestData()
// 注意keyType仍然是0-GROUP
.setKeyType(FindCoordinatorRequest.CoordinatorType.GROUP.id())
// 消费组id
.setKey(this.groupId);
NetworkClientDelegate.UnsentRequest unsentRequest = new NetworkClientDelegate.UnsentRequest(
new FindCoordinatorRequest.Builder(data),
// 不指定node,请求bootstrap中任意节点
Optional.empty()
);
return unsentRequest.whenComplete((clientResponse, throwable) -> {
getAndClearFatalError();
if (clientResponse != null) {
FindCoordinatorResponse response = (FindCoordinatorResponse) clientResponse.responseBody();
onResponse(clientResponse.receivedTimeMs(), response);
} else {
onFailedResponse(unsentRequest.handler().completionTimeMs(), throwable);
}
});
}
private void onSuccessfulResponse(
final long currentTimeMs,
final FindCoordinatorResponseData.Coordinator coordinator
) {
// 使用非brokerId作为connectionId,客户端与协调者broker使用单独的连接
int coordinatorConnectionId = Integer.MAX_VALUE - coordinator.nodeId();
this.coordinator = new Node(
coordinatorConnectionId,
coordinator.host(),
coordinator.port());
}
KafkaApis.getCoordinator:broker计算leader(hash(groupId) % consumer_offsets分区数)作为协调者节点返回 , 默认consumer_offsets分区数为50。
private def getCoordinator(request: RequestChannel.Request, keyType: Byte, key: String): (Errors, Node) = {
val (partition, internalTopicName) = CoordinatorType.forId(keyType) match {
case CoordinatorType.GROUP =>
// 仍然走这 topic=__consumer_offsets
(groupCoordinator.partitionFor(key), GROUP_METADATA_TOPIC_NAME)
case CoordinatorType.TRANSACTION =>
(txnCoordinator.partitionFor(key), TRANSACTION_STATE_TOPIC_NAME)
case CoordinatorType.SHARE =>
// 并不会走这
(shareCoordinator.partitionFor(SharePartitionKey.getInstance(key)), SHARE_GROUP_STATE_TOPIC_NAME)
}
}
public int partitionFor(String groupId) {
return Utils.abs(groupId.hashCode()) % numPartitions;
}
五、消费者心跳
classic协议需要通过JoinGroup+SyncGroup完成Rebalance,新协议通过心跳完成Rebalance。
5-1、客户端
AbstractHeartbeatRequestManager.poll:新协议消费者构建心跳请求
- pollTimer.isExpired:应用线程超过max.poll.interval.ms=5分钟未执行poll(SharePollEvent),执行LeaveGroup,心跳包中memberEpoch=-1代表LeaveGroup;
- 定时发送心跳;
public NetworkClientDelegate.PollResult poll(long currentTimeMs) {
// 没发现协调者,不发心跳
if (coordinatorRequestManager.coordinator().isEmpty() ||
membershipManager().shouldSkipHeartbeat()) {
membershipManager().onHeartbeatRequestSkipped();
maybePropagateCoordinatorFatalErrorEvent();
return NetworkClientDelegate.PollResult.EMPTY;
}
pollTimer.update(currentTimeMs);
// max.poll.interval.ms = 300000 = 5分钟 自动离组
if (pollTimer.isExpired() && !membershipManager().isLeavingGroup()) {
// member_epoch = -1
membershipManager().transitionToSendingLeaveGroup(true);
// 构建心跳
NetworkClientDelegate.UnsentRequest leaveHeartbeat = makeHeartbeatRequest(currentTimeMs, true);
heartbeatRequestState.reset();
resetHeartbeatState();
return new NetworkClientDelegate.PollResult(heartbeatRequestState.heartbeatIntervalMs(), Collections.singletonList(leaveHeartbeat));
}
boolean heartbeatNow = shouldSendLeaveHeartbeatNow() ||
(membershipManager().shouldHeartbeatNow() && !heartbeatRequestState.requestInFlight());
// canSendRequest=false 没到心跳时间,直接返回
if (!heartbeatRequestState.canSendRequest(currentTimeMs) && !heartbeatNow) {
return new NetworkClientDelegate.PollResult(heartbeatRequestState.timeToNextHeartbeatMs(currentTimeMs));
}
// 构建心跳
NetworkClientDelegate.UnsentRequest request = makeHeartbeatRequest(currentTimeMs, false);
return new NetworkClientDelegate.PollResult(heartbeatRequestState.heartbeatIntervalMs(), Collections.singletonList(request));
}
需要注意,在新协议(consumer)下,消费者heartbeat.interval.ms和session.timeout.ms配置不生效,心跳间隔和心跳超时时间由服务端决定。消费者需要通过一次成功的心跳响应,获取后续的心跳间隔。(见KIP-848)
ShareHeartbeatRequestManager.buildHeartbeatRequest:心跳请求
- groupId:消费组id;
- memberId:uuid,消费者实例id;
- memberEpoch:消费者实例版本,刚上线是0,-1代表离组心跳,其他由服务端响应更新;
- subscribedTopicNames:订阅的topics,首次进组或订阅变更才发送;
public NetworkClientDelegate.UnsentRequest buildHeartbeatRequest() {
return new NetworkClientDelegate.UnsentRequest(
new ShareGroupHeartbeatRequest.Builder(this.heartbeatState.buildRequestData()),
// 发送给协调者节点
coordinatorRequestManager.coordinator());
}
public ShareGroupHeartbeatRequestData buildRequestData() {
ShareGroupHeartbeatRequestData data = new ShareGroupHeartbeatRequestData();
data.setGroupId(shareMembershipManager.groupId());
// uuid
data.setMemberId(shareMembershipManager.memberId());
// 0-刚上线 -1-LeaveGroup
data.setMemberEpoch(shareMembershipManager.memberEpoch());
if (sentFields.rackId == null) {
data.setRackId(shareMembershipManager.rackId());
sentFields.rackId = shareMembershipManager.rackId();
}
// 首次进组/订阅变更,需要发送订阅的topic
boolean sendAllFields = shareMembershipManager.state() == MemberState.JOINING;
TreeSet<String> subscribedTopicNames = new TreeSet<>(this.subscriptions.subscription());
if (sendAllFields || !subscribedTopicNames.equals(sentFields.subscribedTopicNames)) {
data.setSubscribedTopicNames(new ArrayList<>(this.subscriptions.subscription()));
sentFields.subscribedTopicNames = subscribedTopicNames;
}
return data;
}
AbstractHeartbeatRequestManager.onResponse:协调者返回心跳响应
- 心跳间隔=服务端返回heartbeatIntervalMs;
- 消费分区=服务端返回assignment;
private void onResponse(final R response, final long currentTimeMs) {
if (errorForResponse(response) == Errors.NONE) {
// 服务端返回heartbeatIntervalMs,更新到内存,作为后续心跳间隔
heartbeatRequestState.updateHeartbeatIntervalMs(
heartbeatIntervalForResponse(response));
heartbeatRequestState.onSuccessfulAttempt(currentTimeMs);
// share和非share 实现不同
membershipManager().onHeartbeatSuccess(response);
return;
}
onErrorResponse(response, currentTimeMs);
}
public void onHeartbeatSuccess(ShareGroupHeartbeatResponse response) {
ShareGroupHeartbeatResponseData responseData = response.data();
// ...
// 更新epoch
updateMemberEpoch(responseData.memberEpoch());
// 协调者下发给自己的分区
ShareGroupHeartbeatResponseData.Assignment assignment = responseData.assignment();
if (assignment != null) {
if (!state.canHandleNewAssignment()) {
return;
}
Map<Uuid, SortedSet<Integer>> newAssignment = new HashMap<>();
assignment.topicPartitions().forEach(topicPartition -> newAssignment.put(topicPartition.topicId(), new TreeSet<>(topicPartition.partitions())));
// 处理下发的分区
processAssignmentReceived(newAssignment);
}
}
AbstractMembershipManager.processAssignmentReceived:处理心跳响应时,消费者发现协调者下发分区(currentTargetAssignment)与当前(currentAssignment)不同,进入RECONCILING调谐状态。
新协议下,消费者通过心跳响应得知分区分配变化,是Rebalance的开始时间。
// 自己正在消费的分区
private LocalAssignment currentAssignment;
// 协调者给自己分配的分区
private LocalAssignment currentTargetAssignment;
protected void processAssignmentReceived(Map<Uuid, SortedSet<Integer>> assignment) {
// currentTargetAssignment = 协调者分区分区
replaceTargetAssignmentWithNewAssignment(assignment);
if (!targetAssignmentReconciled()) {
// currentAssignment != currentTargetAssignment 进入 RECONCILING 状态
transitionTo(MemberState.RECONCILING);
} else {
// currentAssignment == currentTargetAssignment 进入 STABLE 状态
if (state == MemberState.RECONCILING
|| state == MemberState.JOINING) {
transitionTo(MemberState.STABLE);
}
}
}
调谐状态,需要消费者将自身assignment更新为目标assignment,比如:对于普通消费者需要把分区消费进度提交,才能更新assignment。
调谐状态推进由IO线程处理,触发方式:
- 应用线程调用poll api发送SharePollEvent;
- IO线程循环中,RequestManager(AbstractMembershipManager).poll;
// ApplicationEventProcessor#process
private void process(final SharePollEvent event) {
// 处理 调谐 状态
requestManagers.consumerMembershipManager.ifPresent(
consumerMembershipManager -> consumerMembershipManager.maybeReconcile(true));
requestManagers.shareHeartbeatRequestManager.ifPresent(hrm -> {
hrm.membershipManager().onConsumerPoll();
// 重置poll超时计数器
hrm.resetPollTimer(event.pollTimeMs());
});
}
// AbstractMembershipManager#poll
public NetworkClientDelegate.PollResult poll(final long currentTimeMs) {
maybeReconcile(false);
return NetworkClientDelegate.PollResult.EMPTY;
}
AbstractMembershipManager.maybeReconcile:处理调谐
- 比对目标状态currentTargetAssignment和当前状态currentAssignment;
- markPendingRevocationToPauseFetching,对于撤销分区,停止fetch;
- 更新currentAssignment=currentTargetAssignment,如果是普通消费者需要先完成消费进度提交,share消费者不用;
客户端调谐完成,通过心跳响应发现自己assignment=服务端返回assignment,进入STABLE状态,结束Rebalance。
// 目标状态
private LocalAssignment currentTargetAssignment;
// 当前状态
private LocalAssignment currentAssignment;
public void maybeReconcile(boolean canCommit) {
if (state != MemberState.RECONCILING) {
return;
}
if (targetAssignmentReconciled()) {
return;
}
if (reconciliationInProgress) {
return;
}
// 目标分区
TopicIdPartitionSet assignedTopicIdPartitions = findResolvableAssignmentAndTriggerMetadataUpdate();
final LocalAssignment resolvedAssignment = new LocalAssignment(currentTargetAssignment.localEpoch, assignedTopicIdPartitions);
if (!currentAssignment.isNone() && resolvedAssignment.partitions.equals(currentAssignment.partitions)) {
currentAssignment = resolvedAssignment;
transitionTo(MemberState.ACKNOWLEDGING);
return;
}
// share autoCommitEnabled = false
if (autoCommitEnabled && !canCommit) return;
markReconciliationInProgress();
// 目标分区 new
SortedSet<TopicPartition> assignedTopicPartitions = assignedTopicIdPartitions.toTopicNamePartitionSet();
// 当前分区 old
SortedSet<TopicPartition> ownedPartitions = new TreeSet<>(TOPIC_PARTITION_COMPARATOR);
ownedPartitions.addAll(subscriptions.assignedPartitions());
// 新增分区
SortedSet<TopicPartition> addedPartitions = new TreeSet<>(TOPIC_PARTITION_COMPARATOR);
addedPartitions.addAll(assignedTopicPartitions);
addedPartitions.removeAll(ownedPartitions);
// 撤销分区
SortedSet<TopicPartition> revokedPartitions = new TreeSet<>(TOPIC_PARTITION_COMPARATOR);
revokedPartitions.addAll(ownedPartitions);
revokedPartitions.removeAll(assignedTopicPartitions);
// 停止分区fetch
markPendingRevocationToPauseFetching(revokedPartitions);
// 非share 且 自动提交 执行一次提交
CompletableFuture<Void> commitResult = signalReconciliationStarted();
commitResult.whenComplete((__, commitReqError) -> {
// ...
if (!maybeAbortReconciliation()) {
// 提交完成,将目标assignment更新到内存
// SubscriptionState#assignment
revokeAndAssign(resolvedAssignment, assignedTopicIdPartitions, revokedPartitions, addedPartitions);
}
});
}
5-2、协调者
GroupCoordinatorService.shareGroupHeartbeat:协调者处理心跳,处理两个状态:
- 消费组状态(consumer_offset):消费组成员和分区分配情况;
- 共享组状态(share_group_state):共享组消费分区消费进度,首次创建,需要初始化;
public CompletableFuture<ShareGroupHeartbeatResponseData> shareGroupHeartbeat(
AuthorizableRequestContext context,
ShareGroupHeartbeatRequestData request) {
// 处理consumer_offset状态
return runtime.scheduleWriteOperation(
"share-group-heartbeat",
// TopicPartition
topicPartitionFor(request.groupId()),
// 超时时间
Duration.ofMillis(config.offsetCommitTimeoutMs()),
// CoordinatorWriteOperation
coordinator -> coordinator.shareGroupHeartbeat(context, request)
).thenCompose(result -> {
if (result.getValue().isPresent()) {
// 返回了InitializeShareGroupStateParameters,初始化share group状态
timer.add(new TimerTask(0L) {
@Override
public void run() {
persisterInitialize(result.getValue().get(), result.getKey())
}
});
}
return CompletableFuture.completedFuture(result.getKey());
});
}
5-2-1、消费组状态-成员信息&分区分配
关注:1-构造哪些数据记录;2-如何分配分区;3-数据记录应用到内存。
5-2-1-1、数据记录
GroupMetadataManager.shareGroupHeartbeat:处理心跳请求,关注几个关键的consumer_offset数据记录
- getOrMaybeCreateShareGroup,消费组不能修改类型,即普通消费组不能调整为共享消费组;
- hasMemberSubscriptionChanged:成员变更(订阅Topic变更、新成员上线),key=(groupId,memberId),value=ShareGroupMemberMetadataValue(clientId,topics);
- updateTargetAssignment:目标分区分配变化,key=(groupId,memberId),value=ShareGroupTargetAssignmentMemberValue(topics,partitionIds);
- maybeReconcile:如果目标分区分配变化,记录当前分区分配情况,key=(groupId,memberId), value=ShareGroupCurrentMemberAssignmentValue(memberEpoch,previousMemberEpoch,state,assignment);
- 响应客户端心跳间隔=share.heartbeat.interval.ms=5s,心跳超时检测=share.session.timeout.ms=45s;
- 如果客户端首次心跳或分区分配变化,响应assignment分区分配;
- maybeCreateInitializeShareGroupStateRequest:如果有消费组有分区还没在share_group_state初始化,构造记录key=groupId, value=ShareGroupStatePartitionMetadataValue(已经初始化分区、正在初始化分区、正在删除分区),并返回InitializeShareGroupStateParameters异步初始化share group对应分区状态;
这部分是在处理consumer_offset状态,share_group_state是异步RPC处理,需要补偿。补偿依赖于消费组成员心跳触发,具体查看GroupMetadataManager.subscribedTopicsChangeMap方法。如果分区超过group.share.initialize.retry.interval.ms=30s仍然在consumeroffset处于initializing,认为超时,需要重新发起share_group_state对应分区的初始化。
private CoordinatorResult<Map.Entry<ShareGroupHeartbeatResponseData,
Optional<InitializeShareGroupStateParameters>>, CoordinatorRecord> shareGroupHeartbeat(
String groupId, String memberId, int memberEpoch,
String rackId, String clientId,
String clientHost, List<String> subscribedTopicNames) throws ApiException {
final long currentTimeMs = time.milliseconds();
final List<CoordinatorRecord> records = new ArrayList<>();
boolean createIfNotExists = memberEpoch == 0;
// new ShareGroup() 已经存在groupId的普通消费组,改为share,报错GroupIdNotFoundException
final ShareGroup group = getOrMaybeCreateShareGroup(groupId, createIfNotExists);
// group.share.max.size = 200 一个组只能有200个消费者实例
throwIfShareGroupIsFull(group, memberId);
// new ShareGroupMember()
ShareGroupMember member = getOrMaybeSubscribeShareGroupMember(
group, memberId,
memberEpoch, createIfNotExists);
ShareGroupMember updatedMember = new ShareGroupMember.Builder(member)
.maybeUpdateRackId(Optional.ofNullable(rackId))
.maybeUpdateSubscribedTopicNames(Optional.ofNullable(subscribedTopicNames))
.setClientId(clientId)
.setClientHost(clientHost)
.build();
// consumer订阅topic变化/新成员上线
// key=groupId+memberId,value=ShareGroupMemberMetadataValue(clientId、topics)
boolean bumpGroupEpoch = hasMemberSubscriptionChanged(
groupId, member, updatedMember, records) || initializedAssignmentPending(group);
int groupEpoch = group.groupEpoch();
Map<String, SubscriptionCount> subscribedTopicNamesMap = group.subscribedTopicNames();
SubscriptionType subscriptionType = group.subscriptionType();
if (bumpGroupEpoch || group.hasMetadataExpired(currentTimeMs)) {
subscribedTopicNamesMap = group.computeSubscribedTopicNames(member, updatedMember);
// 对订阅的n个topic元数据计算一个hash,如果分区变化,这里hash会变
long groupMetadataHash = ModernGroup.computeMetadataHash(
subscribedTopicNamesMap,
topicHashCache,
metadataImage
);
int numMembers = group.numMembers();
if (!group.hasMember(updatedMember.memberId())) {
numMembers++;
}
subscriptionType = ModernGroup.subscriptionType(
subscribedTopicNamesMap,
numMembers
);
if (groupMetadataHash != group.metadataHash()) {
bumpGroupEpoch = true;
}
if (bumpGroupEpoch) {
groupEpoch += 1;
// group变化
// key=groupId, value=ShareGroupMetadataValue(groupEpoch,groupMetadataHash)
records.add(newShareGroupEpochRecord(groupId, groupEpoch, groupMetadataHash));
}
group.setMetadataRefreshDeadline(currentTimeMs + METADATA_REFRESH_INTERVAL_MS, groupEpoch);
}
final int targetAssignmentEpoch;
final Assignment targetAssignment;
if (groupEpoch > group.assignmentEpoch()) {
// group assignment变化【分区分配】
targetAssignment = updateTargetAssignment(
group,
groupEpoch,
updatedMember,
subscriptionType,
records
);
targetAssignmentEpoch = groupEpoch;
} else {
targetAssignmentEpoch = group.assignmentEpoch();
targetAssignment = group.targetAssignment(updatedMember.memberId());
}
// member assignment变化
updatedMember = maybeReconcile(
groupId,
updatedMember,
targetAssignmentEpoch,
targetAssignment,
bumpGroupEpoch,
records
);
// share.session.timeout.ms=45s,重新开启定时器,处理member心跳超时
scheduleShareGroupSessionTimeout(groupId, memberId);
ShareGroupHeartbeatResponseData response = new ShareGroupHeartbeatResponseData()
.setMemberId(updatedMember.memberId())
.setMemberEpoch(updatedMember.memberEpoch())
// share.heartbeat.interval.ms=5s 心跳间隔
.setHeartbeatIntervalMs(shareGroupHeartbeatIntervalMs(groupId));
// 返回消费者的assignment
boolean isFullRequest = subscribedTopicNames != null;
if (memberEpoch == 0 || isFullRequest || hasAssignedPartitionsChanged(member, updatedMember)) {
response.setAssignment(ShareGroupHeartbeatResponse.createAssignment(updatedMember.assignedPartitions()));
}
return new CoordinatorResult<>(
records,
Map.entry(
response,
// 判断是否需要初始化 共享组的topic分区状态
// key=groupId, value=ShareGroupStatePartitionMetadataValue(已经初始化分区、正在初始化分区、正在删除分区)
// InitializeShareGroupStateParameters = groupId + 需要初始化share_group_state的分区
maybeCreateInitializeShareGroupStateRequest(groupId, groupEpoch, subscribedTopicNamesMap.keySet(), records)
)
);
}
5-2-1-2、分区分配
share group还是需要分区分配,只不过一个分区可以被一个消费组里多个消费者消费。
SimpleAssignor:共享组分区分配,区分订阅关系是否一致(成员订阅相同Topic)。
- 优先满足负载均衡Balance,两个成员的分区数量差异不超过1;
- 其次满足粘性策略Stickiness,避免分区在两个成员之间转移;
public class SimpleAssignor implements ShareGroupPartitionAssignor {
public GroupAssignment assign(GroupSpec groupSpec,
SubscribedTopicDescriber subscribedTopicDescriber)
throws PartitionAssignorException {
if (groupSpec.subscriptionType().equals(HOMOGENEOUS)) {
// 订阅关系一致
return new SimpleHomogeneousAssignmentBuilder(groupSpec, subscribedTopicDescriber).build();
} else {
// 订阅关系不一致
return new SimpleHeterogeneousAssignmentBuilder(groupSpec, subscribedTopicDescriber).build();
}
}
}
SimpleHomogeneousAssignmentBuilder:简单看一下订阅关系一致的场景。
- desiredSharing=ceil(members/partitions),期望一个分区被n个成员共享,当成员数≤分区数,分区不会被共享;
- desiredAssignmentCount[i] ,在desiredSharing条件下,成员i具体分配到几个分区;
例子:
- 3成员7分区,因为成员数量<分区数量,最终还是分区独占;
- 6成员4分区,desiredSharing=ceil(6/4)=2,成员1、4分配2个分区,成员2、3、5、6分配1个分区;
public SimpleHomogeneousAssignmentBuilder(GroupSpec groupSpec, SubscribedTopicDescriber subscribedTopicDescriber) {
this.subscribedTopicIds = groupSpec.memberSubscription(groupSpec.memberIds().iterator().next()).subscribedTopicIds();
this.numGroupMembers = groupSpec.memberIds().size();
this.memberIds = new ArrayList<>(groupSpec.memberIds());
this.memberIndices = AssignorHelpers.newHashMap(numGroupMembers);
for (int memberIndex = 0; memberIndex < numGroupMembers; memberIndex++) {
memberIndices.put(memberIds.get(memberIndex), memberIndex);
}
this.targetPartitions = computeTargetPartitions(groupSpec, subscribedTopicIds, subscribedTopicDescriber);
int numTargetPartitions = targetPartitions.size();
if (numTargetPartitions == 0) {
this.desiredSharing = 0;
} else {
// 一个分区被desiredSharing个成员共享
// desiredSharing = ceil(members/partitions) = (members + partitions - 1) / partitions
// 即 成员 <= 分区数,分区不会被共享
this.desiredSharing = (numGroupMembers + numTargetPartitions - 1) / numTargetPartitions;
}
// 每个成员消费n个分区
this.desiredAssignmentCount = new int[numGroupMembers];
this.oldGroupAssignment = AssignorHelpers.newHashMap(numGroupMembers);
this.newGroupAssignment = AssignorHelpers.newHashMap(numGroupMembers);
this.finalAssignmentByPartition = AssignorHelpers.newHashMap(numTargetPartitions);
this.finalAssignmentByMember = AssignorHelpers.newHashMap(numGroupMembers);
this.unfilledMembers = AssignorHelpers.newHashSet(numGroupMembers);
this.overfilledMembers = AssignorHelpers.newHashSet(numGroupMembers);
groupSpec.memberIds().forEach(memberId -> {
int memberIndex = memberIndices.get(memberId);
oldGroupAssignment.put(memberIndex, groupSpec.memberAssignment(memberId).partitions());
});
// 假设 3个成员,7个分区
// desiredSharing=ceil(3/7) = 1.0
// preciseDesiredAssignmentCount = 1.0 * 7 / 3 = 2.33
// 假设 6个成员,4个分区
// desiredSharing=ceil(6/4) = 2.0
// preciseDesiredAssignmentCount = 2.0 * 4 / 6 = 1.33
double preciseDesiredAssignmentCount = desiredSharing * numTargetPartitions / (double) numGroupMembers;
for (int memberIndex = 0; memberIndex < numGroupMembers; memberIndex++) {
desiredAssignmentCount[memberIndex] =
(int) Math.ceil(preciseDesiredAssignmentCount * (double) (memberIndex + 1)) -
(int) Math.ceil(preciseDesiredAssignmentCount * (double) memberIndex);
}
}
SimpleHomogeneousAssignmentBuilder.build:构造同时满足desiredAssignmentCount[i]和desiredSharing的分配方案。
public GroupAssignment build() {
if (subscribedTopicIds.isEmpty()) {
return new GroupAssignment(Map.of());
}
// 撤销member不再订阅的topic
revokeUnassignablePartitions();
// 满足desiredAssignmentCount[i],撤销member拥有过多分区
revokeOverfilledMembers();
// 超过desiredSharing的分区,撤销分区有过多member共享
revokeOversharedPartitions();
targetPartitions.forEach(topicPartition -> finalAssignmentByPartition.computeIfAbsent(topicPartition, k -> new HashSet<>()));
// 未达到desiredSharing的分区,按照desiredAssignmentCount[i]分配给剩余member
assignRemainingPartitions();
Map<String, MemberAssignment> targetAssignment = AssignorHelpers.newHashMap(numGroupMembers);
for (int memberIndex = 0; memberIndex < numGroupMembers; memberIndex++) {
Map<Uuid, Set<Integer>> memberAssignment = newGroupAssignment.get(memberIndex);
if (memberAssignment == null) {
targetAssignment.put(memberIds.get(memberIndex), new MemberAssignmentImpl(oldGroupAssignment.get(memberIndex)));
} else {
targetAssignment.put(memberIds.get(memberIndex), new MemberAssignmentImpl(memberAssignment));
}
}
return new GroupAssignment(targetAssignment);
}
5-2-1-3、应用到内存
GroupCoordinatorShard.replay:consumer_offset记录应用到内存的统一入口
public void replay(
long offset, long producerId,
short producerEpoch, CoordinatorRecord record) throws RuntimeException {
ApiMessage key = record.key();
ApiMessageAndVersion value = record.value();
CoordinatorRecordType recordType =
CoordinatorRecordType.fromId(key.apiKey());
switch (recordType) {
// ... 其他数据记录
// groupId+memberId => member元数据如订阅topics
case SHARE_GROUP_MEMBER_METADATA:
groupMetadataManager.replay(
(ShareGroupMemberMetadataKey) key,
(ShareGroupMemberMetadataValue) Utils.messageOrNull(value)
);
break;
// groupId+memberId => 目标分区分配
case SHARE_GROUP_TARGET_ASSIGNMENT_MEMBER:
groupMetadataManager.replay(
(ShareGroupTargetAssignmentMemberKey) key,
(ShareGroupTargetAssignmentMemberValue) Utils.messageOrNull(value)
);
break;
// groupId+memberId => 当前分区分配
case SHARE_GROUP_CURRENT_MEMBER_ASSIGNMENT:
groupMetadataManager.replay(
(ShareGroupCurrentMemberAssignmentKey) key,
(ShareGroupCurrentMemberAssignmentValue) Utils.messageOrNull(value)
);
break;
// groupId => share_group_state的分区初始化情况
case SHARE_GROUP_STATE_PARTITION_METADATA:
groupMetadataManager.replay(
(ShareGroupStatePartitionMetadataKey) key,
(ShareGroupStatePartitionMetadataValue) Utils.messageOrNull(value)
);
break;
}
}
GroupMetadataManager:一方面构造数据记录,另一方面管理内存状态。
- groups:每个消费组的信息,包括成员、成员订阅Topics、分区分配情况;
- shareGroupStatePartitionMetadata:ShareGroupStatePartitionMetadataValue应用到内存后的形态,记录每个分区是否在sharegroupstate正常初始化;
// groupId -> 消费组(普通/共享)
private final TimelineHashMap<String, Group> groups;
// groupId -> share_group_state元数据
private final TimelineHashMap<String, ShareGroupStatePartitionMetadataInfo> shareGroupStatePartitionMetadata;
public record ShareGroupStatePartitionMetadataInfo(
// 正在初始化
Map<Uuid, InitMapValue> initializingTopics,
// 完成初始化
Map<Uuid, InitMapValue> initializedTopics,
Set<Uuid> deletingTopics) {
}
public record InitMapValue(
String name, Set<Integer> partitions,
long timestamp) { // 创建时间
}
ModernGroup是新协议消费组的顶层实现,存储消费组和消费组成员信息。
targetAssignment=ShareGroupTargetAssignmentMemberValue=分区期望的分配情况。
public abstract class ModernGroup<T extends ModernGroupMember>
implements Group {
// groupId
protected final String groupId;
// group版本
protected final TimelineInteger groupEpoch;
// memberId -> member
protected final TimelineHashMap<String, T> members;
// memberId -> assignment(partitions)
protected final TimelineHashMap<String, Assignment> targetAssignment;
}
public class ShareGroup extends ModernGroup<ShareGroupMember> {
private final TimelineObject<ShareGroupState> state;
}
public enum ShareGroupState {
EMPTY("Empty"),
STABLE("Stable"),
DEAD("Dead"),
}
public class Assignment implements MemberAssignment {
private final Map<Uuid, Set<Integer>> partitions;
}
ModernGroupMember:存储组内每个成员的状态:
- subscribedTopicNames=ShareGroupMemberMetadataValue=订阅Topics;
- assignedPartitions=ShareGroupCurrentMemberAssignmentValue=成员当前消费分区;
public abstract class ModernGroupMember {
protected String memberId;
protected int memberEpoch;
protected int previousMemberEpoch;
protected MemberState state;
// 订阅topics
protected Set<String> subscribedTopicNames;
// topicId -> partitionId
protected Map<Uuid, Set<Integer>> assignedPartitions;
}
public class ShareGroupMember extends ModernGroupMember {
}
5-2-2、初始化共享组状态-消费进度
5-2-2-1、消费组协调者节点-发送WriteShareGroupStateRequest
上面完成了消费组状态变更,对于共享组的新加入分区,需要完成共享组状态初始化,初始化参数InitializeShareGroupStateParameters包含n个分区。
public class InitializeShareGroupStateParameters {
private final GroupTopicPartitionData<PartitionStateData>
groupTopicPartitionData;
}
public class GroupTopicPartitionData<P extends PartitionInfoData> {
private final String groupId;
private final List<TopicData<P>> topicsData;
}
public class TopicData<P extends PartitionInfoData> {
private final Uuid topicId;
// n个分区
private final List<P> partitions;
}
public class PartitionData implements ...{
// 分区id
private final int partition;
// groupEpoch
private final int stateEpoch;
// -1
private final long startOffset;
}
GroupCoordinatorService.persisterInitialize:异步调用共享组状态存储节点完成后,更新shareGroupStatePartitionMetadata中相关分区状态,还需要走一轮写日志+应用到内存。(这部分忽略)
private final Persister persister;
CompletableFuture<ShareGroupHeartbeatResponseData> persisterInitialize(
InitializeShareGroupStateParameters request,
ShareGroupHeartbeatResponseData defaultResponse) {
// 远程调用,初始化消费组部分分区的share_group_state
return persister.initializeState(request)
.handle((response, exp) -> {
if (exp == null) {
// 无异常,initializing -> initialized
return handlePersisterInitializeResponse(request.groupTopicPartitionData().groupId(), response, defaultResponse);
}
// 有异常,从initializing移除
GroupTopicPartitionData<PartitionStateData> gtp = request.groupTopicPartitionData();
Errors error = Errors.forException(exp);
Map<Uuid, Set<Integer>> topicPartitionMap = new HashMap<>();
gtp.topicsData().forEach(topicData -> topicPartitionMap.computeIfAbsent(topicData.topicId(), k -> new HashSet<>())
.addAll(topicData.partitions().stream().map(PartitionStateData::partition).collect(Collectors.toSet())));
return uninitializeShareGroupState(error, gtp.groupId(), topicPartitionMap);
})
.thenCompose(resp -> resp);
}
DefaultStatePersister.initializeState:按照分区维度,构造n个InitializeStateHandler,提交到Sender线程。
public CompletableFuture<InitializeShareGroupStateResult> initializeState(InitializeShareGroupStateParameters request) {
GroupTopicPartitionData<PartitionStateData> gtp = request.groupTopicPartitionData();
String groupId = gtp.groupId();
Map<Uuid, Map<Integer, CompletableFuture<InitializeShareGroupStateResponse>>> futureMap = new HashMap<>();
// 需要初始化的topic-partition,组装为n个InitializeStateHandler
List<PersisterStateManager.InitializeStateHandler> handlers = new ArrayList<>();
gtp.topicsData().forEach(topicData -> {
topicData.partitions().forEach(partitionData -> {
CompletableFuture<InitializeShareGroupStateResponse> future = futureMap
.computeIfAbsent(topicData.topicId(), k -> new HashMap<>())
.computeIfAbsent(partitionData.partition(), k -> new CompletableFuture<>());
handlers.add(
stateManager.new InitializeStateHandler(
groupId, topicData.topicId(), partitionData.partition(),
partitionData.stateEpoch(), partitionData.startOffset(),
future, null
)
);
});
});
// 提交请求给sender线程,异步处理
for (PersisterStateManager.PersisterStateManagerHandler handler : handlers) {
stateManager.enqueue(handler);
}
// sender处理完毕,触发shareGroupStatePartitionMetadata更新
CompletableFuture<Void> combinedFuture = CompletableFuture.allOf(
handlers.stream()
.map(PersisterStateManager.InitializeStateHandler::result)
.toArray(CompletableFuture[]::new));
return combinedFuture.thenApply(v -> initializeResponsesToResult(futureMap));
}
PersisterStateManager.SendThread.generateRequests:Sender线程
- 优先从当前节点元数据缓存获取共享组状态存储节点 = leaderOf(hash(groupId,topic,partitionId) % 50),默认share_group_state也是50分区;
- 如果元数据缓存没有,走FindCoordinator请求任意节点获取共享组状态存储节点,和获取普通协调者一致,只是keyType=2=SHARE;
- 对每个共享组状态存储节点,合并为一个WriteShareGroupStateRequest;
public Collection<RequestAndCompletionHandler> generateRequests() {
List<RequestAndCompletionHandler> requests = new ArrayList<>();
if (!queue.isEmpty()) {
PersisterStateManagerHandler handler = queue.peek();
queue.poll();
// 1. 从metadataCache中获取 共享组状态存储节点
if (handler.lookupNeeded()) {
Node randomNode = randomNode();
if (randomNode == Node.noNode()) {
handler.findCoordinatorErrorResponse(Errors.COORDINATOR_NOT_AVAILABLE);
return List.of();
}
// 2. 如果metadataCache中不存在,发起find coordinator RPC
return List.of(new RequestAndCompletionHandler(
time.milliseconds(),
randomNode,
handler.findShareCoordinatorBuilder(),
handler
));
}
}
// ...
// 3. 组装WriteShareGroupStateRequest
requests.add(new RequestAndCompletionHandler(
time.milliseconds(),
coordNode,
// 按照 共享组状态存储节点 合并请求
combinedRequestPerTypePerGroup,
response -> {
handlersPerGroup.forEach(handler1 -> handler1.onComplete(response));
wakeup();
}));
return requests;
}
5-2-2-2、共享组状态存储节点-初始化共享组分区消费进度
ShareCoordinatorService.writeState:共享组状态存储节点,循环每个分区,提交共享组状态初始化,分区即hash(groupId,topic,partitionId)%50。
public CompletableFuture<WriteShareGroupStateResponseData> writeState(RequestContext context, WriteShareGroupStateRequestData request) {
request.topics().forEach(topicData -> {
Map<Integer, CompletableFuture<WriteShareGroupStateResponseData>> partitionFut =
futureMap.computeIfAbsent(topicData.topicId(), k -> new HashMap<>());
topicData.partitions().forEach(
partitionData -> {
CompletableFuture<WriteShareGroupStateResponseData> future = runtime.scheduleWriteOperation(
"write-share-group-state",
topicPartitionFor(SharePartitionKey.getInstance(groupId, topicData.topicId(), partitionData.partition())),
Duration.ofMillis(config.shareCoordinatorWriteTimeoutMs()),
coordinator -> coordinator.writeState(new WriteShareGroupStateRequestData()...);
partitionFut.put(partitionData.partition(), future);
});
});
//..
}
ShareCoordinatorShard.generateShareStateRecord:生成share_group_state数据记录。
// 共享组分区消费进度
private final TimelineHashMap<SharePartitionKey, ShareGroupOffset> shareStateMap;
private CoordinatorRecord generateShareStateRecord(
WriteShareGroupStateRequestData.PartitionData partitionData,
SharePartitionKey key,
boolean updateLeaderEpoch) {
long timestamp = time.milliseconds();
ShareGroupOffset currentState = shareStateMap.get(key);
int newLeaderEpoch = currentState.leaderEpoch();
if (updateLeaderEpoch) {
newLeaderEpoch = partitionData.leaderEpoch() != -1 ? partitionData.leaderEpoch() : newLeaderEpoch;
}
// ...
// key=groupId,topicId,partition,value=ShareUpdateValue
return ShareCoordinatorRecordHelpers.newShareUpdateRecord(
key.groupId(), key.topicId(), partitionData.partition(),
new ShareGroupOffset.Builder()
.setSnapshotEpoch(currentState.snapshotEpoch())
.setStartOffset(partitionData.startOffset())
.setDeliveryCompleteCount(partitionData.deliveryCompleteCount())
.setLeaderEpoch(newLeaderEpoch)
.setStateBatches(mergeBatches(List.of(), partitionData))
.build());
}
ShareCoordinatorShard.shareStateMap:最终share_group_state记录应用到内存,即共享组的消费进度。
// groupId,topicId,partition -> 共享组消费进度
private final TimelineHashMap<SharePartitionKey, ShareGroupOffset> shareStateMap;
private void handleShareUpdate(ShareUpdateKey key, ShareUpdateValue value) {
SharePartitionKey mapKey = SharePartitionKey.getInstance(key.groupId(), key.topicId(), key.partition());
maybeUpdateLeaderEpochMap(mapKey, value.leaderEpoch());
ShareGroupOffset offsetRecord = ShareGroupOffset.fromRecord(value);
// merge 合并分区消费进度
shareStateMap.compute(mapKey, (k, v) -> v == null ? offsetRecord : merge(v, value));
snapshotUpdateCount.compute(mapKey, (k, v) -> v == null ? 0 : v + 1);
}
ShareGroupOffset:共享组消费分区的消费进度
- stateBatches:需要存储状态的n个消息批次,比如offset=[1,10]被超时release,变为AVAILABLE,可以被重新拉取消费,消费次数=deliveryCount;
- startOffset:stateBatches的第一个offset,当stateBatches为空就是消费进度;
public class ShareGroupOffset {
private final int snapshotEpoch;
private final int stateEpoch;
private final int leaderEpoch;
private final long startOffset;
private final int deliveryCompleteCount;
private final List<PersisterStateBatch> stateBatches;
private final long createTimestamp;
private final long writeTimestamp;
}
public class PersisterStateBatch implements Comparable<PersisterStateBatch> {
// 批次起始offset
private final long firstOffset;
// 批次结束offset
private final long lastOffset;
// 发送次数
private final short deliveryCount;
// RecordState
private final byte deliveryState;
}
public enum RecordState {
AVAILABLE((byte) 0),
ACQUIRED((byte) 1),
ACKNOWLEDGED((byte) 2),
ARCHIVED((byte) 4);
}
六、ShareFetch
6-1、消费者发送Fetch
ShareConsumerImpl.collect:应用线程poll消息,优先从ShareFetchBuffer内存中获取,如果没有,则发送ShareFetchEvent给IO线程。
// 内存消息
private final ShareFetchBuffer fetchBuffer;
private ShareFetch<K, V> collect(Map<TopicIdPartition, NodeAcknowledgements> acknowledgementsMap) {
Map<TopicIdPartition, NodeAcknowledgements> acksToSend = acknowledgementsMap;
if (currentFetch.isEmpty() && !currentFetch.hasRenewals()) {
final ShareFetch<K, V> fetch = fetchCollector.collect(fetchBuffer);
if (fetch.isEmpty()) {
// 内存里无可用消息
// 发送ShareFetchEvent,让IO线程发送ShareFetchRequest
if (shouldSendShareFetchEvent) {
applicationEventHandler.add(new ShareFetchEvent(acksToSend));
shouldSendShareFetchEvent = false;
applicationEventHandler.wakeupNetworkThread();
}
}
return fetch;
}
}
ShareConsumeRequestManager.poll:IO线程,针对每个node(分区leader)构建ShareFetchRequest。ShareSessionHandler,之前2.x讲过,KIP-227通过与服务端建立Session机制(cwiki.apache.org/confluence/…),支持增量Fetch,不用每次传入全量分区列表,只需要传入变化分区列表。
public PollResult poll(long currentTimeMs) {
// 如果用户使用commitSync api,不执行fetch
PollResult pollResult = processAcknowledgements(currentTimeMs);
if (pollResult != null) {
return pollResult;
}
// broker - session处理
Map<Node, ShareSessionHandler> handlerMap = new HashMap<>();
// 分区
for (TopicPartition partition : partitionsToFetch()) {
Optional<Node> leaderOpt = metadata.currentLeader(partition).leader;
if (leaderOpt.isEmpty()) {
// leader不存在,跳过
metadata.requestUpdate(false);
continue;
}
Uuid topicId = topicIds.get(partition.topic());
if (topicId == null) {
// topic元数据没有,跳过
metadata.requestUpdate(false);
continue;
}
Node node = leaderOpt.get();
if (nodesWithPendingRequests.contains(node.id())) {
// 节点仍然有fetch请求在处理,跳过
} else {
ShareSessionHandler handler = handlerMap.computeIfAbsent(node,
k -> sessionHandlers.computeIfAbsent(node.id(), n -> new ShareSessionHandler(logContext, n, memberId)));
TopicIdPartition tip = new TopicIdPartition(topicId, partition);
Acknowledgements acknowledgementsToSend = null;
boolean canSendAcknowledgements = true;
// piggy back ack
Map<TopicIdPartition, Acknowledgements> nodeAcksFromFetchMap = fetchAcknowledgementsToSend.get(node.id());
if (nodeAcksFromFetchMap != null) {
acknowledgementsToSend = nodeAcksFromFetchMap.remove(tip);
if (acknowledgementsToSend != null) {
if (!maybeAddAcknowledgements(handler, node, tip, acknowledgementsToSend)) {
canSendAcknowledgements = false;
}
}
}
// 分区加入session
if (canSendAcknowledgements) {
handler.addPartitionToFetch(tip, acknowledgementsToSend);
} else {
handler.addPartitionToFetch(tip, null);
}
topicNamesMap.putIfAbsent(new IdAndPartition(tip.topicId(), tip.partition()), tip.topic());
}
}
// ...
List<UnsentRequest> requests = handlerMap.entrySet().stream().map(entry -> {
Node target = entry.getKey();
ShareSessionHandler handler = entry.getValue();
// ...
// 构建fetch请求
ShareFetchRequest.Builder requestBuilder = handler.newShareFetchBuilder(groupId, shareFetchConfig, canSkipIfRequestEmpty);
nodesWithPendingRequests.add(target.id());
BiConsumer<ClientResponse, Throwable> responseHandler = (clientResponse, error) -> {
// fetch响应处理
if (error != null) {
handleShareFetchFailure(target, requestBuilder.data(), error);
} else {
handleShareFetchSuccess(target, requestBuilder.data(), clientResponse);
}
};
return new UnsentRequest(requestBuilder, Optional.of(target)).whenComplete(responseHandler);
}).filter(Objects::nonNull).collect(Collectors.toList());
return new PollResult(requests);
}
ShareFetchRequest.json:ShareFetchRequest请求如下
- 相较于普通FetchRequest(普通消费组拉消息请求),不再包含FetchOffset,即消费进度不由客户端传入,由服务端管理;
- AcknowledgementBatches,针对每个消息批次,支持把ack附着在Fetch请求里,减少网络交互次数;
- MaxWaitMs=fetch.max.wait.ms=500ms,没数据,fetch请求在服务端挂起500ms;
- MinBytes=fetch.min.bytes=1byte,只要有数据就返回,MaxBytes=fetch.max.bytes=50MB,响应不超过50MB数据;
{
"apiKey": 78,
"type": "request",
"listeners": ["broker"],
"name": "ShareFetchRequest",
"validVersions": "1-2",
"flexibleVersions": "0+",
"fields": [
{ "name": "GroupId", "type": "string", "versions": "0+", "nullableVersions": "0+", "default": "null", "entityType": "groupId",
"about": "The group identifier." },
{ "name": "MemberId", "type": "string", "versions": "0+", "nullableVersions": "0+",
"about": "The member ID." },
{ "name": "ShareSessionEpoch", "type": "int32", "versions": "0+",
"about": "The current share session epoch: 0 to open a share session; -1 to close it; otherwise increments for consecutive requests." },
{ "name": "MaxWaitMs", "type": "int32", "versions": "0+",
"about": "The maximum time in milliseconds to wait for the response." },
{ "name": "MinBytes", "type": "int32", "versions": "0+",
"about": "The minimum bytes to accumulate in the response." },
{ "name": "MaxBytes", "type": "int32", "versions": "0+", "default": "0x7fffffff",
"about": "The maximum bytes to fetch. See KIP-74 for cases where this limit may not be honored." },
{ "name": "MaxRecords", "type": "int32", "versions": "1+",
"about": "The maximum number of records to fetch. This limit can be exceeded for alignment of batch boundaries." },
{ "name": "BatchSize", "type": "int32", "versions": "1+",
"about": "The optimal number of records for batches of acquired records and acknowledgements." },
{ "name": "ShareAcquireMode", "type": "int8", "versions": "2+", "default": "0", "ignorable": true,
"about": "The acquire mode to control the fetch behavior - 0:batch-optimized,1:record-limit." },
{ "name": "IsRenewAck", "type": "bool", "versions": "2+", "default": "false",
"about": "Whether Renew type acknowledgements present in AcknowledgementBatches." },
{ "name": "Topics", "type": "[]FetchTopic", "versions": "0+",
"about": "The topics to fetch.", "fields": [
{ "name": "TopicId", "type": "uuid", "versions": "0+", "about": "The unique topic ID.", "mapKey": true },
{ "name": "Partitions", "type": "[]FetchPartition", "versions": "0+",
"about": "The partitions to fetch.", "fields": [
{ "name": "PartitionIndex", "type": "int32", "versions": "0+", "mapKey": true,
"about": "The partition index." },
{ "name": "PartitionMaxBytes", "type": "int32", "versions": "0",
"about": "The maximum bytes to fetch from this partition. 0 when only acknowledgement with no fetching is required. See KIP-74 for cases where this limit may not be honored." },
{ "name": "AcknowledgementBatches", "type": "[]AcknowledgementBatch", "versions": "0+",
"about": "Record batches to acknowledge.", "fields": [
{ "name": "FirstOffset", "type": "int64", "versions": "0+",
"about": "First offset of batch of records to acknowledge." },
{ "name": "LastOffset", "type": "int64", "versions": "0+",
"about": "Last offset (inclusive) of batch of records to acknowledge." },
{ "name": "AcknowledgeTypes", "type": "[]int8", "versions": "0+",
"about": "Array of acknowledge types - 0:Gap,1:Accept,2:Release,3:Reject,4:Renew." }
]}
]}
]}
]
}
6-2、分区Leader处理Fetch
KafkaApis.handleShareFetchRequest:Leader处理Fetch,暂时忽略ack。
def handleShareFetchRequest(request: RequestChannel.Request): CompletableFuture[Unit] = {
val shareFetchRequest = request.body[ShareFetchRequest]
def isAcknowledgeDataPresentInFetchRequest: Boolean = {
shareFetchRequest.data.topics.asScala
.flatMap(t => t.partitions().asScala)
.exists(partition => partition.acknowledgementBatches != null && !partition.acknowledgementBatches.isEmpty)
}
val isAcknowledgeDataPresent = isAcknowledgeDataPresentInFetchRequest
// 1. 处理ack
if (isAcknowledgeDataPresent) {
val erroneous = mutable.Map[TopicIdPartition, ShareAcknowledgeResponseData.PartitionData]()
val acknowledgementDataFromRequest = getAcknowledgeBatchesFromShareFetchRequest(shareFetchRequest, topicIdNames, erroneous)
acknowledgeResult = handleAcknowledgements(...)
}
// 2. 处理fetch
val fetchResult: CompletableFuture[Map[TopicIdPartition, ShareFetchResponseData.PartitionData]] =
handleFetchFromShareFetchRequest(request, shareSessionEpoch, erroneousAndValidPartitionData, sharePartitionManager, authorizedTopics)
// 3. 组合 ack和fetch 响应
combineShareFetchAndShareAcknowledgeResponses(fetchResult, acknowledgeResult).handle[Unit] {(result, exception) =>
if (exception != null) {
requestHelper.sendMaybeThrottle(request, shareFetchRequest.getErrorResponse(AbstractResponse.DEFAULT_THROTTLE_TIME, exception))
} else {
requestChannel.sendResponse(request, result, None)
}
}
}
6-2-1、初始化分区消费进度
SharePartition:管理某个消费组分区的消费状态。
- cachedState:n个正在消费的批次的状态,key=批次起始offset;
- startOffset,endOffset:正在消费的消息批次的上下界offset;
private final String groupId;
private final TopicIdPartition topicIdPartition;
private final int leaderEpoch;
// 正在消费批次的起始offset -> 正在消费批次的状态
private final NavigableMap<Long, InFlightBatch> cachedState;
// 是否需要重新计算下一个fetchOffset
private boolean findNextFetchOffset;
// share_group_state存储,rpc调用共享组协调者
private final Persister persister;
// cachedState的起始offset
private long startOffset;
// cachedState的结束offset
private long endOffset;
InFlightBatch是正在消费的消息批次状态。
- offsetState:如果消息批次中不同消息被不同方式ack,比如:offset=1 ack release 需要重新消费,offset=[2,3] ack accept 消费成功,offsetState单独跟踪每条记录状态;
- batchState:如果消息批次被整体处理,比如整体ack release 需要重新消费,跟踪批次整体状态;
cachedState: ConcurrentSkipListMap<Long, InFlightBatch>
│
├── Key: Long (firstOffset) — 批次的起始偏移量
│
└── Value: InFlightBatch — 批次状态
│
├── firstOffset: long — 批次起始 offset
├── lastOffset: long — 批次结束 offset
├── batchState: InFlightState — 批次整体状态
│ │
│ ├── state: RecordState — 状态
│ ├── deliveryCount: int — 消费次数
│ ├── memberId: String — 持有该批次的消费者
│ └── acquisitionLockTimeoutTask: TimerTask — 超时释放任务
│
└── offsetState: NavigableMap<Long, InFlightState> — 单独跟踪每条记录状态
SharePartitionManager.processShareFetch:当消费组分区首次加入,Leader分区并不存储share_group_state,需要初始化消费组分区消费进度,存储在内存SharePartition。后续直接根据SharePartition,决定从哪个offset开始fetch。
void processShareFetch(ShareFetch shareFetch) {
List<DelayedShareFetchKey> delayedShareFetchWatchKeys = new ArrayList<>();
LinkedHashMap<TopicIdPartition, SharePartition> sharePartitions = new LinkedHashMap<>();
Set<String> topics = new HashSet<>();
for (TopicIdPartition topicIdPartition : shareFetch.topicIdPartitions()) {
topics.add(topicIdPartition.topic());
SharePartitionKey sharePartitionKey = sharePartitionKey(
shareFetch.groupId(),
topicIdPartition
);
// 1. 创建内存SharePartition,管理share group的消费情况
SharePartition sharePartition;
try {
sharePartition = getOrCreateSharePartition(sharePartitionKey);
} catch (Exception e) {
shareFetch.addErroneous(topicIdPartition, e);
continue;
}
DelayedShareFetchKey delayedShareFetchKey = new DelayedShareFetchGroupKey(shareFetch.groupId(),
topicIdPartition.topicId(), topicIdPartition.partition());
delayedShareFetchWatchKeys.add(delayedShareFetchKey);
delayedShareFetchWatchKeys.add(new DelayedShareFetchPartitionKey(topicIdPartition.topicId(), topicIdPartition.partition()));
// 2. 发送ReadShareGroupStateRequest给share group协调者,初始化内存SharePartition
CompletableFuture<Void> initializationFuture = sharePartition.maybeInitialize();
//...
sharePartitions.put(topicIdPartition, sharePartition);
}
if (shareFetch.errorInAllPartitions()) {
shareFetch.maybeComplete(Map.of());
return;
}
// 3. 尝试执行fetch,或挂起
addDelayedShareFetch(new DelayedShareFetch(shareFetch, replicaManager, fencedSharePartitionHandler(), sharePartitions, shareGroupMetrics, time, remoteFetchMaxWaitMs),
delayedShareFetchWatchKeys);
}
SharePartition.maybeInitialize:消费分区Leader通过leader(hash(group、topic、partition)%50)找share_group_state存储节点,发送ReadShareGroupStateRequest读取分区消费进度,初始化消费进度。如果startOffset=-1,代表无消费进度,默认消费组配置share.auto.offset.reset=latest(可选earliest/by_duration),将分区高水位作为初始消费进度;
传统消费者需要自行通过ListOffsetRequest获取初始fetchOffset。
public CompletableFuture<Void> maybeInitialize() {
CompletableFuture<Void> future = new CompletableFuture<>();
persister.readState(new ReadShareGroupStateParameters.Builder()
.setGroupTopicPartitionData(new GroupTopicPartitionData.Builder<PartitionIdLeaderEpochData>()
.setGroupId(this.groupId)
.setTopicsData(List.of(new TopicData<>(topicIdPartition.topicId(),
List.of(PartitionFactory.newPartitionIdLeaderEpochData(topicIdPartition.partition(), leaderEpoch)))))
.build())
.build()
).whenComplete((result, exception) -> {
lock.writeLock().lock();
try {
TopicData<PartitionAllData> state = result.topicsData().get(0);
// 如果消费组分区没消费进度,初始化起始offset
startOffset = startOffsetDuringInitialization(partitionData.startOffset());
stateEpoch = partitionData.stateEpoch();
// 正在消费的消息批次状态
List<PersisterStateBatch> stateBatches = partitionData.stateBatches();
long gapStartOffset = -1;
long previousBatchLastOffset = startOffset - 1;
for (PersisterStateBatch stateBatch : stateBatches) {
if (gapStartOffset == -1 && stateBatch.firstOffset() > previousBatchLastOffset + 1) {
gapStartOffset = previousBatchLastOffset + 1;
}
previousBatchLastOffset = stateBatch.lastOffset();
// 正在消费的消息批次状态
InFlightBatch inFlightBatch = new InFlightBatch(timer, time, EMPTY_MEMBER_ID, stateBatch.firstOffset(),
stateBatch.lastOffset(), RecordState.forId(stateBatch.deliveryState()), stateBatch.deliveryCount(),
null, timeoutHandler, sharePartitionMetrics);
cachedState.put(stateBatch.firstOffset(), inFlightBatch);
if (isStateTerminal(RecordState.forId(stateBatch.deliveryState()))) {
deliveryCompleteCount.addAndGet((int) (stateBatch.lastOffset() - stateBatch.firstOffset() + 1));
}
}
if (!cachedState.isEmpty()) {
// 有正在消费的消息批次
updateFindNextFetchOffset(true);
endOffset = cachedState.lastEntry().getValue().lastOffset();
if (gapStartOffset != -1) {
persisterReadResultGapWindow = new GapWindow(endOffset, gapStartOffset);
}
maybeUpdateCachedStateAndOffsets();
} else {
// 无正在消费的消息批次,start=end
endOffset = startOffset;
}
partitionState = SharePartitionState.ACTIVE;
} catch (Exception e) {
throwable = e;
} finally {
boolean isFailed = throwable != null;
if (isFailed) {
partitionState = SharePartitionState.FAILED;
}
lock.writeLock().unlock();
if (isFailed) {
future.completeExceptionally(throwable);
} else {
future.complete(null);
}
}
});
return future;
}
ShareCoordinatorShard.readStateAndMaybeUpdateLeaderEpoch:share_group_state所在节点,读取内存中的消费进度,返回消费分区Leader。
private final TimelineHashMap<SharePartitionKey, ShareGroupOffset> shareStateMap;
public CoordinatorResult<ReadShareGroupStateResponseData, CoordinatorRecord>
readStateAndMaybeUpdateLeaderEpoch(ReadShareGroupStateRequestData request) {
ReadShareGroupStateRequestData.ReadStateData topicData = request.topics().get(0);
ReadShareGroupStateRequestData.PartitionData partitionData = topicData.partitions().get(0);
Uuid topicId = topicData.topicId();
int partitionId = partitionData.partition();
int leaderEpoch = partitionData.leaderEpoch();
SharePartitionKey key = SharePartitionKey.getInstance(request.groupId(), topicId, partitionId);
ShareGroupOffset offsetValue = shareStateMap.get(key);
List<ReadShareGroupStateResponseData.StateBatch> stateBatches = (offsetValue.stateBatches() != null && !offsetValue.stateBatches().isEmpty()) ?
offsetValue.stateBatches().stream()
.map(
stateBatch -> new ReadShareGroupStateResponseData.StateBatch()
.setFirstOffset(stateBatch.firstOffset())
.setLastOffset(stateBatch.lastOffset())
.setDeliveryState(stateBatch.deliveryState())
.setDeliveryCount(stateBatch.deliveryCount())
).toList() : List.of();
ReadShareGroupStateResponseData responseData = ReadShareGroupStateResponse.toResponseData(
topicId,
partitionId,
offsetValue.startOffset(),
offsetValue.stateEpoch(),
stateBatches
);
// 分区leaderEpoch没变化,直接返回
if (leaderEpoch == -1 ||
(leaderEpochMap.get(key) != null && leaderEpochMap.get(key) == leaderEpoch)) {
return new CoordinatorResult<>(List.of(), responseData);
}
// leaderEpoch变更,要写一条记录
WriteShareGroupStateRequestData.PartitionData writePartitionData = new WriteShareGroupStateRequestData.PartitionData()
.setPartition(partitionId)
.setLeaderEpoch(leaderEpoch)
.setStateBatches(List.of())
.setStartOffset(responseData.results().get(0).partitions().get(0).startOffset())
.setDeliveryCompleteCount(offsetValue.deliveryCompleteCount())
.setStateEpoch(responseData.results().get(0).partitions().get(0).stateEpoch());
CoordinatorRecord record = generateShareStateRecord(writePartitionData, key, true);
return new CoordinatorResult<>(List.of(record), responseData);
}
6-2-2、执行Fetch
DelayedShareFetch.tryComplete:
- acquirablePartitions,获取分区锁和fetchOffset,获取分区锁失败,则挂起,等待超时或其他线程唤醒;
- maybeReadFromLog,针对获取分区锁成功的分区,读消息数据;
- forceComplete,读异常或读取字节数满足minBytes(1),释放分区锁,响应客户端;
- releasePartitionLocks,不满足3则释放分区锁;
// fetch请求
private final ShareFetch shareFetch;
// SharePartition
private final LinkedHashMap<TopicIdPartition, SharePartition> sharePartitions;
public boolean tryComplete() {
// 1. 尝试获取分区锁,返回 获取成功的分区 -> fetch offset
LinkedHashMap<TopicIdPartition, Long> topicPartitionData = acquirablePartitions(sharePartitions);
try {
if (!topicPartitionData.isEmpty()) {
// 2. 读log
LinkedHashMap<TopicIdPartition, LogReadResult> replicaManagerReadResponse =
maybeReadFromLog(topicPartitionData);
maybeUpdateFetchOffsetMetadata(topicPartitionData, replicaManagerReadResponse);
if (anyPartitionHasLogReadError(replicaManagerReadResponse)
|| isMinBytesSatisfied(topicPartitionData,
partitionMaxBytesStrategy.maxBytes(shareFetch.fetchParams().maxBytes,
topicPartitionData.keySet(), topicPartitionData.size()))) {
// 3. 如果有分区异常,或满足minBytes(1-有数据就返回),响应客户端
partitionsAcquired = topicPartitionData;
localPartitionsAlreadyFetched = replicaManagerReadResponse;
return forceComplete();
} else {
// 4. 没数据,要挂起请求,释放分区锁
releasePartitionLocks(topicPartitionData.keySet());
}
}
return false;
} catch (Exception e) {
return forceComplete();
}
}
DelayedShareFetch.acquirablePartitions:循环消费分区,用fetch请求唯一uuid+CAS获取分区锁,并获取nextFetchOffset。
// 每次fetch请求生成uuid,用于获取锁
private final Uuid fetchId;
LinkedHashMap<TopicIdPartition, Long> acquirablePartitions(
LinkedHashMap<TopicIdPartition, SharePartition> sharePartitionsForAcquire) {
LinkedHashMap<TopicIdPartition, Long> topicPartitionData = new LinkedHashMap<>();
sharePartitionsForAcquire.forEach((topicIdPartition, sharePartition) -> {
if (sharePartition.maybeAcquireFetchLock(fetchId)) {
try {
if (sharePartition.canAcquireRecords()) {
// 如果能获取数据,等读取数据后才会释放锁
// 计算分区fetchOffset
topicPartitionData.put(topicIdPartition,
sharePartition.nextFetchOffset());
} else {
// 释放锁
sharePartition.releaseFetchLock(fetchId);
}
} catch (Exception e) {
sharePartition.releaseFetchLock(fetchId);
}
}
});
return topicPartitionData;
}
// SharePartition#maybeAcquireFetchLock
// cas成功,则fetch请求获取分区锁成功
private final AtomicReference<Uuid> fetchLock;
public boolean maybeAcquireFetchLock(Uuid fetchId) {
boolean acquired = fetchLock.compareAndSet(null, Objects.requireNonNull(fetchId));
if (acquired) {
long currentTime = time.hiResClockMs();
}
return acquired;
}
// SharePartition#canAcquireRecords
boolean canAcquireRecords() {
// nextFetchOffset = 下一个fetch的offset
// endOffset = 曾经fetch的最大offset
// 如果不相等,应该是有以前的消息被释放了,比如超时
if (nextFetchOffset() != endOffset() + 1) {
return true;
}
// group.share.partition.max.record.locks = 2000,每个分区最多2000条记录被锁
return numInFlightRecords() < maxInFlightRecords;
}
SharePartition.nextFetchOffset:nextFetchOffset的计算逻辑如下
- 简单场景,如所有批次都被完整正常消费,直接返回endOffset(上次fetch进度);
- cachedState有正在消费的消息批次
- 有gap,如offset=[0,10]和offset=[20,30]正在消费,nextFetchOffset=11。场景如消费者只ack release了0-10和20-30,share_group_state只存储了这两个区间,[11,20)没持久化,broker重启恢复后,只能从11开始消费;
- 无gap,正在消费的批次被完整ack(release)释放,返回第一个被完整释放的批次的起始offset。如offset=[10,20]被ack(release),则从10开始消费;
- 无gap,正在消费的批次不是完整处理的,比如针对一个批次[10,12]:ack(offset=10, release),ack(offset=11, accept),ack(offset=12, release),则从10开始消费;
// 是否需要重新计算fetchOffset
private boolean findNextFetchOffset;
// 正在被消费的批次 key=批次起始offset
private final NavigableMap<Long, InFlightBatch> cachedState;
// cachedState中第一条消息offset
private long startOffset;
// 被fetch的最后一条消息offset
private long endOffset;
public long nextFetchOffset() {
lock.writeLock().lock();
try {
// case1,简单场景,简单返回endOffset即可
if (!findNextFetchOffset) {
if (cachedState.isEmpty()
|| startOffset > cachedState.lastEntry().getValue().lastOffset()) {
return endOffset;
} else {
return endOffset + 1;
}
}
if (cachedState.isEmpty()
|| startOffset > cachedState.lastEntry().getValue().lastOffset()) {
updateFindNextFetchOffset(false);
return endOffset;
}
// findNextFetchOffset=true,需要做复杂计算
long nextFetchOffset = -1;
long gapStartOffset = isPersisterReadGapWindowActive() ? persisterReadResultGapWindow.gapStartOffset() : -1;
// 循环所有正在消费的批次
for (Map.Entry<Long, InFlightBatch> entry : cachedState.entrySet()) {
if (isPersisterReadGapWindowActive()) {
if (entry.getKey() > gapStartOffset) {
// case2 有gap存在,找gap起始offset返回
nextFetchOffset = gapStartOffset;
break;
}
gapStartOffset = Math.max(entry.getValue().lastOffset() + 1, gapStartOffset);
}
if (entry.getValue().offsetState() == null) {
// case3 批次被完整处理 且 批次被release ack,允许重复消费
if (entry.getValue().batchState() == RecordState.AVAILABLE && !entry.getValue().batchHasOngoingStateTransition()) {
nextFetchOffset = entry.getValue().firstOffset();
break;
}
} else {
// case4 批次没有被完整处理 比如offset=1正常ack(ACCEPT),offset=2被release(RELEASE),返回offset=2
for (Map.Entry<Long, InFlightState> offsetState : entry.getValue().offsetState().entrySet()) {
if (offsetState.getValue().state() == RecordState.AVAILABLE && !offsetState.getValue().hasOngoingStateTransition()) {
nextFetchOffset = offsetState.getKey();
break;
}
}
if (nextFetchOffset != -1) {
break;
}
}
}
if (nextFetchOffset == -1) {
updateFindNextFetchOffset(false);
nextFetchOffset = endOffset + 1;
}
return nextFetchOffset;
} finally {
lock.writeLock().unlock();
}
}
6-2-3、更新消息消费状态(消息锁)
ShareFetchUtils.processFetchResponse:处理响应,主要需要SharePartition.acquire更新内存分区消费状态。
SharePartition.acquire:结合nextFetchOffset的计算逻辑,从底层Log拉取消息批次后,还需要进行过滤并将被拉取的消息批次记录到cachedState的InFlightBatch中。
- 如果cachedState为空,代表没有正在消费的批次,需要创建内存InFlightBatch,否则需要结合InFlightBatch过滤实际返回客户端的数据;
- inFlightBatch.offsetState() != null,批次被部分提交,筛选AVAILABLE(被释放)的单条消息记录offset返回,更新状态ACQUIRED(正在消费);
- inFlightBatch.offsetState() = null,批次被整体释放,筛选AVAILABLE批次,更新状态ACQUIRED;
- InFlightBatch处理完毕,没达到maxFetchRecords,创建新的内存InFlightBatch;
- 共享组也支持隔离级别,仅读取已提交消息,通过消费组share.isolation.level=read_committed(默认read_uncommitted)配置。普通消费组是在客户端做事务消息过滤的(比如跳过控制批次和Abort的消息批次),共享组在服务端做了事务消息过滤;
public ShareAcquiredRecords acquire(
String memberId,
ShareAcquireMode shareAcquireMode,
int batchSize,
int maxFetchRecords,
long fetchOffset,
// fetch到的消息记录,n个批次
FetchPartitionData fetchPartitionData,
FetchIsolation isolationLevel) {
RecordBatch lastBatch = fetchPartitionData.records.lastBatch().orElse(null);
LastOffsetAndMaxRecords lastOffsetAndMaxRecords = lastOffsetAndMaxRecordsToAcquire(fetchOffset,
maxFetchRecords, lastBatch.lastOffset());
if (lastOffsetAndMaxRecords.maxRecords() <= 0) {
return ShareAcquiredRecords.empty();
}
int maxRecordsToAcquire = lastOffsetAndMaxRecords.maxRecords();
final long lastOffsetToAcquire = lastOffsetAndMaxRecords.lastOffset();
RecordBatch firstBatch = fetchPartitionData.records.batches().iterator().next();
lock.writeLock().lock();
try {
long baseOffset = firstBatch.baseOffset();
// 拉取offset -> 拉取消息的第一个offset 之间数据标记为ARCHIVED
maybeArchiveStaleBatches(fetchOffset, baseOffset);
Map.Entry<Long, InFlightBatch> floorEntry = cachedState.floorEntry(baseOffset);
if (floorEntry == null) {
if (baseOffset < startOffset) {
baseOffset = startOffset;
}
} else if (floorEntry.getValue().lastOffset() >= baseOffset) {
baseOffset = floorEntry.getKey();
}
final NavigableMap<Long, InFlightBatch> subMap = cachedState.subMap(
Math.min(firstBatch.baseOffset(), baseOffset), true, lastOffsetToAcquire, true);
if (subMap.isEmpty()) {
// case1, cachedState没有InFlightBatch可以分配,创建新的InFlightBatch
ShareAcquiredRecords shareAcquiredRecords = acquireNewBatchRecords(memberId, fetchPartitionData.records.batches(), isRecordLimitMode,
baseOffset, lastOffsetToAcquire, batchSize, maxRecordsToAcquire);
return maybeFilterAbortedTransactionalAcquiredRecords(fetchPartitionData, isolationLevel, shareAcquiredRecords);
}
List<AcquiredRecords> result = new ArrayList<>();
int acquiredCount = 0;
long maybeGapStartOffset = baseOffset;
for (Map.Entry<Long, InFlightBatch> entry : subMap.entrySet()) {
if (acquiredCount >= maxRecordsToAcquire) {
break;
}
InFlightBatch inFlightBatch = entry.getValue();
boolean fullMatch = checkForFullMatch(inFlightBatch, firstBatch.baseOffset(), lastOffsetToAcquire);
int numRecordsRemaining = maxRecordsToAcquire - acquiredCount;
boolean recordLimitSubsetMatch = isRecordLimitMode && checkForRecordLimitSubsetMatch(inFlightBatch, maxRecordsToAcquire, acquiredCount);
boolean throttleRecordsDelivery = shouldThrottleRecordsDelivery(inFlightBatch, firstBatch.baseOffset(), lastOffsetToAcquire);
if (throttleRecordsDelivery && acquiredCount > 0) {
maxRecordsToAcquire = 0;
break;
}
// inFlightBatch.offsetState() != null - 批次被部分提交,只能获取其中部分offset记录
if (!fullMatch || inFlightBatch.offsetState() != null || recordLimitSubsetMatch || throttleRecordsDelivery) {
// case2 获取InFlightBatch中部分offset数据
int acquiredSubsetCount = acquireSubsetBatchRecords(memberId, isRecordLimitMode, numRecordsRemaining, firstBatch.baseOffset(), lastOffsetToAcquire, inFlightBatch, result);
acquiredCount += acquiredSubsetCount;
if (throttleRecordsDelivery && acquiredSubsetCount > 0) {
maxRecordsToAcquire = 0;
break;
}
continue;
}
if (inFlightBatch.batchState() != RecordState.AVAILABLE
|| inFlightBatch.batchHasOngoingStateTransition()) {
continue;
}
// case3 获取整个InFlightBatch,更新状态ACQUIRED被获取
InFlightState updateResult = inFlightBatch.tryUpdateBatchState(RecordState.ACQUIRED, DeliveryCountOps.INCREASE, maxDeliveryCount, memberId);
// 提交timer定时任务
AcquisitionLockTimerTask acquisitionLockTimeoutTask = scheduleAcquisitionLockTimeout(memberId, inFlightBatch.firstOffset(), inFlightBatch.lastOffset());
inFlightBatch.updateAcquisitionLockTimeout(acquisitionLockTimeoutTask);
result.add(new AcquiredRecords()
.setFirstOffset(inFlightBatch.firstOffset())
.setLastOffset(inFlightBatch.lastOffset())
.setDeliveryCount((short) inFlightBatch.batchDeliveryCount()));
acquiredCount += (int) (inFlightBatch.lastOffset() - inFlightBatch.firstOffset() + 1);
}
// case4 从InFlightBatch中获取了部分数据,还没达到maxRecordsToAcquire,继续获取新的批次
if (acquiredCount < maxRecordsToAcquire
&& subMap.lastEntry().getValue().lastOffset() < lastOffsetToAcquire) {
int numRecordsRemaining = maxRecordsToAcquire - acquiredCount;
ShareAcquiredRecords shareAcquiredRecords = acquireNewBatchRecords(memberId, fetchPartitionData.records.batches(), isRecordLimitMode,
subMap.lastEntry().getValue().lastOffset() + 1,
lastOffsetToAcquire, batchSize, numRecordsRemaining);
result.addAll(shareAcquiredRecords.acquiredRecords());
acquiredCount += shareAcquiredRecords.count();
}
if (!result.isEmpty()) {
maybeUpdatePersisterGapWindowStartOffset(result.get(result.size() - 1).lastOffset() + 1);
return maybeFilterAbortedTransactionalAcquiredRecords(fetchPartitionData, isolationLevel, new ShareAcquiredRecords(result, acquiredCount));
}
return new ShareAcquiredRecords(result, acquiredCount);
} finally {
lock.writeLock().unlock();
}
}
6-2-4、开启Ack超时任务
因为消息批次(或单条消息)有锁定超时时间,默认group.share.record.lock.duration.ms=30000,如果一个消息批次(或单条消息)30s内没有ack,将会变为AVAILABLE,可以被重新Fetch。这个事情在创建或更新InFlightBach时处理。
SharePartition.acquireNewBatchRecords:以创建新InFlightBatch为例。
private ShareAcquiredRecords acquireNewBatchRecords(....) {
lock.writeLock().lock();
try {
long firstAcquiredOffset = firstOffset;
if (cachedState.isEmpty() && endOffset > firstAcquiredOffset) {
firstAcquiredOffset = endOffset;
}
long lastAcquiredOffset = lastOffset;
if (maxFetchRecords < lastAcquiredOffset - firstAcquiredOffset + 1) {
lastAcquiredOffset = lastOffsetFromBatchWithRequestOffset(batches, firstAcquiredOffset + maxFetchRecords - 1);
if (isPersisterReadGapWindowActive() && lastAcquiredOffset > lastOffset) {
lastAcquiredOffset = lastOffset;
}
}
// 创建新InFilghtBatch
List<AcquiredRecords> acquiredRecords = createBatches(memberId, batches, isRecordLimitMode, maxFetchRecords, firstAcquiredOffset, lastAcquiredOffset, batchSize);
// 更新startOffset和endOffset
if (cachedState.firstKey() == firstAcquiredOffset) {
startOffset = firstAcquiredOffset;
}
if (lastAcquiredOffset > endOffset) {
endOffset = lastAcquiredOffset;
}
// 更新gap
maybeUpdatePersisterGapWindowStartOffset(lastAcquiredOffset + 1);
return new ShareAcquiredRecords(acquiredRecords, (int) (lastAcquiredOffset - firstAcquiredOffset + 1));
} finally {
lock.writeLock().unlock();
}
}
SharePartition.addBatches:创建新InFlightBatch,并开启AcquisitionLockTimerTask。
private void addBatches(String memberId, List<AcquiredRecords> acquiredRecordsList) {
lock.writeLock().lock();
try {
acquiredRecordsList.forEach(acquiredRecords -> {
// 开启定时任务
AcquisitionLockTimerTask timerTask = scheduleAcquisitionLockTimeout(
memberId, acquiredRecords.firstOffset(), acquiredRecords.lastOffset());
// 缓存新的InFlightBatch
cachedState.put(acquiredRecords.firstOffset(), new InFlightBatch(...));
});
} finally {
lock.writeLock().unlock();
}
}
AcquisitionLockTimerTask scheduleAcquisitionLockTimeout(...) {
int recordLockDurationMs = recordLockDurationMsOrDefault(groupConfigManager, groupId, defaultRecordLockDurationMs);
return scheduleAcquisitionLockTimeout(memberId, firstOffset, lastOffset, recordLockDurationMs);
}
private final Timer timer;
private AcquisitionLockTimerTask scheduleAcquisitionLockTimeout(...) {
AcquisitionLockTimerTask acquisitionLockTimerTask = acquisitionLockTimerTask(memberId, firstOffset, lastOffset, delayMs);
timer.add(acquisitionLockTimerTask);
return acquisitionLockTimerTask;
}
SharePartition.releaseAcquisitionLockOnTimeout:处理InFlightBatch超时
- 更新内存InFlightBatch,offsetState=null批次完整超时,offsetState≠null批次部分offset超时,更新对应批次或offset为AVAILABLE/ARCHIVED,默认group.share.delivery.count.limit=5,消费5次还失败(包括超时和客户端主动ack RELEASE),设置为ARCHIVED(等同于ack REJECT);
- writeShareGroupState,InFlightBatch变更,异步持久化share_group_state;
- maybeUpdateCachedStateAndOffsets,清理内存cachedState的InFlightBatch,因为第1步可能把部分InFlightBatch ARCHIVED,就可以移除了;
- maybeCompleteDelayedShareFetchRequest,第1步可能让部分InFlightBatch释放AVAILABLE,尝试唤起同消费组分区的ShareFetch拉取消息;
private AcquisitionLockTimeoutHandler releaseAcquisitionLockOnTimeout() {
// 获取消息记录的数据范围=[firstOffset,lastOffset]
return (memberId, firstOffset, lastOffset, timerTask) -> {
List<PersisterStateBatch> stateBatches;
lock.writeLock().lock();
try {
Map.Entry<Long, InFlightBatch> floorOffset = cachedState.floorEntry(firstOffset);
if (floorOffset == null) {
return;
}
stateBatches = new ArrayList<>();
// 找起始offset在[firstOffset,lastOffset]范围的InFlightBatch
NavigableMap<Long, InFlightBatch> subMap = cachedState.subMap(floorOffset.getKey(), true, lastOffset, true);
for (Map.Entry<Long, InFlightBatch> entry : subMap.entrySet()) {
InFlightBatch inFlightBatch = entry.getValue();
// ...
// 1. 更新内存InFlightBatch - AVAILABLE/ARCHIVED
if (inFlightBatch.offsetState() == null) {
releaseAcquisitionLockOnTimeoutForCompleteBatch(inFlightBatch, stateBatches, memberId);
} else {
releaseAcquisitionLockOnTimeoutForPerOffsetBatch(inFlightBatch, stateBatches, memberId, firstOffset, lastOffset);
}
}
} finally {
lock.writeLock().unlock();
}
// 2. 写share_group_state
if (!stateBatches.isEmpty()) {
writeShareGroupState(stateBatches).whenComplete((result, exception) -> {
// 3. 更新cachedState,可能可以移除一些内存里的InFlightBatch,比如ARCHIVED
maybeUpdateCachedStateAndOffsets();
});
}
// 4. 尝试完成延迟的share fetch请求(group+topic+partition纬度),因为有些记录可能超时释放AVAILABLE
maybeCompleteDelayedShareFetchRequest(!stateBatches.isEmpty());
};
}
6-3、消费者接收消息
ShareConsumeRequestManager.handleShareFetchSuccess:broker返回数据构造为ShareCompletedFetch,缓存在ShareFetchBuffer,供应用线程poll拉取。
七、Ack
7-1、消费者发送Ack
客户端通过share.acknowledgement.mode控制ack模式:
- implicit:implicit,默认,隐式提交;
- explicit:显示提交;
explicit模式下,用户代码需要显示对poll返回的每条消息,手动执行ShareConsumer.acknowledge API。
ShareConsumerImpl.acknowledge:acknowledge api仅在内存中,对于正在消费的记录(offset)进行标记。
AcknowledgeType:ack标记有四种类型
- ACCEPT:消息处理成功,消息进入终态;
- REJECT:消息处理失败,但无需重试消费,消息进入终态;
- RELEASE:消息处理失败,释放消息锁,允许broker再次投递;
- RENEW:消息续期,告诉broker不要释放锁,消息仍然被当前消费者独占;
implicit模式,工作方式类似于传统消费者的自动提交,每次consumer.poll会将上次poll的所有消息(currentFetch)标记为ACCEPT。
ShareConsumerImpl.collect:应用线程poll拉取消息,会触发ack。ack有两种请求方式
- 需要发送ShareFetch(发送ShareFetchEvent给IO线程)拉消息,ack附加在ShareFetch的AcknowledgementBatches参数,减少网络交互次数;
- 不需要发送ShareFetch(发送ShareAcknowledgeAsyncEvent给IO线程),ack需要单独发送ShareAcknowledgeRequest;
7-2、分区Leader处理Ack
KafkaApis.handleShareFetchRequest:ShareFetch,先处理ack,再处理fetch。
KafkaApis.handleShareAcknowledgeRequest:ShareAcknowledge,单独ack处理。
SharePartition.acknowledge:无论哪种ack,循环每个groupId的消费分区SharePartition执行
- 更新内存InFlightBatch,并构造PersisterBatch包含:
- InFlightState updatedState:更新后,某个正在消费的整个消息批次(InFlightBatch.batchState)或单条offset(InFlightBatch.offsetState)的内存状态;
- PersisterStateBatch stateBatch:updatedState对应的消息持久化状态;
- 发送WriteShareGroupStateRequest(PersisterStateBatch)持久化sharegroupstate状态,如果失败,回滚内存InFlightBatch;
InFlightState:InFlightState代表某个消息批次 或 某个offset消息的状态。
startStateTransition修改内存消息状态,默认group.share.delivery.count.limit=5,ack release只能重新被投递5次。
WriteShareGroupStateRequest发送给leaderOf(hash(groupId,消费topic,消费partitionId) % 50) ,持久化到share_group_state,和5-2-2一致。
7-3、关于renew
对于accept/reject ack,消息不再投递。
对于release ack,释放消息锁,消息需要重新传输,可能给不同消费组成员。
对于renew ack,消费者不会将这些消息从内存中移除,消息不需要重新传输。只要broker能成功对消息锁续期,消费者就可以直接重新消费。
ShareInFlightBatch:消费者侧
- 应用线程poll,收集本次需要发送的ack时,将renew消息标记为renewing;
- IO线程收到ack响应,发送ShareAcknowledgementEvent给应用线程;
- 应用线程poll,首先会处理ShareAcknowledgementEvent,将renewing变为renewed,当从内存拉取消息时,将renewed变为inFlight,重新拉取到消息继续消费;
SharePartition.acknowledgePerOffsetBatchRecords:Broker侧,以单消息renew为例。分区Leader处理renew ack并不会修改消息状态,只会重新开启AcquisitionLockTimerTask锁超时任务。
总结
消费者心跳
共享组使用KIP-848新消费组协议,通过心跳完成Rebalance。
心跳请求:
1)groupId;
2)memberId:每次进程启动生成uuid;
3)memberEpoch:刚上线=0,离组下线=-1,其他由协调者控制=assignment epoch;
4)topics:订阅Topics;
心跳响应:
1)新协议心跳由服务端控制时间:心跳间隔=share.heartbeat.interval.ms=5s,心跳超时检测=share.session.timeout.ms=45s;
2)分区分配assignment,即当前成员可以消费哪些topic分区,后续fetch请求时使用;
共享组场景:
1)协调者只用consumer_offset存储消费组成员、订阅情况、分区分配,内存中提现为ModernGroup和ModernGroupMember;
2)协调者通过leaderOf(hash(groupId+topic+partition) % 50),找到topic=share_group_state的leader节点,存储共享组消费进度,内存中提现为SharePartitionKey(groupId+topic+partition) →ShareGroupOffset(stateBatches+startOffset) ;
3)ShareGroupOffset,因为共享组消息可以单独ack,stateBatches存储需要被管理的消息状态,比如offset=1-100因为消费者release ack释放消息,变为available可以重试,另外还要存储重试次数,ShareGroupOffset只会在消息ack、消息超时release场景下会修改;
4)协调者通过心跳,发现共享组新增消费分区,触发SharePartitionKey到ShareGroupOffset的初始化。这一步包含额外异步RPC,需要通过心跳被动补偿;
分区分配:
订阅关系一致,采用SimpleHomogeneousAssignmentBuilder。
1个分区被desiredSharing个成员共享,desiredSharing=ceil(members/partitions),即 成员数 / 分区数 向上取整。比如:成员数≤分区数,分区仍然被独占;成员数=7,分区数=3,那么每个分区会被3个成员共享,有2个成员消费2个分区,有5个成员消费1个分区。
Fetch
消费者侧:
- 应用线程,如果ShareFetchBuffer为空,发送ShareFetchEvent给IO线程;
- IO线程,收到ShareFetchEvent,发送ShareFetch请求给Broker,不同于传统消费组,ShareFetch请求不包含fetchOffset;
- IO线程,收到ShareFetch响应,把消息放入ShareFetchBuffer;
- 应用线程,从ShareFetchBuffer获取消息消费;
Broker侧(分区Leader):
- 如果消费者新增消费分区,找共享组消费进度存储节点,加载消费进度到内存SharePartition;
- 如果offset=-1,代表无消费进度,默认消费组配置share.auto.offset.reset=latest,将分区高水位作为初始消费进度;(传统消费者需要自行通过ListOffsetRequest获取初始fetchOffset)
- 共享组消费fetchOffset由leader计算得到,获取分区锁拉消息;
- 将消息记录到内存InFlightBatch,标记为ACQUIRED,后面fetchOffset计算则不会拉取这些消息;
- 针对这些消息,开启超时检测,默认group.share.record.lock.duration.ms=30000,30s没ack则自动释放,标记为AVAILABLE,可以被重新Fetch;(释放需要持久化share_group_state,包括重试次数+1)
Ack
AcknowledgeType:ack标记有四种类型
- ACCEPT:消息处理成功,消息进入终态;
- REJECT:消息处理失败,但无需重试消费,消息进入终态;
- RELEASE:消息处理失败,释放消息锁,允许broker再次投递;
- RENEW:消息续期,告诉broker不要释放锁,消息仍然被当前消费者独占;
消费者两种ack模式,share.acknowledgement.mode控制ack模式:
- implicit:implicit,默认,隐式提交。工作方式类似于传统消费者的自动提交,每次consumer.poll会将上次poll的所有消息标记为ACCEPT;
- explicit:显示提交,对每条poll返回消息,调用ShareConsumer.acknowledge API,设置AcknowledgeType;
消费者发送Ack请求有两种方式,优先附加在ShareFetchRequest一起发送,否则单独发送ShareAcknowledgeRequest,减少网络交互次数。此外,共享组Ack是发送给消费分区Leader的,而普通组提交offset是发送给协调者的。
Broker侧:
- 更新内存InFlightBatch,比如RELEASE将消息标记为AVAILABLE,RENEW重新开启超时检测;
- 非RENEW,消息状态需要通过RPC持久化到share_group_state;
- 默认group.share.delivery.count.limit=5,RELEASE只能重新投递5次,超出次数消息状态为ARCHIVED(等同于ack reject);