api
coordinator专门用来处理consumer group和commit offset相关的工作,kafka api中和其相关的api有
- OFFSET_COMMIT
- OFFSET_FETCH
- JOIN_GROUP
- LEAVE_GROUP
- SYNC_GROUP
- DESCRIBE_GROUPS
- LIST_GROUPS
- DELETE_GROUPS
后面我们介绍这些api的处理方法
offset_commit和offset_fetch
在这一节介绍kafka server处理offset相关请求的过程。offset相关的请求有两个,分别是offset_commit和offset_fetch。
offset_commit
offset_commit的请求包含
- groupId
- offsetData: Map<TopicPartition,PartitionData>
- memberId
- generationId
其中PartitionData在前面partition复制的时候已经看到过
public static final class PartitionData {
@Deprecated
public final long timestamp; // for V1
public final long offset;
public final String metadata;
public final Optional<Integer> leaderEpoch;
}
在收到commit offset的请求后,
- 首先先对请求进行鉴权,如果鉴权不通过(对group无权限),返回GROUP_AUTHORIZATION_FAILED错误。
- 否则对topic进行鉴权,不通过的话返回TOPIC_AUTHORIZATION_FAILED错误
- 否则,如果api的metadataCache如果不包含此topicPartition的话,返回UNKNOWN_TOPIC_OR_PARTITION错误
- 鉴权通过的话,根据请求header的version进行判断
- 如果version == 0, 在zk中设置对应topicPartition的offset
- 否则调用groupCoordinator的handleCommitOffsets方法保存offset
在这里,我们主要看groupCoordinator的handleCommitOffsets方法的实现
def handleCommitOffsets(groupId: String,
memberId: String,
generationId: Int,
offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
responseCallback: immutable.Map[TopicPartition, Errors] => Unit) {
validateGroupStatus(groupId, ApiKeys.OFFSET_COMMIT) match {
case Some(error) => responseCallback(offsetMetadata.mapValues(_ => error))
case None =>
groupManager.getGroup(groupId) match {
case None =>
if (generationId < 0) {
// the group is not relying on Kafka for group management, so allow the commit
//如果没有group的元数据且未开始generation时
val group = groupManager.addGroup(new GroupMetadata(groupId, Empty, time))
doCommitOffsets(group, memberId, generationId, NO_PRODUCER_ID, NO_PRODUCER_EPOCH,
offsetMetadata, responseCallback)
} else {
// or this is a request coming from an older generation. either way, reject the commit
//如果rebalance后没有group的信息
responseCallback(offsetMetadata.mapValues(_ => Errors.ILLEGAL_GENERATION))
}
//如果保存有group的信息
case Some(group) =>
doCommitOffsets(group, memberId, generationId, NO_PRODUCER_ID, NO_PRODUCER_EPOCH,
offsetMetadata, responseCallback)
}
}
}
groupCoordinator调用了doCommitOffsets方法保存offset
private def doCommitOffsets(group: GroupMetadata,
memberId: String,
generationId: Int,
producerId: Long,
producerEpoch: Short,
offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
responseCallback: immutable.Map[TopicPartition, Errors] => Unit) {
group.inLock {
if (group.is(Dead)) {
//group为dead表示已没有任何member
responseCallback(offsetMetadata.mapValues(_ => Errors.UNKNOWN_MEMBER_ID))
} else if ((generationId < 0 && group.is(Empty)) || (producerId != NO_PRODUCER_ID)) {
// The group is only using Kafka to store offsets.
// Also, for transactional offset commits we don't need to validate group membership and the generation.
groupManager.storeOffsets(group, memberId, offsetMetadata, responseCallback, producerId, producerEpoch)
} else if (group.is(CompletingRebalance)) {
//group正在Rebalance进行中
responseCallback(offsetMetadata.mapValues(_ => Errors.REBALANCE_IN_PROGRESS))
} else if (!group.has(memberId)) {
responseCallback(offsetMetadata.mapValues(_ => Errors.UNKNOWN_MEMBER_ID))
} else if (generationId != group.generationId) {
responseCallback(offsetMetadata.mapValues(_ => Errors.ILLEGAL_GENERATION))
} else {
val member = group.get(memberId)
completeAndScheduleNextHeartbeatExpiration(group, member)
groupManager.storeOffsets(group, memberId, offsetMetadata, responseCallback)
}
}
}
最后调用的是GroupMetadataManager的storeOffsets方法
def storeOffsets(group: GroupMetadata,
consumerId: String,
offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
responseCallback: immutable.Map[TopicPartition, Errors] => Unit,
producerId: Long = RecordBatch.NO_PRODUCER_ID,
producerEpoch: Short = RecordBatch.NO_PRODUCER_EPOCH): Unit = {
// first filter out partitions with offset metadata size exceeding limit
val filteredOffsetMetadata = offsetMetadata.filter { case (_, offsetAndMetadata) =>
validateOffsetMetadataLength(offsetAndMetadata.metadata)
}
group.inLock {
if (!group.hasReceivedConsistentOffsetCommits)
warn(s"group: ${group.groupId} with leader: ${group.leaderOrNull} has received offset commits from consumers as well " +
s"as transactional producers. Mixing both types of offset commits will generally result in surprises and " +
s"should be avoided.")
}
val isTxnOffsetCommit = producerId != RecordBatch.NO_PRODUCER_ID
// construct the message set to append
if (filteredOffsetMetadata.isEmpty) {
// compute the final error codes for the commit response
val commitStatus = offsetMetadata.mapValues(_ => Errors.OFFSET_METADATA_TOO_LARGE)
responseCallback(commitStatus)
None
} else {
getMagic(partitionFor(group.groupId)) match {
case Some(magicValue) =>
// We always use CREATE_TIME, like the producer. The conversion to LOG_APPEND_TIME (if necessary) happens automatically.
val timestampType = TimestampType.CREATE_TIME
val timestamp = time.milliseconds()
val records = filteredOffsetMetadata.map { case (topicPartition, offsetAndMetadata) =>
val key = GroupMetadataManager.offsetCommitKey(group.groupId, topicPartition)
val value = GroupMetadataManager.offsetCommitValue(offsetAndMetadata, interBrokerProtocolVersion)
new SimpleRecord(timestamp, key, value)
}
val offsetTopicPartition = new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, partitionFor(group.groupId))
val buffer = ByteBuffer.allocate(AbstractRecords.estimateSizeInBytes(magicValue, compressionType, records.asJava))
if (isTxnOffsetCommit && magicValue < RecordBatch.MAGIC_VALUE_V2)
throw Errors.UNSUPPORTED_FOR_MESSAGE_FORMAT.exception("Attempting to make a transaction offset commit with an invalid magic: " + magicValue)
val builder = MemoryRecords.builder(buffer, magicValue, compressionType, timestampType, 0L, time.milliseconds(),
producerId, producerEpoch, 0, isTxnOffsetCommit, RecordBatch.NO_PARTITION_LEADER_EPOCH)
records.foreach(builder.append)
val entries = Map(offsetTopicPartition -> builder.build())
// set the callback function to insert offsets into cache after log append completed
def putCacheCallback(responseStatus: Map[TopicPartition, PartitionResponse]) {
...
}
if (isTxnOffsetCommit) {
group.inLock {
addProducerGroup(producerId, group.groupId)
group.prepareTxnOffsetCommit(producerId, offsetMetadata)
}
} else {
group.inLock {
group.prepareOffsetCommit(offsetMetadata)
}
}
appendForGroup(group, entries, putCacheCallback)
case None =>
val commitStatus = offsetMetadata.map { case (topicPartition, _) =>
(topicPartition, Errors.NOT_COORDINATOR)
}
responseCallback(commitStatus)
None
}
}
}
kafka在0.9以后将offset数据保存在内置的topic内,具体的步骤是
- 先构造要append到topic partition中的记录数据
val records = filteredOffsetMetadata.map { case (topicPartition, offsetAndMetadata) =>
val key = GroupMetadataManager.offsetCommitKey(group.groupId, topicPartition)
val value = GroupMetadataManager.offsetCommitValue(offsetAndMetadata, interBrokerProtocolVersion)
new SimpleRecord(timestamp, key, value)
}
- 定义putCacheCallback方法用于在保存完记录以后回调,插入offset信息到缓存中
- 调用appendForGroup方法保存offset,最终是通过replicaManager方法将日志写入磁盘
将日志最终写入到文件的最终方法是:
def appendRecordsToLeader(records: MemoryRecords, isFromClient: Boolean, requiredAcks: Int = 0): LogAppendInfo = {
val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
leaderReplicaIfLocal match {
//只写入leader中
case Some(leaderReplica) =>
val log = leaderReplica.log.get
val minIsr = log.config.minInSyncReplicas
val inSyncSize = inSyncReplicas.size
// Avoid writing to leader if there are not enough insync replicas to make it safe
if (inSyncSize < minIsr && requiredAcks == -1) {
throw new NotEnoughReplicasException(s"The size of the current ISR ${inSyncReplicas.map(_.brokerId)} " +
s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
}
val info = log.appendAsLeader(records, leaderEpoch = this.leaderEpoch, isFromClient)
// probably unblock some follower fetch requests since log end offset has been updated
replicaManager.tryCompleteDelayedFetch(TopicPartitionOperationKey(this.topic, this.partitionId))
// we may need to increment high watermark since ISR could be down to 1
(info, maybeIncrementLeaderHW(leaderReplica))
case None =>
throw new NotLeaderForPartitionException("Leader not local for partition %s on broker %d"
.format(topicPartition, localBrokerId))
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
info
}
日志文件添加成功以后再执行更新metadata的回调方法putCacheCallback,其中保存offset的方法是
def onOffsetCommitAppend(topicPartition: TopicPartition, offsetWithCommitRecordMetadata: CommitRecordMetadataAndOffset) {
if (pendingOffsetCommits.contains(topicPartition)) {
if (offsetWithCommitRecordMetadata.appendedBatchOffset.isEmpty)
throw new IllegalStateException("Cannot complete offset commit write without providing the metadata of the record " +
"in the log.")
//只更新比较新的offset
if (!offsets.contains(topicPartition) || offsets(topicPartition).olderThan(offsetWithCommitRecordMetadata))
offsets.put(topicPartition, offsetWithCommitRecordMetadata)
}
pendingOffsetCommits.get(topicPartition) match {
case Some(stagedOffset) if offsetWithCommitRecordMetadata.offsetAndMetadata == stagedOffset =>
pendingOffsetCommits.remove(topicPartition)
case _ =>
// The pendingOffsetCommits for this partition could be empty if the topic was deleted, in which case
// its entries would be removed from the cache by the `removeOffsets` method.
}
}
至此,cache和topic中都保留了最新offset的数据。
offset_fetch
offset_fetch的请求用于获取一个group消费的offset的情况。请求的内容主要是:
private static final List<TopicPartition> ALL_TOPIC_PARTITIONS = null;
private final String groupId;
private final List<TopicPartition> partitions;
请求处理方法是:
def createResponse(requestThrottleMs: Int): AbstractResponse = {
val offsetFetchResponse =
// reject the request if not authorized to the group
if (!authorize(request.session, Describe, Resource(Group, offsetFetchRequest.groupId, LITERAL)))
offsetFetchRequest.getErrorResponse(requestThrottleMs, Errors.GROUP_AUTHORIZATION_FAILED)
else {
if (header.apiVersion == 0) {
val (authorizedPartitions, unauthorizedPartitions) = offsetFetchRequest.partitions.asScala
.partition(authorizeTopicDescribe)
// version 0 reads offsets from ZK
val authorizedPartitionData = authorizedPartitions.map { topicPartition =>
try {
if (!metadataCache.contains(topicPartition))
(topicPartition, OffsetFetchResponse.UNKNOWN_PARTITION)
else {
val payloadOpt = zkClient.getConsumerOffset(offsetFetchRequest.groupId, topicPartition)
payloadOpt match {
case Some(payload) =>
(topicPartition, new OffsetFetchResponse.PartitionData(payload.toLong,
Optional.empty(), OffsetFetchResponse.NO_METADATA, Errors.NONE))
case None =>
(topicPartition, OffsetFetchResponse.UNKNOWN_PARTITION)
}
}
} catch {
case e: Throwable =>
(topicPartition, new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
Optional.empty(), OffsetFetchResponse.NO_METADATA, Errors.forException(e)))
}
}.toMap
val unauthorizedPartitionData = unauthorizedPartitions.map(_ -> OffsetFetchResponse.UNAUTHORIZED_PARTITION).toMap
new OffsetFetchResponse(requestThrottleMs, Errors.NONE, (authorizedPartitionData ++ unauthorizedPartitionData).asJava)
} else {
// versions 1 and above read offsets from Kafka
if (offsetFetchRequest.isAllPartitions) {
val (error, allPartitionData) = groupCoordinator.handleFetchOffsets(offsetFetchRequest.groupId)
if (error != Errors.NONE)
offsetFetchRequest.getErrorResponse(requestThrottleMs, error)
else {
// clients are not allowed to see offsets for topics that are not authorized for Describe
val authorizedPartitionData = allPartitionData.filter { case (topicPartition, _) => authorizeTopicDescribe(topicPartition) }
new OffsetFetchResponse(requestThrottleMs, Errors.NONE, authorizedPartitionData.asJava)
}
} else {
val (authorizedPartitions, unauthorizedPartitions) = offsetFetchRequest.partitions.asScala
.partition(authorizeTopicDescribe)
val (error, authorizedPartitionData) = groupCoordinator.handleFetchOffsets(offsetFetchRequest.groupId,
Some(authorizedPartitions))
if (error != Errors.NONE)
offsetFetchRequest.getErrorResponse(requestThrottleMs, error)
else {
val unauthorizedPartitionData = unauthorizedPartitions.map(_ -> OffsetFetchResponse.UNAUTHORIZED_PARTITION).toMap
new OffsetFetchResponse(requestThrottleMs, Errors.NONE, (authorizedPartitionData ++ unauthorizedPartitionData).asJava)
}
}
}
}
trace(s"Sending offset fetch response $offsetFetchResponse for correlation id ${header.correlationId} to client ${header.clientId}.")
offsetFetchResponse
}
其中对于一些错误的类型与offset_commit类似,这里不多介绍。这个方法最主要还是调用groupMetadataManager的getOffsets方法
def getOffsets(groupId: String, topicPartitionsOpt: Option[Seq[TopicPartition]]): Map[TopicPartition, OffsetFetchResponse.PartitionData] = {
trace("Getting offsets of %s for group %s.".format(topicPartitionsOpt.getOrElse("all partitions"), groupId))
val group = groupMetadataCache.get(groupId)
//如果不存在此group,返回INVALID_OFFSET
if (group == null) {
topicPartitionsOpt.getOrElse(Seq.empty[TopicPartition]).map { topicPartition =>
val partitionData = new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
Optional.empty(), "", Errors.NONE)
topicPartition -> partitionData
}.toMap
} else {
group.inLock {
if (group.is(Dead)) {
topicPartitionsOpt.getOrElse(Seq.empty[TopicPartition]).map { topicPartition =>
val partitionData = new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
Optional.empty(), "", Errors.NONE)
topicPartition -> partitionData
}.toMap
} else {
topicPartitionsOpt match {
case None =>
// Return offsets for all partitions owned by this consumer group. (this only applies to consumers
// that commit offsets to Kafka.)
group.allOffsets.map { case (topicPartition, offsetAndMetadata) =>
topicPartition -> new OffsetFetchResponse.PartitionData(offsetAndMetadata.offset,
offsetAndMetadata.leaderEpoch, offsetAndMetadata.metadata, Errors.NONE)
}
case Some(topicPartitions) =>
topicPartitions.map { topicPartition =>
val partitionData = group.offset(topicPartition) match {
case None =>
new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
Optional.empty(), "", Errors.NONE)
case Some(offsetAndMetadata) =>
new OffsetFetchResponse.PartitionData(offsetAndMetadata.offset,
offsetAndMetadata.leaderEpoch, offsetAndMetadata.metadata, Errors.NONE)
}
topicPartition -> partitionData
}.toMap
}
}
}
}
}
groupMetadataManager的groupMetadata中保留了每个topic partition的消费到的offset的缓存,fetch offset的时候从缓存中获取。
group相关请求
kafka将consumer group中消费的offset保存在内置topic中,和group相关的信息也是保存在内置topic中。
kafka在处理group相关的请求时,会经历几个组件,我们以JOIN_GROUP为例。join group请求表示一台consumer想加入到指定的consumer group当中,它的请求会经历这几个过程处理
- 检查鉴权是否通过
- 由groupCoordinator作为协调器处理请求;调用groupManager获取到group相关的信息,如果group不存在,通过groupManager添加相应的group。在这一步,以join_group为例,会调用到groupCoordinator如下几个方法
- doJoinGroup:添加member到group中
- addMemberAndRebalance/updateMemberAndRebalance:因为member信息发生了改变,所以需要进行rebalance
我们还是以join_group为例看一下具体的实现
group.inLock {
if (!group.is(Empty) && (!group.protocolType.contains(protocolType) || !group.supportsProtocols(protocols.map(_._1).toSet))) {
// if the new member does not support the group protocol, reject it
responseCallback(joinError(memberId, Errors.INCONSISTENT_GROUP_PROTOCOL))
} else if (group.is(Empty) && (protocols.isEmpty || protocolType.isEmpty)) {
//reject if first member with empty group protocol or protocolType is empty
responseCallback(joinError(memberId, Errors.INCONSISTENT_GROUP_PROTOCOL))
} else if (memberId != JoinGroupRequest.UNKNOWN_MEMBER_ID && !group.has(memberId)) {
// if the member trying to register with a un-recognized id, send the response to let
// it reset its member id and retry
responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
} else {
group.currentState match {
case Dead =>
// if the group is marked as dead, it means some other thread has just removed the group
// from the coordinator metadata; this is likely that the group has migrated to some other
// coordinator OR the group is in a transient unstable phase. Let the member retry
// joining without the specified member id,
//如果此时group状态为dead,则member可以继续重试,直至group度过此状态
responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
case PreparingRebalance =>
if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
//如果是一个新的member id,执行add member操作
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
protocols, group, responseCallback)
} else {
//如果是一个存在过的member,执行update member操作
val member = group.get(memberId)
updateMemberAndRebalance(group, member, protocols, responseCallback)
}
case CompletingRebalance =>
if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
//如果是一个新的member id,执行add member操作
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
protocols, group, responseCallback)
} else {
val member = group.get(memberId)
//member可能没收到join group响应,此时没必要再rebalance,直接把当前member的信息返回
if (member.matches(protocols)) {
// member is joining with the same metadata (which could be because it failed to
// receive the initial JoinGroup response), so just return current group information
// for the current generation.
responseCallback(JoinGroupResult(
members = if (group.isLeader(memberId)) {
group.currentMemberMetadata
} else {
Map.empty
},
memberId = memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE))
} else {
// member has changed metadata, so force a rebalance
updateMemberAndRebalance(group, member, protocols, responseCallback)
}
}
case Empty | Stable =>
if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
// if the member id is unknown, register the member to the group
addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
protocols, group, responseCallback)
} else {
val member = group.get(memberId)
//empty或stable状态下leader的join请求会触发rebalance
if (group.isLeader(memberId) || !member.matches(protocols)) {
// force a rebalance if a member has changed metadata or if the leader sends JoinGroup.
// The latter allows the leader to trigger rebalances for changes affecting assignment
// which do not affect the member metadata (such as topic metadata changes for the consumer)
updateMemberAndRebalance(group, member, protocols, responseCallback)
} else {
// for followers with no actual change to their metadata, just return group information
// for the current generation which will allow them to issue SyncGroup
responseCallback(JoinGroupResult(
members = Map.empty,
memberId = memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE))
}
}
}
rebalance
addMemberAndRebalance
private def addMemberAndRebalance(rebalanceTimeoutMs: Int,
sessionTimeoutMs: Int,
clientId: String,
clientHost: String,
protocolType: String,
protocols: List[(String, Array[Byte])],
group: GroupMetadata,
callback: JoinCallback): MemberMetadata = {
val memberId = clientId + "-" + group.generateMemberIdSuffix
val member = new MemberMetadata(memberId, group.groupId, clientId, clientHost, rebalanceTimeoutMs,
sessionTimeoutMs, protocolType, protocols)
// update the newMemberAdded flag to indicate that the join group can be further delayed
//注意这里加了一个标记,表示此时group中有新加入的member
if (group.is(PreparingRebalance) && group.generationId == 0)
group.newMemberAdded = true
group.add(member, callback)
maybePrepareRebalance(group, s"Adding new member $memberId")
member
}
在执行rebalance之前调用prepareRebalance方法
private def prepareRebalance(group: GroupMetadata, reason: String) {
// if any members are awaiting sync, cancel their request and have them rejoin
if (group.is(CompletingRebalance))
resetAndPropagateAssignmentError(group, Errors.REBALANCE_IN_PROGRESS)
val delayedRebalance = if (group.is(Empty))
new InitialDelayedJoin(this,
joinPurgatory,
group,
groupConfig.groupInitialRebalanceDelayMs,
groupConfig.groupInitialRebalanceDelayMs,
max(group.rebalanceTimeoutMs - groupConfig.groupInitialRebalanceDelayMs, 0))
else
new DelayedJoin(this, group, group.rebalanceTimeoutMs)
group.transitionTo(PreparingRebalance)
info(s"Preparing to rebalance group ${group.groupId} in state ${group.currentState} with old generation " +
s"${group.generationId} (${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)}) (reason: $reason)")
val groupKey = GroupKey(group.groupId)
joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
}
这里我们看到group的状态变成了PreparingRebalance,并且注册了一个delayed operation. delayed operation的原理我们后面介绍,这里提一下大概:
- kafka的每个DelayedOperation都会继承三个方法:tryComplete,onExpiration,onComplete
- 将delayed operation注册到帮助类DelayedOperationPurgatory中。在DelayedOperationPurgatory中,保存了key和每个key对应的watcher,而watcher负责观察一个operation。在这里,key是groupId,operation就是刚刚定义的delayed operation
- DelayedOperationPurgatory在将operation注册到key对应的watcher上时,首先尝试能不能完成operation,调用tryComplete,如果满足一定条件,调用forceComplete。所以这里我们注意看定义的delayed operation的forceComplete是怎么实现的。下面看代码。
首先先看tryComplete中的判断条件
def tryCompleteJoin(group: GroupMetadata, forceComplete: () => Boolean) = {
group.inLock {
if (group.hasAllMembersJoined)
forceComplete()
else false
}
}
其中
//每当有member发起join时(可能是一个已存在的member),numMembersAwaitingJoin加1,join request返回时,减1
def hasAllMembersJoined = members.size <= numMembersAwaitingJoin
从代码中看出,当group的member都发起join请求时,触发hasAllMembersJoined条件,从而开始rebalance.
def forceComplete(): Boolean = {
if (completed.compareAndSet(false, true)) {
// cancel the timeout timer
cancel()
onComplete()
true
} else {
false
}
}
forceComplete判断,如果此时operation还没有完成的话,调用delayed operation重写的onComplete方法
def onCompleteJoin(group: GroupMetadata) {
group.inLock {
// remove any members who haven't joined the group yet
group.notYetRejoinedMembers.foreach { failedMember =>
removeHeartbeatForLeavingMember(group, failedMember)
group.remove(failedMember.memberId)
// TODO: cut the socket connection to the client
}
if (!group.is(Dead)) {
//先生成新的generation id
group.initNextGeneration()
if (group.is(Empty)) {
info(s"Group ${group.groupId} with generation ${group.generationId} is now empty " +
s"(${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)})")
groupManager.storeGroup(group, Map.empty, error => {
if (error != Errors.NONE) {
// we failed to write the empty group metadata. If the broker fails before another rebalance,
// the previous generation written to the log will become active again (and most likely timeout).
// This should be safe since there are no active members in an empty generation, so we just warn.
warn(s"Failed to write empty metadata for group ${group.groupId}: ${error.message}")
}
})
} else {
info(s"Stabilized group ${group.groupId} generation ${group.generationId} " +
s"(${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)})")
// trigger the awaiting join group response callback for all the members after rebalancing
for (member <- group.allMemberMetadata) {
assert(member.awaitingJoinCallback != null)
val joinResult = JoinGroupResult(
members = if (group.isLeader(member.memberId)) {
group.currentMemberMetadata
} else {
Map.empty
},
memberId = member.memberId,
generationId = group.generationId,
subProtocol = group.protocolOrNull,
leaderId = group.leaderOrNull,
error = Errors.NONE)
group.invokeJoinCallback(member, joinResult)
completeAndScheduleNextHeartbeatExpiration(group, member)
}
}
}
}
}
我们需要注意到,coordinator在这里等待了一定的时间,这段时间内没有发来join请求的member将会被踢出group。最后,coordinator将member的元数据信息发送给leader。那么,一个consumer group的leader是谁呢?从add member揭晓:
def add(member: MemberMetadata, callback: JoinCallback = null) {
if (members.isEmpty)
this.protocolType = Some(member.protocolType)
assert(groupId == member.groupId)
assert(this.protocolType.orNull == member.protocolType)
assert(supportsProtocols(member.protocols))
if (leaderId.isEmpty)
leaderId = Some(member.memberId)
members.put(member.memberId, member)
member.supportedProtocols.foreach{ case (protocol, _) => supportedProtocols(protocol) += 1 }
member.awaitingJoinCallback = callback
if (member.awaitingJoinCallback != null)
numMembersAwaitingJoin += 1;
}
从代码中看到,都一个加入到group中的consumer被当做了leader。
kafka中,consumer的assignment是由每个group的leader来决定的,leader在收到join_group response后进行topic partition的分配,并将结果通过sync_group请求返回到coordinator。
sync group
leader在收到join group的相应之后进行rebalance assignment,再将分配结果发送给coordinator。sync group的主要请求体是:
private final String groupId;
private final int generationId;
private final String memberId;
private final Map<String, ByteBuffer> groupAssignment;
和其他所有group请求类似,coordinator在收到sync group请求后对请求进行鉴权,鉴权通过后执行handleSyncGroup方法。最终调用doSyncGroup方法
private def doSyncGroup(group: GroupMetadata,
generationId: Int,
memberId: String,
groupAssignment: Map[String, Array[Byte]],
responseCallback: SyncCallback) {
group.inLock {
if (!group.has(memberId)) {
responseCallback(Array.empty, Errors.UNKNOWN_MEMBER_ID)
} else if (generationId != group.generationId) {
responseCallback(Array.empty, Errors.ILLEGAL_GENERATION)
} else {
group.currentState match {
//处于empty和dead状态的group会报错
case Empty | Dead =>
responseCallback(Array.empty, Errors.UNKNOWN_MEMBER_ID)
//处于prepareRebalalnce的group不接受sync请求
case PreparingRebalance =>
responseCallback(Array.empty, Errors.REBALANCE_IN_PROGRESS)
case CompletingRebalance =>
group.get(memberId).awaitingSyncCallback = responseCallback
// if this is the leader, then we can attempt to persist state and transition to stable
//只处理来自leader的请求
if (group.isLeader(memberId)) {
info(s"Assignment received from leader for group ${group.groupId} for generation ${group.generationId}")
// fill any missing members with an empty assignment
val missing = group.allMembers -- groupAssignment.keySet
val assignment = groupAssignment ++ missing.map(_ -> Array.empty[Byte]).toMap
groupManager.storeGroup(group, assignment, (error: Errors) => {
group.inLock {
// another member may have joined the group while we were awaiting this callback,
// so we must ensure we are still in the CompletingRebalance state and the same generation
// when it gets invoked. if we have transitioned to another state, then do nothing
if (group.is(CompletingRebalance) && generationId == group.generationId) {
if (error != Errors.NONE) {
resetAndPropagateAssignmentError(group, error)
maybePrepareRebalance(group, s"error when storing group assignment during SyncGroup (member: $memberId)")
} else {
setAndPropagateAssignment(group, assignment)
group.transitionTo(Stable)
}
}
}
})
}
case Stable =>
// if the group is stable, we just return the current assignment
val memberMetadata = group.get(memberId)
responseCallback(memberMetadata.assignment, Errors.NONE)
completeAndScheduleNextHeartbeatExpiration(group, group.get(memberId))
}
}
}
}
coordinator在收到leader发来的assignment以后保存group信息到__consumer_offset topic,同时将每个member的元数据的assignment更新。当其他非leader的consumer member再发来请求时,直接返回其对应的assignment。
GroupMetadata
- PreparingRebalance:准备进行rebalance
- CompletingRebalance:等待leader返回的状态分配结果
- Stable
- Dead:group中已经没有成员,并且正在被清除元数据
- Empty:group中没有成员,但是还在等待offset过期
| PreparingRebalance | CompletingRebalance | Stable | Dead | Empty | |
|---|---|---|---|---|---|
| PreparingRebalance | no | 有成员加入到group | no | group被移除 | 所有成员都离开group |
| CompletingRebalance | 有成员加入,离开,或者失败 | no | 收到leader的分配结果 | group被移除 | no |
| Stable | 监听到成员失败,离开,新成员加入等请求 | no | no | group被移除 | no |
| Dead | 没有状态转移 | 没有状态转移 | no | 没有状态转移 | |
| Empty | 有新成员加入 | no | no | offset过期,group被移除 | no |