kafka server - coordinator

917 阅读7分钟

api

coordinator专门用来处理consumer group和commit offset相关的工作,kafka api中和其相关的api有

  • OFFSET_COMMIT
  • OFFSET_FETCH
  • JOIN_GROUP
  • LEAVE_GROUP
  • SYNC_GROUP
  • DESCRIBE_GROUPS
  • LIST_GROUPS
  • DELETE_GROUPS

后面我们介绍这些api的处理方法

offset_commit和offset_fetch

在这一节介绍kafka server处理offset相关请求的过程。offset相关的请求有两个,分别是offset_commit和offset_fetch。

offset_commit

offset_commit的请求包含

  • groupId
  • offsetData: Map<TopicPartition,PartitionData>
  • memberId
  • generationId

其中PartitionData在前面partition复制的时候已经看到过

public static final class PartitionData {
        @Deprecated
        public final long timestamp;                // for V1

        public final long offset;
        public final String metadata;
        public final Optional<Integer> leaderEpoch;
}

在收到commit offset的请求后,

  • 首先先对请求进行鉴权,如果鉴权不通过(对group无权限),返回GROUP_AUTHORIZATION_FAILED错误。
  • 否则对topic进行鉴权,不通过的话返回TOPIC_AUTHORIZATION_FAILED错误
  • 否则,如果api的metadataCache如果不包含此topicPartition的话,返回UNKNOWN_TOPIC_OR_PARTITION错误
  • 鉴权通过的话,根据请求header的version进行判断
    • 如果version == 0, 在zk中设置对应topicPartition的offset
    • 否则调用groupCoordinator的handleCommitOffsets方法保存offset

在这里,我们主要看groupCoordinator的handleCommitOffsets方法的实现

def handleCommitOffsets(groupId: String,
                          memberId: String,
                          generationId: Int,
                          offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
                          responseCallback: immutable.Map[TopicPartition, Errors] => Unit) {
    validateGroupStatus(groupId, ApiKeys.OFFSET_COMMIT) match {
      case Some(error) => responseCallback(offsetMetadata.mapValues(_ => error))
      case None =>
        groupManager.getGroup(groupId) match {
          case None =>
            if (generationId < 0) {
              // the group is not relying on Kafka for group management, so allow the commit
              //如果没有group的元数据且未开始generation时
              val group = groupManager.addGroup(new GroupMetadata(groupId, Empty, time))
              doCommitOffsets(group, memberId, generationId, NO_PRODUCER_ID, NO_PRODUCER_EPOCH,
                offsetMetadata, responseCallback)
            } else {
              // or this is a request coming from an older generation. either way, reject the commit
              //如果rebalance后没有group的信息
              responseCallback(offsetMetadata.mapValues(_ => Errors.ILLEGAL_GENERATION))
            }

          //如果保存有group的信息
          case Some(group) =>
            doCommitOffsets(group, memberId, generationId, NO_PRODUCER_ID, NO_PRODUCER_EPOCH,
              offsetMetadata, responseCallback)
        }
    }
  }

groupCoordinator调用了doCommitOffsets方法保存offset

private def doCommitOffsets(group: GroupMetadata,
                             memberId: String,
                             generationId: Int,
                             producerId: Long,
                             producerEpoch: Short,
                             offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
                             responseCallback: immutable.Map[TopicPartition, Errors] => Unit) {
   group.inLock {
     if (group.is(Dead)) {
       //group为dead表示已没有任何member
       responseCallback(offsetMetadata.mapValues(_ => Errors.UNKNOWN_MEMBER_ID))
     } else if ((generationId < 0 && group.is(Empty)) || (producerId != NO_PRODUCER_ID)) {
       // The group is only using Kafka to store offsets.
       // Also, for transactional offset commits we don't need to validate group membership and the generation.
       groupManager.storeOffsets(group, memberId, offsetMetadata, responseCallback, producerId, producerEpoch)
     } else if (group.is(CompletingRebalance)) {
       //group正在Rebalance进行中
       responseCallback(offsetMetadata.mapValues(_ => Errors.REBALANCE_IN_PROGRESS))
     } else if (!group.has(memberId)) {
       responseCallback(offsetMetadata.mapValues(_ => Errors.UNKNOWN_MEMBER_ID))
     } else if (generationId != group.generationId) {
       responseCallback(offsetMetadata.mapValues(_ => Errors.ILLEGAL_GENERATION))
     } else {
       val member = group.get(memberId)
       completeAndScheduleNextHeartbeatExpiration(group, member)
       groupManager.storeOffsets(group, memberId, offsetMetadata, responseCallback)
     }
   }
 }

最后调用的是GroupMetadataManager的storeOffsets方法

def storeOffsets(group: GroupMetadata,
                   consumerId: String,
                   offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
                   responseCallback: immutable.Map[TopicPartition, Errors] => Unit,
                   producerId: Long = RecordBatch.NO_PRODUCER_ID,
                   producerEpoch: Short = RecordBatch.NO_PRODUCER_EPOCH): Unit = {
    // first filter out partitions with offset metadata size exceeding limit
    val filteredOffsetMetadata = offsetMetadata.filter { case (_, offsetAndMetadata) =>
      validateOffsetMetadataLength(offsetAndMetadata.metadata)
    }

    group.inLock {
      if (!group.hasReceivedConsistentOffsetCommits)
        warn(s"group: ${group.groupId} with leader: ${group.leaderOrNull} has received offset commits from consumers as well " +
          s"as transactional producers. Mixing both types of offset commits will generally result in surprises and " +
          s"should be avoided.")
    }

    val isTxnOffsetCommit = producerId != RecordBatch.NO_PRODUCER_ID
    // construct the message set to append
    if (filteredOffsetMetadata.isEmpty) {
      // compute the final error codes for the commit response
      val commitStatus = offsetMetadata.mapValues(_ => Errors.OFFSET_METADATA_TOO_LARGE)
      responseCallback(commitStatus)
      None
    } else {
      getMagic(partitionFor(group.groupId)) match {
        case Some(magicValue) =>
          // We always use CREATE_TIME, like the producer. The conversion to LOG_APPEND_TIME (if necessary) happens automatically.
          val timestampType = TimestampType.CREATE_TIME
          val timestamp = time.milliseconds()

          val records = filteredOffsetMetadata.map { case (topicPartition, offsetAndMetadata) =>
            val key = GroupMetadataManager.offsetCommitKey(group.groupId, topicPartition)
            val value = GroupMetadataManager.offsetCommitValue(offsetAndMetadata, interBrokerProtocolVersion)
            new SimpleRecord(timestamp, key, value)
          }
          val offsetTopicPartition = new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, partitionFor(group.groupId))
          val buffer = ByteBuffer.allocate(AbstractRecords.estimateSizeInBytes(magicValue, compressionType, records.asJava))

          if (isTxnOffsetCommit && magicValue < RecordBatch.MAGIC_VALUE_V2)
            throw Errors.UNSUPPORTED_FOR_MESSAGE_FORMAT.exception("Attempting to make a transaction offset commit with an invalid magic: " + magicValue)

          val builder = MemoryRecords.builder(buffer, magicValue, compressionType, timestampType, 0L, time.milliseconds(),
            producerId, producerEpoch, 0, isTxnOffsetCommit, RecordBatch.NO_PARTITION_LEADER_EPOCH)

          records.foreach(builder.append)
          val entries = Map(offsetTopicPartition -> builder.build())

          // set the callback function to insert offsets into cache after log append completed
          def putCacheCallback(responseStatus: Map[TopicPartition, PartitionResponse]) {
            ...
          }

          if (isTxnOffsetCommit) {
            group.inLock {
              addProducerGroup(producerId, group.groupId)
              group.prepareTxnOffsetCommit(producerId, offsetMetadata)
            }
          } else {
            group.inLock {
              group.prepareOffsetCommit(offsetMetadata)
            }
          }

          appendForGroup(group, entries, putCacheCallback)

        case None =>
          val commitStatus = offsetMetadata.map { case (topicPartition, _) =>
            (topicPartition, Errors.NOT_COORDINATOR)
          }
          responseCallback(commitStatus)
          None
      }
    }
  }

kafka在0.9以后将offset数据保存在内置的topic内,具体的步骤是

  1. 先构造要append到topic partition中的记录数据
val records = filteredOffsetMetadata.map { case (topicPartition, offsetAndMetadata) =>
           val key = GroupMetadataManager.offsetCommitKey(group.groupId, topicPartition)
           val value = GroupMetadataManager.offsetCommitValue(offsetAndMetadata, interBrokerProtocolVersion)
           new SimpleRecord(timestamp, key, value)
         }
  1. 定义putCacheCallback方法用于在保存完记录以后回调,插入offset信息到缓存中
  2. 调用appendForGroup方法保存offset,最终是通过replicaManager方法将日志写入磁盘

将日志最终写入到文件的最终方法是:

def appendRecordsToLeader(records: MemoryRecords, isFromClient: Boolean, requiredAcks: Int = 0): LogAppendInfo = {
    val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
      leaderReplicaIfLocal match {
        //只写入leader中
        case Some(leaderReplica) =>
          val log = leaderReplica.log.get
          val minIsr = log.config.minInSyncReplicas
          val inSyncSize = inSyncReplicas.size

          // Avoid writing to leader if there are not enough insync replicas to make it safe
          if (inSyncSize < minIsr && requiredAcks == -1) {
            throw new NotEnoughReplicasException(s"The size of the current ISR ${inSyncReplicas.map(_.brokerId)} " +
              s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
          }

          val info = log.appendAsLeader(records, leaderEpoch = this.leaderEpoch, isFromClient)
          // probably unblock some follower fetch requests since log end offset has been updated
          replicaManager.tryCompleteDelayedFetch(TopicPartitionOperationKey(this.topic, this.partitionId))
          // we may need to increment high watermark since ISR could be down to 1
          (info, maybeIncrementLeaderHW(leaderReplica))

        case None =>
          throw new NotLeaderForPartitionException("Leader not local for partition %s on broker %d"
            .format(topicPartition, localBrokerId))
      }
    }

    // some delayed operations may be unblocked after HW changed
    if (leaderHWIncremented)
      tryCompleteDelayedRequests()

    info
  }

日志文件添加成功以后再执行更新metadata的回调方法putCacheCallback,其中保存offset的方法是

def onOffsetCommitAppend(topicPartition: TopicPartition, offsetWithCommitRecordMetadata: CommitRecordMetadataAndOffset) {
   if (pendingOffsetCommits.contains(topicPartition)) {
     if (offsetWithCommitRecordMetadata.appendedBatchOffset.isEmpty)
       throw new IllegalStateException("Cannot complete offset commit write without providing the metadata of the record " +
         "in the log.")
     //只更新比较新的offset
     if (!offsets.contains(topicPartition) || offsets(topicPartition).olderThan(offsetWithCommitRecordMetadata))
       offsets.put(topicPartition, offsetWithCommitRecordMetadata)
   }

   pendingOffsetCommits.get(topicPartition) match {
     case Some(stagedOffset) if offsetWithCommitRecordMetadata.offsetAndMetadata == stagedOffset =>
       pendingOffsetCommits.remove(topicPartition)
     case _ =>
       // The pendingOffsetCommits for this partition could be empty if the topic was deleted, in which case
       // its entries would be removed from the cache by the `removeOffsets` method.
   }
 }

至此,cache和topic中都保留了最新offset的数据。

offset_fetch

offset_fetch的请求用于获取一个group消费的offset的情况。请求的内容主要是:

private static final List<TopicPartition> ALL_TOPIC_PARTITIONS = null;
private final String groupId;
private final List<TopicPartition> partitions;

请求处理方法是:

def createResponse(requestThrottleMs: Int): AbstractResponse = {
      val offsetFetchResponse =
        // reject the request if not authorized to the group
        if (!authorize(request.session, Describe, Resource(Group, offsetFetchRequest.groupId, LITERAL)))
          offsetFetchRequest.getErrorResponse(requestThrottleMs, Errors.GROUP_AUTHORIZATION_FAILED)
        else {
          if (header.apiVersion == 0) {
            val (authorizedPartitions, unauthorizedPartitions) = offsetFetchRequest.partitions.asScala
              .partition(authorizeTopicDescribe)

            // version 0 reads offsets from ZK
            val authorizedPartitionData = authorizedPartitions.map { topicPartition =>
              try {
                if (!metadataCache.contains(topicPartition))
                  (topicPartition, OffsetFetchResponse.UNKNOWN_PARTITION)
                else {
                  val payloadOpt = zkClient.getConsumerOffset(offsetFetchRequest.groupId, topicPartition)
                  payloadOpt match {
                    case Some(payload) =>
                      (topicPartition, new OffsetFetchResponse.PartitionData(payload.toLong,
                        Optional.empty(), OffsetFetchResponse.NO_METADATA, Errors.NONE))
                    case None =>
                      (topicPartition, OffsetFetchResponse.UNKNOWN_PARTITION)
                  }
                }
              } catch {
                case e: Throwable =>
                  (topicPartition, new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
                    Optional.empty(), OffsetFetchResponse.NO_METADATA, Errors.forException(e)))
              }
            }.toMap

            val unauthorizedPartitionData = unauthorizedPartitions.map(_ -> OffsetFetchResponse.UNAUTHORIZED_PARTITION).toMap
            new OffsetFetchResponse(requestThrottleMs, Errors.NONE, (authorizedPartitionData ++ unauthorizedPartitionData).asJava)
          } else {
            // versions 1 and above read offsets from Kafka
            if (offsetFetchRequest.isAllPartitions) {
              val (error, allPartitionData) = groupCoordinator.handleFetchOffsets(offsetFetchRequest.groupId)
              if (error != Errors.NONE)
                offsetFetchRequest.getErrorResponse(requestThrottleMs, error)
              else {
                // clients are not allowed to see offsets for topics that are not authorized for Describe
                val authorizedPartitionData = allPartitionData.filter { case (topicPartition, _) => authorizeTopicDescribe(topicPartition) }
                new OffsetFetchResponse(requestThrottleMs, Errors.NONE, authorizedPartitionData.asJava)
              }
            } else {
              val (authorizedPartitions, unauthorizedPartitions) = offsetFetchRequest.partitions.asScala
                .partition(authorizeTopicDescribe)
              val (error, authorizedPartitionData) = groupCoordinator.handleFetchOffsets(offsetFetchRequest.groupId,
                Some(authorizedPartitions))
              if (error != Errors.NONE)
                offsetFetchRequest.getErrorResponse(requestThrottleMs, error)
              else {
                val unauthorizedPartitionData = unauthorizedPartitions.map(_ -> OffsetFetchResponse.UNAUTHORIZED_PARTITION).toMap
                new OffsetFetchResponse(requestThrottleMs, Errors.NONE, (authorizedPartitionData ++ unauthorizedPartitionData).asJava)
              }
            }
          }
        }

      trace(s"Sending offset fetch response $offsetFetchResponse for correlation id ${header.correlationId} to client ${header.clientId}.")
      offsetFetchResponse
    }

其中对于一些错误的类型与offset_commit类似,这里不多介绍。这个方法最主要还是调用groupMetadataManager的getOffsets方法

def getOffsets(groupId: String, topicPartitionsOpt: Option[Seq[TopicPartition]]): Map[TopicPartition, OffsetFetchResponse.PartitionData] = {
    trace("Getting offsets of %s for group %s.".format(topicPartitionsOpt.getOrElse("all partitions"), groupId))
    val group = groupMetadataCache.get(groupId)
    //如果不存在此group,返回INVALID_OFFSET
    if (group == null) {
      topicPartitionsOpt.getOrElse(Seq.empty[TopicPartition]).map { topicPartition =>
        val partitionData = new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
          Optional.empty(), "", Errors.NONE)
        topicPartition -> partitionData
      }.toMap
    } else {
      group.inLock {
        if (group.is(Dead)) {
          topicPartitionsOpt.getOrElse(Seq.empty[TopicPartition]).map { topicPartition =>
            val partitionData = new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
              Optional.empty(), "", Errors.NONE)
            topicPartition -> partitionData
          }.toMap
        } else {
          topicPartitionsOpt match {
            case None =>
              // Return offsets for all partitions owned by this consumer group. (this only applies to consumers
              // that commit offsets to Kafka.)
              group.allOffsets.map { case (topicPartition, offsetAndMetadata) =>
                topicPartition -> new OffsetFetchResponse.PartitionData(offsetAndMetadata.offset,
                  offsetAndMetadata.leaderEpoch, offsetAndMetadata.metadata, Errors.NONE)
              }

            case Some(topicPartitions) =>
              topicPartitions.map { topicPartition =>
                val partitionData = group.offset(topicPartition) match {
                  case None =>
                    new OffsetFetchResponse.PartitionData(OffsetFetchResponse.INVALID_OFFSET,
                      Optional.empty(), "", Errors.NONE)
                  case Some(offsetAndMetadata) =>
                    new OffsetFetchResponse.PartitionData(offsetAndMetadata.offset,
                      offsetAndMetadata.leaderEpoch, offsetAndMetadata.metadata, Errors.NONE)
                }
                topicPartition -> partitionData
              }.toMap
          }
        }
      }
    }
  }

groupMetadataManager的groupMetadata中保留了每个topic partition的消费到的offset的缓存,fetch offset的时候从缓存中获取。

group相关请求

kafka将consumer group中消费的offset保存在内置topic中,和group相关的信息也是保存在内置topic中。

kafka在处理group相关的请求时,会经历几个组件,我们以JOIN_GROUP为例。join group请求表示一台consumer想加入到指定的consumer group当中,它的请求会经历这几个过程处理

  • 检查鉴权是否通过
  • 由groupCoordinator作为协调器处理请求;调用groupManager获取到group相关的信息,如果group不存在,通过groupManager添加相应的group。在这一步,以join_group为例,会调用到groupCoordinator如下几个方法
    • doJoinGroup:添加member到group中
    • addMemberAndRebalance/updateMemberAndRebalance:因为member信息发生了改变,所以需要进行rebalance

我们还是以join_group为例看一下具体的实现

group.inLock {
      if (!group.is(Empty) && (!group.protocolType.contains(protocolType) || !group.supportsProtocols(protocols.map(_._1).toSet))) {
        // if the new member does not support the group protocol, reject it
        responseCallback(joinError(memberId, Errors.INCONSISTENT_GROUP_PROTOCOL))
      } else if (group.is(Empty) && (protocols.isEmpty || protocolType.isEmpty)) {
        //reject if first member with empty group protocol or protocolType is empty
        responseCallback(joinError(memberId, Errors.INCONSISTENT_GROUP_PROTOCOL))
      } else if (memberId != JoinGroupRequest.UNKNOWN_MEMBER_ID && !group.has(memberId)) {
        // if the member trying to register with a un-recognized id, send the response to let
        // it reset its member id and retry
        responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
      } else {
        group.currentState match {
          case Dead =>
            // if the group is marked as dead, it means some other thread has just removed the group
            // from the coordinator metadata; this is likely that the group has migrated to some other
            // coordinator OR the group is in a transient unstable phase. Let the member retry
            // joining without the specified member id,
            //如果此时group状态为dead,则member可以继续重试,直至group度过此状态
            responseCallback(joinError(memberId, Errors.UNKNOWN_MEMBER_ID))
          case PreparingRebalance =>
            if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
              //如果是一个新的member id,执行add member操作
              addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
                protocols, group, responseCallback)
            } else {
              //如果是一个存在过的member,执行update member操作
              val member = group.get(memberId)
              updateMemberAndRebalance(group, member, protocols, responseCallback)
            }

          case CompletingRebalance =>
            if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
            //如果是一个新的member id,执行add member操作
              addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
                protocols, group, responseCallback)
            } else {
              val member = group.get(memberId)
              //member可能没收到join group响应,此时没必要再rebalance,直接把当前member的信息返回
              if (member.matches(protocols)) {
                // member is joining with the same metadata (which could be because it failed to
                // receive the initial JoinGroup response), so just return current group information
                // for the current generation.
                responseCallback(JoinGroupResult(
                  members = if (group.isLeader(memberId)) {
                    group.currentMemberMetadata
                  } else {
                    Map.empty
                  },
                  memberId = memberId,
                  generationId = group.generationId,
                  subProtocol = group.protocolOrNull,
                  leaderId = group.leaderOrNull,
                  error = Errors.NONE))
              } else {
                // member has changed metadata, so force a rebalance
                updateMemberAndRebalance(group, member, protocols, responseCallback)
              }
            }

          case Empty | Stable =>
            if (memberId == JoinGroupRequest.UNKNOWN_MEMBER_ID) {
              // if the member id is unknown, register the member to the group
              addMemberAndRebalance(rebalanceTimeoutMs, sessionTimeoutMs, clientId, clientHost, protocolType,
                protocols, group, responseCallback)
            } else {
              val member = group.get(memberId)
              //empty或stable状态下leader的join请求会触发rebalance
              if (group.isLeader(memberId) || !member.matches(protocols)) {
                // force a rebalance if a member has changed metadata or if the leader sends JoinGroup.
                // The latter allows the leader to trigger rebalances for changes affecting assignment
                // which do not affect the member metadata (such as topic metadata changes for the consumer)
                updateMemberAndRebalance(group, member, protocols, responseCallback)
              } else {
                // for followers with no actual change to their metadata, just return group information
                // for the current generation which will allow them to issue SyncGroup
                responseCallback(JoinGroupResult(
                  members = Map.empty,
                  memberId = memberId,
                  generationId = group.generationId,
                  subProtocol = group.protocolOrNull,
                  leaderId = group.leaderOrNull,
                  error = Errors.NONE))
              }
            }
        }

rebalance

addMemberAndRebalance
private def addMemberAndRebalance(rebalanceTimeoutMs: Int,
                                    sessionTimeoutMs: Int,
                                    clientId: String,
                                    clientHost: String,
                                    protocolType: String,
                                    protocols: List[(String, Array[Byte])],
                                    group: GroupMetadata,
                                    callback: JoinCallback): MemberMetadata = {
    val memberId = clientId + "-" + group.generateMemberIdSuffix
    val member = new MemberMetadata(memberId, group.groupId, clientId, clientHost, rebalanceTimeoutMs,
      sessionTimeoutMs, protocolType, protocols)
    // update the newMemberAdded flag to indicate that the join group can be further delayed
    //注意这里加了一个标记,表示此时group中有新加入的member
    if (group.is(PreparingRebalance) && group.generationId == 0)
      group.newMemberAdded = true

    group.add(member, callback)
    maybePrepareRebalance(group, s"Adding new member $memberId")
    member
  }

在执行rebalance之前调用prepareRebalance方法

private def prepareRebalance(group: GroupMetadata, reason: String) {
   // if any members are awaiting sync, cancel their request and have them rejoin
   if (group.is(CompletingRebalance))
     resetAndPropagateAssignmentError(group, Errors.REBALANCE_IN_PROGRESS)

   val delayedRebalance = if (group.is(Empty))
     new InitialDelayedJoin(this,
       joinPurgatory,
       group,
       groupConfig.groupInitialRebalanceDelayMs,
       groupConfig.groupInitialRebalanceDelayMs,
       max(group.rebalanceTimeoutMs - groupConfig.groupInitialRebalanceDelayMs, 0))
   else
     new DelayedJoin(this, group, group.rebalanceTimeoutMs)

   group.transitionTo(PreparingRebalance)

   info(s"Preparing to rebalance group ${group.groupId} in state ${group.currentState} with old generation " +
     s"${group.generationId} (${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)}) (reason: $reason)")

   val groupKey = GroupKey(group.groupId)
   joinPurgatory.tryCompleteElseWatch(delayedRebalance, Seq(groupKey))
 }

这里我们看到group的状态变成了PreparingRebalance,并且注册了一个delayed operation. delayed operation的原理我们后面介绍,这里提一下大概:

  • kafka的每个DelayedOperation都会继承三个方法:tryComplete,onExpiration,onComplete
  • 将delayed operation注册到帮助类DelayedOperationPurgatory中。在DelayedOperationPurgatory中,保存了key和每个key对应的watcher,而watcher负责观察一个operation。在这里,key是groupId,operation就是刚刚定义的delayed operation
  • DelayedOperationPurgatory在将operation注册到key对应的watcher上时,首先尝试能不能完成operation,调用tryComplete,如果满足一定条件,调用forceComplete。所以这里我们注意看定义的delayed operation的forceComplete是怎么实现的。下面看代码。

首先先看tryComplete中的判断条件

def tryCompleteJoin(group: GroupMetadata, forceComplete: () => Boolean) = {
    group.inLock {
      if (group.hasAllMembersJoined)
        forceComplete()
      else false
    }
  }

其中

//每当有member发起join时(可能是一个已存在的member),numMembersAwaitingJoin加1,join request返回时,减1
def hasAllMembersJoined = members.size <= numMembersAwaitingJoin

从代码中看出,当group的member都发起join请求时,触发hasAllMembersJoined条件,从而开始rebalance.

def forceComplete(): Boolean = {
    if (completed.compareAndSet(false, true)) {
      // cancel the timeout timer
      cancel()
      onComplete()
      true
    } else {
      false
    }
  }

forceComplete判断,如果此时operation还没有完成的话,调用delayed operation重写的onComplete方法

def onCompleteJoin(group: GroupMetadata) {
   group.inLock {
     // remove any members who haven't joined the group yet
     group.notYetRejoinedMembers.foreach { failedMember =>
       removeHeartbeatForLeavingMember(group, failedMember)
       group.remove(failedMember.memberId)
       // TODO: cut the socket connection to the client
     }

     if (!group.is(Dead)) {
       //先生成新的generation id
       group.initNextGeneration()
       if (group.is(Empty)) {
         info(s"Group ${group.groupId} with generation ${group.generationId} is now empty " +
           s"(${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)})")

         groupManager.storeGroup(group, Map.empty, error => {
           if (error != Errors.NONE) {
             // we failed to write the empty group metadata. If the broker fails before another rebalance,
             // the previous generation written to the log will become active again (and most likely timeout).
             // This should be safe since there are no active members in an empty generation, so we just warn.
             warn(s"Failed to write empty metadata for group ${group.groupId}: ${error.message}")
           }
         })
       } else {
         info(s"Stabilized group ${group.groupId} generation ${group.generationId} " +
           s"(${Topic.GROUP_METADATA_TOPIC_NAME}-${partitionFor(group.groupId)})")

         // trigger the awaiting join group response callback for all the members after rebalancing
         for (member <- group.allMemberMetadata) {
           assert(member.awaitingJoinCallback != null)
           val joinResult = JoinGroupResult(
             members = if (group.isLeader(member.memberId)) {
               group.currentMemberMetadata
             } else {
               Map.empty
             },
             memberId = member.memberId,
             generationId = group.generationId,
             subProtocol = group.protocolOrNull,
             leaderId = group.leaderOrNull,
             error = Errors.NONE)

           group.invokeJoinCallback(member, joinResult)
           completeAndScheduleNextHeartbeatExpiration(group, member)
         }
       }
     }
   }
 }

我们需要注意到,coordinator在这里等待了一定的时间,这段时间内没有发来join请求的member将会被踢出group。最后,coordinator将member的元数据信息发送给leader。那么,一个consumer group的leader是谁呢?从add member揭晓:

def add(member: MemberMetadata, callback: JoinCallback = null) {
   if (members.isEmpty)
     this.protocolType = Some(member.protocolType)

   assert(groupId == member.groupId)
   assert(this.protocolType.orNull == member.protocolType)
   assert(supportsProtocols(member.protocols))

   if (leaderId.isEmpty)
     leaderId = Some(member.memberId)
   members.put(member.memberId, member)
   member.supportedProtocols.foreach{ case (protocol, _) => supportedProtocols(protocol) += 1 }
   member.awaitingJoinCallback = callback
   if (member.awaitingJoinCallback != null)
     numMembersAwaitingJoin += 1;
 }

从代码中看到,都一个加入到group中的consumer被当做了leader。

kafka中,consumer的assignment是由每个group的leader来决定的,leader在收到join_group response后进行topic partition的分配,并将结果通过sync_group请求返回到coordinator。

sync group

leader在收到join group的相应之后进行rebalance assignment,再将分配结果发送给coordinator。sync group的主要请求体是:

private final String groupId;
private final int generationId;
private final String memberId;
private final Map<String, ByteBuffer> groupAssignment;

和其他所有group请求类似,coordinator在收到sync group请求后对请求进行鉴权,鉴权通过后执行handleSyncGroup方法。最终调用doSyncGroup方法

private def doSyncGroup(group: GroupMetadata,
                          generationId: Int,
                          memberId: String,
                          groupAssignment: Map[String, Array[Byte]],
                          responseCallback: SyncCallback) {
    group.inLock {
      if (!group.has(memberId)) {
        responseCallback(Array.empty, Errors.UNKNOWN_MEMBER_ID)
      } else if (generationId != group.generationId) {
        responseCallback(Array.empty, Errors.ILLEGAL_GENERATION)
      } else {
        group.currentState match {
          //处于empty和dead状态的group会报错
          case Empty | Dead =>
            responseCallback(Array.empty, Errors.UNKNOWN_MEMBER_ID)
          //处于prepareRebalalnce的group不接受sync请求
          case PreparingRebalance =>
            responseCallback(Array.empty, Errors.REBALANCE_IN_PROGRESS)

          case CompletingRebalance =>
            group.get(memberId).awaitingSyncCallback = responseCallback

            // if this is the leader, then we can attempt to persist state and transition to stable
            //只处理来自leader的请求
            if (group.isLeader(memberId)) {
              info(s"Assignment received from leader for group ${group.groupId} for generation ${group.generationId}")

              // fill any missing members with an empty assignment
              val missing = group.allMembers -- groupAssignment.keySet
              val assignment = groupAssignment ++ missing.map(_ -> Array.empty[Byte]).toMap

              groupManager.storeGroup(group, assignment, (error: Errors) => {
                group.inLock {
                  // another member may have joined the group while we were awaiting this callback,
                  // so we must ensure we are still in the CompletingRebalance state and the same generation
                  // when it gets invoked. if we have transitioned to another state, then do nothing
                  if (group.is(CompletingRebalance) && generationId == group.generationId) {
                    if (error != Errors.NONE) {
                      resetAndPropagateAssignmentError(group, error)
                      maybePrepareRebalance(group, s"error when storing group assignment during SyncGroup (member: $memberId)")
                    } else {
                      setAndPropagateAssignment(group, assignment)
                      group.transitionTo(Stable)
                    }
                  }
                }
              })
            }

          case Stable =>
            // if the group is stable, we just return the current assignment
            val memberMetadata = group.get(memberId)
            responseCallback(memberMetadata.assignment, Errors.NONE)
            completeAndScheduleNextHeartbeatExpiration(group, group.get(memberId))
        }
      }
    }
  }

coordinator在收到leader发来的assignment以后保存group信息到__consumer_offset topic,同时将每个member的元数据的assignment更新。当其他非leader的consumer member再发来请求时,直接返回其对应的assignment。

GroupMetadata

  • PreparingRebalance:准备进行rebalance
  • CompletingRebalance:等待leader返回的状态分配结果
  • Stable
  • Dead:group中已经没有成员,并且正在被清除元数据
  • Empty:group中没有成员,但是还在等待offset过期
PreparingRebalance CompletingRebalance Stable Dead Empty
PreparingRebalance no 有成员加入到group no group被移除 所有成员都离开group
CompletingRebalance 有成员加入,离开,或者失败 no 收到leader的分配结果 group被移除 no
Stable 监听到成员失败,离开,新成员加入等请求 no no group被移除 no
Dead 没有状态转移 没有状态转移 no 没有状态转移
Empty 有新成员加入 no no offset过期,group被移除 no