完成了前期准备工作之后,消费者将正式开始执行分区再分配,这是一个客户端与服务端交互配合的过程,消费者需要构造并发送 JoinGroupResult 请求到对应的 GroupCoordinator 实例所在节点申请加入目标 group。
这一过程位于 AbstractCoordinator#initiateJoinGroup 方法中,该方法的主要工作就是切换当前消费者的状态为 REBALANCING,创建并缓存 JoinGroupRequest 请求,并处理申请加入的结果。如果申请加入成功,则会切换当前消费者的状态为 STABLE,并重启心跳机制(为了避免心跳机制干扰分区再分配,在开始执行分区再分配之前会临时关闭心跳机制);如果申请加入失败,则会切换当前消费者的状态为 UNJOINED。
private synchronized RequestFuture<ByteBuffer> initiateJoinGroup() { // we store the join future in case we are woken up by the user after beginning the // rebalance in the call to poll below. This ensures that we do not mistakenly attempt // to rejoin before the pending rebalance has completed. if (joinFuture == null) { // fence off the heartbeat thread explicitly so that it cannot interfere with the join group. // Note that this must come after the call to onJoinPrepare since we must be able to continue // sending heartbeats if that callback takes some time. disableHeartbeatThread(); // 切换当前消费者的状态为 REBALANCING state = MemberState.REBALANCING; // a rebalance can be triggered consecutively if the previous one failed, // in this case we would not update the start time. if (lastRebalanceStartMs == -1L) lastRebalanceStartMs = time.milliseconds(); // 创建并缓存 JoinGroupRequest 请求,并处理申请加入的结果 joinFuture = sendJoinGroupRequest(); joinFuture.addListener(new RequestFutureListener<ByteBuffer>() { @Override public void onSuccess(ByteBuffer value) { // handle join completion in the callback so that the callback will be invoked // even if the consumer is woken up before finishing the rebalance synchronized (AbstractCoordinator.this) { if (generation != Generation.NO_GENERATION) { log.info("Successfully joined group with generation {}", generation.generationId); // 果申请加入成功,则会切换当前消费者的状态为 STABLE,并重启心跳机制 state = MemberState.STABLE; rejoinNeeded = false; // record rebalance latency lastRebalanceEndMs = time.milliseconds(); sensors.successfulRebalanceSensor.record(lastRebalanceEndMs - lastRebalanceStartMs); lastRebalanceStartMs = -1L; if (heartbeatThread != null) heartbeatThread.enable(); } else { log.info("Generation data was cleared by heartbeat thread. Rejoin failed."); recordRebalanceFailure(); } } } @Override public void onFailure(RuntimeException e) { // we handle failures below after the request finishes. if the join completes // after having been woken up, the exception is ignored and we will rejoin synchronized (AbstractCoordinator.this) { recordRebalanceFailure(); } } private void recordRebalanceFailure() { state = MemberState.UNJOINED; sensors.failedRebalanceSensor.record(); } }); } return joinFuture; }sendJoinGroupRequest
JoinGroupRequest 请求中包含了当前消费者的 ID,消费者所属 group 的 ID、消费者支持的分区策略、协议类型、以及会话超时时间等信息。构造、发送,以及处理 JoinGroupRequest 请求及其响应的过程位于 AbstractCoordinator#sendJoinGroupRequest 方法中,实现如下:
RequestFuture<ByteBuffer> sendJoinGroupRequest() { if (coordinatorUnknown()) // 如果目标 GroupCoordinator 节点不可达,则返回异常 return RequestFuture.coordinatorNotAvailable(); // send a join group request to the coordinator log.info("(Re-)joining group"); // 构建 JoinGroupRequest 请求 JoinGroupRequest.Builder requestBuilder = new JoinGroupRequest.Builder( new JoinGroupRequestData() .setGroupId(rebalanceConfig.groupId) .setSessionTimeoutMs(this.rebalanceConfig.sessionTimeoutMs) .setMemberId(this.generation.memberId) .setGroupInstanceId(this.rebalanceConfig.groupInstanceId.orElse(null)) .setProtocolType(protocolType()) .setProtocols(metadata()) .setRebalanceTimeoutMs(this.rebalanceConfig.rebalanceTimeoutMs) ); log.debug("Sending JoinGroup ({}) to coordinator {}", requestBuilder, this.coordinator); // Note that we override the request timeout using the rebalance timeout since that is the // maximum time that it may block on the coordinator. We add an extra 5 seconds for small delays. int joinGroupTimeoutMs = Math.max(rebalanceConfig.rebalanceTimeoutMs, rebalanceConfig.rebalanceTimeoutMs + 5000); // 发送 JoinGroupRequest 请求,并注册结果处理器 JoinGroupResponseHandler return client.send(coordinator, requestBuilder, joinGroupTimeoutMs) .compose(new JoinGroupResponseHandler()); }JoinGroupResponseHandler
消费者通过注册结果处理器 JoinGroupResponseHandler 对请求的响应结果进行处理,如果是正常响应则会执行分区分配操作,核心逻辑实现如下:
private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> { @Override public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) { Errors error = joinResponse.error(); if (error == Errors.NONE) { log.debug("Received successful JoinGroup response: {}", joinResponse); sensors.joinSensor.record(response.requestLatencyMs()); synchronized (AbstractCoordinator.this) { if (state != MemberState.REBALANCING) { // 确认当前状态是不是 REBALANCING // 例如消费者因某种原理离开了所属的 group,这种情况下不应该再继续执行下去。 // if the consumer was woken up before a rebalance completes, we may have already left // the group. In this case, we do not want to continue with the sync group. future.raise(new UnjoinedGroupException()); } else { // 基于响应,更新 group 的年代信息 AbstractCoordinator.this.generation = new Generation(joinResponse.data().generationId(), joinResponse.data().memberId(), joinResponse.data().protocolName()); // 如果当前消费者是 group 中的 leader 角色 if (joinResponse.isLeader()) { /* * 基于分区分配策略执行分区分配,leader 需要关注当前 group 中所有消费者订阅的 topic, * 依据 GroupCoordinator 最终确定的分区分配策略为当前 group 名下所有的消费者分配分区, * 并发送 SyncGroupRequest 请求向对应的 GroupCoordinator 节点反馈最终的分区分配结果。 */ onJoinLeader(joinResponse).chain(future); } else { // 如果是 follower 消费者,则只关注自己订阅的 topic,这一步仅发送 SyncGroupRequest 请求 // 响应 JoinGroupRequest 请求的逻辑只是构造一个包含空的分区分配结果的 SyncGroupRequest 请求,并附带上所属的 group 和自身 ID,以及 group 年代信息,发送给对应的 GroupCoordinator 节点, // 如果此时所属的 group 已经处于正常运行的状态,则该消费者会拿到分配给自己的分区信息。 onJoinFollower().chain(future); } } } } else if (error == Errors.COORDINATOR_LOAD_IN_PROGRESS) { log.debug("Attempt to join group rejected since coordinator {} is loading the group.", coordinator()); // backoff and retry // 在接收到响应之前,消费者的状态发生变更(可能已经从所属 group 离开),抛出异常 future.raise(error); } else if (error == Errors.UNKNOWN_MEMBER_ID) { // reset the member id and retry immediately resetGenerationOnResponseError(ApiKeys.JOIN_GROUP, error); log.debug("Attempt to join group failed due to unknown member id."); future.raise(error); } else if (error == Errors.COORDINATOR_NOT_AVAILABLE || error == Errors.NOT_COORDINATOR) { // re-discover the coordinator and retry with backoff markCoordinatorUnknown(); log.debug("Attempt to join group failed due to obsolete coordinator information: {}", error.message()); future.raise(error); } else if (error == Errors.FENCED_INSTANCE_ID) { log.error("Received fatal exception: group.instance.id gets fenced"); future.raise(error); } else if (error == Errors.INCONSISTENT_GROUP_PROTOCOL || error == Errors.INVALID_SESSION_TIMEOUT || error == Errors.INVALID_GROUP_ID || error == Errors.GROUP_AUTHORIZATION_FAILED || error == Errors.GROUP_MAX_SIZE_REACHED) { // log the error and re-throw the exception log.error("Attempt to join group failed due to fatal error: {}", error.message()); if (error == Errors.GROUP_MAX_SIZE_REACHED) { future.raise(new GroupMaxSizeReachedException("Consumer group " + rebalanceConfig.groupId + " already has the configured maximum number of members.")); } else if (error == Errors.GROUP_AUTHORIZATION_FAILED) { future.raise(GroupAuthorizationException.forGroupId(rebalanceConfig.groupId)); } else { future.raise(error); } } else if (error == Errors.UNSUPPORTED_VERSION) { log.error("Attempt to join group failed due to unsupported version error. Please unset field group.instance.id and retry" + "to see if the problem resolves"); future.raise(error); } else if (error == Errors.MEMBER_ID_REQUIRED) { // Broker requires a concrete member id to be allowed to join the group. Update member id // and send another join group request in next cycle. synchronized (AbstractCoordinator.this) { AbstractCoordinator.this.generation = new Generation(OffsetCommitRequest.DEFAULT_GENERATION_ID, joinResponse.data().memberId(), null); AbstractCoordinator.this.resetStateAndRejoin(); } future.raise(error); } else { // unexpected error, throw the exception log.error("Attempt to join group failed due to unexpected error: {}", error.message()); future.raise(new KafkaException("Unexpected error in join group response: " + error.message())); } } }AbstractCoordinator#onJoinLeader onJoinLeader基于分区分配策略分配分区并返回结果给服务端。
private RequestFuture<ByteBuffer> onJoinLeader(JoinGroupResponse joinResponse) { try { // perform the leader synchronization and send back the assignment for the group // 基于分区分配策略分配分区 Map<String, ByteBuffer> groupAssignment = performAssignment(joinResponse.data().leader(), joinResponse.data().protocolName(), joinResponse.data().members()); List<SyncGroupRequestData.SyncGroupRequestAssignment> groupAssignmentList = new ArrayList<>(); for (Map.Entry<String, ByteBuffer> assignment : groupAssignment.entrySet()) { groupAssignmentList.add(new SyncGroupRequestData.SyncGroupRequestAssignment() .setMemberId(assignment.getKey()) .setAssignment(Utils.toArray(assignment.getValue())) ); } // 创建 SyncGroupRequest 请求,反馈分区分配结果给 GroupCoordinator 节点 SyncGroupRequest.Builder requestBuilder = new SyncGroupRequest.Builder( new SyncGroupRequestData() .setGroupId(rebalanceConfig.groupId) .setMemberId(generation.memberId) .setGroupInstanceId(this.rebalanceConfig.groupInstanceId.orElse(null)) .setGenerationId(generation.generationId) .setAssignments(groupAssignmentList) ); log.debug("Sending leader SyncGroup to coordinator {} at generation {}: {}", this.coordinator, this.generation, requestBuilder); // 发送 SyncGroupRequest 请求 return sendSyncGroupRequest(requestBuilder); } catch (RuntimeException e) { return RequestFuture.failure(e); } }分配分区的具体过程位于 ConsumerCoordinator#performAssignment 方法中
protected Map<String, ByteBuffer> performAssignment(String leaderId, String assignmentStrategy, List<JoinGroupResponseData.JoinGroupResponseMember> allSubscriptions) { // 从消费者支持的分区分配策略集合中选择指定策略对应的分区分配器 ConsumerPartitionAssignor assignor = lookupAssignor(assignmentStrategy); if (assignor == null) throw new IllegalStateException("Coordinator selected invalid assignment protocol: " + assignmentStrategy); // 解析封装 topic 订阅信息 Set<String> allSubscribedTopics = new HashSet<>();// 记录 group 名下所有消费者订阅的 topic 集合 Map<String, Subscription> subscriptions = new HashMap<>(); // Map<String, ByteBuffer> -> Map<String, Subscription> // collect all the owned partitions Map<String, List<TopicPartition>> ownedPartitions = new HashMap<>(); for (JoinGroupResponseData.JoinGroupResponseMember memberSubscription : allSubscriptions) { Subscription subscription = ConsumerProtocol.deserializeSubscription(ByteBuffer.wrap(memberSubscription.metadata())); subscription.setGroupInstanceId(Optional.ofNullable(memberSubscription.groupInstanceId())); subscriptions.put(memberSubscription.memberId(), subscription); allSubscribedTopics.addAll(subscription.topics()); ownedPartitions.put(memberSubscription.memberId(), subscription.ownedPartitions()); } // the leader will begin watching for changes to any of the topics the group is interested in, // which ensures that all metadata changes will eventually be seen // 分区再分配之后,检测是否需要更新集群元数据信息,如果需要则立即更新 updateGroupSubscription(allSubscribedTopics); // 标记当前消费者为 leader 角色 isLeader = true; log.debug("Performing assignment using strategy {} with subscriptions {}", assignor.name(), subscriptions); /* * 基于分区分配器(range/round-robin)执行分区分配, * 返回结果:key 是消费者 ID,value 是对应的分区分配结果 */ Map<String, Assignment> assignments = assignor.assign(metadata.fetch(), new GroupSubscription(subscriptions)).groupAssignment(); if (protocol == RebalanceProtocol.COOPERATIVE) { validateCooperativeAssignment(ownedPartitions, assignments); } // user-customized assignor may have created some topics that are not in the subscription list // and assign their partitions to the members; in this case we would like to update the leader's // own metadata with the newly added topics so that it will not trigger a subsequent rebalance // when these topics gets updated from metadata refresh. // // TODO: this is a hack and not something we want to support long-term unless we push regex into the protocol // we may need to modify the ConsumerPartitionAssignor API to better support this case. // 记录所有完成分配的 topic 集合 Set<String> assignedTopics = new HashSet<>(); for (Assignment assigned : assignments.values()) { for (TopicPartition tp : assigned.partitions()) assignedTopics.add(tp.topic()); } // 如果 group 中存在一些已经订阅的 topic 并未分配,则日志记录 if (!assignedTopics.containsAll(allSubscribedTopics)) { Set<String> notAssignedTopics = new HashSet<>(allSubscribedTopics); notAssignedTopics.removeAll(assignedTopics); log.warn("The following subscribed topics are not assigned to any members: {} ", notAssignedTopics); } // 如果分配的 topic 集合包含一些未订阅的 topic 集合 if (!allSubscribedTopics.containsAll(assignedTopics)) { // 日志记录这些未订阅的 topic Set<String> newlyAddedTopics = new HashSet<>(assignedTopics); newlyAddedTopics.removeAll(allSubscribedTopics); log.info("The following not-subscribed topics are assigned, and their metadata will be " + "fetched from the brokers: {}", newlyAddedTopics); // 将这些已分配但是未订阅的 topic 添加到 group 集合中 allSubscribedTopics.addAll(assignedTopics); // 更新元数据信息 updateGroupSubscription(allSubscribedTopics); } // 更新本地缓存的元数据信息快照 assignmentSnapshot = metadataSnapshot; log.info("Finished assignment for group at generation {}: {}", generation().generationId, assignments); // 对分区分配结果进行序列化,后续需要反馈给集群 Map<String, ByteBuffer> groupAssignment = new HashMap<>(); for (Map.Entry<String, Assignment> assignmentEntry : assignments.entrySet()) { ByteBuffer buffer = ConsumerProtocol.serializeAssignment(assignmentEntry.getValue()); groupAssignment.put(assignmentEntry.getKey(), buffer); } return groupAssignment; }