Pulsar 负载管理 & Topic归属 Lookup机制

1,602 阅读25分钟

本文已参与「开源摘星计划」,欢迎正在阅读的你加入。活动链接:github.com/weopenproje…

带着以下疑问来看pulsar源码v2.9.2:

  • bundle和topic的关系是怎么建立的?
  • bundle和broker的关系又是怎么建立的?
  • 卸载bundle和卸载topic一样么? split bundle和unload bundle又有什么区别?
  • broker如何查看broker的负载? 有什么值得注意的?
  • pulsar的负载是怎么管理的?如何做负载均衡?

好了,本文开始!

此文中的topic不指主题,而是主题的一个分区。

每个Topic都要和Broker绑定,那是怎么绑定的呢?Topic 和Broker依靠Bundle来绑定,Bundle是一致性哈希环中的虚拟结点。Topic通过主题名计算hash值,从而找到对应的Bundle。

如下图, 我们将Namespace划分为4个区域(defaultNumberOfNamespaceBundles默认就为4),从0x00000000~0xffffffff,之后将Namespace下的Topic按照名字做hash运算。

image.png

查看namespace的Bundle信息,在zk上:

/admin/local-policies/apache/pulsar

{
  "bundles" : {
    "boundaries" : [ "0x00000000", "0x40000000", "0x80000000", "0xc0000000", "0xffffffff" ],
    "numBundles" : 4
  }
}

值得讨论和研究的是

  • bundle和topic关系是如何建立的
  • bundle和broker关系是如何建立的

Lookup查找Topic归属的Broker 方法

非分区主题lookup:

pulsar.apache.org/docs/next/a…

分区主题lookup:

pulsar.apache.org/docs/next/a…

bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic

Lookup解析

以上的请求方式会访问restful接口最终会落到:每个分区都会有一次请求

GET{u=//localhost:8081/lookup/v2/topic/persistent/apache/pulsar/test-topic-partition-0,HTTP/1.1,h=3,cl=-1}

GET{u=//localhost:8081/lookup/v2/topic/persistent/apache/pulsar/test-topic-partition-1,HTTP/1.1,h=3,cl=-1}

@Path("/v2/topic")
public class TopicLookup extends TopicLookupBase {

    static final String LISTENERNAME_HEADER = "X-Pulsar-ListenerName";

    @GET
    @Path("{topic-domain}/{tenant}/{namespace}/{topic}"
    @HeaderParam(LISTENERNAME_HEADER) String listenerNameHeader) {
        TopicName topicName = getTopicName(topicDomain, tenant, namespace, encodedTopic);
        if (StringUtils.isEmpty(listenerName) && StringUtils.isNotEmpty(listenerNameHeader)) {
            listenerName = listenerNameHeader;
        }
	// 查找Lookup
        internalLookupTopicAsync(topicName, authoritative, asyncResponse, listenerName);
    }

internalLookupTopicAsync:

protected void internalLookupTopicAsync(TopicName topicName, boolean authoritative,
                                            AsyncResponse asyncResponse, String listenerName) {
	// 获取信号量
        if (!pulsar().getBrokerService().getLookupRequestSemaphore().tryAcquire()) {
            log.warn("No broker was found available for topic {}", topicName);
            asyncResponse.resume(new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE));
            return;
        }

        try {
        // 验证集群归属,之前pulsar的Topic名称包含集群信息, 如果有集群信息的话,需要判断是当前这个集群是否是topic对应集群。 
            // 如果不是就要重定向一下
            validateClusterOwnership(topicName.getCluster());
            // 验证权限
            validateAdminAndClientPermission(topicName);
             // pulsar跨地域复制,判断这个namespace是否是复制过来的, 如果是就重定向一下到原始集群
            validateGlobalNamespaceOwnership(topicName.getNamespaceObject());
        } catch (WebApplicationException we) {
            // Validation checks failed
            log.error("Validation check failed: {}", we.getMessage());
            completeLookupResponseExceptionally(asyncResponse, we);
            return;
        } catch (Throwable t) {
            // Validation checks failed with unknown error
            log.error("Validation check failed: {}", t.getMessage(), t);
            completeLookupResponseExceptionally(asyncResponse, new RestException(t));
            return;
        }

        CompletableFuture<Optional<LookupResult>> lookupFuture = pulsar().getNamespaceService()
                .getBrokerServiceUrlAsync(topicName,
                        LookupOptions.builder().advertisedListenerName(listenerName)
                                .authoritative(authoritative).loadTopicsInBundle(false).build());

        lookupFuture.thenAccept(optionalResult -> {
            if (optionalResult == null || !optionalResult.isPresent()) {
                log.warn("No broker was found available for topic {}", topicName);
                completeLookupResponseExceptionally(asyncResponse,
                        new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE));
                return;
            }

            LookupResult result = optionalResult.get();
            // We have found either a broker that owns the topic, or a broker to which we should redirect the client to
            if (result.isRedirect()) {
                boolean newAuthoritative = result.isAuthoritativeRedirect();
                URI redirect;
                try {
                    String redirectUrl = isRequestHttps() ? result.getLookupData().getHttpUrlTls()
                            : result.getLookupData().getHttpUrl();
                    checkNotNull(redirectUrl, "Redirected cluster's service url is not configured");
                    String lookupPath = topicName.isV2() ? LOOKUP_PATH_V2 : LOOKUP_PATH_V1;
                    String path = String.format("%s%s%s?authoritative=%s",
                            redirectUrl, lookupPath, topicName.getLookupName(), newAuthoritative);
                    path = listenerName == null ? path : path + "&listenerName=" + listenerName;
                    redirect = new URI(path);
                } catch (URISyntaxException | NullPointerException e) {
                    log.error("Error in preparing redirect url for {}: {}", topicName, e.getMessage(), e);
                    completeLookupResponseExceptionally(asyncResponse, e);
                    return;
                }
                if (log.isDebugEnabled()) {
                    log.debug("Redirect lookup for topic {} to {}", topicName, redirect);
                }
                completeLookupResponseExceptionally(asyncResponse,
                        new WebApplicationException(Response.temporaryRedirect(redirect).build()));

            } else {
                // Found broker owning the topic
                if (log.isDebugEnabled()) {
                    log.debug("Lookup succeeded for topic {} -- broker: {}", topicName, result.getLookupData());
                }
                completeLookupResponseSuccessfully(asyncResponse, result.getLookupData());
            }
        }).exceptionally(exception -> {
            log.warn("Failed to lookup broker for topic {}: {}", topicName, exception.getMessage(), exception);
            completeLookupResponseExceptionally(asyncResponse, exception);
            return null;
        });
    }

bundle和topic关系是如何建立的

getBrokerServiceUrlAsync: 通过主题名找到了bundlerange

public CompletableFuture<Optional<LookupResult>> getBrokerServiceUrlAsync(TopicName topic, LookupOptions options) {
        long startTime = System.nanoTime();
				// 通过主题名找bundle
			  // persistent://apache/pulsar/test-topic-partition-1  --->  apache/pulsar/0x80000000_0xc0000000
        CompletableFuture<Optional<LookupResult>> future = getBundleAsync(topic)
                .thenCompose(bundle -> findBrokerServiceUrl(bundle, options));
				
        future.thenAccept(optResult -> {
            lookupLatency.observe(System.nanoTime() - startTime, TimeUnit.NANOSECONDS);
            if (optResult.isPresent()) {
                if (optResult.get().isRedirect()) {
                    lookupRedirects.inc();
                } else {
                    lookupAnswers.inc();
                }
            }
        }).exceptionally(ex -> {
            lookupFailures.inc();
            return null;
        });

        return future;
    }

getBundleAsync(topic):

public CompletableFuture<NamespaceBundle> getBundleAsync(TopicName topic) {
        return bundleFactory.getBundlesAsync(topic.getNamespaceObject())
                .thenApply(bundles -> bundles.findBundle(topic));
    }

bundles对象的属性:

image.png

bundles.findBundle

public NamespaceBundle findBundle(TopicName topicName) {
        checkArgument(this.nsname.equals(topicName.getNamespaceObject()));
				// 计算topic哈希值
        long hashCode = factory.getLongHashCode(topicName.toString());
				// 得到hash
        NamespaceBundle bundle = getBundle(hashCode);
        if (topicName.getDomain().equals(TopicDomain.non_persistent)) {
            bundle.setHasNonPersistentTopic(true);
        }
        return bundle;
    }

image.png

到此为止, topic和bundle的关系就建立起来了!

bundle和broker关系是如何建立的

一句话总结, 通过LoadManager查找负载最低的Broker来和bundle建立关系。

findBrokerServiceUrl 再通过bundle (getBundleAsync(topic)的返回值)找到brokerServiceUrl并放在future中:

image.png

这样子客户端就会知道这个主题对应的是哪个broker,就可以和broker交互了。

具体细节:

findBrokerServiceUrl:

private CompletableFuture<Optional<LookupResult>> findBrokerServiceUrl(
            NamespaceBundle bundle, LookupOptions options) {
        if (LOG.isDebugEnabled()) {
            LOG.debug("findBrokerServiceUrl: {} - options: {}", bundle, options);
        }

        ConcurrentOpenHashMap<NamespaceBundle, CompletableFuture<Optional<LookupResult>>> targetMap;
        if (options.isAuthoritative()) {
            targetMap = findingBundlesAuthoritative;
        } else {
            targetMap = findingBundlesNotAuthoritative;
        }

        return targetMap.computeIfAbsent(bundle, (k) -> {
            CompletableFuture<Optional<LookupResult>> future = new CompletableFuture<>();

            // First check if we or someone else already owns the bundle
            ownershipCache.getOwnerAsync(bundle).thenAccept(nsData -> {
                if (!nsData.isPresent()) {
                    // No one owns this bundle

                    if (options.isReadOnly()) {
                        // Do not attempt to acquire ownership
                        future.complete(Optional.empty());
                    } else {
                        // Now, no one owns the namespace yet. Hence, we will try to dynamically assign it
                        pulsar.getExecutor().execute(() -> {
														// 找一个候选broker.
                            searchForCandidateBroker(bundle, future, options);
                        });
                    }
                } else if (nsData.get().isDisabled()) {
 ...
    }

searchForCandidateBroker给卸载的bundle找候选broker:

private void searchForCandidateBroker(NamespaceBundle bundle,
                                          CompletableFuture<Optional<LookupResult>> lookupFuture,
                                          LookupOptions options) {
				// 找到broker的leader
        if (null == pulsar.getLeaderElectionService()) {
            LOG.warn("The leader election has not yet been completed! NamespaceBundle[{}]", bundle);
            lookupFuture.completeExceptionally(
                    new IllegalStateException("The leader election has not yet been completed!"));
            return;
        }
        String candidateBroker = null;

        LeaderElectionService les = pulsar.getLeaderElectionService();
        if (les == null) {
            // The leader election service was not initialized yet. This can happen because the broker service is
            // initialized first and it might start receiving lookup requests before the leader election service is
            // fully initialized.
            LOG.warn("Leader election service isn't initialized yet. "
                            + "Returning empty result to lookup. NamespaceBundle[{}]",
                    bundle);
            lookupFuture.complete(Optional.empty());
            return;
        }

        boolean authoritativeRedirect = les.isLeader();

        try {
           // 优先考虑heartbeat和SLAMonitor所在broker
            // check if this is Heartbeat or SLAMonitor namespace
            candidateBroker = checkHeartbeatNamespace(bundle);
            if (candidateBroker == null) {
                candidateBroker = checkHeartbeatNamespaceV2(bundle);
            }
            if (candidateBroker == null) {
                String broker = getSLAMonitorBrokerName(bundle);
                // checking if the broker is up and running
                if (broker != null && isBrokerActive(broker)) {
                    candidateBroker = broker;
                }
            }

            if (candidateBroker == null) {
                Optional<LeaderBroker> currentLeader = pulsar.getLeaderElectionService().getCurrentLeader();

                if (options.isAuthoritative()) {
                    // leader broker already assigned the current broker as owner   
                    // 如果请求是被其他broker(一般是leader)重定向过来的,那直接就是当前broker承载bundle了。
                    candidateBroker = pulsar.getSafeWebServiceAddress();
                } else {
										
                 // 去找LoadManager 寻找broker
                    LoadManager loadManager = this.loadManager.get();
                    boolean makeLoadManagerDecisionOnThisBroker = !loadManager.isCentralized() || les.isLeader();
                    if (!makeLoadManagerDecisionOnThisBroker) {
                                                                                                                // loadManager中心化场景下,当前broker又不是leader
                        // If leader is not active, fallback to pick the least loaded from current broker loadmanager
                        boolean leaderBrokerActive = currentLeader.isPresent()
                                && isBrokerActive(currentLeader.get().getServiceUrl());
                        if (!leaderBrokerActive) {
                            makeLoadManagerDecisionOnThisBroker = true;
                            if (!currentLeader.isPresent()) {
                                LOG.warn(
                                        "The information about the current leader broker wasn't available. "
                                                + "Handling load manager decisions in a decentralized way. "
                                                + "NamespaceBundle[{}]",
                                        bundle);
                            } else {
                                LOG.warn(
                                        "The current leader broker {} isn't active. "
                                                + "Handling load manager decisions in a decentralized way. "
                                                + "NamespaceBundle[{}]",
                                        currentLeader.get(), bundle);
                            }
                        }
                    }
                    if (makeLoadManagerDecisionOnThisBroker) {  // 中心化场景,自己又是leader ,leader的LoadManager会保存各个节点的负载,通过ModularLoadManagerImpl.selectBrokerForAssignment
                                                                                                    // 找到一个合适的broker返回。
                        Optional<String> availableBroker = getLeastLoadedFromLoadManager(bundle)
                        if (!availableBroker.isPresent()) {
                            LOG.warn("Load manager didn't return any available broker. "
                                            + "Returning empty result to lookup. NamespaceBundle[{}]",
                                    bundle);
                            lookupFuture.complete(Optional.empty());
                            return;
                        }
                        candidateBroker = availableBroker.get();
                        authoritativeRedirect = true;
                    } else {
                        // forward to leader broker to make assignment         
                        candidateBroker = currentLeader.get().getServiceUrl();
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Error when searching for candidate broker to acquire {}: {}", bundle, e.getMessage(), e);
            lookupFuture.completeExceptionally(e);
            return;
        }

        try {
            checkNotNull(candidateBroker);
		// 如果选出的broekr和当前的broker是一致的
            if (candidateBroker.equals(pulsar.getSafeWebServiceAddress())) {
                // Load manager decided that the local broker should try to become the owner
		// 尝试把bundle占领
                ownershipCache.tryAcquiringOwnership(bundle).thenAccept(ownerInfo -> {
                    if (ownerInfo.isDisabled()) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Namespace bundle {} is currently being unloaded", bundle);
                        }
                        lookupFuture.completeExceptionally(new IllegalStateException(
                                String.format("Namespace bundle %s is currently being unloaded", bundle)));
                    } else {
                        // Found owner for the namespace bundle

                        if (options.isLoadTopicsInBundle()) {
                            // Schedule the task to pre-load topics   占领成功就加载每一个topic
                            pulsar.loadNamespaceTopics(bundle);
                        }
                        // find the target
                        if (options.hasAdvertisedListenerName()) {
                            AdvertisedListener listener =
                                    ownerInfo.getAdvertisedListeners().get(options.getAdvertisedListenerName());
                            if (listener == null) {
                                lookupFuture.completeExceptionally(
                                        new PulsarServerException("the broker do not have "
                                                + options.getAdvertisedListenerName() + " listener"));
                                return;
                            } else {
                                URI url = listener.getBrokerServiceUrl();
                                URI urlTls = listener.getBrokerServiceUrlTls();
                                lookupFuture.complete(Optional.of(
                                        new LookupResult(ownerInfo,
                                                url == null ? null : url.toString(),
                                                urlTls == null ? null : urlTls.toString())));
                                return;
                            }
                        } else {
                            lookupFuture.complete(Optional.of(new LookupResult(ownerInfo)));
                            return;
                        }
                    }
                }).exceptionally(exception -> {
                    LOG.warn("Failed to acquire ownership for namespace bundle {}: {}", bundle, exception);
                    lookupFuture.completeExceptionally(new PulsarServerException(
                            "Failed to acquire ownership for namespace bundle " + bundle, exception));
                    return null;
                });

            } else {
                // Load managed decider some other broker should try to acquire ownership

                if (LOG.isDebugEnabled()) {
                    LOG.debug("Redirecting to broker {} to acquire ownership of bundle {}", candidateBroker, bundle);
                }

                // Now setting the redirect url 把请求重定向到候选broker
                createLookupResult(candidateBroker, authoritativeRedirect, options.getAdvertisedListenerName())
                        .thenAccept(lookupResult -> lookupFuture.complete(Optional.of(lookupResult)))
                        .exceptionally(ex -> {
                            lookupFuture.completeExceptionally(ex);
                            return null;
                        });

            }
        } catch (Exception e) {
            LOG.warn("Error in trying to acquire namespace bundle ownership for {}: {}", bundle, e.getMessage(), e);
            lookupFuture.completeExceptionally(e);
        }
    }

继续分析一下getLeastLoadedFromLoadManager:

private Optional<String> getLeastLoadedFromLoadManager(ServiceUnitId serviceUnit) throws Exception {
        // 获取getLeastLoaded 最低负载的。
				Optional<ResourceUnit> leastLoadedBroker = loadManager.get().getLeastLoaded(serviceUnit);
        if (!leastLoadedBroker.isPresent()) {
            LOG.warn("No broker is available for {}", serviceUnit);
            return Optional.empty();
        }
				// 放返回可用的broker地址。
        String lookupAddress = leastLoadedBroker.get().getResourceId();
        if (LOG.isDebugEnabled()) {
            LOG.debug("{} : redirecting to the least loaded broker, lookup address={}",
                    pulsar.getSafeWebServiceAddress(),
                    lookupAddress);
        }
        return Optional.of(lookupAddress);
    }

getLeastLoaded 一路点过来,我们看下ModularLoadManagerImpl的实现:

@Override
    public Optional<String> selectBrokerForAssignment(final ServiceUnitId serviceUnit) {
        // Use brokerCandidateCache as a lock to reduce synchronization.
        long startTime = System.nanoTime();

        try {
            synchronized (brokerCandidateCache) {
                final String bundle = serviceUnit.toString();
                if (preallocatedBundleToBroker.containsKey(bundle)) {
		// 如果已经预分配就就返回broker地址
                    // If the given bundle is already in preallocated, return the selected broker.
                    return Optional.of(preallocatedBundleToBroker.get(bundle));
                }
                final BundleData data = loadData.getBundleData().computeIfAbsent(bundle,
                        key -> getBundleDataOrDefault(bundle));
                brokerCandidateCache.clear();
		// 命名空间策略
                LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
                        getAvailableBrokers(),
                        brokerTopicLoadingPredicate);

                // filter brokers which owns topic higher than threshold 过滤掉超出topic数broker
                LoadManagerShared.filterBrokersWithLargeTopicCount(brokerCandidateCache, loadData,
                        conf.getLoadBalancerBrokerMaxTopics());

                // 亲和性相关 distribute namespaces to domain and brokers according to anti-affinity-group
                LoadManagerShared.filterAntiAffinityGroupOwnedBrokers(pulsar, serviceUnit.toString(),
                        brokerCandidateCache,
                        brokerToNamespaceToBundleRange, brokerToFailureDomainMap);
                // distribute bundles evenly to candidate-brokers
		// todo 
                LoadManagerShared.removeMostServicingBrokersForNamespace(serviceUnit.toString(), brokerCandidateCache,
                        brokerToNamespaceToBundleRange);
                log.info("{} brokers being considered for assignment of {}", brokerCandidateCache.size(), bundle);

                // Use the filter pipeline to finalize broker candidates.
	// brokerFilter目前只有版本过滤
                try {
                    for (BrokerFilter filter : filterPipeline) {
                        filter.filter(brokerCandidateCache, data, loadData, conf);
                    }
                } catch (BrokerFilterException x) {
		// 重新选候选broker
                    // restore the list of brokers to the full set
                    LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
                            getAvailableBrokers(),
                            brokerTopicLoadingPredicate);
                }

                if (brokerCandidateCache.isEmpty()) {
		// 重新选候选broker
                    // restore the list of brokers to the full set
                    LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
                            getAvailableBrokers(),
                            brokerTopicLoadingPredicate);
                }

                // Choose a broker among the potentially smaller filtered list, when possible
                Optional<String> broker = placementStrategy.selectBroker(brokerCandidateCache, data, loadData, conf);
                if (log.isDebugEnabled()) {
                    log.debug("Selected broker {} from candidate brokers {}", broker, brokerCandidateCache);
                }

                if (!broker.isPresent()) {
                    // No brokers available
                    return broker;
                }

                final double overloadThreshold = conf.getLoadBalancerBrokerOverloadedThresholdPercentage() / 100.0;
                final double maxUsage = loadData.getBrokerData().get(broker.get()).getLocalData().getMaxResourceUsage();
                if (maxUsage > overloadThreshold) {
                    // All brokers that were in the filtered list were overloaded, so check if there is a better broker
                    LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
                            getAvailableBrokers(),
                            brokerTopicLoadingPredicate);
                    broker = placementStrategy.selectBroker(brokerCandidateCache, data, loadData, conf);
                }

                // Add new bundle to preallocated.
                loadData.getBrokerData().get(broker.get()).getPreallocatedBundleData().put(bundle, data);
                preallocatedBundleToBroker.put(bundle, broker.get());

                final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundle);
                final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundle);
                final ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>> namespaceToBundleRange =
                        brokerToNamespaceToBundleRange
                                .computeIfAbsent(broker.get(), k -> new ConcurrentOpenHashMap<>());
                synchronized (namespaceToBundleRange) {
                    namespaceToBundleRange.computeIfAbsent(namespaceName, k -> new ConcurrentOpenHashSet<>())
                            .add(bundleRange);
                }
                return broker;
            }
        } finally {
            selectBrokerForAssignment.observe(System.nanoTime() - startTime, TimeUnit.NANOSECONDS);
        }
    }

卸载bundle

bundle的卸载只是将一个Bundle从一个Broker转移到另外一个broker,并把bundle上所有的topic都迁移到另外一个broker。

卸载实践:

root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5

 bin/pulsar-admin topics list-partitioned-topics apache/pulsar
"persistent://apache/pulsar/test-topic"

// 获取topic的bundle范围
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x59999994_0x66666660"

// 获取主题目前负责的broker
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-2    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"

// 卸载bundle
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin  namespaces unload apache/pulsar
// 或者指定范围卸载bundle
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin  namespaces unload  -b 0x59999994_0x66666660  apache/pulsar
对应的sdk:
if (bundle == null) {
    getAdmin().namespaces().unload(namespace);
} else {
    getAdmin().namespaces().unloadNamespaceBundle(namespace, bundle);
}

root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"  // 注意到这个有所改变。
"persistent://apache/pulsar/test-topic-partition-2    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"

卸载源码分析

主要是想知道卸载之后的bundle是如何找到新的broker的,是怎么分配的?

入口:

image.png

image.png

		// unload整个namespace
		@PUT
    @Path("/{tenant}/{namespace}/unload")
    @ApiOperation(value = "Unload namespace",
            notes = "Unload an active namespace from the current broker serving it. Performing this operation will"
                    + " let the brokerremoves all producers, consumers, and connections using this namespace,"
                    + " and close all topics (includingtheir persistent store). During that operation,"
                    + " the namespace is marked as tentatively unavailable until thebroker completes "
                    + "the unloading action. This operation requires strictly super user privileges,"
                    + " since it wouldresult in non-persistent message loss and"
                    + " unexpected connection closure to the clients.")
    @ApiResponses(value = {
            @ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
            @ApiResponse(code = 403, message = "Don't have admin permission"),
            @ApiResponse(code = 404, message = "Tenant or namespace doesn't exist"),
            @ApiResponse(code = 412, message = "Namespace is already unloaded or Namespace has bundles activated")})
    public void unloadNamespace(@Suspended final AsyncResponse asyncResponse, @PathParam("tenant") String tenant,
            @PathParam("namespace") String namespace) {
        try {
            validateNamespaceName(tenant, namespace);
            internalUnloadNamespace(asyncResponse);
        } catch (WebApplicationException wae) {
            asyncResponse.resume(wae);
        } catch (Exception e) {
            asyncResponse.resume(new RestException(e));
        }
    }

		// 带bundle range 的unload
    @PUT
    @Path("/{tenant}/{namespace}/{bundle}/unload")
    @ApiOperation(value = "Unload a namespace bundle")
    @ApiResponses(value = {
            @ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
            @ApiResponse(code = 403, message = "Don't have admin permission") })
    public void unloadNamespaceBundle(@Suspended final AsyncResponse asyncResponse,
            @PathParam("tenant") String tenant, @PathParam("namespace") String namespace,
            @PathParam("bundle") String bundleRange,
            @QueryParam("authoritative") @DefaultValue("false") boolean authoritative) {
        validateNamespaceName(tenant, namespace);
        internalUnloadNamespaceBundle(asyncResponse, bundleRange, authoritative);
    }

unload整个namespace

internalUnloadNamespace:

Policies policies = getNamespacePolicies(namespaceName);

        final List<CompletableFuture<Void>> futures = Lists.newArrayList();
        List<String> boundaries = policies.bundles.getBoundaries(); // 遍历所有的bundle
        for (int i = 0; i < boundaries.size() - 1; i++) {
            String bundle = String.format("%s_%s", boundaries.get(i), boundaries.get(i + 1));
            try {
                futures.add(pulsar().getAdminClient().namespaces().unloadNamespaceBundleAsync(namespaceName.toString(), // 调用unloadNamespaceBundleAsync
                        bundle));
            } catch (PulsarServerException e) {
                log.error("[{}] Failed to unload namespace {}", clientAppId(), namespaceName, e);
                asyncResponse.resume(new RestException(e));
                return;
            }
        }

遍历所有的bundle 调用带bundle range 的unload。

image.png

带bundle range 的unload

源码调试技巧: unload namespace的时候,会遍历namespace所有的bundle来调用。 有些bundle上并不会有我们想要调试的主题,这样就可能无法快速断点调试,所以我们将defaultNumberOfNamespaceBundles=1

调用unload之前,使用lookup查找一下,确保唯一的这个bundle也归属,调用lookup,如果没有归属则会触发分配流程

bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
// 判断bundle是否属于任何broker
	isBundleOwnedByAnyBroker(namespaceName, policies.bundles, bundleRange).thenAccept(flag -> {
            log.info("judge ....  bundleRange: {}", bundleRange);
            if (!flag) {
                log.info("[{}] Namespace bundle is not owned by any broker {}/{}", clientAppId(), namespaceName,
                        bundleRange);
                asyncResponse.resume(Response.noContent().build());
                return; // 不属于任何broker就退出了,没有卸载的必要。
            }
            NamespaceBundle nsBundle;

            try {
                nsBundle = validateNamespaceBundleOwnership(namespaceName, policies.bundles, bundleRange,
                    authoritative, true);
            } catch (WebApplicationException wae) {
                asyncResponse.resume(wae);
                return;
            }
	// 开始卸载
            pulsar().getNamespaceService().unloadNamespaceBundle(nsBundle)
public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle) {
        // unload namespace bundle
        return unloadNamespaceBundle(bundle, 5, TimeUnit.MINUTES);
    }

    public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) {
        // unload namespace bundle
        OwnedBundle ob = ownershipCache.getOwnedBundle(bundle);
        if (ob == null) {
            return FutureUtil.failedFuture(new IllegalStateException("Bundle " + bundle + " is not currently owned"));
        } else {
            return ob.handleUnloadRequest(pulsar, timeout, timeoutUnit);
        }
    }

ob.handleUnloadRequest:

public CompletableFuture<Void> handleUnloadRequest(PulsarService pulsar, long timeout, TimeUnit timeoutUnit) {
        long unloadBundleStartTime = System.nanoTime();
        // Need a per namespace RenetrantReadWriteLock
        // Here to do a writeLock to set the flag and proceed to check and close connections
        try {
            while (!this.nsLock.writeLock().tryLock(1, TimeUnit.SECONDS)) {
                // Using tryLock to avoid deadlocks caused by 2 threads trying to acquire 2 readlocks (eg: replicators)
                // while a handleUnloadRequest happens in the middle
                LOG.warn("Contention on OwnedBundle rw lock. Retrying to acquire lock write lock");
            }

            try {
	// 将namespace 置为非活动状态,拒绝所有生产者和消费者
                // set the flag locally s.t. no more producer/consumer to this namespace is allowed
                if (!IS_ACTIVE_UPDATER.compareAndSet(this, TRUE, FALSE)) {
                    // An exception is thrown when the namespace is not in active state (i.e. another thread is
                    // removing/have removed it)
                    return FutureUtil.failedFuture(new IllegalStateException(
                            "Namespace is not active. ns:" + this.bundle + "; state:" + IS_ACTIVE_UPDATER.get(this)));
                }
            } finally {
                // no matter success or not, unlock
                this.nsLock.writeLock().unlock();
            }
        } catch (InterruptedException e) {
            return FutureUtil.failedFuture(e);
        }

        AtomicInteger unloadedTopics = new AtomicInteger();
        LOG.info("Disabling ownership: {}", this.bundle);

        // close topics forcefully
        return pulsar.getNamespaceService().getOwnershipCache()
                .updateBundleState(this.bundle, false)
                .thenCompose(v -> pulsar.getBrokerService().unloadServiceUnit(bundle, true, timeout, timeoutUnit)) // unloadServiceUnit做卸载主题流程, 关闭所有连接和存储
                .handle((numUnloadedTopics, ex) -> {
                    if (ex != null) {
                        // ignore topic-close failure to unload bundle
                        LOG.error("Failed to close topics under namespace {}", bundle.toString(), ex);
                    } else {
                        unloadedTopics.set(numUnloadedTopics);
                    }
                    // clean up topics that failed to unload from the broker ownership cache  清除缓存
                    pulsar.getBrokerService().cleanUnloadedTopicFromCache(bundle);
                    return null;
                })
                .thenCompose(v -> {
                    // delete ownership node on zk
                    return pulsar.getNamespaceService().getOwnershipCache().removeOwnership(bundle);
                }).whenComplete((ignored, ex) -> {
                    double unloadBundleTime = TimeUnit.NANOSECONDS
                            .toMillis((System.nanoTime() - unloadBundleStartTime));
                    LOG.info("Unloading {} namespace-bundle with {} topics completed in {} ms", this.bundle,
                            unloadedTopics, unloadBundleTime, ex);
                });
    }

删除zk上的关系:

/namespace/apache/pulsar/0x00000000_0xffffffff

{
  "nativeUrl" : "pulsar://192.168.18.135:6651",
  "nativeUrlTls" : "pulsar+ssl://192.168.18.135:6671",
  "httpUrl" : "http://192.168.18.135:8081",
  "httpUrlTls" : "https://192.168.18.135:8441",
  "disabled" : false,
  "advertisedListeners" : { }
}

注意到unload只是删除临时结点, 归属的流程还是在lookup里面!

卸载topic

卸载topic和bundle的卸载是不一样的, 无法用来做流量转移,topic的卸载只是把当前topic中的生产者,消费者,副本同步等连接断开,关闭对应的存储ledger。这个可以用在客户端异常场景恢复,强制重置状态。如果topic是fencing,那么topic会阻止客户端重新发起连接请求。

卸载topic实践:


root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5

 bin/pulsar-admin topics list-partitioned-topics apache/pulsar
"persistent://apache/pulsar/test-topic"

// 获取topic的bundle范围
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x59999994_0x66666660"

// 获取主题目前负责的broker
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"  
"persistent://apache/pulsar/test-topic-partition-2    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"

// 卸载topic
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics unload persistent://apache/pulsar/test-topic

 
// 卸载topic可以看到没有任何改变,并没有发生重分配,无论执行多次次卸载topic都不会触发重分配。
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"  
"persistent://apache/pulsar/test-topic-partition-2    pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"

分裂bundle

自动分裂

自动分裂配置:

# 启用/禁用 自动拆分命名空间中的bundle
loadBalancerAutoBundleSplitEnabled=true

# 启用/禁用 自动卸载切分的bundle
loadBalancerAutoUnloadSplitBundlesEnabled=true

# bundle 中最大的主题数, 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxTopics=1000

# bundle 最大的session数量(生产 + 消费), 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxSessions=1000

# bundle 最大的msgRate(进+出)的值, 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxMsgRate=30000

# bundle 最大的带宽(进+出)的值, 一旦超过这个值,将触发拆分操作
loadBalancerNamespaceBundleMaxBandwidthMbytes=100

# 命名空间中最大的 bundle 数量 (用于自动拆分bundle时)
loadBalancerNamespaceMaximumBundles=128

bundle分裂,其管理的topic会触发重连, 重新lookup,这样子对客户端不太友好,一般是在流量低峰自己做分裂。

手动分裂

手动分分裂实践

hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics list-partitioned-topics apache/pulsar
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x40000000_0x80000000"
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-2    pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://192.168.18.135:6651"

bin/pulsar-admin namespaces split-bundle --bundle 0x40000000_0x80000000  apache/pulsar
分裂之后,由4个bundle 分裂成 5个:

/admin/local-policies/apache/pulsar
{
  "bundles" : {
    "boundaries" : [ "0x00000000", "0x40000000", "0x60000000", "0x80000000", "0xc0000000", "0xffffffff" ],
    "numBundles" : 5
  }
}

hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics  partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0    pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-1    pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-2    pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-3    pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-4    pulsar://192.168.18.135:6651"

split-bundle 有两种方式:

You can split namespace bundles in two ways, by setting supportedNamespaceBundleSplitAlgorithmsto range_equally_divideor topic_count_equally_dividein broker.conffile.

这里要注意了!!

分裂出的bundle会分配给当前的broker。 除非你开启了unload!

原理解析

无论是由LoadManager模块调用的自动分裂还是手动调用的split,都是走/{tenant}/{namespace}/{bundle}/split 这样的请求!

@PUT
    @Path("/{tenant}/{namespace}/{bundle}/split")
    @ApiOperation(value = "Split a namespace bundle")
    @ApiResponses(value = {
            @ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
            @ApiResponse(code = 403, message = "Don't have admin permission") })
    public void splitNamespaceBundle(
            @Suspended final AsyncResponse asyncResponse,
            @PathParam("tenant") String tenant,
            @PathParam("namespace") String namespace,
            @PathParam("bundle") String bundleRange,
            @QueryParam("authoritative") @DefaultValue("false") boolean authoritative,
            @QueryParam("unload") @DefaultValue("false") boolean unload,  // 默认是分配给当前的broker,是不会卸载的!
            @QueryParam("splitAlgorithmName") String splitAlgorithmName) {

        try {
            validateNamespaceName(tenant, namespace);
            internalSplitNamespaceBundle(asyncResponse, bundleRange, authoritative, unload, splitAlgorithmName);
        } catch (WebApplicationException wae) {
            asyncResponse.resume(wae);
        } catch (Exception e) {
            asyncResponse.resume(new RestException(e));
        }
    }
protected void internalSplitNamespaceBundle(AsyncResponse asyncResponse, String bundleName,
                                                boolean authoritative, boolean unload, String splitAlgorithmName) {
        validateSuperUserAccess();
        checkNotNull(bundleName, "BundleRange should not be null");
        log.info("[{}] Split namespace bundle {}/{}", clientAppId(), namespaceName, bundleName);

        String bundleRange = bundleName.equals(Policies.LARGEST_BUNDLE)
                ? findLargestBundleWithTopics(namespaceName).getBundleRange()
                : bundleName;

        Policies policies = getNamespacePolicies(namespaceName);

        if (namespaceName.isGlobal()) {
            // check cluster ownership for a given global namespace: redirect if peer-cluster owns it
            validateGlobalNamespaceOwnership(namespaceName);
        } else {
            validateClusterOwnership(namespaceName.getCluster());
            validateClusterForTenant(namespaceName.getTenant(), namespaceName.getCluster());
        }

        validatePoliciesReadOnlyAccess();

        List<String> supportedNamespaceBundleSplitAlgorithms =
                pulsar().getConfig().getSupportedNamespaceBundleSplitAlgorithms();
        if (StringUtils.isNotBlank(splitAlgorithmName)
                && !supportedNamespaceBundleSplitAlgorithms.contains(splitAlgorithmName)) {
            asyncResponse.resume(new RestException(Status.PRECONDITION_FAILED,
                    "Unsupported namespace bundle split algorithm, supported algorithms are "
                            + supportedNamespaceBundleSplitAlgorithms));
        }

        NamespaceBundle nsBundle;

        try {
            nsBundle = validateNamespaceBundleOwnership(namespaceName, policies.bundles, bundleRange,
                    authoritative, true);
        } catch (Exception e) {
            asyncResponse.resume(e);
            return;
        }

        pulsar().getNamespaceService().splitAndOwnBundle(nsBundle, unload,
                getNamespaceBundleSplitAlgorithmByName(splitAlgorithmName))
                .thenRun(() -> {
                    log.info("[{}] Successfully split namespace bundle {}", clientAppId(), nsBundle.toString());
                    asyncResponse.resume(Response.noContent().build());
                }).exceptionally(ex -> {
            if (ex.getCause() instanceof IllegalArgumentException) {
                log.error("[{}] Failed to split namespace bundle {}/{} due to {}", clientAppId(), namespaceName,
                        bundleRange, ex.getMessage());
                asyncResponse.resume(new RestException(Status.PRECONDITION_FAILED,
                        "Split bundle failed due to invalid request"));
            } else {
                log.error("[{}] Failed to split namespace bundle {}/{}", clientAppId(), namespaceName, bundleRange, ex);
                asyncResponse.resume(new RestException(ex.getCause()));
            }
            return null;
        });
    }

大致流程就是分裂出bundle,将分裂出的bundle写入zk中,再将bundle中的topic再走一遍findBundle流程(上面分析过了)

负载解析

查看负载的几个API:

  • bin/pulsar-perf monitor-brokers --connect-string 127.0.0.1:2181
  • bin/pulsar-admin broker-stats load-report
  • bin/pulsar-admin broker-stats topics -i

LoadManager 概述

负载管理类,负载管理类有多种实现,可以配置中指定loadManagerClassName来选择不同的实现类。

如果加载失败,那么将使用org.apache.pulsar.broker.loadbalance.impl.SimpleLoadManagerImpl

默认是org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl

  • ModularLoadManagerWrapper采用了适配器模式,是一个适配器,适配ModularLoadManagerImpl和LoadManager。(ModularLoadManagerImpl 实现了新的接口,不能直接用于现有的LoadManager流程,所以需要适配)
public class ModularLoadManagerWrapper implements LoadManager {
private ModularLoadManager loadManager;  // 会指向ModularLoadManagerImpl
  • SimpleLoadManagerImpl ,最简单的重均衡实现,负载均衡未实现插件化,固有三种判定策略
  • NoopLoadManager,一个空实现,使用这个NoopLoadManager,不会有任何的重均衡和负载上报
  • 可动态修改LoadManager实现类。
bin/pulsar-admin update-dynamic-config --config loadManagerClassName --value 类名

LoadManager主要做三种事情:

  • 每个结点的LoadManager周期将自己的负载信息上报元数据中心(zk) —> LoadReportUpdaterTask
  • leader节点的LoadManager定时将统计型负载信息(每个broker、Bundle的历史信息)更新到zk —> loadResourceQuotaTask
  • leader节点的LoadManager定时根据负载信息重平衡、分裂Bundle —> loadSheddingTask

看下LoadManager类的抽象:

/**
 * LoadManager runs through set of load reports collected from different brokers and generates a recommendation of
 * namespace/ServiceUnit placement on machines/ResourceUnit. Each Concrete Load Manager will use different algorithms to
 * generate this mapping.
 *
 * Concrete Load Manager is also return the least loaded broker that should own the new namespace.
 */

/*
  LoadManager运行在每个Broker上,每个broker会上报负载并且生成一个推荐配置,即namespace/ServiceUnit 应该归属于哪个machines/ResourceUnit管理,
  LoadManger 还会返回最少负载的broker应该负责哪些新的命名空间。
*/
public interface LoadManager {
    Logger LOG = LoggerFactory.getLogger(LoadManager.class);

    String LOADBALANCE_BROKERS_ROOT = "/loadbalance/brokers"; // zk broker的根节点

    void start() throws PulsarServerException;

    /**
     * Is centralized decision making to assign a new bundle.
     * 是否是集中决策分配bundle
     */
    boolean isCentralized();

    /**
     * Returns the Least Loaded Resource Unit decided by some algorithm or criteria which is implementation specific.
		 * 返回由某些特定于实现的算法或标准决定的最少负载资源单元。
     */
    Optional<ResourceUnit> getLeastLoaded(ServiceUnitId su) throws Exception;

    /**
     * Generate the load report.
     * 生成负载报告 调用pulsar-admin broker-stats load-report ,其实就是走这里
     */
    LoadManagerReport generateLoadReport() throws Exception;

    /**
     * Set flag to force load report update.
		 * 设置强制更新负载报告标志
     */
    void setLoadReportForceUpdateFlag();

    /**
     * Publish the current load report on ZK. 向元数据服务更新当前负载
     */
    void writeLoadReportOnZookeeper() throws Exception;

    /**
     * Publish the current load report on ZK, forced or not. 发布负载报告到zk
     * By default rely on method writeLoadReportOnZookeeper().
     */
    default void writeLoadReportOnZookeeper(boolean force) throws Exception {
        writeLoadReportOnZookeeper();
    }

    /**
     * Update namespace bundle resource quota on ZK.更新资源配额到元数据
     */
    void writeResourceQuotasToZooKeeper() throws Exception;

    /**
     * Generate load balancing stats metrics. 获取资源负载指标
     */
    List<Metrics> getLoadBalancingMetrics();

    /**
     * Unload a candidate service unit to balance the load.  卸载某些服务单元以达到负载均衡,leader会调用这个方法
     */
    void doLoadShedding();

    /**
     * Namespace bundle split. bundle分裂
     */
    void doNamespaceBundleSplit() throws Exception;

    /**
     * Removes visibility of current broker from loadbalancer list so, other brokers can't redirect any request to this
     * broker and this broker won't accept new connection requests.  移除broker的可见性(是否会移除所有bundle?),其他broker的请求无法重定向到这台broker,broker也没法接受新的请求
     *
     * @throws Exception
     */
    void disableBroker() throws Exception;

    /**
     * Get list of available brokers in cluster. 获取可用的broker列表
     *
     * @return
     * @throws Exception
     */
    Set<String> getAvailableBrokers() throws Exception;

    void stop() throws PulsarServerException;

    /**
     * Initialize this LoadManager.  反射初始化LoadManager
     *
     * @param pulsar
     *            The service to initialize this with.
     */
    void initialize(PulsarService pulsar);
		// 创建一个LoadManager类
    static LoadManager create(final PulsarService pulsar) {
        try {
            final ServiceConfiguration conf = pulsar.getConfiguration();
            final Class<?> loadManagerClass = Class.forName(conf.getLoadManagerClassName());
            // Assume there is a constructor with one argument of PulsarService.
            final Object loadManagerInstance = loadManagerClass.getDeclaredConstructor().newInstance();
						//是子类就直接创建
            if (loadManagerInstance instanceof LoadManager) {
                final LoadManager casted = (LoadManager) loadManagerInstance;
                casted.initialize(pulsar);
                return casted;
            } else if (loadManagerInstance instanceof ModularLoadManager) {  //包装一下再创建
                final LoadManager casted = new ModularLoadManagerWrapper((ModularLoadManager) loadManagerInstance);
                casted.initialize(pulsar);
                return casted;
            }
        } catch (Exception e) {
            LOG.warn("Error when trying to create load manager: ", e);
        }
        // If we failed to create a load manager, default to SimpleLoadManagerImpl.
        return new SimpleLoadManagerImpl(pulsar);
    }

初始化

那么究竟是怎么选主的呢?

一句话总结:依靠抢占临时节点来实现的。

在PulsarService有一个LeaderElectionService 服务。

其实我们可以看到LeaderElectionService实现是非常少的, 它组合了CoordinationService的实现类CoordinationServiceImpl来一起完成具体逻辑。 一句话总结:依靠抢占临时节点来实现的。

在PulsarService有一个LeaderElectionService 服务。

private LeaderElectionService leaderElectionService = null;
public class LeaderElectionService implements AutoCloseable {

    private static final String ELECTION_ROOT = "/loadbalance/leader";

    private final LeaderElection<LeaderBroker> leaderElection;
    private final LeaderBroker localValue;

    public LeaderElectionService(CoordinationService cs, String localWebServiceAddress,
            Consumer<LeaderElectionState> listener) {
        this.leaderElection = cs.getLeaderElection(LeaderBroker.class, ELECTION_ROOT, listener);
        this.localValue = new LeaderBroker(localWebServiceAddress);
    }

    public void start() {
        leaderElection.elect(localValue).join();
    }
@Override
    public <T> LeaderElection<T> getLeaderElection(Class<T> clazz, String path,
            Consumer<LeaderElectionState> stateChangesListener) {

        return (LeaderElection<T>) leaderElections.computeIfAbsent(path,
                key -> new LeaderElectionImpl<T>(store, clazz, path, stateChangesListener, executor)); // 返回了一个LeaderElectionImpl
    }

LeaderElectionImpl:

LeaderElectionImpl(MetadataStoreExtended store, Class<T> clazz, String path,
            Consumer<LeaderElectionState> stateChangesListener,
                       ScheduledExecutorService executor) {
        this.path = path;  // 上面传递进来的"/loadbalance/leader"
        this.serde = new JSONMetadataSerdeSimpleType<>(TypeFactory.defaultInstance().constructSimpleType(clazz, null));
        this.store = store;
        this.cache = store.getMetadataCache(clazz);
        this.leaderElectionState = LeaderElectionState.NoLeader;
        this.internalState = InternalState.Init;
        this.stateChangesListener = stateChangesListener;
        this.executor = executor;

        store.registerListener(this::handlePathNotification);   // 注册临时节点监听
        store.registerSessionListener(this::handleSessionNotification); // 注册会话监听
    }

选举方法:

private synchronized CompletableFuture<LeaderElectionState> elect() {
        // First check if there's already a leader elected
        internalState = InternalState.ElectionInProgress;
        return store.get(path).thenCompose(optLock -> {
            if (optLock.isPresent()) {
                return handleExistingLeaderValue(optLock.get());
            } else {
                return tryToBecomeLeader();  // 尝试成为主
            }
        }).thenCompose(leaderElectionState ->
                // make sure that the cache contains the current leader
                // so that getLeaderValueIfPresent works on all brokers
                cache.get(path).thenApply(__ -> leaderElectionState));
    }
    
    private synchronized CompletableFuture<LeaderElectionState> tryToBecomeLeader() {
        byte[] payload;
        try {
            payload = serde.serialize(path, proposedValue.get());
        } catch (Throwable t) {
            return FutureUtils.exception(t);
        }

        CompletableFuture<LeaderElectionState> result = new CompletableFuture<>();
        store.put(path, payload, Optional.of(-1L), EnumSet.of(CreateOption.Ephemeral))  // 创建 "/loadbalance/leader" 临时节点。
                .thenAccept(stat -> {
                    synchronized (LeaderElectionImpl.this) {
                        if (internalState == InternalState.ElectionInProgress) {

在PulsarService的Start方法中,如果是leader会启动loadSheddingTask任务和LoadResourceQuotaUpdaterTask任务

// Start the leader election service
startLeaderElectionService();

protected void startLeaderElectionService() {
        this.leaderElectionService = new LeaderElectionService(coordinationService, getSafeWebServiceAddress(),
                state -> {
                    if (state == LeaderElectionState.Leading) {
                        LOG.info("This broker was elected leader");
                        if (getConfiguration().isLoadBalancerEnabled()) {  // 默认开启
                            long loadSheddingInterval = TimeUnit.MINUTES
                                    .toMillis(getConfiguration().getLoadBalancerSheddingIntervalMinutes());
                            long resourceQuotaUpdateInterval = TimeUnit.MINUTES
                                    .toMillis(getConfiguration().getLoadBalancerResourceQuotaUpdateIntervalMinutes());
														// 取消之前的loadSheddingTask任务和loadResourceQuotaTask任务
                            if (loadSheddingTask != null) {
                                loadSheddingTask.cancel(false);
                            }
                            if (loadResourceQuotaTask != null) {
                                loadResourceQuotaTask.cancel(false);
                            }
														// leader 初始化 loadSheddingTask和LoadResourceQuotaUpdaterTask
                            loadSheddingTask = loadManagerExecutor.scheduleAtFixedRate(
                                    new LoadSheddingTask(loadManager),
                                    loadSheddingInterval, loadSheddingInterval, TimeUnit.MILLISECONDS);
                            loadResourceQuotaTask = loadManagerExecutor.scheduleAtFixedRate(
                                    new LoadResourceQuotaUpdaterTask(loadManager), resourceQuotaUpdateInterval,
                                    resourceQuotaUpdateInterval, TimeUnit.MILLISECONDS);
                        }
                    } else {
                        if (leaderElectionService != null) {
                            LOG.info("This broker is a follower. Current leader is {}",
                                    leaderElectionService.getCurrentLeader());
                        }
                        if (loadSheddingTask != null) {
                            loadSheddingTask.cancel(false);
                            loadSheddingTask = null;
                        }
                        if (loadResourceQuotaTask != null) {
                            loadResourceQuotaTask.cancel(false);
                            loadResourceQuotaTask = null;
                        }
                    }
                });

        leaderElectionService.start();
    }

注: 从源码中,我们看出元数据服务抽象出了一个MetadataStore,这样脱离zk会更加轻松。

ModularLoadManagerImpl详解

属性


// Path to ZNode whose children contain BundleData jsons for each bundle (new API version of ResourceQuota).
// Bundle 负载的根目录
public static final String BUNDLE_DATA_PATH = "/loadbalance/bundle-data";

// todo 什么时候bundle是unseen的? 

// Default message rate to assume for unseen bundles.1
public static final double DEFAULT_MESSAGE_RATE = 50;

// Default message throughput to assume for unseen bundles.
// Note that the default message size is implicitly defined as DEFAULT_MESSAGE_THROUGHPUT / DEFAULT_MESSAGE_RATE.
public static final double DEFAULT_MESSAGE_THROUGHPUT = 50000;

//  为了统计长期负载的样本
// The number of effective samples to keep for observing long term data.
public static final int NUM_LONG_SAMPLES = 1000;

//  为了统计短期期负载的样本
// The number of effective samples to keep for observing short term data.
public static final int NUM_SHORT_SAMPLES = 10;

// Path to ZNode whose children contain ResourceQuota jsons.
public static final String RESOURCE_QUOTA_ZPATH = "/loadbalance/resource-quota/namespace";

// Path to ZNode containing TimeAverageBrokerData jsons for each broker.
// 每个broker的长期和短期负载数据结点
public static final String TIME_AVERAGE_BROKER_ZPATH = "/loadbalance/broker-time-average";

// Set of broker candidates to reuse so that object creation is avoided.
// 把候选broker缓存起来,避免重复创建
private final Set<String> brokerCandidateCache;

// Cache of the local broker data, stored in LoadManager.LOADBALANCE_BROKER_ROOT.
// LocalBrokerData 是broker负载的信息。
private LockManager<LocalBrokerData> brokersData;
private ResourceLock<LocalBrokerData> brokerDataLock;
// 各个缓存,避免频繁读取zk
private MetadataCache<BundleData> bundlesCache;
private MetadataCache<ResourceQuota> resourceQuotaCache;
private MetadataCache<TimeAverageBrokerData> timeAverageBrokerDataCache;

// Broker host usage object used to calculate system resource usage.
// broker用来计算系统资源的,比如内存,cpu等
private BrokerHostUsage brokerHostUsage;

// Map from brokers to namespaces to the bundle ranges in that namespace assigned to that broker.
// Used to distribute bundles within a namespace evenly across brokers.
// 存储了broker到namespace再到namespace的bundle range的关系
private final ConcurrentOpenHashMap<String, ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>>>
        brokerToNamespaceToBundleRange;

// Path to the ZNode containing the LocalBrokerData json for this broker.
// 每个broker是一个/loadbalance/brokers/192.168.18.135:8081
private String brokerZnodePath;

// Strategy to use for splitting bundles.
private BundleSplitStrategy bundleSplitStrategy;

// Service configuration belonging to the pulsar service.
private ServiceConfiguration conf;

// The default bundle stats which are used to initialize historic data.
// This data is overridden after the bundle receives its first sample.
// 初始化的时候用一些,当采到第一个样本的时候,会被覆盖
private final NamespaceBundleStats defaultStats;

// Used to filter brokers from being selected for assignment.
// 用来过滤可选的broker
private final List<BrokerFilter> filterPipeline;

// Timestamp of last invocation of updateBundleData.
// 最后一次调用updateBundleData的时间
private long lastBundleDataUpdate;

// LocalBrokerData available before most recent update.
// 最近更新LocalBrokerData的时间
private LocalBrokerData lastData;

// Pipeline used to determine what namespaces, if any, should be unloaded.
private final List<LoadSheddingStrategy> loadSheddingPipeline;

// Local data for the broker this is running on.
// 当前broker运行负载的数据
private LocalBrokerData localData;

// Load data comprising data available for each broker.
// 包含每个broker的负载数据
private final LoadData loadData;

// Used to determine whether a bundle is preallocated.
// 预分配namespaceBundle到broker
private final Map<String, String> preallocatedBundleToBroker;

// Strategy used to determine where new topics should be placed.
// 确认一个新的topic应该放在哪里的策略
private ModularLoadManagerStrategy placementStrategy;

// Policies used to determine which brokers are available for particular namespaces.
// 确认特殊命名空间应该放在哪些broker的策略
private SimpleResourceAllocationPolicies policies;

// Pulsar service used to initialize this.
private PulsarService pulsar;

// Executor service used to regularly update broker data.
private final ScheduledExecutorService scheduler;

// check if given broker can load persistent/non-persistent topic
// 检查给定的broker能否加载持久化和非持久化topic
private final BrokerTopicLoadingPredicate brokerTopicLoadingPredicate;

private Map<String, String> brokerToFailureDomainMap;

private SessionEvent lastMetadataSessionEvent = SessionEvent.Reconnected;

// record load balancing metrics
private AtomicReference<List<Metrics>> loadBalancingMetrics = new AtomicReference<>();
// record bundle unload metrics
private AtomicReference<List<Metrics>> bundleUnloadMetrics = new AtomicReference<>();
// record bundle split metrics
private AtomicReference<List<Metrics>> bundleSplitMetrics = new AtomicReference<>();

private long bundleSplitCount = 0;
private long unloadBrokerCount = 0;
private long unloadBundleCount = 0;

// 负载数据的读写锁
private final Lock lock = new ReentrantLock();

负载信息

/loadbalance/broker-time-average下是一段时间内,broker的长期和短期负载

image.png

/loadbalance/bundle-data 下是各个主题分区的负载,也有这长期和短期的负载情况。topics表示是负责的topic数

image.png

/loadbalance/brokers/192.168.18.135:8081 是临时节点,包含大量信息, borker-stats load-report 的信息就是从这里来的。

image.png

ModularLoadManaagerImpl会生成LocalBrokerData对象。LocalBrokerData包含broker所有的负载数据

public class LocalBrokerData implements LoadManagerReport {

    // URLs to satisfy contract of ServiceLookupData (used by NamespaceService). broker连接信息
    private final String webServiceUrl;
    private final String webServiceUrlTls;
    private final String pulsarServiceUrl;
    private final String pulsarServiceUrlTls;
    private boolean persistentTopicsEnabled = true; // 启用持久化主题
    private boolean nonPersistentTopicsEnabled = true; // 启用非持久化主题

    // Most recently available system resource usage. 最新的系统资源使用量
    private ResourceUsage cpu;
    private ResourceUsage memory; // 内存
    private ResourceUsage directMemory; // 堆外内存

    private ResourceUsage bandwidthIn; // 入带宽
    private ResourceUsage bandwidthOut; // 出带宽

    // Message data from the most recent namespace bundle stats.  bundle相关的状态
    private double msgThroughputIn;   // 总接收消息吞吐量
    private double msgThroughputOut;  // 总推送消息吞吐量
    private double msgRateIn;    //入消息的QPS
    private double msgRateOut;  // 出消息的QPS

    // Timestamp of last update.
    private long lastUpdate;

    // The stats given in the most recent invocation of update. 每个Bundle的详细流量信息
    private Map<String, NamespaceBundleStats> lastStats;

    private int numTopics; // broker上的总主题数
    private int numBundles; // broker上的总Consumer数
    private int numConsumers; // broker上的消费者数
    private int numProducers; // broker上的生产者数

    // All bundles belonging to this broker.
    private Set<String> bundles; // 负责的所有bundle

    // The bundles gained since the last invocation of update.
    private Set<String> lastBundleGains; // 和上一次数据更新比较,Broker获取了哪些bundle

    // The bundles lost since the last invocation of update.
    private Set<String> lastBundleLosses; // 和上一次数据更新比较,Broker失去了哪些bundle

    // The version string that this broker is running, obtained from the Maven build artifact in the POM  broker的版本号
    private String brokerVersionString;
    // This place-holder requires to identify correct LoadManagerReport type while deserializing
    @SuppressWarnings("checkstyle:ConstantName")
    public static final String loadReportType = LocalBrokerData.class.getSimpleName();

    // the external protocol data advertised by protocol handlers.
    private Map<String, String> protocols;   // 外部协议
    //
    private Map<String, AdvertisedListener> advertisedListeners;

还有一个LoadData数据。

/**
     * Map from broker names to their available data.
     */
    private final Map<String, BrokerData> brokerData;

    /**
     * Map from bundle names to their time-sensitive aggregated data.
     */
    private final Map<String, BundleData> bundleData;

    /**
     * Map from recently unloaded bundles to the timestamp of when they were last loaded.
     */
    private final Map<String, Long> recentlyUnloadedBundles;

他的更新由updateAll进行调用。下面会做详细分析

启动start

@Override
    public void start() throws PulsarServerException {
        try {
            // At this point, the ports will be updated with the real port number that the server was assigned
            Map<String, String> protocolData = pulsar.getProtocolDataToAdvertise();

            lastData = new LocalBrokerData(pulsar.getSafeWebServiceAddress(), pulsar.getWebServiceAddressTls(),
                    pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls(), pulsar.getAdvertisedListeners());
            lastData.setProtocols(protocolData);
            // configure broker-topic mode
            lastData.setPersistentTopicsEnabled(pulsar.getConfiguration().isEnablePersistentTopics());
            lastData.setNonPersistentTopicsEnabled(pulsar.getConfiguration().isEnableNonPersistentTopics());

            localData = new LocalBrokerData(pulsar.getSafeWebServiceAddress(), pulsar.getWebServiceAddressTls(),
                    pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls(), pulsar.getAdvertisedListeners());
            localData.setProtocols(protocolData);
            localData.setBrokerVersionString(pulsar.getBrokerVersion());
            // configure broker-topic mode
            localData.setPersistentTopicsEnabled(pulsar.getConfiguration().isEnablePersistentTopics());
            localData.setNonPersistentTopicsEnabled(pulsar.getConfiguration().isEnableNonPersistentTopics());

            String lookupServiceAddress = pulsar.getAdvertisedAddress() + ":"
                    + (conf.getWebServicePort().isPresent() ? conf.getWebServicePort().get()
                            : conf.getWebServicePortTls().get());
            brokerZnodePath = LoadManager.LOADBALANCE_BROKERS_ROOT + "/" + lookupServiceAddress;
            final String timeAverageZPath = TIME_AVERAGE_BROKER_ZPATH + "/" + lookupServiceAddress;
            
						// 更新当前broker负载到localData,更新系统指标和bundle状态指标,更新LoadBalance的指标
						updateLocalBrokerData();

            brokerDataLock = brokersData.acquireLock(brokerZnodePath, localData).join();

            timeAverageBrokerDataCache.readModifyUpdateOrCreate(timeAverageZPath,
                    __ -> new TimeAverageBrokerData()).join();
						// 更新
            updateAll();
            lastBundleDataUpdate = System.currentTimeMillis();
        } catch (Exception e) {
            log.error("Unable to acquire lock for broker: [{}]", brokerZnodePath, e);
            throw new PulsarServerException(e);
        }
    }

updateAll 详细分析

调用时机有:

  • 每个broker的LoadManager启动时会调用
  • 当LOADBALANCE_BROKERS_ROOT zk节点有更改是会调用,watch监听搞的(handleDataNotification)
  • LoadReportUpdaterTask 每次上报负载的时候(run方法→LoadReportUpdaterTask中)
// Update both the broker data and the bundle data.
    public void updateAll() {
        if (log.isDebugEnabled()) {
            log.debug("Updating broker and bundle data for loadreport");
        }
        updateAllBrokerData(); // 更新loadData
        updateBundleData(); // 这里也是更新loadData, 更新bundleData
        // broker has latest load-report: check if any bundle requires split 看下是否需要拆分bundle
        // 只有leader,和开启了自动split才会发生
				checkNamespaceBundleSplit();
    }

updateAllBrokerData

所有broker的负载数据都要通过updateLocalBrokerData上报给元数据存储(zk),这样leader才读取到数据,才可以更新laodData中的broker数据映射


private void updateAllBrokerData() {
        final Set<String> activeBrokers = getAvailableBrokers();
        final Map<String, BrokerData> brokerDataMap = loadData.getBrokerData();
					// 遍历存活的broker
        for (String broker : activeBrokers) {
            try {
                String key = String.format("%s/%s", LoadManager.LOADBALANCE_BROKERS_ROOT, broker);
                Optional<LocalBrokerData> localData = brokersData.readLock(key).get();
                if (!localData.isPresent()) {
                    brokerDataMap.remove(broker); // 不存在就移除了,可能是结点下线了,或者其他问题
                    log.info("[{}] Broker load report is not present", broker);
                    continue;
                }

                if (brokerDataMap.containsKey(broker)) {
                    // Replace previous local broker data.
                    brokerDataMap.get(broker).setLocalData(localData.get());
                } else {
                    // Initialize BrokerData object for previously unseen
                    // brokers.
                    brokerDataMap.put(broker, new BrokerData(localData.get()));
                }
            } catch (Exception e) {
                log.warn("Error reading broker data from cache for broker - [{}], [{}]", broker, e.getMessage());
            }
        }
        // Remove obsolete brokers.
        for (final String broker : brokerDataMap.keySet()) {
            if (!activeBrokers.contains(broker)) {
                brokerDataMap.remove(broker);
            }
        }
    }

updateLocalBrokerData里更新bundle数据依托于PulsarStats.updateStats, 这个启动会更新,定时也会更新。

updateBundleData

同样的道理,这里是统计bundle的负载,通过Broker上报的数据更新loadData中的Bundle的数据

private void updateBundleData() {
        final Map<String, BundleData> bundleData = loadData.getBundleData();
        // Iterate over the broker data.
        for (Map.Entry<String, BrokerData> brokerEntry : loadData.getBrokerData().entrySet()) {
            final String broker = brokerEntry.getKey();
            final BrokerData brokerData = brokerEntry.getValue();
            final Map<String, NamespaceBundleStats> statsMap = brokerData.getLocalData().getLastStats();

            // Iterate over the last bundle stats available to the current
            // broker to update the bundle data.
            for (Map.Entry<String, NamespaceBundleStats> entry : statsMap.entrySet()) {
                final String bundle = entry.getKey();
                final NamespaceBundleStats stats = entry.getValue();
                if (bundleData.containsKey(bundle)) { // 如果已经识别过这个bundle了就更新
                    // If we recognize the bundle, add these stats as a new sample.
                    bundleData.get(bundle).update(stats);
                } else { //如果没有识别过这个bundle就新增
                    // Otherwise, attempt to find the bundle data on metadata store.
                    // If it cannot be found, use the latest stats as the first sample.
                    BundleData currentBundleData = getBundleDataOrDefault(bundle);
                    currentBundleData.update(stats);
                    bundleData.put(bundle, currentBundleData);
                }
            }
						// 移除预分配bundle中的,已经加载的bundle
            // Remove all loaded bundles from the preallocated maps.
            final Map<String, BundleData> preallocatedBundleData = brokerData.getPreallocatedBundleData();
            synchronized (preallocatedBundleData) {
                for (String preallocatedBundleName : brokerData.getPreallocatedBundleData().keySet()) {
                    if (brokerData.getLocalData().getBundles().contains(preallocatedBundleName)) {
                        final Iterator<Map.Entry<String, BundleData>> preallocatedIterator =
                                preallocatedBundleData.entrySet()
                                        .iterator();
                        while (preallocatedIterator.hasNext()) {
                            final String bundle = preallocatedIterator.next().getKey();
														// bundleData中已经有bundle的信息,就把这个bundle从预分配中移除
                            if (bundleData.containsKey(bundle)) { 
                                preallocatedIterator.remove(); 
                                preallocatedBundleToBroker.remove(bundle);
                            }
                        }
                    }
									
										// todo 这里没想明白
                    // This is needed too in case a broker which was assigned a bundle dies and comes back up.
                    preallocatedBundleToBroker.remove(preallocatedBundleName);
                }
            }
							// 使用最新的数据去更新
            // Using the newest data, update the aggregated time-average data for the current broker.
            brokerData.getTimeAverageData().reset(statsMap.keySet(), bundleData, defaultStats);
            final ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>> namespaceToBundleRange =
                    brokerToNamespaceToBundleRange
                            .computeIfAbsent(broker, k -> new ConcurrentOpenHashMap<>());
            synchronized (namespaceToBundleRange) {
                namespaceToBundleRange.clear();
                LoadManagerShared.fillNamespaceToBundlesMap(statsMap.keySet(), namespaceToBundleRange);
                LoadManagerShared.fillNamespaceToBundlesMap(preallocatedBundleData.keySet(), namespaceToBundleRange);
            }
        }
    }

checkNamespaceBundleSplit

决定是否需要拆分

synchronized (bundleSplitStrategy) {
            final Set<String> bundlesToBeSplit = bundleSplitStrategy.findBundlesToSplit(loadData, pulsar);
            NamespaceBundleFactory namespaceBundleFactory = pulsar.getNamespaceService().getNamespaceBundleFactory();
            // 遍历每个bundle看能否拆分
						for (String bundleName : bundlesToBeSplit) {
                try {
                    final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundleName);
                    final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundleName);
                    if (!namespaceBundleFactory
                            .canSplitBundle(namespaceBundleFactory.getBundle(namespaceName, bundleRange))) {
                        continue;
                    }

                    // 清理被拆分的bundle缓存 ,确保不会再被选中
                    loadData.getBundleData().remove(bundleName);
                    localData.getLastStats().remove(bundleName);
                   
                    this.pulsar.getNamespaceService().getNamespaceBundleFactory()
                            .invalidateBundleCache(NamespaceName.get(namespaceName));
                    deleteBundleDataFromMetadataStore(bundleName);
										// 调用split, 其实是向borker发送一个split请求。
                    log.info("Load-manager splitting bundle {} and unloading {}", bundleName, unloadSplitBundles);
                    pulsar.getAdminClient().namespaces().splitNamespaceBundle(namespaceName, bundleRange,
                        unloadSplitBundles, null);

                    log.info("Successfully split namespace bundle {}", bundleName);
                } catch (Exception e) {
                    log.error("Failed to split namespace bundle {}", bundleName, e);
                }
            }

            updateBundleSplitMetrics(bundlesToBeSplit);
        }

findBundlesToSplit:

public Set<String> findBundlesToSplit(final LoadData loadData, final PulsarService pulsar) {
        bundleCache.clear();
        final ServiceConfiguration conf = pulsar.getConfiguration();
				// 获取配置中的最大bundle数,最大topic数,最大会话数,最大消息速率,最大带宽
        int maxBundleCount = conf.getLoadBalancerNamespaceMaximumBundles();
        long maxBundleTopics = conf.getLoadBalancerNamespaceBundleMaxTopics();
        long maxBundleSessions = conf.getLoadBalancerNamespaceBundleMaxSessions();
        long maxBundleMsgRate = conf.getLoadBalancerNamespaceBundleMaxMsgRate();
        long maxBundleBandwidth = conf.getLoadBalancerNamespaceBundleMaxBandwidthMbytes() * LoadManagerShared.MIBI;
        loadData.getBrokerData().forEach((broker, brokerData) -> {
            LocalBrokerData localData = brokerData.getLocalData();
            for (final Map.Entry<String, NamespaceBundleStats> entry : localData.getLastStats().entrySet()) {
                final String bundle = entry.getKey();
                final NamespaceBundleStats stats = entry.getValue();
                if (stats.topics < 2) { // bundle中赋值的topic小于2没有分裂的必要。
                    log.info("The count of topics on the bundle {} is less than 2,skip split!", bundle);
                    continue;
                }
                double totalMessageRate = 0;
                double totalMessageThroughput = 0;
                // Attempt to consider long-term message data, otherwise effectively ignore.
                if (loadData.getBundleData().containsKey(bundle)) {
                    final TimeAverageMessageData longTermData = loadData.getBundleData().get(bundle).getLongTermData();
                    totalMessageRate = longTermData.totalMsgRate();
                    totalMessageThroughput = longTermData.totalMsgThroughput();
                }
								//  只要大于任何一个条件就可以分裂了。
                if (stats.topics > maxBundleTopics || stats.consumerCount + stats.producerCount > maxBundleSessions
                        || totalMessageRate > maxBundleMsgRate || totalMessageThroughput > maxBundleBandwidth) {
                    final String namespace = LoadManagerShared.getNamespaceNameFromBundleName(bundle);
                    try {
                        final int bundleCount = pulsar.getNamespaceService()
                                .getBundleCount(NamespaceName.get(namespace));
												// 分割少于配置中最大分割数
                        if (bundleCount < maxBundleCount) {
                            bundleCache.add(bundle);
                        } else {
                            log.warn(
                                    "Could not split namespace bundle {} because namespace {} has too many bundles: {}",
                                    bundle, namespace, bundleCount);
                        }
                    } catch (Exception e) {
                        log.warn("Error while getting bundle count for namespace {}", namespace, e);
                    }
                }
            }
        });
        return bundleCache;
    }

负载上报

LoadReportUpdaterTask

if (config.isLoadBalancerEnabled()) {
            LOG.info("Starting load balancer");
            if (this.loadReportTask == null) {
                long loadReportMinInterval = LoadManagerShared.LOAD_REPORT_UPDATE_MINIMUM_INTERVAL;
                this.loadReportTask = this.loadManagerExecutor.scheduleAtFixedRate(
                        new LoadReportUpdaterTask(loadManager), loadReportMinInterval, loadReportMinInterval,
                        TimeUnit.MILLISECONDS);
            }
        }

LoadResourceQuotaUpdaterTask

这个是只有leader才会启动的进程,可以理解为将各个broker的负载以及broker上的bundle信息进行统计、窗口计算得到长期和短期的平均负载。

触发 LoadManager#writeResourceQuotasToZooKeeper 并最终调用以下方法,做这两件事:

  • 计算Bundle负载写入:/loadbalance/bundle-data/xxx
  • 计算Broker并写入负载到zk:/loadbalance/broker-time-average/xxx
### LoadResourceQuotaUpdaterTask

这个是只有leader才会启动的进程,可以理解为将各个broker的负载以及broker上的bundle信息进行统计、窗口计算得到长期和短期的平均负载。

触发 `LoadManager#writeResourceQuotasToZooKeeper` 并最终调用以下方法,做这两件事:

-   计算Bundle负载写入:/loadbalance/bundle-data/xxx
-   计算Broker并写入负载到zk:/loadbalance/broker-time-average/xxx

配置:

  • loadBalancerResourceQuotaUpdateIntervalMinutes 更新Bundle负载的时间间隔

负载策略 loadSheddingTask

负载策略用于判断Broker上的哪些Bundle需要卸载,让其他broker接管,这样子集群会更加均衡由 loadSheddingTask触发

调度开关:loadBalancerEnabled 默认·开启

调度间隔:loadBalancerSheddingIntervalMinutes

最小调度间隔:loadBalancerSheddingGracePeriodMinutes 避免一个bundle在多个broker来回跑

ModularLoadManagerImpl有内置了以下三个负载策略。 image.png

DeviationShedder

一个抽象类,它使 LoadSheddingStrategy 使基于标准偏差的决策更容易实施。 假设存在一些可以估计服务器负载的指标,这种负载策略计算相对于该指标的标准偏差,并减轻负担偏差高于某个阈值的Broker的负担。 不能直接使用。

源码中也没有继承该类做进一步实现。

OverloadShedder

默认的负载策略,当某个Broker负载超过了loadBalancerBrokerOverloadedThresholdPercentage(默认85%)时,会尝试在Broker上卸载一个bundle。

@Override
    public Multimap<String, String> findBundlesForUnloading(final LoadData loadData, final ServiceConfiguration conf) {
        selectedBundlesCache.clear();
        final double overloadThreshold = conf.getLoadBalancerBrokerOverloadedThresholdPercentage() / 100.0;
        final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();

        // Check every broker and select
        loadData.getBrokerData().forEach((broker, brokerData) -> {

            final LocalBrokerData localData = brokerData.getLocalData();
            final double currentUsage = localData.getMaxResourceUsage();
            if (currentUsage < overloadThreshold) {
                if (log.isDebugEnabled()) {
                    log.debug("[{}] Broker is not overloaded, ignoring at this point ({})", broker,
                            localData.printResourceUsage());
                }
                return;
            }

            // We want to offload enough traffic such that this broker will go below the overload threshold
            // Also, add a small margin so that this broker won't be very close to the threshold edge.
            double percentOfTrafficToOffload = currentUsage - overloadThreshold + ADDITIONAL_THRESHOLD_PERCENT_MARGIN;
            double brokerCurrentThroughput = localData.getMsgThroughputIn() + localData.getMsgThroughputOut();

            double minimumThroughputToOffload = brokerCurrentThroughput * percentOfTrafficToOffload;

            log.info(
                    "Attempting to shed load on {}, which has resource usage {}% above threshold {}%"
                            + " -- Offloading at least {} MByte/s of traffic ({})",
                    broker, 100 * currentUsage, 100 * overloadThreshold, minimumThroughputToOffload / 1024 / 1024,
                    localData.printResourceUsage());

            MutableDouble trafficMarkedToOffload = new MutableDouble(0);
            MutableBoolean atLeastOneBundleSelected = new MutableBoolean(false);

            if (localData.getBundles().size() > 1) {  // 拥有的bundle要大于1
                // Sort bundles by throughput, then pick the biggest N which combined
                // make up for at least the minimum throughput to offload

                loadData.getBundleDataForLoadShedding().entrySet().stream()
                    .filter(e -> localData.getBundles().contains(e.getKey()))
                    .map((e) -> {
                        // Map to throughput value
                        // Consider short-term byte rate to address system resource burden
                        String bundle = e.getKey();
                        BundleData bundleData = e.getValue();
                        TimeAverageMessageData shortTermData = bundleData.getShortTermData();
                        double throughput = shortTermData.getMsgThroughputIn() + shortTermData
                                .getMsgThroughputOut();
                    return Pair.of(bundle, throughput);
                }).filter(e -> {
                    // Only consider bundles that were not already unloaded recently  // 最近没有被卸载过
                    return !recentlyUnloadedBundles.containsKey(e.getLeft());
                }).filter(e ->
                        localData.getBundles().contains(e.getLeft())
                ).sorted((e1, e2) -> {
                    // Sort by throughput in reverse order    // 卸载throughput最大的几个
                    return Double.compare(e2.getRight(), e1.getRight());
                }).forEach(e -> {
                    if (trafficMarkedToOffload.doubleValue() < minimumThroughputToOffload
                            || atLeastOneBundleSelected.isFalse()) {
                       selectedBundlesCache.put(broker, e.getLeft());
                       trafficMarkedToOffload.add(e.getRight());
                       atLeastOneBundleSelected.setTrue();
                   }
                });
            } else if (localData.getBundles().size() == 1) {
                log.warn(
                        "HIGH USAGE WARNING : Sole namespace bundle {} is overloading broker {}. "
                                + "No Load Shedding will be done on this broker",
                        localData.getBundles().iterator().next(), broker);
            } else {
                log.warn("Broker {} is overloaded despite having no bundles", broker);
            }
        });
        return selectedBundlesCache;
    }
}

优点:

  • 尽量少得触发卸载, 只要broker没有超过阈值,哪怕新加broker也不会触发卸载,卸载次数少,系统更加稳定

缺点:

  • 当集群的broker负载的比较高的时候,bundle会在各个broker之间来回传递。

这个时候就一定要注意总体压力不要过高,监控要到位。

ThresholdShedder

根据Broker在所有Broker中占的比例来判断是否需要卸载Bundle,这个策略目的是让集群每个节点负载比较平衡。

首先根据LoadData里获取集群的Broker的平均负载,

private double getBrokerAvgUsage(final LoadData loadData, final double historyPercentage,
                                     final ServiceConfiguration conf) {
        double totalUsage = 0.0;
        int totalBrokers = 0;

        for (Map.Entry<String, BrokerData> entry : loadData.getBrokerData().entrySet()) {
            LocalBrokerData localBrokerData = entry.getValue().getLocalData();
            String broker = entry.getKey();
	          // 负载之和 =  cpu * cpu权重 + 内存 * 内存权重 + 入出带宽 * 入出带宽权重 + ...
            totalUsage += updateAvgResourceUsage(broker, localBrokerData, historyPercentage, conf);
            totalBrokers++;
        }

        return totalBrokers > 0 ? totalUsage / totalBrokers : 0;  // 平均负载就是负载之和 / Broker数量
    }

	
    private double updateAvgResourceUsage(String broker, LocalBrokerData localBrokerData,
                                          final double historyPercentage, final ServiceConfiguration conf) {
        Double historyUsage =
                brokerAvgResourceUsage.get(broker);
        double resourceUsage = localBrokerData.getMaxResourceUsageWithWeight(
                conf.getLoadBalancerCPUResourceWeight(),
                conf.getLoadBalancerMemoryResourceWeight(), conf.getLoadBalancerDirectMemoryResourceWeight(),
                conf.getLoadBalancerBandwithInResourceWeight(),
                conf.getLoadBalancerBandwithOutResourceWeight());
        historyUsage = historyUsage == null
                ? resourceUsage : historyUsage * historyPercentage + (1 - historyPercentage) * resourceUsage;
				// 会将当前broker的负载也记录,方便下面计算
        brokerAvgResourceUsage.put(broker, historyUsage);
        return historyUsage;
    }

权重配置:

  • loadBalancerBandwithInResourceWeight
  • loadBalancerBandwithOutResourceWeight
  • loadBalancerCPUResourceWeight
  • loadBalancerMemoryResourceWeight
  • loadBalancerDirectMemoryResourceWeight

findBundlesForUnloading细节:

@Override
    public Multimap<String, String> findBundlesForUnloading(final LoadData loadData, final ServiceConfiguration conf) {
        selectedBundlesCache.clear();
				// 阈值默认10%
        final double threshold = conf.getLoadBalancerBrokerThresholdShedderPercentage() / 100.0; 
					
        final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();
	      //  loadBalancerBundleUnloadMinThroughputThreshold 默认10 ,默认10MB
        final double minThroughputThreshold = conf.getLoadBalancerBundleUnloadMinThroughputThreshold() * MB;
				// 获取集群的平衡负载
        final double avgUsage = getBrokerAvgUsage(loadData, conf.getLoadBalancerHistoryResourcePercentage(), conf);

        if (avgUsage == 0) {
            log.warn("average max resource usage is 0");
            return selectedBundlesCache;
        }
				// 遍历每个Broker,计算Broker在集群Broker平均负载的占比
        loadData.getBrokerData().forEach((broker, brokerData) -> {
            final LocalBrokerData localData = brokerData.getLocalData();
						
            final double currentUsage = brokerAvgResourceUsage.getOrDefault(broker, 0.0);
            // 当前负载没有大于平均 + 阈值
            if (currentUsage < avgUsage + threshold) {
                if (log.isDebugEnabled()) {
                    log.debug("[{}] broker is not overloaded, ignoring at this point", broker);
                }
                return;
            }
					
            double percentOfTrafficToOffload =
                    currentUsage - avgUsage - threshold + ADDITIONAL_THRESHOLD_PERCENT_MARGIN;
            double brokerCurrentThroughput = localData.getMsgThroughputIn() + localData.getMsgThroughputOut();
            // 算出卸载之后的broker负载 (这里只是期望)
						double minimumThroughputToOffload = brokerCurrentThroughput * percentOfTrafficToOffload;

            if (minimumThroughputToOffload < minThroughputThreshold) { // 吞吐流量默认没有超过10MB是不能卸载的
                if (log.isDebugEnabled()) {
                    log.info("[{}] broker is planning to shed throughput {} MByte/s less than "
                                    + "minimumThroughputThreshold {} MByte/s, skipping bundle unload.",
                            broker, minimumThroughputToOffload / MB, minThroughputThreshold / MB);
                }
                return;
            }

            log.info(
                    "Attempting to shed load on {}, which has max resource usage above avgUsage  and threshold {}%"
                            + " > {}% + {}% -- Offloading at least {} MByte/s of traffic, left throughput {} MByte/s",
                    broker, currentUsage, avgUsage, threshold, minimumThroughputToOffload / MB,
                    (brokerCurrentThroughput - minimumThroughputToOffload) / MB);

            MutableDouble trafficMarkedToOffload = new MutableDouble(0);
            MutableBoolean atLeastOneBundleSelected = new MutableBoolean(false);

            if (localData.getBundles().size() > 1) {
                loadData.getBundleDataForLoadShedding().entrySet().stream()
                    .map((e) -> {
                        String bundle = e.getKey();
                        BundleData bundleData = e.getValue();
                        TimeAverageMessageData shortTermData = bundleData.getShortTermData(); // 使用短期资源使用率来判断
                        double throughput = shortTermData.getMsgThroughputIn() + shortTermData.getMsgThroughputOut();
                        return Pair.of(bundle, throughput);
                }).filter(e ->
                        !recentlyUnloadedBundles.containsKey(e.getLeft())  // 最近没有卸载过才要卸载
                ).filter(e ->
                        localData.getBundles().contains(e.getLeft())
                ).sorted((e1, e2) ->
                        Double.compare(e2.getRight(), e1.getRight())
                ).forEach(e -> {
										// 至少要选择一个
                    if (trafficMarkedToOffload.doubleValue() < minimumThroughputToOffload
                            || atLeastOneBundleSelected.isFalse()) {
                        selectedBundlesCache.put(broker, e.getLeft());
                        trafficMarkedToOffload.add(e.getRight());
                        atLeastOneBundleSelected.setTrue();
                    }
                });
            } else if (localData.getBundles().size() == 1) {
                log.warn(
                        "HIGH USAGE WARNING : Sole namespace bundle {} is overloading broker {}. "
                                + "No Load Shedding will be done on this broker",
                        localData.getBundles().iterator().next(), broker);
            } else {
                log.warn("Broker {} is overloaded despite having no bundles", broker);
            }
        });

        return selectedBundlesCache;
    }

优点:

  • 更加均衡一些

缺点:

  • 对应某个主题分区的负载特别大的情况(单个主题分区负载就超过平均值), 这个Bundle可能会在多个Broker来回跳, 这种情况就要加大分区数,或者使用其他负载策略

具体负载均衡过程

回到doLoadShedding 方法

		/**
     * As the leader broker, select bundles for the namespace service to unload so that they may be reassigned to new
     * brokers.
     */
    @Override
    public synchronized void doLoadShedding() {
        if (!LoadManagerShared.isLoadSheddingEnabled(pulsar)) {
            return;
        }
				// 只有一个broker, 直接返回
        if (getAvailableBrokers().size() <= 1) {
            log.info("Only 1 broker available: no load shedding will be performed");
            return;
        }
				// 如果是卸载过的,在loadBalancerSheddingGracePeriodMinutes时间内就不要再卸载了。
        // Remove bundles who have been unloaded for longer than the grace period from the recently unloaded map.
        final long timeout = System.currentTimeMillis()
                - TimeUnit.MINUTES.toMillis(conf.getLoadBalancerSheddingGracePeriodMinutes());
        final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();
  
				recentlyUnloadedBundles.keySet().removeIf(e -> recentlyUnloadedBundles.get(e) < timeout);

        for (LoadSheddingStrategy strategy : loadSheddingPipeline) {
            final Multimap<String, String> bundlesToUnload = strategy.findBundlesForUnloading(loadData, conf); //  通过策略类获取到要被卸载的bundle

            bundlesToUnload.asMap().forEach((broker, bundles) -> {
                bundles.forEach(bundle -> {
                    final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundle); // 根据bundle获取对应的namespace名称
                    final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundle); // 获取bundle范围
                    // 反亲和性不符合就不进行卸载,和namespace隔离相关
										if (!shouldAntiAffinityNamespaceUnload(namespaceName, bundleRange, broker)) {
                        return;
                    }

                    log.info("[{}] Unloading bundle: {} from broker {}",
                            strategy.getClass().getSimpleName(), bundle, broker);
                    try {
												// 调用卸载restful api去了。
                        pulsar.getAdminClient().namespaces().unloadNamespaceBundle(namespaceName, bundleRange);  // 卸载Bundle
                        loadData.getRecentlyUnloadedBundles().put(bundle, System.currentTimeMillis());
                    } catch (PulsarServerException | PulsarAdminException e) {
                        log.warn("Error when trying to perform load shedding on {} for broker {}", bundle, broker, e);
                    }
                });
            });

            updateBundleUnloadingMetrics(bundlesToUnload);
        }
    }


注: 引用《深入解析Apache Pulsar》