本文已参与「开源摘星计划」,欢迎正在阅读的你加入。活动链接:github.com/weopenproje…
带着以下疑问来看pulsar源码v2.9.2:
- bundle和topic的关系是怎么建立的?
- bundle和broker的关系又是怎么建立的?
- 卸载bundle和卸载topic一样么? split bundle和unload bundle又有什么区别?
- broker如何查看broker的负载? 有什么值得注意的?
- pulsar的负载是怎么管理的?如何做负载均衡?
好了,本文开始!
此文中的topic不指主题,而是主题的一个分区。
每个Topic都要和Broker绑定,那是怎么绑定的呢?Topic 和Broker依靠Bundle来绑定,Bundle是一致性哈希环中的虚拟结点。Topic通过主题名计算hash值,从而找到对应的Bundle。
如下图, 我们将Namespace划分为4个区域(defaultNumberOfNamespaceBundles默认就为4),从0x00000000~0xffffffff,之后将Namespace下的Topic按照名字做hash运算。
查看namespace的Bundle信息,在zk上:
/admin/local-policies/apache/pulsar
{
"bundles" : {
"boundaries" : [ "0x00000000", "0x40000000", "0x80000000", "0xc0000000", "0xffffffff" ],
"numBundles" : 4
}
}
值得讨论和研究的是
- bundle和topic关系是如何建立的
- bundle和broker关系是如何建立的
Lookup查找Topic归属的Broker 方法
非分区主题lookup:
pulsar.apache.org/docs/next/a…
分区主题lookup:
pulsar.apache.org/docs/next/a…
bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
Lookup解析
以上的请求方式会访问restful接口最终会落到:每个分区都会有一次请求
GET{u=//localhost:8081/lookup/v2/topic/persistent/apache/pulsar/test-topic-partition-0,HTTP/1.1,h=3,cl=-1}
GET{u=//localhost:8081/lookup/v2/topic/persistent/apache/pulsar/test-topic-partition-1,HTTP/1.1,h=3,cl=-1}
@Path("/v2/topic")
public class TopicLookup extends TopicLookupBase {
static final String LISTENERNAME_HEADER = "X-Pulsar-ListenerName";
@GET
@Path("{topic-domain}/{tenant}/{namespace}/{topic}"
@HeaderParam(LISTENERNAME_HEADER) String listenerNameHeader) {
TopicName topicName = getTopicName(topicDomain, tenant, namespace, encodedTopic);
if (StringUtils.isEmpty(listenerName) && StringUtils.isNotEmpty(listenerNameHeader)) {
listenerName = listenerNameHeader;
}
// 查找Lookup
internalLookupTopicAsync(topicName, authoritative, asyncResponse, listenerName);
}
internalLookupTopicAsync:
protected void internalLookupTopicAsync(TopicName topicName, boolean authoritative,
AsyncResponse asyncResponse, String listenerName) {
// 获取信号量
if (!pulsar().getBrokerService().getLookupRequestSemaphore().tryAcquire()) {
log.warn("No broker was found available for topic {}", topicName);
asyncResponse.resume(new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE));
return;
}
try {
// 验证集群归属,之前pulsar的Topic名称包含集群信息, 如果有集群信息的话,需要判断是当前这个集群是否是topic对应集群。
// 如果不是就要重定向一下
validateClusterOwnership(topicName.getCluster());
// 验证权限
validateAdminAndClientPermission(topicName);
// pulsar跨地域复制,判断这个namespace是否是复制过来的, 如果是就重定向一下到原始集群
validateGlobalNamespaceOwnership(topicName.getNamespaceObject());
} catch (WebApplicationException we) {
// Validation checks failed
log.error("Validation check failed: {}", we.getMessage());
completeLookupResponseExceptionally(asyncResponse, we);
return;
} catch (Throwable t) {
// Validation checks failed with unknown error
log.error("Validation check failed: {}", t.getMessage(), t);
completeLookupResponseExceptionally(asyncResponse, new RestException(t));
return;
}
CompletableFuture<Optional<LookupResult>> lookupFuture = pulsar().getNamespaceService()
.getBrokerServiceUrlAsync(topicName,
LookupOptions.builder().advertisedListenerName(listenerName)
.authoritative(authoritative).loadTopicsInBundle(false).build());
lookupFuture.thenAccept(optionalResult -> {
if (optionalResult == null || !optionalResult.isPresent()) {
log.warn("No broker was found available for topic {}", topicName);
completeLookupResponseExceptionally(asyncResponse,
new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE));
return;
}
LookupResult result = optionalResult.get();
// We have found either a broker that owns the topic, or a broker to which we should redirect the client to
if (result.isRedirect()) {
boolean newAuthoritative = result.isAuthoritativeRedirect();
URI redirect;
try {
String redirectUrl = isRequestHttps() ? result.getLookupData().getHttpUrlTls()
: result.getLookupData().getHttpUrl();
checkNotNull(redirectUrl, "Redirected cluster's service url is not configured");
String lookupPath = topicName.isV2() ? LOOKUP_PATH_V2 : LOOKUP_PATH_V1;
String path = String.format("%s%s%s?authoritative=%s",
redirectUrl, lookupPath, topicName.getLookupName(), newAuthoritative);
path = listenerName == null ? path : path + "&listenerName=" + listenerName;
redirect = new URI(path);
} catch (URISyntaxException | NullPointerException e) {
log.error("Error in preparing redirect url for {}: {}", topicName, e.getMessage(), e);
completeLookupResponseExceptionally(asyncResponse, e);
return;
}
if (log.isDebugEnabled()) {
log.debug("Redirect lookup for topic {} to {}", topicName, redirect);
}
completeLookupResponseExceptionally(asyncResponse,
new WebApplicationException(Response.temporaryRedirect(redirect).build()));
} else {
// Found broker owning the topic
if (log.isDebugEnabled()) {
log.debug("Lookup succeeded for topic {} -- broker: {}", topicName, result.getLookupData());
}
completeLookupResponseSuccessfully(asyncResponse, result.getLookupData());
}
}).exceptionally(exception -> {
log.warn("Failed to lookup broker for topic {}: {}", topicName, exception.getMessage(), exception);
completeLookupResponseExceptionally(asyncResponse, exception);
return null;
});
}
bundle和topic关系是如何建立的
getBrokerServiceUrlAsync: 通过主题名找到了bundlerange
public CompletableFuture<Optional<LookupResult>> getBrokerServiceUrlAsync(TopicName topic, LookupOptions options) {
long startTime = System.nanoTime();
// 通过主题名找bundle
// persistent://apache/pulsar/test-topic-partition-1 ---> apache/pulsar/0x80000000_0xc0000000
CompletableFuture<Optional<LookupResult>> future = getBundleAsync(topic)
.thenCompose(bundle -> findBrokerServiceUrl(bundle, options));
future.thenAccept(optResult -> {
lookupLatency.observe(System.nanoTime() - startTime, TimeUnit.NANOSECONDS);
if (optResult.isPresent()) {
if (optResult.get().isRedirect()) {
lookupRedirects.inc();
} else {
lookupAnswers.inc();
}
}
}).exceptionally(ex -> {
lookupFailures.inc();
return null;
});
return future;
}
getBundleAsync(topic):
public CompletableFuture<NamespaceBundle> getBundleAsync(TopicName topic) {
return bundleFactory.getBundlesAsync(topic.getNamespaceObject())
.thenApply(bundles -> bundles.findBundle(topic));
}
bundles对象的属性:
bundles.findBundle
public NamespaceBundle findBundle(TopicName topicName) {
checkArgument(this.nsname.equals(topicName.getNamespaceObject()));
// 计算topic哈希值
long hashCode = factory.getLongHashCode(topicName.toString());
// 得到hash
NamespaceBundle bundle = getBundle(hashCode);
if (topicName.getDomain().equals(TopicDomain.non_persistent)) {
bundle.setHasNonPersistentTopic(true);
}
return bundle;
}
到此为止, topic和bundle的关系就建立起来了!
bundle和broker关系是如何建立的
一句话总结, 通过LoadManager查找负载最低的Broker来和bundle建立关系。
findBrokerServiceUrl 再通过bundle (getBundleAsync(topic)的返回值)找到brokerServiceUrl并放在future中:
这样子客户端就会知道这个主题对应的是哪个broker,就可以和broker交互了。
具体细节:
findBrokerServiceUrl:
private CompletableFuture<Optional<LookupResult>> findBrokerServiceUrl(
NamespaceBundle bundle, LookupOptions options) {
if (LOG.isDebugEnabled()) {
LOG.debug("findBrokerServiceUrl: {} - options: {}", bundle, options);
}
ConcurrentOpenHashMap<NamespaceBundle, CompletableFuture<Optional<LookupResult>>> targetMap;
if (options.isAuthoritative()) {
targetMap = findingBundlesAuthoritative;
} else {
targetMap = findingBundlesNotAuthoritative;
}
return targetMap.computeIfAbsent(bundle, (k) -> {
CompletableFuture<Optional<LookupResult>> future = new CompletableFuture<>();
// First check if we or someone else already owns the bundle
ownershipCache.getOwnerAsync(bundle).thenAccept(nsData -> {
if (!nsData.isPresent()) {
// No one owns this bundle
if (options.isReadOnly()) {
// Do not attempt to acquire ownership
future.complete(Optional.empty());
} else {
// Now, no one owns the namespace yet. Hence, we will try to dynamically assign it
pulsar.getExecutor().execute(() -> {
// 找一个候选broker.
searchForCandidateBroker(bundle, future, options);
});
}
} else if (nsData.get().isDisabled()) {
...
}
searchForCandidateBroker给卸载的bundle找候选broker:
private void searchForCandidateBroker(NamespaceBundle bundle,
CompletableFuture<Optional<LookupResult>> lookupFuture,
LookupOptions options) {
// 找到broker的leader
if (null == pulsar.getLeaderElectionService()) {
LOG.warn("The leader election has not yet been completed! NamespaceBundle[{}]", bundle);
lookupFuture.completeExceptionally(
new IllegalStateException("The leader election has not yet been completed!"));
return;
}
String candidateBroker = null;
LeaderElectionService les = pulsar.getLeaderElectionService();
if (les == null) {
// The leader election service was not initialized yet. This can happen because the broker service is
// initialized first and it might start receiving lookup requests before the leader election service is
// fully initialized.
LOG.warn("Leader election service isn't initialized yet. "
+ "Returning empty result to lookup. NamespaceBundle[{}]",
bundle);
lookupFuture.complete(Optional.empty());
return;
}
boolean authoritativeRedirect = les.isLeader();
try {
// 优先考虑heartbeat和SLAMonitor所在broker
// check if this is Heartbeat or SLAMonitor namespace
candidateBroker = checkHeartbeatNamespace(bundle);
if (candidateBroker == null) {
candidateBroker = checkHeartbeatNamespaceV2(bundle);
}
if (candidateBroker == null) {
String broker = getSLAMonitorBrokerName(bundle);
// checking if the broker is up and running
if (broker != null && isBrokerActive(broker)) {
candidateBroker = broker;
}
}
if (candidateBroker == null) {
Optional<LeaderBroker> currentLeader = pulsar.getLeaderElectionService().getCurrentLeader();
if (options.isAuthoritative()) {
// leader broker already assigned the current broker as owner
// 如果请求是被其他broker(一般是leader)重定向过来的,那直接就是当前broker承载bundle了。
candidateBroker = pulsar.getSafeWebServiceAddress();
} else {
// 去找LoadManager 寻找broker
LoadManager loadManager = this.loadManager.get();
boolean makeLoadManagerDecisionOnThisBroker = !loadManager.isCentralized() || les.isLeader();
if (!makeLoadManagerDecisionOnThisBroker) {
// loadManager中心化场景下,当前broker又不是leader
// If leader is not active, fallback to pick the least loaded from current broker loadmanager
boolean leaderBrokerActive = currentLeader.isPresent()
&& isBrokerActive(currentLeader.get().getServiceUrl());
if (!leaderBrokerActive) {
makeLoadManagerDecisionOnThisBroker = true;
if (!currentLeader.isPresent()) {
LOG.warn(
"The information about the current leader broker wasn't available. "
+ "Handling load manager decisions in a decentralized way. "
+ "NamespaceBundle[{}]",
bundle);
} else {
LOG.warn(
"The current leader broker {} isn't active. "
+ "Handling load manager decisions in a decentralized way. "
+ "NamespaceBundle[{}]",
currentLeader.get(), bundle);
}
}
}
if (makeLoadManagerDecisionOnThisBroker) { // 中心化场景,自己又是leader ,leader的LoadManager会保存各个节点的负载,通过ModularLoadManagerImpl.selectBrokerForAssignment
// 找到一个合适的broker返回。
Optional<String> availableBroker = getLeastLoadedFromLoadManager(bundle)
if (!availableBroker.isPresent()) {
LOG.warn("Load manager didn't return any available broker. "
+ "Returning empty result to lookup. NamespaceBundle[{}]",
bundle);
lookupFuture.complete(Optional.empty());
return;
}
candidateBroker = availableBroker.get();
authoritativeRedirect = true;
} else {
// forward to leader broker to make assignment
candidateBroker = currentLeader.get().getServiceUrl();
}
}
}
} catch (Exception e) {
LOG.warn("Error when searching for candidate broker to acquire {}: {}", bundle, e.getMessage(), e);
lookupFuture.completeExceptionally(e);
return;
}
try {
checkNotNull(candidateBroker);
// 如果选出的broekr和当前的broker是一致的
if (candidateBroker.equals(pulsar.getSafeWebServiceAddress())) {
// Load manager decided that the local broker should try to become the owner
// 尝试把bundle占领
ownershipCache.tryAcquiringOwnership(bundle).thenAccept(ownerInfo -> {
if (ownerInfo.isDisabled()) {
if (LOG.isDebugEnabled()) {
LOG.debug("Namespace bundle {} is currently being unloaded", bundle);
}
lookupFuture.completeExceptionally(new IllegalStateException(
String.format("Namespace bundle %s is currently being unloaded", bundle)));
} else {
// Found owner for the namespace bundle
if (options.isLoadTopicsInBundle()) {
// Schedule the task to pre-load topics 占领成功就加载每一个topic
pulsar.loadNamespaceTopics(bundle);
}
// find the target
if (options.hasAdvertisedListenerName()) {
AdvertisedListener listener =
ownerInfo.getAdvertisedListeners().get(options.getAdvertisedListenerName());
if (listener == null) {
lookupFuture.completeExceptionally(
new PulsarServerException("the broker do not have "
+ options.getAdvertisedListenerName() + " listener"));
return;
} else {
URI url = listener.getBrokerServiceUrl();
URI urlTls = listener.getBrokerServiceUrlTls();
lookupFuture.complete(Optional.of(
new LookupResult(ownerInfo,
url == null ? null : url.toString(),
urlTls == null ? null : urlTls.toString())));
return;
}
} else {
lookupFuture.complete(Optional.of(new LookupResult(ownerInfo)));
return;
}
}
}).exceptionally(exception -> {
LOG.warn("Failed to acquire ownership for namespace bundle {}: {}", bundle, exception);
lookupFuture.completeExceptionally(new PulsarServerException(
"Failed to acquire ownership for namespace bundle " + bundle, exception));
return null;
});
} else {
// Load managed decider some other broker should try to acquire ownership
if (LOG.isDebugEnabled()) {
LOG.debug("Redirecting to broker {} to acquire ownership of bundle {}", candidateBroker, bundle);
}
// Now setting the redirect url 把请求重定向到候选broker
createLookupResult(candidateBroker, authoritativeRedirect, options.getAdvertisedListenerName())
.thenAccept(lookupResult -> lookupFuture.complete(Optional.of(lookupResult)))
.exceptionally(ex -> {
lookupFuture.completeExceptionally(ex);
return null;
});
}
} catch (Exception e) {
LOG.warn("Error in trying to acquire namespace bundle ownership for {}: {}", bundle, e.getMessage(), e);
lookupFuture.completeExceptionally(e);
}
}
继续分析一下getLeastLoadedFromLoadManager:
private Optional<String> getLeastLoadedFromLoadManager(ServiceUnitId serviceUnit) throws Exception {
// 获取getLeastLoaded 最低负载的。
Optional<ResourceUnit> leastLoadedBroker = loadManager.get().getLeastLoaded(serviceUnit);
if (!leastLoadedBroker.isPresent()) {
LOG.warn("No broker is available for {}", serviceUnit);
return Optional.empty();
}
// 放返回可用的broker地址。
String lookupAddress = leastLoadedBroker.get().getResourceId();
if (LOG.isDebugEnabled()) {
LOG.debug("{} : redirecting to the least loaded broker, lookup address={}",
pulsar.getSafeWebServiceAddress(),
lookupAddress);
}
return Optional.of(lookupAddress);
}
getLeastLoaded 一路点过来,我们看下ModularLoadManagerImpl的实现:
@Override
public Optional<String> selectBrokerForAssignment(final ServiceUnitId serviceUnit) {
// Use brokerCandidateCache as a lock to reduce synchronization.
long startTime = System.nanoTime();
try {
synchronized (brokerCandidateCache) {
final String bundle = serviceUnit.toString();
if (preallocatedBundleToBroker.containsKey(bundle)) {
// 如果已经预分配就就返回broker地址
// If the given bundle is already in preallocated, return the selected broker.
return Optional.of(preallocatedBundleToBroker.get(bundle));
}
final BundleData data = loadData.getBundleData().computeIfAbsent(bundle,
key -> getBundleDataOrDefault(bundle));
brokerCandidateCache.clear();
// 命名空间策略
LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
getAvailableBrokers(),
brokerTopicLoadingPredicate);
// filter brokers which owns topic higher than threshold 过滤掉超出topic数broker
LoadManagerShared.filterBrokersWithLargeTopicCount(brokerCandidateCache, loadData,
conf.getLoadBalancerBrokerMaxTopics());
// 亲和性相关 distribute namespaces to domain and brokers according to anti-affinity-group
LoadManagerShared.filterAntiAffinityGroupOwnedBrokers(pulsar, serviceUnit.toString(),
brokerCandidateCache,
brokerToNamespaceToBundleRange, brokerToFailureDomainMap);
// distribute bundles evenly to candidate-brokers
// todo
LoadManagerShared.removeMostServicingBrokersForNamespace(serviceUnit.toString(), brokerCandidateCache,
brokerToNamespaceToBundleRange);
log.info("{} brokers being considered for assignment of {}", brokerCandidateCache.size(), bundle);
// Use the filter pipeline to finalize broker candidates.
// brokerFilter目前只有版本过滤
try {
for (BrokerFilter filter : filterPipeline) {
filter.filter(brokerCandidateCache, data, loadData, conf);
}
} catch (BrokerFilterException x) {
// 重新选候选broker
// restore the list of brokers to the full set
LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
getAvailableBrokers(),
brokerTopicLoadingPredicate);
}
if (brokerCandidateCache.isEmpty()) {
// 重新选候选broker
// restore the list of brokers to the full set
LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
getAvailableBrokers(),
brokerTopicLoadingPredicate);
}
// Choose a broker among the potentially smaller filtered list, when possible
Optional<String> broker = placementStrategy.selectBroker(brokerCandidateCache, data, loadData, conf);
if (log.isDebugEnabled()) {
log.debug("Selected broker {} from candidate brokers {}", broker, brokerCandidateCache);
}
if (!broker.isPresent()) {
// No brokers available
return broker;
}
final double overloadThreshold = conf.getLoadBalancerBrokerOverloadedThresholdPercentage() / 100.0;
final double maxUsage = loadData.getBrokerData().get(broker.get()).getLocalData().getMaxResourceUsage();
if (maxUsage > overloadThreshold) {
// All brokers that were in the filtered list were overloaded, so check if there is a better broker
LoadManagerShared.applyNamespacePolicies(serviceUnit, policies, brokerCandidateCache,
getAvailableBrokers(),
brokerTopicLoadingPredicate);
broker = placementStrategy.selectBroker(brokerCandidateCache, data, loadData, conf);
}
// Add new bundle to preallocated.
loadData.getBrokerData().get(broker.get()).getPreallocatedBundleData().put(bundle, data);
preallocatedBundleToBroker.put(bundle, broker.get());
final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundle);
final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundle);
final ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>> namespaceToBundleRange =
brokerToNamespaceToBundleRange
.computeIfAbsent(broker.get(), k -> new ConcurrentOpenHashMap<>());
synchronized (namespaceToBundleRange) {
namespaceToBundleRange.computeIfAbsent(namespaceName, k -> new ConcurrentOpenHashSet<>())
.add(bundleRange);
}
return broker;
}
} finally {
selectBrokerForAssignment.observe(System.nanoTime() - startTime, TimeUnit.NANOSECONDS);
}
}
卸载bundle
bundle的卸载只是将一个Bundle从一个Broker转移到另外一个broker,并把bundle上所有的topic都迁移到另外一个broker。
卸载实践:
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5
bin/pulsar-admin topics list-partitioned-topics apache/pulsar
"persistent://apache/pulsar/test-topic"
// 获取topic的bundle范围
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x59999994_0x66666660"
// 获取主题目前负责的broker
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-2 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
// 卸载bundle
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin namespaces unload apache/pulsar
// 或者指定范围卸载bundle
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin namespaces unload -b 0x59999994_0x66666660 apache/pulsar
对应的sdk:
if (bundle == null) {
getAdmin().namespaces().unload(namespace);
} else {
getAdmin().namespaces().unloadNamespaceBundle(namespace, bundle);
}
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650" // 注意到这个有所改变。
"persistent://apache/pulsar/test-topic-partition-2 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
卸载源码分析
主要是想知道卸载之后的bundle是如何找到新的broker的,是怎么分配的?
入口:
// unload整个namespace
@PUT
@Path("/{tenant}/{namespace}/unload")
@ApiOperation(value = "Unload namespace",
notes = "Unload an active namespace from the current broker serving it. Performing this operation will"
+ " let the brokerremoves all producers, consumers, and connections using this namespace,"
+ " and close all topics (includingtheir persistent store). During that operation,"
+ " the namespace is marked as tentatively unavailable until thebroker completes "
+ "the unloading action. This operation requires strictly super user privileges,"
+ " since it wouldresult in non-persistent message loss and"
+ " unexpected connection closure to the clients.")
@ApiResponses(value = {
@ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
@ApiResponse(code = 403, message = "Don't have admin permission"),
@ApiResponse(code = 404, message = "Tenant or namespace doesn't exist"),
@ApiResponse(code = 412, message = "Namespace is already unloaded or Namespace has bundles activated")})
public void unloadNamespace(@Suspended final AsyncResponse asyncResponse, @PathParam("tenant") String tenant,
@PathParam("namespace") String namespace) {
try {
validateNamespaceName(tenant, namespace);
internalUnloadNamespace(asyncResponse);
} catch (WebApplicationException wae) {
asyncResponse.resume(wae);
} catch (Exception e) {
asyncResponse.resume(new RestException(e));
}
}
// 带bundle range 的unload
@PUT
@Path("/{tenant}/{namespace}/{bundle}/unload")
@ApiOperation(value = "Unload a namespace bundle")
@ApiResponses(value = {
@ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
@ApiResponse(code = 403, message = "Don't have admin permission") })
public void unloadNamespaceBundle(@Suspended final AsyncResponse asyncResponse,
@PathParam("tenant") String tenant, @PathParam("namespace") String namespace,
@PathParam("bundle") String bundleRange,
@QueryParam("authoritative") @DefaultValue("false") boolean authoritative) {
validateNamespaceName(tenant, namespace);
internalUnloadNamespaceBundle(asyncResponse, bundleRange, authoritative);
}
unload整个namespace
internalUnloadNamespace:
Policies policies = getNamespacePolicies(namespaceName);
final List<CompletableFuture<Void>> futures = Lists.newArrayList();
List<String> boundaries = policies.bundles.getBoundaries(); // 遍历所有的bundle
for (int i = 0; i < boundaries.size() - 1; i++) {
String bundle = String.format("%s_%s", boundaries.get(i), boundaries.get(i + 1));
try {
futures.add(pulsar().getAdminClient().namespaces().unloadNamespaceBundleAsync(namespaceName.toString(), // 调用unloadNamespaceBundleAsync
bundle));
} catch (PulsarServerException e) {
log.error("[{}] Failed to unload namespace {}", clientAppId(), namespaceName, e);
asyncResponse.resume(new RestException(e));
return;
}
}
遍历所有的bundle 调用带bundle range 的unload。
带bundle range 的unload
源码调试技巧: unload namespace的时候,会遍历namespace所有的bundle来调用。 有些bundle上并不会有我们想要调试的主题,这样就可能无法快速断点调试,所以我们将defaultNumberOfNamespaceBundles=1
调用unload之前,使用lookup查找一下,确保唯一的这个bundle也归属,调用lookup,如果没有归属则会触发分配流程
bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
// 判断bundle是否属于任何broker
isBundleOwnedByAnyBroker(namespaceName, policies.bundles, bundleRange).thenAccept(flag -> {
log.info("judge .... bundleRange: {}", bundleRange);
if (!flag) {
log.info("[{}] Namespace bundle is not owned by any broker {}/{}", clientAppId(), namespaceName,
bundleRange);
asyncResponse.resume(Response.noContent().build());
return; // 不属于任何broker就退出了,没有卸载的必要。
}
NamespaceBundle nsBundle;
try {
nsBundle = validateNamespaceBundleOwnership(namespaceName, policies.bundles, bundleRange,
authoritative, true);
} catch (WebApplicationException wae) {
asyncResponse.resume(wae);
return;
}
// 开始卸载
pulsar().getNamespaceService().unloadNamespaceBundle(nsBundle)
public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle) {
// unload namespace bundle
return unloadNamespaceBundle(bundle, 5, TimeUnit.MINUTES);
}
public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) {
// unload namespace bundle
OwnedBundle ob = ownershipCache.getOwnedBundle(bundle);
if (ob == null) {
return FutureUtil.failedFuture(new IllegalStateException("Bundle " + bundle + " is not currently owned"));
} else {
return ob.handleUnloadRequest(pulsar, timeout, timeoutUnit);
}
}
ob.handleUnloadRequest:
public CompletableFuture<Void> handleUnloadRequest(PulsarService pulsar, long timeout, TimeUnit timeoutUnit) {
long unloadBundleStartTime = System.nanoTime();
// Need a per namespace RenetrantReadWriteLock
// Here to do a writeLock to set the flag and proceed to check and close connections
try {
while (!this.nsLock.writeLock().tryLock(1, TimeUnit.SECONDS)) {
// Using tryLock to avoid deadlocks caused by 2 threads trying to acquire 2 readlocks (eg: replicators)
// while a handleUnloadRequest happens in the middle
LOG.warn("Contention on OwnedBundle rw lock. Retrying to acquire lock write lock");
}
try {
// 将namespace 置为非活动状态,拒绝所有生产者和消费者
// set the flag locally s.t. no more producer/consumer to this namespace is allowed
if (!IS_ACTIVE_UPDATER.compareAndSet(this, TRUE, FALSE)) {
// An exception is thrown when the namespace is not in active state (i.e. another thread is
// removing/have removed it)
return FutureUtil.failedFuture(new IllegalStateException(
"Namespace is not active. ns:" + this.bundle + "; state:" + IS_ACTIVE_UPDATER.get(this)));
}
} finally {
// no matter success or not, unlock
this.nsLock.writeLock().unlock();
}
} catch (InterruptedException e) {
return FutureUtil.failedFuture(e);
}
AtomicInteger unloadedTopics = new AtomicInteger();
LOG.info("Disabling ownership: {}", this.bundle);
// close topics forcefully
return pulsar.getNamespaceService().getOwnershipCache()
.updateBundleState(this.bundle, false)
.thenCompose(v -> pulsar.getBrokerService().unloadServiceUnit(bundle, true, timeout, timeoutUnit)) // unloadServiceUnit做卸载主题流程, 关闭所有连接和存储
.handle((numUnloadedTopics, ex) -> {
if (ex != null) {
// ignore topic-close failure to unload bundle
LOG.error("Failed to close topics under namespace {}", bundle.toString(), ex);
} else {
unloadedTopics.set(numUnloadedTopics);
}
// clean up topics that failed to unload from the broker ownership cache 清除缓存
pulsar.getBrokerService().cleanUnloadedTopicFromCache(bundle);
return null;
})
.thenCompose(v -> {
// delete ownership node on zk
return pulsar.getNamespaceService().getOwnershipCache().removeOwnership(bundle);
}).whenComplete((ignored, ex) -> {
double unloadBundleTime = TimeUnit.NANOSECONDS
.toMillis((System.nanoTime() - unloadBundleStartTime));
LOG.info("Unloading {} namespace-bundle with {} topics completed in {} ms", this.bundle,
unloadedTopics, unloadBundleTime, ex);
});
}
删除zk上的关系:
/namespace/apache/pulsar/0x00000000_0xffffffff
{
"nativeUrl" : "pulsar://192.168.18.135:6651",
"nativeUrlTls" : "pulsar+ssl://192.168.18.135:6671",
"httpUrl" : "http://192.168.18.135:8081",
"httpUrlTls" : "https://192.168.18.135:8441",
"disabled" : false,
"advertisedListeners" : { }
}
注意到unload只是删除临时结点, 归属的流程还是在lookup里面!
卸载topic
卸载topic和bundle的卸载是不一样的, 无法用来做流量转移
,topic的卸载只是把当前topic中的生产者,消费者,副本同步等连接断开,关闭对应的存储ledger。这个可以用在客户端异常场景恢复,强制重置状态。如果topic是fencing,那么topic会阻止客户端重新发起连接请求。
卸载topic实践:
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5
bin/pulsar-admin topics list-partitioned-topics apache/pulsar
"persistent://apache/pulsar/test-topic"
// 获取topic的bundle范围
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x59999994_0x66666660"
// 获取主题目前负责的broker
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-2 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
// 卸载topic
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics unload persistent://apache/pulsar/test-topic
// 卸载topic可以看到没有任何改变,并没有发生重分配,无论执行多次次卸载topic都不会触发重分配。
root@pulsar-toolset-0:/pulsar# bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-2 pulsar://pulsar-broker-2.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://pulsar-broker-1.pulsar-broker.components.svc.cluster.local:6650"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://pulsar-broker-0.pulsar-broker.components.svc.cluster.local:6650"
分裂bundle
自动分裂
自动分裂配置:
# 启用/禁用 自动拆分命名空间中的bundle
loadBalancerAutoBundleSplitEnabled=true
# 启用/禁用 自动卸载切分的bundle
loadBalancerAutoUnloadSplitBundlesEnabled=true
# bundle 中最大的主题数, 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxTopics=1000
# bundle 最大的session数量(生产 + 消费), 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxSessions=1000
# bundle 最大的msgRate(进+出)的值, 一旦超过这个值,将触发拆分操作。
loadBalancerNamespaceBundleMaxMsgRate=30000
# bundle 最大的带宽(进+出)的值, 一旦超过这个值,将触发拆分操作
loadBalancerNamespaceBundleMaxBandwidthMbytes=100
# 命名空间中最大的 bundle 数量 (用于自动拆分bundle时)
loadBalancerNamespaceMaximumBundles=128
bundle分裂,其管理的topic会触发重连, 重新lookup,这样子对客户端不太友好,一般是在流量低峰自己做分裂。
手动分裂
手动分分裂实践
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics list-partitioned-topics apache/pulsar
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics create-partitioned-topic apache/pulsar/test-topic -p 5
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics bundle-range persistent://apache/pulsar/test-topic
"0x40000000_0x80000000"
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-2 pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://192.168.18.135:6651"
bin/pulsar-admin namespaces split-bundle --bundle 0x40000000_0x80000000 apache/pulsar
分裂之后,由4个bundle 分裂成 5个:
/admin/local-policies/apache/pulsar
{
"bundles" : {
"boundaries" : [ "0x00000000", "0x40000000", "0x60000000", "0x80000000", "0xc0000000", "0xffffffff" ],
"numBundles" : 5
}
}
hcb@ubuntu:~/data/code/pulsar$ bin/pulsar-admin topics partitioned-lookup persistent://apache/pulsar/test-topic
"persistent://apache/pulsar/test-topic-partition-0 pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-1 pulsar://192.168.18.135:6651"
"persistent://apache/pulsar/test-topic-partition-2 pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-3 pulsar://192.168.18.135:6652"
"persistent://apache/pulsar/test-topic-partition-4 pulsar://192.168.18.135:6651"
split-bundle 有两种方式:
You can split namespace bundles in two ways, by setting supportedNamespaceBundleSplitAlgorithms
to range_equally_divide
or topic_count_equally_divide
in broker.conf
file.
这里要注意了!!
分裂出的bundle会分配给当前的broker。 除非你开启了unload!
原理解析
无论是由LoadManager模块调用的自动分裂还是手动调用的split,都是走/{tenant}/{namespace}/{bundle}/split 这样的请求!
@PUT
@Path("/{tenant}/{namespace}/{bundle}/split")
@ApiOperation(value = "Split a namespace bundle")
@ApiResponses(value = {
@ApiResponse(code = 307, message = "Current broker doesn't serve the namespace"),
@ApiResponse(code = 403, message = "Don't have admin permission") })
public void splitNamespaceBundle(
@Suspended final AsyncResponse asyncResponse,
@PathParam("tenant") String tenant,
@PathParam("namespace") String namespace,
@PathParam("bundle") String bundleRange,
@QueryParam("authoritative") @DefaultValue("false") boolean authoritative,
@QueryParam("unload") @DefaultValue("false") boolean unload, // 默认是分配给当前的broker,是不会卸载的!
@QueryParam("splitAlgorithmName") String splitAlgorithmName) {
try {
validateNamespaceName(tenant, namespace);
internalSplitNamespaceBundle(asyncResponse, bundleRange, authoritative, unload, splitAlgorithmName);
} catch (WebApplicationException wae) {
asyncResponse.resume(wae);
} catch (Exception e) {
asyncResponse.resume(new RestException(e));
}
}
protected void internalSplitNamespaceBundle(AsyncResponse asyncResponse, String bundleName,
boolean authoritative, boolean unload, String splitAlgorithmName) {
validateSuperUserAccess();
checkNotNull(bundleName, "BundleRange should not be null");
log.info("[{}] Split namespace bundle {}/{}", clientAppId(), namespaceName, bundleName);
String bundleRange = bundleName.equals(Policies.LARGEST_BUNDLE)
? findLargestBundleWithTopics(namespaceName).getBundleRange()
: bundleName;
Policies policies = getNamespacePolicies(namespaceName);
if (namespaceName.isGlobal()) {
// check cluster ownership for a given global namespace: redirect if peer-cluster owns it
validateGlobalNamespaceOwnership(namespaceName);
} else {
validateClusterOwnership(namespaceName.getCluster());
validateClusterForTenant(namespaceName.getTenant(), namespaceName.getCluster());
}
validatePoliciesReadOnlyAccess();
List<String> supportedNamespaceBundleSplitAlgorithms =
pulsar().getConfig().getSupportedNamespaceBundleSplitAlgorithms();
if (StringUtils.isNotBlank(splitAlgorithmName)
&& !supportedNamespaceBundleSplitAlgorithms.contains(splitAlgorithmName)) {
asyncResponse.resume(new RestException(Status.PRECONDITION_FAILED,
"Unsupported namespace bundle split algorithm, supported algorithms are "
+ supportedNamespaceBundleSplitAlgorithms));
}
NamespaceBundle nsBundle;
try {
nsBundle = validateNamespaceBundleOwnership(namespaceName, policies.bundles, bundleRange,
authoritative, true);
} catch (Exception e) {
asyncResponse.resume(e);
return;
}
pulsar().getNamespaceService().splitAndOwnBundle(nsBundle, unload,
getNamespaceBundleSplitAlgorithmByName(splitAlgorithmName))
.thenRun(() -> {
log.info("[{}] Successfully split namespace bundle {}", clientAppId(), nsBundle.toString());
asyncResponse.resume(Response.noContent().build());
}).exceptionally(ex -> {
if (ex.getCause() instanceof IllegalArgumentException) {
log.error("[{}] Failed to split namespace bundle {}/{} due to {}", clientAppId(), namespaceName,
bundleRange, ex.getMessage());
asyncResponse.resume(new RestException(Status.PRECONDITION_FAILED,
"Split bundle failed due to invalid request"));
} else {
log.error("[{}] Failed to split namespace bundle {}/{}", clientAppId(), namespaceName, bundleRange, ex);
asyncResponse.resume(new RestException(ex.getCause()));
}
return null;
});
}
大致流程就是分裂出bundle,将分裂出的bundle写入zk中,再将bundle中的topic再走一遍findBundle流程(上面分析过了)
负载解析
查看负载的几个API:
- bin/pulsar-perf monitor-brokers --connect-string 127.0.0.1:2181
- bin/pulsar-admin broker-stats load-report
- bin/pulsar-admin broker-stats topics -i
LoadManager 概述
负载管理类,负载管理类有多种实现,可以配置中指定loadManagerClassName来选择不同的实现类。
如果加载失败,那么将使用org.apache.pulsar.broker.loadbalance.impl.SimpleLoadManagerImpl
默认是org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl
- ModularLoadManagerWrapper采用了适配器模式,是一个适配器,适配ModularLoadManagerImpl和LoadManager。(ModularLoadManagerImpl 实现了新的接口,不能直接用于现有的LoadManager流程,所以需要适配)
public class ModularLoadManagerWrapper implements LoadManager {
private ModularLoadManager loadManager; // 会指向ModularLoadManagerImpl
- SimpleLoadManagerImpl ,最简单的重均衡实现,负载均衡未实现插件化,固有三种判定策略
- NoopLoadManager,一个空实现,使用这个NoopLoadManager,不会有任何的重均衡和负载上报
- 可动态修改LoadManager实现类。
bin/pulsar-admin update-dynamic-config --config loadManagerClassName --value 类名
LoadManager主要做三种事情:
- 每个结点的LoadManager周期将自己的负载信息上报元数据中心(zk) —> LoadReportUpdaterTask
- leader节点的LoadManager定时将统计型负载信息(每个broker、Bundle的历史信息)更新到zk —> loadResourceQuotaTask
- leader节点的LoadManager定时根据负载信息重平衡、分裂Bundle —> loadSheddingTask
看下LoadManager类的抽象:
/**
* LoadManager runs through set of load reports collected from different brokers and generates a recommendation of
* namespace/ServiceUnit placement on machines/ResourceUnit. Each Concrete Load Manager will use different algorithms to
* generate this mapping.
*
* Concrete Load Manager is also return the least loaded broker that should own the new namespace.
*/
/*
LoadManager运行在每个Broker上,每个broker会上报负载并且生成一个推荐配置,即namespace/ServiceUnit 应该归属于哪个machines/ResourceUnit管理,
LoadManger 还会返回最少负载的broker应该负责哪些新的命名空间。
*/
public interface LoadManager {
Logger LOG = LoggerFactory.getLogger(LoadManager.class);
String LOADBALANCE_BROKERS_ROOT = "/loadbalance/brokers"; // zk broker的根节点
void start() throws PulsarServerException;
/**
* Is centralized decision making to assign a new bundle.
* 是否是集中决策分配bundle
*/
boolean isCentralized();
/**
* Returns the Least Loaded Resource Unit decided by some algorithm or criteria which is implementation specific.
* 返回由某些特定于实现的算法或标准决定的最少负载资源单元。
*/
Optional<ResourceUnit> getLeastLoaded(ServiceUnitId su) throws Exception;
/**
* Generate the load report.
* 生成负载报告 调用pulsar-admin broker-stats load-report ,其实就是走这里
*/
LoadManagerReport generateLoadReport() throws Exception;
/**
* Set flag to force load report update.
* 设置强制更新负载报告标志
*/
void setLoadReportForceUpdateFlag();
/**
* Publish the current load report on ZK. 向元数据服务更新当前负载
*/
void writeLoadReportOnZookeeper() throws Exception;
/**
* Publish the current load report on ZK, forced or not. 发布负载报告到zk
* By default rely on method writeLoadReportOnZookeeper().
*/
default void writeLoadReportOnZookeeper(boolean force) throws Exception {
writeLoadReportOnZookeeper();
}
/**
* Update namespace bundle resource quota on ZK.更新资源配额到元数据
*/
void writeResourceQuotasToZooKeeper() throws Exception;
/**
* Generate load balancing stats metrics. 获取资源负载指标
*/
List<Metrics> getLoadBalancingMetrics();
/**
* Unload a candidate service unit to balance the load. 卸载某些服务单元以达到负载均衡,leader会调用这个方法
*/
void doLoadShedding();
/**
* Namespace bundle split. bundle分裂
*/
void doNamespaceBundleSplit() throws Exception;
/**
* Removes visibility of current broker from loadbalancer list so, other brokers can't redirect any request to this
* broker and this broker won't accept new connection requests. 移除broker的可见性(是否会移除所有bundle?),其他broker的请求无法重定向到这台broker,broker也没法接受新的请求
*
* @throws Exception
*/
void disableBroker() throws Exception;
/**
* Get list of available brokers in cluster. 获取可用的broker列表
*
* @return
* @throws Exception
*/
Set<String> getAvailableBrokers() throws Exception;
void stop() throws PulsarServerException;
/**
* Initialize this LoadManager. 反射初始化LoadManager
*
* @param pulsar
* The service to initialize this with.
*/
void initialize(PulsarService pulsar);
// 创建一个LoadManager类
static LoadManager create(final PulsarService pulsar) {
try {
final ServiceConfiguration conf = pulsar.getConfiguration();
final Class<?> loadManagerClass = Class.forName(conf.getLoadManagerClassName());
// Assume there is a constructor with one argument of PulsarService.
final Object loadManagerInstance = loadManagerClass.getDeclaredConstructor().newInstance();
//是子类就直接创建
if (loadManagerInstance instanceof LoadManager) {
final LoadManager casted = (LoadManager) loadManagerInstance;
casted.initialize(pulsar);
return casted;
} else if (loadManagerInstance instanceof ModularLoadManager) { //包装一下再创建
final LoadManager casted = new ModularLoadManagerWrapper((ModularLoadManager) loadManagerInstance);
casted.initialize(pulsar);
return casted;
}
} catch (Exception e) {
LOG.warn("Error when trying to create load manager: ", e);
}
// If we failed to create a load manager, default to SimpleLoadManagerImpl.
return new SimpleLoadManagerImpl(pulsar);
}
初始化
一句话总结:依靠抢占临时节点来实现的。
在PulsarService有一个LeaderElectionService 服务。
其实我们可以看到LeaderElectionService实现是非常少的, 它组合了CoordinationService的实现类CoordinationServiceImpl来一起完成具体逻辑。 一句话总结:依靠抢占临时节点来实现的。
在PulsarService有一个LeaderElectionService 服务。
private LeaderElectionService leaderElectionService = null;
public class LeaderElectionService implements AutoCloseable {
private static final String ELECTION_ROOT = "/loadbalance/leader";
private final LeaderElection<LeaderBroker> leaderElection;
private final LeaderBroker localValue;
public LeaderElectionService(CoordinationService cs, String localWebServiceAddress,
Consumer<LeaderElectionState> listener) {
this.leaderElection = cs.getLeaderElection(LeaderBroker.class, ELECTION_ROOT, listener);
this.localValue = new LeaderBroker(localWebServiceAddress);
}
public void start() {
leaderElection.elect(localValue).join();
}
@Override
public <T> LeaderElection<T> getLeaderElection(Class<T> clazz, String path,
Consumer<LeaderElectionState> stateChangesListener) {
return (LeaderElection<T>) leaderElections.computeIfAbsent(path,
key -> new LeaderElectionImpl<T>(store, clazz, path, stateChangesListener, executor)); // 返回了一个LeaderElectionImpl
}
LeaderElectionImpl:
LeaderElectionImpl(MetadataStoreExtended store, Class<T> clazz, String path,
Consumer<LeaderElectionState> stateChangesListener,
ScheduledExecutorService executor) {
this.path = path; // 上面传递进来的"/loadbalance/leader"
this.serde = new JSONMetadataSerdeSimpleType<>(TypeFactory.defaultInstance().constructSimpleType(clazz, null));
this.store = store;
this.cache = store.getMetadataCache(clazz);
this.leaderElectionState = LeaderElectionState.NoLeader;
this.internalState = InternalState.Init;
this.stateChangesListener = stateChangesListener;
this.executor = executor;
store.registerListener(this::handlePathNotification); // 注册临时节点监听
store.registerSessionListener(this::handleSessionNotification); // 注册会话监听
}
选举方法:
private synchronized CompletableFuture<LeaderElectionState> elect() {
// First check if there's already a leader elected
internalState = InternalState.ElectionInProgress;
return store.get(path).thenCompose(optLock -> {
if (optLock.isPresent()) {
return handleExistingLeaderValue(optLock.get());
} else {
return tryToBecomeLeader(); // 尝试成为主
}
}).thenCompose(leaderElectionState ->
// make sure that the cache contains the current leader
// so that getLeaderValueIfPresent works on all brokers
cache.get(path).thenApply(__ -> leaderElectionState));
}
private synchronized CompletableFuture<LeaderElectionState> tryToBecomeLeader() {
byte[] payload;
try {
payload = serde.serialize(path, proposedValue.get());
} catch (Throwable t) {
return FutureUtils.exception(t);
}
CompletableFuture<LeaderElectionState> result = new CompletableFuture<>();
store.put(path, payload, Optional.of(-1L), EnumSet.of(CreateOption.Ephemeral)) // 创建 "/loadbalance/leader" 临时节点。
.thenAccept(stat -> {
synchronized (LeaderElectionImpl.this) {
if (internalState == InternalState.ElectionInProgress) {
在PulsarService的Start方法中,如果是leader会启动loadSheddingTask任务和LoadResourceQuotaUpdaterTask任务
// Start the leader election service
startLeaderElectionService();
protected void startLeaderElectionService() {
this.leaderElectionService = new LeaderElectionService(coordinationService, getSafeWebServiceAddress(),
state -> {
if (state == LeaderElectionState.Leading) {
LOG.info("This broker was elected leader");
if (getConfiguration().isLoadBalancerEnabled()) { // 默认开启
long loadSheddingInterval = TimeUnit.MINUTES
.toMillis(getConfiguration().getLoadBalancerSheddingIntervalMinutes());
long resourceQuotaUpdateInterval = TimeUnit.MINUTES
.toMillis(getConfiguration().getLoadBalancerResourceQuotaUpdateIntervalMinutes());
// 取消之前的loadSheddingTask任务和loadResourceQuotaTask任务
if (loadSheddingTask != null) {
loadSheddingTask.cancel(false);
}
if (loadResourceQuotaTask != null) {
loadResourceQuotaTask.cancel(false);
}
// leader 初始化 loadSheddingTask和LoadResourceQuotaUpdaterTask
loadSheddingTask = loadManagerExecutor.scheduleAtFixedRate(
new LoadSheddingTask(loadManager),
loadSheddingInterval, loadSheddingInterval, TimeUnit.MILLISECONDS);
loadResourceQuotaTask = loadManagerExecutor.scheduleAtFixedRate(
new LoadResourceQuotaUpdaterTask(loadManager), resourceQuotaUpdateInterval,
resourceQuotaUpdateInterval, TimeUnit.MILLISECONDS);
}
} else {
if (leaderElectionService != null) {
LOG.info("This broker is a follower. Current leader is {}",
leaderElectionService.getCurrentLeader());
}
if (loadSheddingTask != null) {
loadSheddingTask.cancel(false);
loadSheddingTask = null;
}
if (loadResourceQuotaTask != null) {
loadResourceQuotaTask.cancel(false);
loadResourceQuotaTask = null;
}
}
});
leaderElectionService.start();
}
注: 从源码中,我们看出元数据服务抽象出了一个MetadataStore,这样脱离zk会更加轻松。
ModularLoadManagerImpl详解
属性
// Path to ZNode whose children contain BundleData jsons for each bundle (new API version of ResourceQuota).
// Bundle 负载的根目录
public static final String BUNDLE_DATA_PATH = "/loadbalance/bundle-data";
// todo 什么时候bundle是unseen的?
// Default message rate to assume for unseen bundles.1
public static final double DEFAULT_MESSAGE_RATE = 50;
// Default message throughput to assume for unseen bundles.
// Note that the default message size is implicitly defined as DEFAULT_MESSAGE_THROUGHPUT / DEFAULT_MESSAGE_RATE.
public static final double DEFAULT_MESSAGE_THROUGHPUT = 50000;
// 为了统计长期负载的样本
// The number of effective samples to keep for observing long term data.
public static final int NUM_LONG_SAMPLES = 1000;
// 为了统计短期期负载的样本
// The number of effective samples to keep for observing short term data.
public static final int NUM_SHORT_SAMPLES = 10;
// Path to ZNode whose children contain ResourceQuota jsons.
public static final String RESOURCE_QUOTA_ZPATH = "/loadbalance/resource-quota/namespace";
// Path to ZNode containing TimeAverageBrokerData jsons for each broker.
// 每个broker的长期和短期负载数据结点
public static final String TIME_AVERAGE_BROKER_ZPATH = "/loadbalance/broker-time-average";
// Set of broker candidates to reuse so that object creation is avoided.
// 把候选broker缓存起来,避免重复创建
private final Set<String> brokerCandidateCache;
// Cache of the local broker data, stored in LoadManager.LOADBALANCE_BROKER_ROOT.
// LocalBrokerData 是broker负载的信息。
private LockManager<LocalBrokerData> brokersData;
private ResourceLock<LocalBrokerData> brokerDataLock;
// 各个缓存,避免频繁读取zk
private MetadataCache<BundleData> bundlesCache;
private MetadataCache<ResourceQuota> resourceQuotaCache;
private MetadataCache<TimeAverageBrokerData> timeAverageBrokerDataCache;
// Broker host usage object used to calculate system resource usage.
// broker用来计算系统资源的,比如内存,cpu等
private BrokerHostUsage brokerHostUsage;
// Map from brokers to namespaces to the bundle ranges in that namespace assigned to that broker.
// Used to distribute bundles within a namespace evenly across brokers.
// 存储了broker到namespace再到namespace的bundle range的关系
private final ConcurrentOpenHashMap<String, ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>>>
brokerToNamespaceToBundleRange;
// Path to the ZNode containing the LocalBrokerData json for this broker.
// 每个broker是一个/loadbalance/brokers/192.168.18.135:8081
private String brokerZnodePath;
// Strategy to use for splitting bundles.
private BundleSplitStrategy bundleSplitStrategy;
// Service configuration belonging to the pulsar service.
private ServiceConfiguration conf;
// The default bundle stats which are used to initialize historic data.
// This data is overridden after the bundle receives its first sample.
// 初始化的时候用一些,当采到第一个样本的时候,会被覆盖
private final NamespaceBundleStats defaultStats;
// Used to filter brokers from being selected for assignment.
// 用来过滤可选的broker
private final List<BrokerFilter> filterPipeline;
// Timestamp of last invocation of updateBundleData.
// 最后一次调用updateBundleData的时间
private long lastBundleDataUpdate;
// LocalBrokerData available before most recent update.
// 最近更新LocalBrokerData的时间
private LocalBrokerData lastData;
// Pipeline used to determine what namespaces, if any, should be unloaded.
private final List<LoadSheddingStrategy> loadSheddingPipeline;
// Local data for the broker this is running on.
// 当前broker运行负载的数据
private LocalBrokerData localData;
// Load data comprising data available for each broker.
// 包含每个broker的负载数据
private final LoadData loadData;
// Used to determine whether a bundle is preallocated.
// 预分配namespaceBundle到broker
private final Map<String, String> preallocatedBundleToBroker;
// Strategy used to determine where new topics should be placed.
// 确认一个新的topic应该放在哪里的策略
private ModularLoadManagerStrategy placementStrategy;
// Policies used to determine which brokers are available for particular namespaces.
// 确认特殊命名空间应该放在哪些broker的策略
private SimpleResourceAllocationPolicies policies;
// Pulsar service used to initialize this.
private PulsarService pulsar;
// Executor service used to regularly update broker data.
private final ScheduledExecutorService scheduler;
// check if given broker can load persistent/non-persistent topic
// 检查给定的broker能否加载持久化和非持久化topic
private final BrokerTopicLoadingPredicate brokerTopicLoadingPredicate;
private Map<String, String> brokerToFailureDomainMap;
private SessionEvent lastMetadataSessionEvent = SessionEvent.Reconnected;
// record load balancing metrics
private AtomicReference<List<Metrics>> loadBalancingMetrics = new AtomicReference<>();
// record bundle unload metrics
private AtomicReference<List<Metrics>> bundleUnloadMetrics = new AtomicReference<>();
// record bundle split metrics
private AtomicReference<List<Metrics>> bundleSplitMetrics = new AtomicReference<>();
private long bundleSplitCount = 0;
private long unloadBrokerCount = 0;
private long unloadBundleCount = 0;
// 负载数据的读写锁
private final Lock lock = new ReentrantLock();
负载信息
/loadbalance/broker-time-average下是一段时间内,broker的长期和短期负载
/loadbalance/bundle-data 下是各个主题分区的负载,也有这长期和短期的负载情况。topics表示是负责的topic数
/loadbalance/brokers/192.168.18.135:8081 是临时节点,包含大量信息, borker-stats load-report 的信息就是从这里来的。
ModularLoadManaagerImpl会生成LocalBrokerData对象。LocalBrokerData包含broker所有的负载数据
public class LocalBrokerData implements LoadManagerReport {
// URLs to satisfy contract of ServiceLookupData (used by NamespaceService). broker连接信息
private final String webServiceUrl;
private final String webServiceUrlTls;
private final String pulsarServiceUrl;
private final String pulsarServiceUrlTls;
private boolean persistentTopicsEnabled = true; // 启用持久化主题
private boolean nonPersistentTopicsEnabled = true; // 启用非持久化主题
// Most recently available system resource usage. 最新的系统资源使用量
private ResourceUsage cpu;
private ResourceUsage memory; // 内存
private ResourceUsage directMemory; // 堆外内存
private ResourceUsage bandwidthIn; // 入带宽
private ResourceUsage bandwidthOut; // 出带宽
// Message data from the most recent namespace bundle stats. bundle相关的状态
private double msgThroughputIn; // 总接收消息吞吐量
private double msgThroughputOut; // 总推送消息吞吐量
private double msgRateIn; //入消息的QPS
private double msgRateOut; // 出消息的QPS
// Timestamp of last update.
private long lastUpdate;
// The stats given in the most recent invocation of update. 每个Bundle的详细流量信息
private Map<String, NamespaceBundleStats> lastStats;
private int numTopics; // broker上的总主题数
private int numBundles; // broker上的总Consumer数
private int numConsumers; // broker上的消费者数
private int numProducers; // broker上的生产者数
// All bundles belonging to this broker.
private Set<String> bundles; // 负责的所有bundle
// The bundles gained since the last invocation of update.
private Set<String> lastBundleGains; // 和上一次数据更新比较,Broker获取了哪些bundle
// The bundles lost since the last invocation of update.
private Set<String> lastBundleLosses; // 和上一次数据更新比较,Broker失去了哪些bundle
// The version string that this broker is running, obtained from the Maven build artifact in the POM broker的版本号
private String brokerVersionString;
// This place-holder requires to identify correct LoadManagerReport type while deserializing
@SuppressWarnings("checkstyle:ConstantName")
public static final String loadReportType = LocalBrokerData.class.getSimpleName();
// the external protocol data advertised by protocol handlers.
private Map<String, String> protocols; // 外部协议
//
private Map<String, AdvertisedListener> advertisedListeners;
还有一个LoadData数据。
/**
* Map from broker names to their available data.
*/
private final Map<String, BrokerData> brokerData;
/**
* Map from bundle names to their time-sensitive aggregated data.
*/
private final Map<String, BundleData> bundleData;
/**
* Map from recently unloaded bundles to the timestamp of when they were last loaded.
*/
private final Map<String, Long> recentlyUnloadedBundles;
他的更新由updateAll进行调用。下面会做详细分析
启动start
@Override
public void start() throws PulsarServerException {
try {
// At this point, the ports will be updated with the real port number that the server was assigned
Map<String, String> protocolData = pulsar.getProtocolDataToAdvertise();
lastData = new LocalBrokerData(pulsar.getSafeWebServiceAddress(), pulsar.getWebServiceAddressTls(),
pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls(), pulsar.getAdvertisedListeners());
lastData.setProtocols(protocolData);
// configure broker-topic mode
lastData.setPersistentTopicsEnabled(pulsar.getConfiguration().isEnablePersistentTopics());
lastData.setNonPersistentTopicsEnabled(pulsar.getConfiguration().isEnableNonPersistentTopics());
localData = new LocalBrokerData(pulsar.getSafeWebServiceAddress(), pulsar.getWebServiceAddressTls(),
pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls(), pulsar.getAdvertisedListeners());
localData.setProtocols(protocolData);
localData.setBrokerVersionString(pulsar.getBrokerVersion());
// configure broker-topic mode
localData.setPersistentTopicsEnabled(pulsar.getConfiguration().isEnablePersistentTopics());
localData.setNonPersistentTopicsEnabled(pulsar.getConfiguration().isEnableNonPersistentTopics());
String lookupServiceAddress = pulsar.getAdvertisedAddress() + ":"
+ (conf.getWebServicePort().isPresent() ? conf.getWebServicePort().get()
: conf.getWebServicePortTls().get());
brokerZnodePath = LoadManager.LOADBALANCE_BROKERS_ROOT + "/" + lookupServiceAddress;
final String timeAverageZPath = TIME_AVERAGE_BROKER_ZPATH + "/" + lookupServiceAddress;
// 更新当前broker负载到localData,更新系统指标和bundle状态指标,更新LoadBalance的指标
updateLocalBrokerData();
brokerDataLock = brokersData.acquireLock(brokerZnodePath, localData).join();
timeAverageBrokerDataCache.readModifyUpdateOrCreate(timeAverageZPath,
__ -> new TimeAverageBrokerData()).join();
// 更新
updateAll();
lastBundleDataUpdate = System.currentTimeMillis();
} catch (Exception e) {
log.error("Unable to acquire lock for broker: [{}]", brokerZnodePath, e);
throw new PulsarServerException(e);
}
}
updateAll 详细分析
调用时机有:
- 每个broker的LoadManager启动时会调用
- 当LOADBALANCE_BROKERS_ROOT zk节点有更改是会调用,watch监听搞的(handleDataNotification)
- LoadReportUpdaterTask 每次上报负载的时候(run方法→LoadReportUpdaterTask中)
// Update both the broker data and the bundle data.
public void updateAll() {
if (log.isDebugEnabled()) {
log.debug("Updating broker and bundle data for loadreport");
}
updateAllBrokerData(); // 更新loadData
updateBundleData(); // 这里也是更新loadData, 更新bundleData
// broker has latest load-report: check if any bundle requires split 看下是否需要拆分bundle
// 只有leader,和开启了自动split才会发生
checkNamespaceBundleSplit();
}
updateAllBrokerData
所有broker的负载数据都要通过updateLocalBrokerData上报给元数据存储(zk),这样leader才读取到数据,才可以更新laodData中的broker数据映射
private void updateAllBrokerData() {
final Set<String> activeBrokers = getAvailableBrokers();
final Map<String, BrokerData> brokerDataMap = loadData.getBrokerData();
// 遍历存活的broker
for (String broker : activeBrokers) {
try {
String key = String.format("%s/%s", LoadManager.LOADBALANCE_BROKERS_ROOT, broker);
Optional<LocalBrokerData> localData = brokersData.readLock(key).get();
if (!localData.isPresent()) {
brokerDataMap.remove(broker); // 不存在就移除了,可能是结点下线了,或者其他问题
log.info("[{}] Broker load report is not present", broker);
continue;
}
if (brokerDataMap.containsKey(broker)) {
// Replace previous local broker data.
brokerDataMap.get(broker).setLocalData(localData.get());
} else {
// Initialize BrokerData object for previously unseen
// brokers.
brokerDataMap.put(broker, new BrokerData(localData.get()));
}
} catch (Exception e) {
log.warn("Error reading broker data from cache for broker - [{}], [{}]", broker, e.getMessage());
}
}
// Remove obsolete brokers.
for (final String broker : brokerDataMap.keySet()) {
if (!activeBrokers.contains(broker)) {
brokerDataMap.remove(broker);
}
}
}
updateLocalBrokerData里更新bundle数据依托于PulsarStats.updateStats, 这个启动会更新,定时也会更新。
updateBundleData
同样的道理,这里是统计bundle的负载,通过Broker上报的数据更新loadData中的Bundle的数据
private void updateBundleData() {
final Map<String, BundleData> bundleData = loadData.getBundleData();
// Iterate over the broker data.
for (Map.Entry<String, BrokerData> brokerEntry : loadData.getBrokerData().entrySet()) {
final String broker = brokerEntry.getKey();
final BrokerData brokerData = brokerEntry.getValue();
final Map<String, NamespaceBundleStats> statsMap = brokerData.getLocalData().getLastStats();
// Iterate over the last bundle stats available to the current
// broker to update the bundle data.
for (Map.Entry<String, NamespaceBundleStats> entry : statsMap.entrySet()) {
final String bundle = entry.getKey();
final NamespaceBundleStats stats = entry.getValue();
if (bundleData.containsKey(bundle)) { // 如果已经识别过这个bundle了就更新
// If we recognize the bundle, add these stats as a new sample.
bundleData.get(bundle).update(stats);
} else { //如果没有识别过这个bundle就新增
// Otherwise, attempt to find the bundle data on metadata store.
// If it cannot be found, use the latest stats as the first sample.
BundleData currentBundleData = getBundleDataOrDefault(bundle);
currentBundleData.update(stats);
bundleData.put(bundle, currentBundleData);
}
}
// 移除预分配bundle中的,已经加载的bundle
// Remove all loaded bundles from the preallocated maps.
final Map<String, BundleData> preallocatedBundleData = brokerData.getPreallocatedBundleData();
synchronized (preallocatedBundleData) {
for (String preallocatedBundleName : brokerData.getPreallocatedBundleData().keySet()) {
if (brokerData.getLocalData().getBundles().contains(preallocatedBundleName)) {
final Iterator<Map.Entry<String, BundleData>> preallocatedIterator =
preallocatedBundleData.entrySet()
.iterator();
while (preallocatedIterator.hasNext()) {
final String bundle = preallocatedIterator.next().getKey();
// bundleData中已经有bundle的信息,就把这个bundle从预分配中移除
if (bundleData.containsKey(bundle)) {
preallocatedIterator.remove();
preallocatedBundleToBroker.remove(bundle);
}
}
}
// todo 这里没想明白
// This is needed too in case a broker which was assigned a bundle dies and comes back up.
preallocatedBundleToBroker.remove(preallocatedBundleName);
}
}
// 使用最新的数据去更新
// Using the newest data, update the aggregated time-average data for the current broker.
brokerData.getTimeAverageData().reset(statsMap.keySet(), bundleData, defaultStats);
final ConcurrentOpenHashMap<String, ConcurrentOpenHashSet<String>> namespaceToBundleRange =
brokerToNamespaceToBundleRange
.computeIfAbsent(broker, k -> new ConcurrentOpenHashMap<>());
synchronized (namespaceToBundleRange) {
namespaceToBundleRange.clear();
LoadManagerShared.fillNamespaceToBundlesMap(statsMap.keySet(), namespaceToBundleRange);
LoadManagerShared.fillNamespaceToBundlesMap(preallocatedBundleData.keySet(), namespaceToBundleRange);
}
}
}
checkNamespaceBundleSplit
决定是否需要拆分
synchronized (bundleSplitStrategy) {
final Set<String> bundlesToBeSplit = bundleSplitStrategy.findBundlesToSplit(loadData, pulsar);
NamespaceBundleFactory namespaceBundleFactory = pulsar.getNamespaceService().getNamespaceBundleFactory();
// 遍历每个bundle看能否拆分
for (String bundleName : bundlesToBeSplit) {
try {
final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundleName);
final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundleName);
if (!namespaceBundleFactory
.canSplitBundle(namespaceBundleFactory.getBundle(namespaceName, bundleRange))) {
continue;
}
// 清理被拆分的bundle缓存 ,确保不会再被选中
loadData.getBundleData().remove(bundleName);
localData.getLastStats().remove(bundleName);
this.pulsar.getNamespaceService().getNamespaceBundleFactory()
.invalidateBundleCache(NamespaceName.get(namespaceName));
deleteBundleDataFromMetadataStore(bundleName);
// 调用split, 其实是向borker发送一个split请求。
log.info("Load-manager splitting bundle {} and unloading {}", bundleName, unloadSplitBundles);
pulsar.getAdminClient().namespaces().splitNamespaceBundle(namespaceName, bundleRange,
unloadSplitBundles, null);
log.info("Successfully split namespace bundle {}", bundleName);
} catch (Exception e) {
log.error("Failed to split namespace bundle {}", bundleName, e);
}
}
updateBundleSplitMetrics(bundlesToBeSplit);
}
findBundlesToSplit:
public Set<String> findBundlesToSplit(final LoadData loadData, final PulsarService pulsar) {
bundleCache.clear();
final ServiceConfiguration conf = pulsar.getConfiguration();
// 获取配置中的最大bundle数,最大topic数,最大会话数,最大消息速率,最大带宽
int maxBundleCount = conf.getLoadBalancerNamespaceMaximumBundles();
long maxBundleTopics = conf.getLoadBalancerNamespaceBundleMaxTopics();
long maxBundleSessions = conf.getLoadBalancerNamespaceBundleMaxSessions();
long maxBundleMsgRate = conf.getLoadBalancerNamespaceBundleMaxMsgRate();
long maxBundleBandwidth = conf.getLoadBalancerNamespaceBundleMaxBandwidthMbytes() * LoadManagerShared.MIBI;
loadData.getBrokerData().forEach((broker, brokerData) -> {
LocalBrokerData localData = brokerData.getLocalData();
for (final Map.Entry<String, NamespaceBundleStats> entry : localData.getLastStats().entrySet()) {
final String bundle = entry.getKey();
final NamespaceBundleStats stats = entry.getValue();
if (stats.topics < 2) { // bundle中赋值的topic小于2没有分裂的必要。
log.info("The count of topics on the bundle {} is less than 2,skip split!", bundle);
continue;
}
double totalMessageRate = 0;
double totalMessageThroughput = 0;
// Attempt to consider long-term message data, otherwise effectively ignore.
if (loadData.getBundleData().containsKey(bundle)) {
final TimeAverageMessageData longTermData = loadData.getBundleData().get(bundle).getLongTermData();
totalMessageRate = longTermData.totalMsgRate();
totalMessageThroughput = longTermData.totalMsgThroughput();
}
// 只要大于任何一个条件就可以分裂了。
if (stats.topics > maxBundleTopics || stats.consumerCount + stats.producerCount > maxBundleSessions
|| totalMessageRate > maxBundleMsgRate || totalMessageThroughput > maxBundleBandwidth) {
final String namespace = LoadManagerShared.getNamespaceNameFromBundleName(bundle);
try {
final int bundleCount = pulsar.getNamespaceService()
.getBundleCount(NamespaceName.get(namespace));
// 分割少于配置中最大分割数
if (bundleCount < maxBundleCount) {
bundleCache.add(bundle);
} else {
log.warn(
"Could not split namespace bundle {} because namespace {} has too many bundles: {}",
bundle, namespace, bundleCount);
}
} catch (Exception e) {
log.warn("Error while getting bundle count for namespace {}", namespace, e);
}
}
}
});
return bundleCache;
}
负载上报
LoadReportUpdaterTask
if (config.isLoadBalancerEnabled()) {
LOG.info("Starting load balancer");
if (this.loadReportTask == null) {
long loadReportMinInterval = LoadManagerShared.LOAD_REPORT_UPDATE_MINIMUM_INTERVAL;
this.loadReportTask = this.loadManagerExecutor.scheduleAtFixedRate(
new LoadReportUpdaterTask(loadManager), loadReportMinInterval, loadReportMinInterval,
TimeUnit.MILLISECONDS);
}
}
LoadResourceQuotaUpdaterTask
这个是只有leader才会启动的进程,可以理解为将各个broker的负载以及broker上的bundle信息进行统计、窗口计算得到长期和短期的平均负载。
触发 LoadManager#writeResourceQuotasToZooKeeper
并最终调用以下方法,做这两件事:
- 计算Bundle负载写入:/loadbalance/bundle-data/xxx
- 计算Broker并写入负载到zk:/loadbalance/broker-time-average/xxx
### LoadResourceQuotaUpdaterTask
这个是只有leader才会启动的进程,可以理解为将各个broker的负载以及broker上的bundle信息进行统计、窗口计算得到长期和短期的平均负载。
触发 `LoadManager#writeResourceQuotasToZooKeeper` 并最终调用以下方法,做这两件事:
- 计算Bundle负载写入:/loadbalance/bundle-data/xxx
- 计算Broker并写入负载到zk:/loadbalance/broker-time-average/xxx
配置:
- loadBalancerResourceQuotaUpdateIntervalMinutes 更新Bundle负载的时间间隔
负载策略 loadSheddingTask
负载策略用于判断Broker上的哪些Bundle需要卸载,让其他broker接管,这样子集群会更加均衡由 loadSheddingTask触发
调度开关:loadBalancerEnabled 默认·开启
调度间隔:loadBalancerSheddingIntervalMinutes
最小调度间隔:loadBalancerSheddingGracePeriodMinutes 避免一个bundle在多个broker来回跑
ModularLoadManagerImpl有内置了以下三个负载策略。
DeviationShedder
一个抽象类,它使 LoadSheddingStrategy 使基于标准偏差的决策更容易实施。 假设存在一些可以估计服务器负载的指标,这种负载策略计算相对于该指标的标准偏差,并减轻负担偏差高于某个阈值的Broker的负担。 不能直接使用。
源码中也没有继承该类做进一步实现。
OverloadShedder
默认的负载策略,当某个Broker负载超过了loadBalancerBrokerOverloadedThresholdPercentage(默认85%)时,会尝试在Broker上卸载一个bundle。
@Override
public Multimap<String, String> findBundlesForUnloading(final LoadData loadData, final ServiceConfiguration conf) {
selectedBundlesCache.clear();
final double overloadThreshold = conf.getLoadBalancerBrokerOverloadedThresholdPercentage() / 100.0;
final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();
// Check every broker and select
loadData.getBrokerData().forEach((broker, brokerData) -> {
final LocalBrokerData localData = brokerData.getLocalData();
final double currentUsage = localData.getMaxResourceUsage();
if (currentUsage < overloadThreshold) {
if (log.isDebugEnabled()) {
log.debug("[{}] Broker is not overloaded, ignoring at this point ({})", broker,
localData.printResourceUsage());
}
return;
}
// We want to offload enough traffic such that this broker will go below the overload threshold
// Also, add a small margin so that this broker won't be very close to the threshold edge.
double percentOfTrafficToOffload = currentUsage - overloadThreshold + ADDITIONAL_THRESHOLD_PERCENT_MARGIN;
double brokerCurrentThroughput = localData.getMsgThroughputIn() + localData.getMsgThroughputOut();
double minimumThroughputToOffload = brokerCurrentThroughput * percentOfTrafficToOffload;
log.info(
"Attempting to shed load on {}, which has resource usage {}% above threshold {}%"
+ " -- Offloading at least {} MByte/s of traffic ({})",
broker, 100 * currentUsage, 100 * overloadThreshold, minimumThroughputToOffload / 1024 / 1024,
localData.printResourceUsage());
MutableDouble trafficMarkedToOffload = new MutableDouble(0);
MutableBoolean atLeastOneBundleSelected = new MutableBoolean(false);
if (localData.getBundles().size() > 1) { // 拥有的bundle要大于1
// Sort bundles by throughput, then pick the biggest N which combined
// make up for at least the minimum throughput to offload
loadData.getBundleDataForLoadShedding().entrySet().stream()
.filter(e -> localData.getBundles().contains(e.getKey()))
.map((e) -> {
// Map to throughput value
// Consider short-term byte rate to address system resource burden
String bundle = e.getKey();
BundleData bundleData = e.getValue();
TimeAverageMessageData shortTermData = bundleData.getShortTermData();
double throughput = shortTermData.getMsgThroughputIn() + shortTermData
.getMsgThroughputOut();
return Pair.of(bundle, throughput);
}).filter(e -> {
// Only consider bundles that were not already unloaded recently // 最近没有被卸载过
return !recentlyUnloadedBundles.containsKey(e.getLeft());
}).filter(e ->
localData.getBundles().contains(e.getLeft())
).sorted((e1, e2) -> {
// Sort by throughput in reverse order // 卸载throughput最大的几个
return Double.compare(e2.getRight(), e1.getRight());
}).forEach(e -> {
if (trafficMarkedToOffload.doubleValue() < minimumThroughputToOffload
|| atLeastOneBundleSelected.isFalse()) {
selectedBundlesCache.put(broker, e.getLeft());
trafficMarkedToOffload.add(e.getRight());
atLeastOneBundleSelected.setTrue();
}
});
} else if (localData.getBundles().size() == 1) {
log.warn(
"HIGH USAGE WARNING : Sole namespace bundle {} is overloading broker {}. "
+ "No Load Shedding will be done on this broker",
localData.getBundles().iterator().next(), broker);
} else {
log.warn("Broker {} is overloaded despite having no bundles", broker);
}
});
return selectedBundlesCache;
}
}
优点:
- 尽量少得触发卸载, 只要broker没有超过阈值,哪怕新加broker也不会触发卸载,卸载次数少,系统更加稳定
缺点:
- 当集群的broker负载的比较高的时候,bundle会在各个broker之间来回传递。
这个时候就一定要注意总体压力不要过高,监控要到位。
ThresholdShedder
根据Broker在所有Broker中占的比例来判断是否需要卸载Bundle,这个策略目的是让集群每个节点负载比较平衡。
首先根据LoadData里获取集群的Broker的平均负载,
private double getBrokerAvgUsage(final LoadData loadData, final double historyPercentage,
final ServiceConfiguration conf) {
double totalUsage = 0.0;
int totalBrokers = 0;
for (Map.Entry<String, BrokerData> entry : loadData.getBrokerData().entrySet()) {
LocalBrokerData localBrokerData = entry.getValue().getLocalData();
String broker = entry.getKey();
// 负载之和 = cpu * cpu权重 + 内存 * 内存权重 + 入出带宽 * 入出带宽权重 + ...
totalUsage += updateAvgResourceUsage(broker, localBrokerData, historyPercentage, conf);
totalBrokers++;
}
return totalBrokers > 0 ? totalUsage / totalBrokers : 0; // 平均负载就是负载之和 / Broker数量
}
private double updateAvgResourceUsage(String broker, LocalBrokerData localBrokerData,
final double historyPercentage, final ServiceConfiguration conf) {
Double historyUsage =
brokerAvgResourceUsage.get(broker);
double resourceUsage = localBrokerData.getMaxResourceUsageWithWeight(
conf.getLoadBalancerCPUResourceWeight(),
conf.getLoadBalancerMemoryResourceWeight(), conf.getLoadBalancerDirectMemoryResourceWeight(),
conf.getLoadBalancerBandwithInResourceWeight(),
conf.getLoadBalancerBandwithOutResourceWeight());
historyUsage = historyUsage == null
? resourceUsage : historyUsage * historyPercentage + (1 - historyPercentage) * resourceUsage;
// 会将当前broker的负载也记录,方便下面计算
brokerAvgResourceUsage.put(broker, historyUsage);
return historyUsage;
}
权重配置:
- loadBalancerBandwithInResourceWeight
- loadBalancerBandwithOutResourceWeight
- loadBalancerCPUResourceWeight
- loadBalancerMemoryResourceWeight
- loadBalancerDirectMemoryResourceWeight
findBundlesForUnloading细节:
@Override
public Multimap<String, String> findBundlesForUnloading(final LoadData loadData, final ServiceConfiguration conf) {
selectedBundlesCache.clear();
// 阈值默认10%
final double threshold = conf.getLoadBalancerBrokerThresholdShedderPercentage() / 100.0;
final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();
// loadBalancerBundleUnloadMinThroughputThreshold 默认10 ,默认10MB
final double minThroughputThreshold = conf.getLoadBalancerBundleUnloadMinThroughputThreshold() * MB;
// 获取集群的平衡负载
final double avgUsage = getBrokerAvgUsage(loadData, conf.getLoadBalancerHistoryResourcePercentage(), conf);
if (avgUsage == 0) {
log.warn("average max resource usage is 0");
return selectedBundlesCache;
}
// 遍历每个Broker,计算Broker在集群Broker平均负载的占比
loadData.getBrokerData().forEach((broker, brokerData) -> {
final LocalBrokerData localData = brokerData.getLocalData();
final double currentUsage = brokerAvgResourceUsage.getOrDefault(broker, 0.0);
// 当前负载没有大于平均 + 阈值
if (currentUsage < avgUsage + threshold) {
if (log.isDebugEnabled()) {
log.debug("[{}] broker is not overloaded, ignoring at this point", broker);
}
return;
}
double percentOfTrafficToOffload =
currentUsage - avgUsage - threshold + ADDITIONAL_THRESHOLD_PERCENT_MARGIN;
double brokerCurrentThroughput = localData.getMsgThroughputIn() + localData.getMsgThroughputOut();
// 算出卸载之后的broker负载 (这里只是期望)
double minimumThroughputToOffload = brokerCurrentThroughput * percentOfTrafficToOffload;
if (minimumThroughputToOffload < minThroughputThreshold) { // 吞吐流量默认没有超过10MB是不能卸载的
if (log.isDebugEnabled()) {
log.info("[{}] broker is planning to shed throughput {} MByte/s less than "
+ "minimumThroughputThreshold {} MByte/s, skipping bundle unload.",
broker, minimumThroughputToOffload / MB, minThroughputThreshold / MB);
}
return;
}
log.info(
"Attempting to shed load on {}, which has max resource usage above avgUsage and threshold {}%"
+ " > {}% + {}% -- Offloading at least {} MByte/s of traffic, left throughput {} MByte/s",
broker, currentUsage, avgUsage, threshold, minimumThroughputToOffload / MB,
(brokerCurrentThroughput - minimumThroughputToOffload) / MB);
MutableDouble trafficMarkedToOffload = new MutableDouble(0);
MutableBoolean atLeastOneBundleSelected = new MutableBoolean(false);
if (localData.getBundles().size() > 1) {
loadData.getBundleDataForLoadShedding().entrySet().stream()
.map((e) -> {
String bundle = e.getKey();
BundleData bundleData = e.getValue();
TimeAverageMessageData shortTermData = bundleData.getShortTermData(); // 使用短期资源使用率来判断
double throughput = shortTermData.getMsgThroughputIn() + shortTermData.getMsgThroughputOut();
return Pair.of(bundle, throughput);
}).filter(e ->
!recentlyUnloadedBundles.containsKey(e.getLeft()) // 最近没有卸载过才要卸载
).filter(e ->
localData.getBundles().contains(e.getLeft())
).sorted((e1, e2) ->
Double.compare(e2.getRight(), e1.getRight())
).forEach(e -> {
// 至少要选择一个
if (trafficMarkedToOffload.doubleValue() < minimumThroughputToOffload
|| atLeastOneBundleSelected.isFalse()) {
selectedBundlesCache.put(broker, e.getLeft());
trafficMarkedToOffload.add(e.getRight());
atLeastOneBundleSelected.setTrue();
}
});
} else if (localData.getBundles().size() == 1) {
log.warn(
"HIGH USAGE WARNING : Sole namespace bundle {} is overloading broker {}. "
+ "No Load Shedding will be done on this broker",
localData.getBundles().iterator().next(), broker);
} else {
log.warn("Broker {} is overloaded despite having no bundles", broker);
}
});
return selectedBundlesCache;
}
优点:
- 更加均衡一些
缺点:
- 对应某个主题分区的负载特别大的情况(单个主题分区负载就超过平均值), 这个Bundle可能会在多个Broker来回跳, 这种情况就要加大分区数,或者使用其他负载策略
具体负载均衡过程
回到doLoadShedding 方法
/**
* As the leader broker, select bundles for the namespace service to unload so that they may be reassigned to new
* brokers.
*/
@Override
public synchronized void doLoadShedding() {
if (!LoadManagerShared.isLoadSheddingEnabled(pulsar)) {
return;
}
// 只有一个broker, 直接返回
if (getAvailableBrokers().size() <= 1) {
log.info("Only 1 broker available: no load shedding will be performed");
return;
}
// 如果是卸载过的,在loadBalancerSheddingGracePeriodMinutes时间内就不要再卸载了。
// Remove bundles who have been unloaded for longer than the grace period from the recently unloaded map.
final long timeout = System.currentTimeMillis()
- TimeUnit.MINUTES.toMillis(conf.getLoadBalancerSheddingGracePeriodMinutes());
final Map<String, Long> recentlyUnloadedBundles = loadData.getRecentlyUnloadedBundles();
recentlyUnloadedBundles.keySet().removeIf(e -> recentlyUnloadedBundles.get(e) < timeout);
for (LoadSheddingStrategy strategy : loadSheddingPipeline) {
final Multimap<String, String> bundlesToUnload = strategy.findBundlesForUnloading(loadData, conf); // 通过策略类获取到要被卸载的bundle
bundlesToUnload.asMap().forEach((broker, bundles) -> {
bundles.forEach(bundle -> {
final String namespaceName = LoadManagerShared.getNamespaceNameFromBundleName(bundle); // 根据bundle获取对应的namespace名称
final String bundleRange = LoadManagerShared.getBundleRangeFromBundleName(bundle); // 获取bundle范围
// 反亲和性不符合就不进行卸载,和namespace隔离相关
if (!shouldAntiAffinityNamespaceUnload(namespaceName, bundleRange, broker)) {
return;
}
log.info("[{}] Unloading bundle: {} from broker {}",
strategy.getClass().getSimpleName(), bundle, broker);
try {
// 调用卸载restful api去了。
pulsar.getAdminClient().namespaces().unloadNamespaceBundle(namespaceName, bundleRange); // 卸载Bundle
loadData.getRecentlyUnloadedBundles().put(bundle, System.currentTimeMillis());
} catch (PulsarServerException | PulsarAdminException e) {
log.warn("Error when trying to perform load shedding on {} for broker {}", bundle, broker, e);
}
});
});
updateBundleUnloadingMetrics(bundlesToUnload);
}
}
注: 引用《深入解析Apache Pulsar》