探究Nacos:实例权重变更为何延迟生效

231 阅读5分钟

最近在通过Jekins实现无中断发布时,发现调整Nacos注册中心服务实例的权重1 -> 0时,客户端需要等待30s以上才不会调到权重为0的服务。

Spring Cloud loadbalancer cache 源码分析

Spring 提供了Spring Cloud loadbalancer负载均衡器实现,跟踪其中一个负载均衡策略结合Debug,查看是如何获取到注册在Nacos上的服务实例。

查看RoundRobinLoadBalancer.java 源码

	@Override
	// see original
	// https://github.com/Netflix/ocelli/blob/master/ocelli-core/
	// src/main/java/netflix/ocelli/loadbalancer/RoundRobinLoadBalancer.java
	public Mono<Response<ServiceInstance>> choose(Request request) {
		ServiceInstanceListSupplier supplier = serviceInstanceListSupplierProvider
				.getIfAvailable(NoopServiceInstanceListSupplier::new);
		return supplier.get(request).next()
				.map(serviceInstances -> processInstanceResponse(supplier, serviceInstances));
	}

通过debug 可进入 CachingServiceInstanceListSupplier.java,这个里面先从本地缓存获取服务实例,本地缓存没取到再通过 DiscoveryClientServiceInstanceListSupplier.java去从nacos服务获取最新的服务实例

	public CachingServiceInstanceListSupplier(ServiceInstanceListSupplier delegate, CacheManager cacheManager) {
		super(delegate);
		this.serviceInstances = CacheFlux.lookup(key -> {
			// TODO: configurable cache name
                        //1、先从缓存获取服务实例
			Cache cache = cacheManager.getCache(SERVICE_INSTANCE_CACHE_NAME);
			if (cache == null) {
				if (log.isErrorEnabled()) {
					log.error("Unable to find cache: " + SERVICE_INSTANCE_CACHE_NAME);
				}
				return Mono.empty();
			}
			List<ServiceInstance> list = cache.get(key, List.class);
			if (list == null || list.isEmpty()) {
				return Mono.empty();
			}
			return Flux.just(list).materialize().collectList();
                        //2、如果缓存已经失效,从Nacos获取最新实例并写入缓存
		}, delegate.getServiceId()).onCacheMissResume(delegate.get().take(1))
				.andWriteWith((key, signals) -> Flux.fromIterable(signals).dematerialize().doOnNext(instances -> {
					Cache cache = cacheManager.getCache(SERVICE_INSTANCE_CACHE_NAME);
					if (cache == null) {
						if (log.isErrorEnabled()) {
							log.error("Unable to find cache for writing: " + SERVICE_INSTANCE_CACHE_NAME);
						}
					}
					else {
						cache.put(key, instances);
					}
				}).then());
	}

结合debug可以发现上面的cacheManager 是通过CaffeineBasedLoadBalancerCacheManager实现的,里面有个LoadBalancerCacheProperties类,有个属性 spring.cloud.loadbalancer.cache.ttl 控制了缓存失效时间,默认是35秒

public class CaffeineBasedLoadBalancerCacheManager extends CaffeineCacheManager implements LoadBalancerCacheManager {

	public CaffeineBasedLoadBalancerCacheManager(String cacheName, LoadBalancerCacheProperties properties) {
		super(cacheName);
		if (!StringUtils.isEmpty(properties.getCaffeine().getSpec())) {
			setCacheSpecification(properties.getCaffeine().getSpec());
		}
		else {
			setCaffeine(Caffeine.newBuilder().initialCapacity(properties.getCapacity())
					.expireAfterWrite(properties.getTtl()).softValues());
		}

	}

	public CaffeineBasedLoadBalancerCacheManager(LoadBalancerCacheProperties properties) {
		this(SERVICE_INSTANCE_CACHE_NAME, properties);
	}

}
@ConfigurationProperties("spring.cloud.loadbalancer.cache")
public class LoadBalancerCacheProperties {

	private Caffeine caffeine = new Caffeine();

	/**
	 * Time To Live - time counted from writing of the record, after which cache entries
	 * are expired, expressed as a {@link Duration}. The property {@link String} has to be
	 * in keeping with the appropriate syntax as specified in Spring Boot
	 * <code>StringToDurationConverter</code>.
	 * @see <a href=
	 * "https://github.com/spring-projects/spring-boot/blob/master/spring-boot-project/spring-boot/src/main/java/org/springframework/boot/convert/StringToDurationConverter.java">StringToDurationConverter.java</a>
	 */
	private Duration ttl = Duration.ofSeconds(35);

	/**
	 * Initial cache capacity expressed as int.
	 */
	private int capacity = 256;

	public Caffeine getCaffeine() {
		return caffeine;
	}

	public void setCaffeine(Caffeine caffeine) {
		this.caffeine = caffeine;
	}

	public Duration getTtl() {
		return ttl;
	}

	public void setTtl(Duration ttl) {
		this.ttl = ttl;
	}

	public int getCapacity() {
		return capacity;
	}

	public void setCapacity(int capacity) {
		this.capacity = capacity;
	}

	/**
	 * Caffeine-specific LoadBalancer cache properties. NOTE: Passing your own Caffeine
	 * specification will override any other LoadBalancerCache settings, including TTL.
	 */
	public static class Caffeine {

		/**
		 * The spec to use to create caches. See CaffeineSpec for more details on the spec
		 * format.
		 */
		private String spec = "";

		public String getSpec() {
			return spec;
		}

		public void setSpec(String spec) {
			this.spec = spec;
		}

	}

}

NacosDiscoveryClient 源码分析

NacosDiscoveryClient.java (nacos-client:1.4.2)

	public List<ServiceInstance> getInstances(String serviceId) {
		try {
			return Optional.of(serviceDiscovery.getInstances(serviceId)).map(instances -> {
						ServiceCache.setInstances(serviceId, instances);
						return instances;
					}).get();
		}
		catch (Exception e) {
			if (failureToleranceEnabled) {
				return ServiceCache.getInstances(serviceId);
			}
			throw new RuntimeException(
					"Can not get hosts from nacos server. serviceId: " + serviceId, e);
		}
	}
	public List<ServiceInstance> getInstances(String serviceId) throws NacosException {
		String group = discoveryProperties.getGroup();
		List<Instance> instances = namingService().selectInstances(serviceId, group,
				true);
		return hostToServiceInstanceList(instances, serviceId);
	}

NacosNamingService#selectInstances

subscribe为true,进入HostReactor 订阅模式

    @Override
    public List<Instance> selectInstances(String serviceName, String groupName, List<String> clusters, boolean healthy,
            boolean subscribe) throws NacosException {
        
        ServiceInfo serviceInfo;
        if (subscribe) {
            serviceInfo = hostReactor.getServiceInfo(NamingUtils.getGroupedName(serviceName, groupName),
                    StringUtils.join(clusters, ","));
        } else {
            serviceInfo = hostReactor
                    .getServiceInfoDirectlyFromServer(NamingUtils.getGroupedName(serviceName, groupName),
                            StringUtils.join(clusters, ","));
        }
        return selectInstances(serviceInfo, healthy);
    }

HostReactor类,有个UpdateTask任务,每10s(cacheMillis)从Nacos注册中心获取指定服务的实例列表 (/nacos/v1/ns/instance/list)

        @Override
        public void run() {
            //初始1s
            long delayTime = DEFAULT_DELAY;
            
            try {
                ServiceInfo serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));
                
                if (serviceObj == null) {
                    updateService(serviceName, clusters);
                    return;
                }
                
                if (serviceObj.getLastRefTime() <= lastRefTime) {
                    updateService(serviceName, clusters);
                    serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));
                } else {
                    // if serviceName already updated by push, we should not override it
                    // since the push data may be different from pull through force push
                    refreshOnly(serviceName, clusters);
                }
                
                lastRefTime = serviceObj.getLastRefTime();
                
                if (!notifier.isSubscribed(serviceName, clusters) && !futureMap
                        .containsKey(ServiceInfo.getKey(serviceName, clusters))) {
                    // abort the update task
                    NAMING_LOGGER.info("update task is stopped, service:" + serviceName + ", clusters:" + clusters);
                    return;
                }
                if (CollectionUtils.isEmpty(serviceObj.getHosts())) {
                    incFailCount();
                    return;
                }
                //nacos返回10s
                delayTime = serviceObj.getCacheMillis();
                resetFailCount();
            } catch (Throwable e) {
                incFailCount();
                NAMING_LOGGER.warn("[NA] failed to update serviceName: " + serviceName, e);
            } finally {
                executor.schedule(this, Math.min(delayTime << failCount, DEFAULT_DELAY * 60), TimeUnit.MILLISECONDS);
            }
        }

总结:Spring Cloud loadbalancer和NacosDiscoveryClient两边都进行了缓存的操作。可以修改spring.cloud.loadbalancer.cache.ttl 配置调整客户端缓存时间。

Spring Cloud loadbalancer知识扩展

Spring CLoud loadbalancer cache

提供了两种缓存实现

  • Caffeine-backed LoadBalancer Cache 适合对缓存性能和有效期管理有高要求的场景
  • Default LoadBalancer Cache 适用于简单的负载均衡场景

LoadBalancerCacheAutoConfiguration.java

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnClass({ Caffeine.class, CaffeineCacheManager.class })
	protected static class CaffeineLoadBalancerCacheManagerConfiguration {

		@Bean(autowireCandidate = false)
		@ConditionalOnMissingBean
		LoadBalancerCacheManager caffeineLoadBalancerCacheManager(LoadBalancerCacheProperties cacheProperties) {
			return new CaffeineBasedLoadBalancerCacheManager(cacheProperties);
		}

	}

	@Configuration(proxyBeanMethods = false)
	@Conditional(OnCaffeineCacheMissingCondition.class)
	@ConditionalOnClass(ConcurrentMapWithTimedEviction.class)
	protected static class DefaultLoadBalancerCacheManagerConfiguration {

		@Bean(autowireCandidate = false)
		@ConditionalOnMissingBean
		LoadBalancerCacheManager defaultLoadBalancerCacheManager(LoadBalancerCacheProperties cacheProperties) {
			return new DefaultLoadBalancerCacheManager(cacheProperties);
		}

	}

缓存配置项

  • spring.cloud.loadbalancer.cache.enabled 是否开启缓存
  • spring.cloud.loadbalancer.cache.ttl 缓存时间
  • spring.cloud.loadbalancer.cache.capacity 缓存容量

ServiceInstanceListSupplier实现

ServiceInstanceListSupplier接口定义了如何获取服务实例列表的逻辑。

Spring Cloud 提供了ServiceInstanceListSupplier几种实现

  • NoopServiceInstanceListSupplier: 是一个占位符实现,不做任何实际的服务实例列表提供工作。它基本上返回一个空的服务实例列表,或是直接返回未做任何修改的服务实例列表。
  • HintBasedServiceInstanceListSupplier:提供的 提示(hints) 来筛选和选择服务实例。提示可以是请求元数据、请求上下文等,用于定制服务实例的选择策略。
  • HealthCheckServiceInstanceListSupplier: 从服务实例列表中剔除那些处于 不健康 状态的实例。它会根据每个实例的健康检查状态来决定是否将其包含在负载均衡的服务实例列表中。
  • RequestBasedStickySessionServiceInstanceListSupplier: 实现了一种 粘性会话(Sticky Session) 机制,它基于请求信息选择固定的服务实例,以确保同一个客户端的后续请求始终路由到同一个服务实例。
  • SameInstancePreferenceServiceInstanceListSupplier: 会优先选择 相同的服务实例,即尽可能确保负载均衡器选中之前使用过的服务实例。
  • CachingServiceInstanceListSupplier: 会缓存 服务实例列表,避免频繁地从服务发现系统拉取服务实例。它会定期或在某些条件下更新缓存中的实例列表。
  • RetryAwareServiceInstanceListSupplier: 结合了 重试机制,如果在从服务发现系统获取服务实例时出现错误(如网络问题或服务不可达),它会尝试重新获取服务实例,直到成功为止。
  • ZonePreferenceServiceInstanceListSupplier: 会根据请求的 区域偏好 选择服务实例。例如,可以优先选择位于特定区域(如数据中心区域或地理区域)中的服务实例。
  • DiscoveryClientServiceInstanceListSupplier: 基于 Spring Cloud DiscoveryClient 接口的实现。DiscoveryClient是Spring Cloud服务发现的一个抽象,它提供了从服务注册中心获取服务实例的能力。这个ServiceInstanceListSupplier通常作为其他ServiceInstanceListSupplier的委托(delegate),用于获取原始的服务实例数据。