Eureka源码分析--服务剔除(5)
前言
有了client主动对server说:嗨,我还活着!那么如果server长时间接收不到client的问候怎么把?eureka server是这样做的,因为本身实例在server中的注册表中是有最迟续约时间的变量,那么如果让一个定时的线程去扫描,将当前时间作为节点,比对一下最迟续约时间和当前时间的大小,这样就能保证server的注册表不会存在一个死实例。
server中定时扫描的任务线程
registry.openForTraffic(applicationInfoManager, registryCount);
进入这行代码,有个很奇怪的事
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
//更新每 30 秒发生一次,一分钟应该是 2 倍。
//master分支
this.expectedNumberOfClientsSendingRenews = count;
就这行代码上面明明写的应该是2倍,但是它这个预期值却是设置为实例总数
因为我看的是master分支的代码,后来我找了一下1.4x的版本,代码如下
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
//1.4.x版本的确实是2倍
this.expectedNumberOfRenewsPerMin = count * 2;
先忽略这里,只是一个小小的插曲
随后设置了一个每分钟最少心跳续约次数
protected void updateRenewsPerMinThreshold() {
this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews
* (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds())
* serverConfig.getRenewalPercentThreshold());
}
就是 每分钟至少接收心跳次数 = (服务实例总数 * (60 / 预期服务续约间隔) * 预期续约百分比)
例如: 假设 5个实例 那么就是 5 * ( 60 / 30 ) * 0.85 = 8.5 也就是1分钟内 至少要有8.5个心跳才能保证不开启自我保护机制,这个和自我保护机制有关。
之后的中间代码没有啥意义,直到最后调用了一个super.postInit()
在这个里面发现了定时剔除服务的线程任务
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
这个线程会在启动时延迟1分钟进行调度,调度频率为1分钟
定时线程中到底做了什么
线程被运行起来后,其实就是调用了EvictionTask类中的run方法
public void run() {
try {
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
这里有一个getCompensationTimeMs()方法,这个方法是用来计算补偿时间
/**
* compute a compensation time defined as the actual time this task was executed since the prev iteration,
* vs the configured amount of time for execution. This is useful for cases where changes in time (due to
* clock skew or gc for example) causes the actual eviction task to execute later than the desired time
* according to the configured cycle.
*/
/**
* 计算补偿时间,该补偿时间定义为自上一次迭代以来执行此任务的实际时间,与配置的执行时间量。这对于时间变化(例如由于时钟偏差或 gc)导致实际驱逐任务执行晚于根据配置的周期所需时间的情况很有用。
* @return
*/
long getCompensationTimeMs() {
//获得当前时间
long currNanos = getCurrentTimeNano();
//获得最后一次的执行时间
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
return 0l;
}
//获得相差的时间
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
//计算补偿时间
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
return compensationTime <= 0l ? 0l : compensationTime;
}
获取到补偿时间后开始进行服务实例的清除动作
清除实例
清除实例会调用evict方法,
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
//先把过期的服务实例收集起来,然后会随机清除
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
//遍历注册表
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
//遍历服务的实例状态
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
//如果当前实例的 规定最后续约时间 + 默认期端时间 + 补偿时间 已经小于当前时间 可以判定为失效状态
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
//为了补偿 GC 暂停或本地时间漂移,我们需要使用当前注册表大小作为触发自我保存的基础。否则,我们将清除完整的注册表
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
//获得每次清除实例的最多个数
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
//执行随机清除实例信息
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
//具体执行清除动作方法
internalCancel(appName, id, false);
}
}
}
这里清除实例的过程会根据当前注册表的大小以及阈值来计算一个当前最多能够随机清理多少个机器,因为存在server自己网络出现故障的情况,所以清除实例这里并不是把所有失效的实例都一次性随机清除了,那样的话如果server网络在很短的时间内恢复后,就会有大量的实例同时再次注册。
其实这里的补偿时间这种设计思路挺好,把一些非常规的意外也考虑进去了。
还有一点就是之前在client端增量同步注册表的时候,会从server中的一个recentlyChangedQueue中获取,在服务注册的时候最后的代码中向这个队列中插入了一条数据,那么在理论上在对所有实例信息操作时都会对这个队列进行操作,而在for循环中能够操作的地方只有一个,internalCancel方法,具体执行的动作
internalCancel(appName, id, false)
protected boolean internalCancel(String appName, String id, boolean isReplication) {
//加读锁
read.lock();
try {
CANCEL.increment(isReplication);
//获得要清除的服务实例
Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
Lease<InstanceInfo> leaseToCancel = null;
if (gMap != null) {
leaseToCancel = gMap.remove(id);
}
//添加进入清除队列
recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
if (instanceStatus != null) {
logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
}
if (leaseToCancel == null) {
CANCEL_NOT_FOUND.increment(isReplication);
logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
return false;
} else {
//设置失效时间
leaseToCancel.cancel();
InstanceInfo instanceInfo = leaseToCancel.getHolder();
String vip = null;
String svip = null;
if (instanceInfo != null) {
//设置失效状态
instanceInfo.setActionType(ActionType.DELETED);
//加入最近修改队列
recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
//设置最后操作时间
instanceInfo.setLastUpdatedTimestamp();
vip = instanceInfo.getVIPAddress();
svip = instanceInfo.getSecureVipAddress();
}
invalidateCache(appName, vip, svip);
logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
}
} finally {
read.unlock();
}
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to cancel it, reduce the number of clients to send renews.
//更新每分钟期望收到心跳次数的阈值
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1;
updateRenewsPerMinThreshold();
}
}
return true;
}
到这里server端主动清除过期故障实例已经完成
总结
不难看出server端对应故障清除这块考虑到很多一般想不到的情况,但是仔细阅读发现它的处理手法还是挺不错的。