Spring Cloud Netflix Eureka的参数调优

1,442 阅读8分钟

“这是我参与更文挑战的11天,活动详情查看: 更文挑战

下面主要分为Client端和Server端两大类进行简述,Eureka的几个核心参数

客户端参数

Client端的核心参数

参数默认值说明
eureka.client.availability-zones告知Client有哪些region以及availability-zones,支持配置修改运行时生效
eureka.client.filter-only-up-instancestrue是否过滤出InstanceStatus为UP的实例
eureka.client.regionus-east-1指定该应用实例所在的region,AWS datacenters适用
eureka.client.register-with-eurekatrue是否将该应用实例注册到Eureka Server
eureka.client.prefer-szme-zone-eurekatrue是否优先使用与该实例处于相同zone的Eureka Server
eureka.client.on-demand-update-status-changetrue是够将本地实例状态更新通过ApplicationInfoManager实时触发同步到Eureka Server
eureka.instance.metadata-map指定应用实例的元数据信息
eureka.instance.prefer-ip-addressfalse指定优先使用ip地址替代host name作为实例的hostName字段值
eureka.instance.lease-expiration-duration-in-seconds90指定Eureka Client间隔多久需要向Eureka Server发送心跳来告知Eureka Server该实例还存活

定时任务参数

参数默认值说明
eureka.client.cache-refresh-executor-thread-pool-size2刷新缓存的CacheRefreshThread的线程池大小
eureka.client.cache-refresh-executor-exponential-back-off-bound10(刷新缓存)调度任务执行超时时下次的调度的延迟时间
reka.client.heartbeat-executor-thread-pool-size2心跳线程HeartBeatThread的线程池大小
eureka.client.heartbeat-executor-exponential-back-off-bound10(心跳执行)调度任务超时时下次的调度的延时时间
eureka.client.registry-fetch-interval-seconds30CacheRefreshThread线程调度频率
eureka.client.eureka-service-url-poll-interval-seconds5*60AsyncResolver.updateTask刷新Eureka Server地址的时间间隔
eureka.client.initial-instance-info-replication-interval-seconds40InstanceInfoReplicator将实例信息变更同步到Eureka Server的初始延时时间
eureka.client.instance-info-replication-interval-seconds30InstanceInfoReplicator将实例信息变更同步到Eureka Server的时间间隔
ureka.instance.lease-renewal-interval-in-seconds30Eureka Client向Eureka Server发送心跳的时间间隔

http参数

Eureka Client底层httpClient与Eureka Server通信,提供的先关参数

参数默认值说明
eureka.client.eureka-server-connect-timeout-seconds5连接超时时间
eureka.client.eureka-server-read-timeout-seconds8读超时时间
eureka.client.eureka-server-total-connections200连接池最大活动连接数
eureka.client.eureka-server-total-connections-per-host50每个host能使用的最大链接数
eureka.client.eureka-connection-idle-timeout-seconds30连接池中链接的空闲时间

服务端端参数

主要包含这几类:基本参数、response cache参数、peer相参数、http参数

基本参数

参数默认值说明
eureka.server.enable-self-perservationtrue是否开启自我保护模式
eureka.server.renewal-percent-threshold0.85指定每分钟需要收到续约次数的阈值
eureka.instance.registry.expected-number-of-renews-per-min1指定每分钟需要接收到的续约次数值,实际该值在其中被写死为count*2,另外也会被更新
eureka.server.renewal-threshold-update-interval-ms15分钟指定updateRenewalThreshold定时任务的调度频率,来动态更新expectedNumberOfRenewsPerMin及numberOfRenewsPerminThreshold值
eureka.server.eviction-interval-timer-in-ms60*1000指定EvictionTask定时任务的调度频率,用于剔除过期的实例

response cache参数

Eureka Server为了提升自身REST API接口的性能,提供了两个缓存:一个是基于ConcurrentMap的readOnlyCacheMap,一个是基于Guava Cache的readWriteCacheMap。其相关参数如下:

参数默认值说明
eureka.server.use-read-only-response-cachetrue是否使用只读的response-cache
eureka.server.response-cache-update-interval-ms30*1000设置CacheUpdateTask的调度时间间隔,用于从readWriteCacheMap更新数据到readOnlyCacheMap。仅仅在eureka.server.use-read-only-response-cache为true的时候生效
eureka.server.response-cache-auto-expiration-in-seconds180设置readWriteCacheMap的expireAfterWrite参数,指定写入多长时间过过期

peer相关参数

参数默认值说明
eureka.server.peer-eureka-nodes-update-interval-ms10分钟指定peersUpdateTask调度的时间间隔,用于从配置文件刷新peerEurekaNodes节点的配置信息('eureka.client.serviceUrl相关zone的配置')
eureka.server.peer-eureka-status-refresh-time-interval-ms30*1000指定更新peer node状态信息的时间间隔

http参数

Eureka Server需要与其他peer节点进行通信,复制实例信息,其底层使用httpClient,提供相关的参数

参数默认值说明
eureka.server.peer-node-connect-timeout-ms200连接超时时间
eureka.server.peer-node-read-timeout-ms200读超时时间
eureka.server.peer-node-total-connections1000连接池最大活动连接数
eureka.server.peer-node-total-connections-per-host500每个host能使用的最大连接数
eureka.server.peer-node-connection-idle-timeout-seconds30连接池中连接的空闲时间

参数调优

常见问题

1.为什么服务下线了,Eureka Server接口返回的信息还会存在?

2.为什么服务上线了,Eureka Client不能及时获取到?

3.为什么会有一下提示:

EMERGENCY!EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN THEY’RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE

解决方法:

1.Eureka Server并不是强一致的,因此registry中会议保留过期的实例信息。原因如下:

  • 应用实例异常挂掉,没能在挂掉之前告知Eureka Server要下线掉该服务实例信息。这个就需要依赖Eureka Server的EvictionTask去剔除。
  • 应用实例下线是有告知Eureka Server下线,但是由于Eureka Server的REST API有response cache,因此需要等待缓存过期才能更新。
  • 由于Eureka Server开启并以入了SELF PRESERVATION(自我保护)模式,导致registry的信息不会因为过期而被剔除掉,直到退出SELF PRESERVATION(自我保护)模式。

针对Client下线而没有通知Eureka Server的问题,可以调整EvictionTask的调度频率,比如把默认的时间间隔60s,调整为5s:

eureka:
  server:
    eviction-interval-timer-in-ms: 5000

针对response cache的问题,可以根据情况考虑关闭readOnlyCacheMap:

eureka:
  server:
    use-read-only-response-cache: false

或者调整readWriteCacheMap的过期时间:

eureka:
  server:
    response-cache-auto-expiration-in-seconds: 60

针对SELF PRESERVATION(自我保护)的问题,在测试环境可以将enable-self-preservation设置为false:

eureka:
  server:
    enable-self-preservation: false

关闭之后会提示:

THE SELF PRESER VAT ION MODE IS TURNED OFF.  THIS MAY NOT PRO TECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.

或者:

RENEWALS ARE LESSER THAN THE THRESHOLD.THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY  NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.

2.针对新服务上线,Eureka Client获取不及时的问题,在测试环境,可以适当提高client端拉取Server注册信息的频率,例如下面将默认的30s改为5s:

eureka:
  client:
    registry-fetch-interval-seconds: 5

3.在实际生产过程中,经常会有网络抖动等问题造成服务实例与Eureka Server的心跳未能如期保持,但是服务实例本身是健康的,这个时候如果按照租约剔除机制剔除的话,会造成误判无果大范围误判的话,可能导致整个服务注册列表的大部分注册信息被删除,从而没有可用服务。Eureka为了解决这个问题引入了SELF PRESERVATION机制,当最近一分钟接收到的租约次数小于等于指定阈值的话,则关闭租约失效剔除,禁止定时任务失效的实例,从而保护注册信息。

在生产环境下,可以吧renewwalPercentThreshold及leaseRenewalIntervalInSeconds参数调小一点,从而提高触发SELF PRESERVATION机制的阈值。

eureka:
  instance:
    lease-renewal-interval-in-seconds: 10 #默认是30
    renewal-percent-threshold: 0.49       #默认是0.85

监控指标

Eureka内置了基于servo的指标统计,具体在com.netflix.eureka.util.EurekaMonitors。Spring Boot 2.x版本改为使用Micrometer,不再支持Neflix Servo,转而支持Neflix Servo的替代品Neflix Spectator。不过对于Servo,可以通过DefaultMonitorRegistry.getInstance().getRegisteredMonitors来获取所有注册了的Monitor,进而获取其指标值。

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//

package com.netflix.eureka.util;

import com.netflix.appinfo.AmazonInfo;
import com.netflix.appinfo.ApplicationInfoManager;
import com.netflix.appinfo.DataCenterInfo;
import com.netflix.appinfo.AmazonInfo.MetaDataKey;
import com.netflix.appinfo.DataCenterInfo.Name;
import com.netflix.servo.DefaultMonitorRegistry;
import com.netflix.servo.annotations.DataSourceType;
import com.netflix.servo.annotations.Monitor;
import com.netflix.servo.monitor.Monitors;
import java.util.concurrent.atomic.AtomicLong;

public enum EurekaMonitors {
    // 自启动以来收到的总续约次数
    RENEW("renewCounter", "Number of total renews seen since startup"),
    // 自启动以来收到的总取消租约次数
    CANCEL("cancelCounter", "Number of total cancels seen since startup"),
    // 自启动以来查询registry的总次数
    GET_ALL_CACHE_MISS("getAllCacheMissCounter", "Number of total registery queries seen since startup"),
    // 自启动以来delta查询registry的总次数
    GET_ALL_CACHE_MISS_DELTA("getAllCacheMissDeltaCounter", "Number of total registery queries for delta seen since startup"),
    // 自启动以来使用remote region查询registry的总次数
    GET_ALL_WITH_REMOTE_REGIONS_CACHE_MISS("getAllWithRemoteRegionCacheMissCounter", "Number of total registry with remote region queries seen since startup"),
  // 自启动以来使用remote region及delta方式查询registry的总次数
  GET_ALL_WITH_REMOTE_REGIONS_CACHE_MISS_DELTA("getAllWithRemoteRegionCacheMissDeltaCounter", "Number of total registry queries for delta with remote region seen since startup"),
    // 自启动以来查询delta的总次数
    GET_ALL_DELTA("getAllDeltaCounter", "Number of total deltas since startup"),
    // 自启动以来传递regions查询delta的总次数
    GET_ALL_DELTA_WITH_REMOTE_REGIONS("getAllDeltaWithRemoteRegionCounter", "Number of total deltas with remote regions since startup"),
    // 自启动以来查询'/{version}/apps'的次数
    GET_ALL("getAllCounter", "Number of total registry queries seen since startup"),
    // 自启动以来传递regions参数查询'/{version}/apps'的次数
    GET_ALL_WITH_REMOTE_REGIONS("getAllWithRemoteRegionCounter", "Number of total registry queries with remote regions, seen since startup"),
    // 自启动以来请求/{version}/apps/{appId}的总次数
    GET_APPLICATION("getApplicationCounter", "Number of total application queries seen since startup"),
    // 自启动以来register的总次数
    REGISTER("registerCounter", "Number of total registers seen since startup"),
    // 自启动以来剔除过期实例的总次数
    EXPIRED("expiredCounter", "Number of total expired leases since startup"),
    // 自启动以来statusUpdate的总次数
    STATUS_UPDATE("statusUpdateCounter", "Number of total admin status updates since startup"),
    // 自启动以来deleteStatusOverride的总次数
    STATUS_OVERRIDE_DELETE("statusOverrideDeleteCounter", "Number of status override removals"),
    // 自启动以来收到cancel请求时对应实例找不到的次数
    CANCEL_NOT_FOUND("cancelNotFoundCounter", "Number of total cancel requests on non-existing instance since startup"),
    // 自启动以来收到renew请求时对应实例找不到的次数
    RENEW_NOT_FOUND("renewNotFoundexpiredCounter", "Number of total renew on non-existing instance since startup"),
    REJECTED_REPLICATIONS("numOfRejectedReplications", "Number of replications rejected because of full queue"),
    FAILED_REPLICATIONS("numOfFailedReplications", "Number of failed replications - likely from timeouts"),
    // 由于开启rate limiter被丢弃的请求数量
    RATE_LIMITED("numOfRateLimitedRequests", "Number of requests discarded by the rate limiter"),
    // 如果开启rate limiter的话,将被丢弃的请求数
    RATE_LIMITED_CANDIDATES("numOfRateLimitedRequestCandidates", "Number of requests that would be discarded if the rate limiter's throttling is activated"),
    // 开启rate limiter时请求全量registry被丢弃的请求数
    RATE_LIMITED_FULL_FETCH("numOfRateLimitedFullFetchRequests", "Number of full registry fetch requests discarded by the rate limiter"),
    // 如果开启rate limiter时请求全量registry将被丢弃的请求数
    RATE_LIMITED_FULL_FETCH_CANDIDATES("numOfRateLimitedFullFetchRequestCandidates", "Number of full registry fetch requests that would be discarded if the rate limiter's throttling is activated");

    private final String name;
    private final String myZoneCounterName;
    private final String description;
    @Monitor(
        name = "count",
        type = DataSourceType.COUNTER
    )
    private final AtomicLong counter = new AtomicLong();
    @Monitor(
        name = "count-minus-replication",
        type = DataSourceType.COUNTER
    )
    private final AtomicLong myZoneCounter = new AtomicLong();

    private EurekaMonitors(String name, String description) {
        this.name = name;
        this.description = description;
        DataCenterInfo dcInfo = ApplicationInfoManager.getInstance().getInfo().getDataCenterInfo();
        if (dcInfo.getName() == Name.Amazon) {
            this.myZoneCounterName = ((AmazonInfo)dcInfo).get(MetaDataKey.availabilityZone) + "." + name;
        } else {
            this.myZoneCounterName = "dcmaster." + name;
        }

    }

    public void increment() {
        this.increment(false);
    }

    public void increment(boolean isReplication) {
        this.counter.incrementAndGet();
        if (!isReplication) {
            this.myZoneCounter.incrementAndGet();
        }

    }

    public String getName() {
        return this.name;
    }

    public String getZoneSpecificName() {
        return this.myZoneCounterName;
    }

    public String getDescription() {
        return this.description;
    }

    public long getCount() {
        return this.counter.get();
    }

    public long getZoneSpecificCount() {
        return this.myZoneCounter.get();
    }

    public static void registerAllStats() {
        EurekaMonitors[] var0 = values();
        int var1 = var0.length;

        for(int var2 = 0; var2 < var1; ++var2) {
            EurekaMonitors c = var0[var2];
            Monitors.registerObject(c.getName(), c);
        }

    }

    public static void shutdown() {
        EurekaMonitors[] var0 = values();
        int var1 = var0.length;

        for(int var2 = 0; var2 < var1; ++var2) {
            EurekaMonitors c = var0[var2];
            DefaultMonitorRegistry.getInstance().unregister(Monitors.newObjectMonitor(c.getName(), c));
        }

    }
}