50天独立打造企业级API网关(三):弹性设计与限流降级

18 阅读8分钟

系列文章第3篇 | 50天手搓Spring Cloud Gateway:44项功能+561测试用例的完整实践


系列导航

  • 第一篇:架构设计与动态路由实现
  • 第二篇:安全防护体系与性能优化
  • 第三篇:弹性设计与限流降级 ← 本篇
  • 第四篇:全链路可观测性与AI Copilot
  • 第五篇:Kubernetes部署与测试保障
  • 第六篇:高级路由与负载均衡实战

一、弹性设计概述

1.1 为什么需要弹性设计?

在微服务架构中,服务间的调用链路越来越长,任何一个环节的故障都可能引发级联故障(雪崩效应)。弹性设计的核心目标是:

  1. 快速失败:当下游服务不可用时,快速返回错误,避免资源耗尽
  2. 优雅降级:在部分功能不可用时,保证核心功能可用
  3. 自动恢复:故障恢复后,系统能自动恢复正常
  4. 限流保护:防止突发流量打垮系统

1.2 弹性设计架构

graph TB
    A[请求到达网关]
    B[限流控制]
    C{是否允许?}
    D[熔断器检查]
    E{熔断器状态?}
    F[执行请求]
    G{请求成功?}
    H[返回响应]
    I[快速失败 429]
    J[快速失败 503]
    K[记录失败指标]
    L[记录成功指标]
    
    A --> B
    B --> C
    C -->|是| D
    C -->|否| I
    D --> E
    E -->|CLOSED| F
    E -->|OPEN| J
    E -->|HALF_OPEN| F
    F --> G
    G -->|是| L
    G -->|否| K
    L --> H
    K --> H
    
    style B fill:#e1f5ff
    style D fill:#fff4e1
    style I fill:#ffe1e1
    style J fill:#ffe1e1

二、Resilience4j熔断器实现

2.1 熔断器状态机

基于Resilience4j实现熔断器,保护下游服务免受级联故障:

stateDiagram-v2
    [*] --> Closed: 初始状态
    Closed --> Open: 失败率 > 50%
    Open --> HalfOpen: 等待30秒
    HalfOpen --> Closed: 测试请求成功
    HalfOpen --> Open: 测试请求失败
    
    note right of Closed
        正常状态
        所有请求正常转发
        记录成功率统计
    end note
    
    note right of Open
        熔断状态
        所有请求立即拒绝
        返回503错误
    end note
    
    note right of HalfOpen
        半开状态
        允许3个测试请求
        根据结果决定状态
    end note

2.2 熔断器配置

{
  "routeId": "critical-service",
  "failureRateThreshold": 50.0,
  "slowCallDurationThreshold": 60000,
  "slowCallRateThreshold": 80.0,
  "waitDurationInOpenState": 30000,
  "slidingWindowSize": 10,
  "minimumNumberOfCalls": 5,
  "permittedNumberOfCallsInHalfOpenState": 3,
  "enabled": true
}

参数说明:

参数说明默认值
failureRateThreshold失败率阈值(%)50
slowCallDurationThreshold慢调用阈值(ms)60000
slowCallRateThreshold慢调用率阈值(%)80
waitDurationInOpenStateOPEN状态等待时间(ms)30000
slidingWindowSize滑动窗口大小10
minimumNumberOfCalls最小调用次数5
permittedNumberOfCallsInHalfOpenStateHALF_OPEN测试次数3

2.3 熔断器实现代码

@Component
public class CircuitBreakerGlobalFilter implements GlobalFilter, Ordered {
    
    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RouteManager routeManager;
    
    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        RouteDefinition route = getRouteDefinition(exchange);
        
        if (!route.getCircuitBreakerConfig().isEnabled()) {
            return chain.filter(exchange);
        }
        
        CircuitBreakerConfig config = buildConfig(route.getCircuitBreakerConfig());
        CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(
            route.getId(), config
        );
        
        return Mono.fromCallable(() -> {
            // 检查熔断器状态
            if (circuitBreaker.getState() == CircuitBreaker.State.OPEN) {
                throw new CircuitBreakerOpenException("Circuit breaker is open");
            }
            return exchange;
        })
        .flatMap(e -> chain.filter(e))
        .doOnSuccess(v -> {
            // 请求成功,记录成功指标
            circuitBreaker.onResult(0, TimeUnit.MILLISECONDS, true);
        })
        .doOnError(throwable -> {
            // 请求失败,记录失败指标
            circuitBreaker.onError(0, TimeUnit.MILLISECONDS, throwable);
        });
    }
}

2.4 熔断器错误响应

当熔断器打开时,返回标准错误响应:

{
  "code": 55301,
  "error": "Service Unavailable",
  "message": "Circuit breaker is open, please try again later",
  "data": null,
  "routeId": "critical-service"
}

三、超时控制

3.1 超时控制实现

@Component
public class TimeoutGlobalFilter implements GlobalFilter, Ordered {
    
    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        RouteDefinition route = getRouteDefinition(exchange);
        long timeout = route.getTimeoutConfig().getResponseTimeout();
        
        if (timeout <= 0) {
            return chain.filter(exchange);
        }
        
        return chain.filter(exchange)
            .timeout(Duration.ofMillis(timeout))
            .onErrorResume(TimeoutException.class, e -> {
                log.warn("Request timeout for route: {}, timeout: {}ms", 
                    route.getId(), timeout);
                return Mono.error(new GatewayTimeoutException("Request timeout"));
            });
    }
}

3.2 超时配置示例

{
  "routeId": "slow-service",
  "connectTimeout": 5000,
  "responseTimeout": 30000,
  "enabled": true
}

四、重试机制

4.1 重试策略

@Component
public class RetryGlobalFilter implements GlobalFilter, Ordered {
    
    private final RetryRegistry retryRegistry;
    
    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        RouteDefinition route = getRouteDefinition(exchange);
        
        if (!route.getRetryConfig().isEnabled()) {
            return chain.filter(exchange);
        }
        
        Retry retry = retryRegistry.retry(route.getId(), buildRetryConfig(route));
        
        return RetryOperator.of(retry)
            .apply(chain.filter(exchange));
    }
    
    private RetryConfig buildRetryConfig(RouteDefinition route) {
        return RetryConfig.custom()
            .maxAttempts(route.getRetryConfig().getMaxAttempts())
            .waitDuration(Duration.ofMillis(route.getRetryConfig().getWaitDuration()))
            .retryOnException(ex -> isRetryableException(ex))
            .retryOnResult(response -> isRetryableResponse(response))
            .build();
    }
    
    private boolean isRetryableException(Throwable ex) {
        return ex instanceof ConnectTimeoutException 
            || ex instanceof ReadTimeoutException
            || ex instanceof ServiceUnavailableException;
    }
    
    private boolean isRetryableResponse(ServerWebExchange exchange) {
        int statusCode = exchange.getResponse().getStatusCode().value();
        return statusCode == 502 || statusCode == 503 || statusCode == 504;
    }
}

4.2 重试配置示例

{
  "routeId": "unstable-service",
  "maxAttempts": 3,
  "waitDuration": 1000,
  "retryOnStatusCodes": [502, 503, 504],
  "retryOnExceptions": ["ConnectTimeoutException", "ReadTimeoutException"],
  "enabled": true
}

五、Shadow Quota限流降级方案(核心亮点)

5.1 为什么需要Shadow Quota?

传统限流方案在Redis故障时会直接降级到本地限流,导致:

  1. 流量突刺:本地限流阈值通常较低,大量请求被拒绝
  2. 用户体验差:用户感知到服务不可用
  3. 雪崩风险:大量429错误导致客户端重试,进一步加重负载

Shadow Quota方案的核心思想:在Redis故障时,使用影子配额平滑降级,避免流量突刺。

5.2 Shadow Quota架构

graph TB
    A[请求到达]
    B{Redis健康?}
    C[Redis限流]
    D[本地Caffeine限流]
    E[Shadow Quota计算]
    
    B -->|是| C
    B -->|否| D
    
    C --> F{限流通过?}
    D --> E
    E --> G[动态调整本地阈值]
    G --> F
    
    F -->|是| H[继续请求]
    F -->|否| I[429 Too Many Requests]
    
    style C fill:#e1ffe1
    style D fill:#fff4e1
    style E fill:#e1f5ff
    style G fill:#f0e1ff

5.3 Shadow Quota实现原理

@Component
public class ShadowQuotaManager {
    
    // Redis正常时的限流配额
    private volatile int normalQuota = 1000;
    
    // Shadow Quota(Redis故障时使用)
    private volatile int shadowQuota = 800;  // 正常配额的80%
    
    // Redis健康状态
    private volatile boolean redisHealthy = true;
    
    // 降级计数器
    private final AtomicLong rejectCount = new AtomicLong(0);
    
    /**
     * 获取当前有效配额
     */
    public int getEffectiveQuota() {
        if (redisHealthy) {
            return normalQuota;
        } else {
            // Redis故障,使用Shadow Quota
            // 根据拒绝率动态调整
            double rejectRate = calculateRejectRate();
            return calculateShadowQuota(rejectRate);
        }
    }
    
    /**
     * 计算Shadow Quota
     */
    private int calculateShadowQuota(double rejectRate) {
        if (rejectRate > 0.5) {
            // 拒绝率过高,进一步降低配额
            return (int) (shadowQuota * 0.7);
        } else if (rejectRate > 0.2) {
            // 拒绝率中等,保持当前配额
            return shadowQuota;
        } else {
            // 拒绝率低,可以稍微提高配额
            return Math.min(normalQuota, (int) (shadowQuota * 1.1));
        }
    }
    
    /**
     * Redis恢复后,平滑过渡回正常配额
     */
    public void onRedisRecovered() {
        redisHealthy = true;
        // 渐进式恢复,避免配额突增
        gradualQuotaRecovery();
    }
    
    private void gradualQuotaRecovery() {
        int current = shadowQuota;
        int target = normalQuota;
        
        // 每秒增加10%的配额
        scheduler.scheduleAtFixedRate(() -> {
            if (current < target) {
                current = Math.min(target, current + (target - current) / 10);
                log.info("Gradually recovering quota to: {}", current);
            }
        }, 0, 1, TimeUnit.SECONDS);
    }
}

5.4 混合限流过滤器

@Component
public class HybridRateLimiterFilter implements GlobalFilter, Ordered {
    
    private final RedisRateLimiter redisRateLimiter;
    private final LocalRateLimiter localRateLimiter;
    private final ShadowQuotaManager shadowQuotaManager;
    private final RedisHealthChecker redisHealthChecker;
    
    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        String routeId = getRouteId(exchange);
        String key = buildRateLimitKey(exchange);
        
        // 检查Redis健康状态
        if (redisHealthChecker.isHealthy()) {
            // Redis正常,使用分布式限流
            return redisRateLimiter.isAllowed(routeId, key)
                .flatMap(response -> {
                    if (response.isAllowed()) {
                        return chain.filter(exchange);
                    } else {
                        return rejectWith429(exchange);
                    }
                });
        } else {
            // Redis故障,使用本地限流 + Shadow Quota
            int effectiveQuota = shadowQuotaManager.getEffectiveQuota();
            
            if (localRateLimiter.tryAcquire(routeId, key, effectiveQuota)) {
                return chain.filter(exchange);
            } else {
                shadowQuotaManager.recordReject();
                return rejectWith429(exchange);
            }
        }
    }
}

5.5 限流降级效果对比

场景传统方案Shadow Quota方案
Redis正常1000 QPS1000 QPS
Redis故障瞬间直接降到500 QPS平滑降到800 QPS
拒绝率<20%保持500 QPS动态调整到880 QPS
拒绝率>50%保持500 QPS动态降到560 QPS
Redis恢复立即恢复1000 QPS渐进式恢复(每秒+10%)
用户体验突然大量429平滑过渡,用户无感知

六、非阻塞锁设计

6.1 EventLoop保护

在响应式编程中,阻塞操作会阻塞EventLoop线程,导致整个网关性能下降。我们使用非阻塞锁设计:

public class NonBlockingLock {
    
    private final AtomicBoolean locked = new AtomicBoolean(false);
    
    /**
     * 尝试获取锁(非阻塞)
     */
    public boolean tryLock() {
        return locked.compareAndSet(false, true);
    }
    
    /**
     * 释放锁
     */
    public void unlock() {
        locked.set(false);
    }
    
    /**
     * 带超时的尝试获取
     */
    public Mono<Boolean> tryLock(Duration timeout) {
        return Mono.defer(() -> {
            if (tryLock()) {
                return Mono.just(true);
            } else {
                return Mono.just(false)
                    .delayElement(timeout)
                    .flatMap(v -> tryLock(timeout));
            }
        });
    }
}

6.2 在限流器中的应用

public class LocalRateLimiter {
    
    private final NonBlockingLock lock = new NonBlockingLock();
    private final Map<String, AtomicLong> counters = new ConcurrentHashMap<>();
    
    public boolean tryAcquire(String key, int quota) {
        // 使用非阻塞锁保护计数器更新
        if (lock.tryLock()) {
            try {
                AtomicLong counter = counters.computeIfAbsent(key, k -> new AtomicLong(0));
                long count = counter.incrementAndGet();
                
                if (count <= quota) {
                    return true;
                } else {
                    counter.decrementAndGet();
                    return false;
                }
            } finally {
                lock.unlock();
            }
        } else {
            // 锁获取失败,直接放行(避免阻塞)
            log.warn("Lock acquisition failed, allowing request to avoid blocking");
            return true;
        }
    }
}

七、多维度限流

7.1 限流维度

支持层级限流:全局 → 租户 → 用户 → IP

{
  "routeId": "tenant-api",
  "enabled": true,
  "dimensions": [
    {
      "type": "GLOBAL",
      "qps": 10000,
      "burstCapacity": 20000
    },
    {
      "type": "TENANT",
      "qps": 1000,
      "burstCapacity": 2000,
      "keyExtractor": "X-Tenant-Id"
    },
    {
      "type": "USER",
      "qps": 100,
      "burstCapacity": 200,
      "keyExtractor": "X-User-Id"
    },
    {
      "type": "IP",
      "qps": 50,
      "burstCapacity": 100
    }
  ]
}

7.2 限流Key类型

Key类型说明适用场景
ip按客户端IP防止单IP滥用
route路由共享全局保护
combined路由+IP组合细粒度控制
header按Header值按用户/租户限流
user按用户ID认证后限流

八、项目截图展示

8.1 策略配置

18.png 19.png 20.png

图18-20:策略配置界面,展示支持的策略类型:IP黑白名单、限流、熔断、超时、重试、降级等(18-19),以及鉴权配置绑定到具体路由(20)。


九、核心代码文件索引

功能文件路径说明
熔断器过滤器my-gateway/src/main/java/com/leoli/gateway/filter/resilience/CircuitBreakerGlobalFilter.javaResilience4j熔断器
超时过滤器my-gateway/src/main/java/com/leoli/gateway/filter/resilience/TimeoutGlobalFilter.java超时控制
重试过滤器my-gateway/src/main/java/com/leoli/gateway/filter/resilience/RetryGlobalFilter.java重试机制
混合限流过滤器my-gateway/src/main/java/com/leoli/gateway/filter/ratelimit/HybridRateLimiterFilter.javaRedis+本地混合限流
多维度限流过滤器my-gateway/src/main/java/com/leoli/gateway/filter/ratelimit/MultiDimRateLimiterFilter.java全局/租户/用户/IP限流
Shadow Quota管理器my-gateway/src/main/java/com/leoli/gateway/limiter/ShadowQuotaManager.java影子配额降级
Redis健康检查器my-gateway/src/main/java/com/leoli/gateway/limiter/RedisHealthChecker.javaRedis健康状态监控
本地限流器my-gateway/src/main/java/com/leoli/gateway/limiter/LocalRateLimiter.javaCaffeine本地限流
非阻塞锁my-gateway/src/main/java/com/leoli/gateway/util/NonBlockingLock.javaEventLoop保护

十、总结与预告

本篇总结

本文深入介绍了企业级API网关的弹性设计与限流降级

  1. 熔断器实现:Resilience4j状态机,保护下游服务
  2. 超时控制:连接超时和响应超时配置
  3. 重试机制:可配置重试策略,支持状态码和异常重试
  4. Shadow Quota限流降级:核心亮点,Redis故障时平滑降级,避免流量突刺
  5. 非阻塞锁设计:保护EventLoop线程,避免阻塞
  6. 多维度限流:全局→租户→用户→IP层级限流

下篇预告

第四篇:全链路可观测性与AI Copilot

  • 实时监控(JVM、GC、CPU、HTTP指标)
  • 分布式追踪(Jaeger集成)
  • 过滤器链性能分析(P50/P95/P99)
  • AI Copilot实战(35+工具)
    • 案例1:AI调试404(推翻用户结论)
    • 案例2:AI压测分析(基于Prometheus+数据库)
    • 案例3:AI路由禁用分析

敬请期待!


参考资料


关于作者

李朝,网关开发,7年+分布式系统经验,专注于API网关、微服务架构、云原生技术领域。

50天独立开发企业级API网关平台,涵盖44项核心功能、561个测试用例,从架构设计到生产环境部署全流程实践。


专业服务

如果你需要构建类似的API网关或微服务平台,我可以提供以下服务:

  • API网关定制开发:根据业务需求定制开发网关功能
  • 架构设计与咨询:微服务架构设计、技术选型、性能优化
  • 性能调优:JVM调优、连接池优化、限流降级方案
  • AI集成:AI Copilot开发、智能运维、自动化诊断

联系方式

需要API网关或微服务架构方面的帮助? 欢迎通过邮件或Upwork联系我,提供技术咨询和定制开发服务。