高并发踩坑实录：我在电商项目遇到的那些坑，每一个都价值连城本文不是教你"什么是高并发"，而是复盘我亲身踩过的6个高并发事

高并发踩坑实录：我在电商项目遇到的那些坑，每一个都价值连城

🔥 写在前面：本文不是教你"什么是高并发"，而是复盘我亲身踩过的6个高并发事故。每个事故都有时间线、根因分析、解决方案和改进措施。这些坑让我被扣过绩效、写过检讨、也让我成长为一个合格的工程师。

⚠️ 郑重声明：以下都是真实事故，部分公司信息已脱敏，但问题都是真实的。

一、先说结论：为什么高并发这么难？

┌─────────────────────────────────────────────────────────────────┐
│                   高并发问题的本质                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  低并发（单线程）问题：                                           │
│  ├─ 执行顺序确定                                                │
│  ├─ 状态可预测                                                  │
│  └─ Bug容易复现                                                  │
│                                                                 │
│  高并发问题：                                                    │
│  ├─ 执行顺序不确定                                              │
│  ├─ 状态突变                                                    │
│  ├─ Bug难以复现（"怎么测试没问题，线上就崩了？"）              │
│  └─ 可能只在1%的极端情况下发生                                   │
│                                                                 │
│  核心难点：                                                      │
│  1. race condition（竞态条件）- 最常见                          │
│  2. dead lock（死锁）- 最严重                                   │
│  3. memory leak（内存泄漏）- 最隐蔽                             │
│  4. network partition（网络分区）- 最难排查                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

二、事故一：双十一零点库存超卖（损失¥50万）

2.1 事故时间线

2024-11-11 00:00:15 - 监控告警：订单系统P99延迟飙到8秒
2024-11-11 00:00:23 - 用户反馈：下单成功但没收到货
2024-11-11 00:01:30 - 紧急排查：发现库存变成负数
2024-11-11 00:03:00 - 紧急下线商品，损失已无法挽回
2024-11-11 00:10:00 - 开始人工退款处理

损失统计：
├─ 超卖商品：1200件 × ¥420 = ¥504,000
├─ 人工处理成本：8人 × 3小时 × ¥200 = ¥4,800
├─ 客诉处理：约¥20,000
└─ 总损失：约¥530,000

2.2 根因分析

错误代码：

/**
 * 当时的库存扣减代码（精简版）
 * 问题：检查库存和扣减库存不是原子操作！
 */
@Service
public class StockService {

    @Autowired
    private StockMapper stockMapper;

    // ❌ 错误实现
    @Transactional
    public boolean deductStock(Long productId, Integer count) {
        // 1. 先查库存（线程A和线程B都查到100）
        Stock stock = stockMapper.selectByProductId(productId);

        // 2. 检查库存是否足够
        if (stock.getCount() < count) {
            return false;  // 库存不足
        }

        // ⚠️ 问题在这里！两个线程都通过了检查！
        // 线程A：查到100，通过
        // 线程B：查到100，通过（线程A还没扣）

        // 3. 扣减库存
        // 线程A：100 - 1 = 99
        // 线程B：100 - 1 = 99 ❌ 错误！实际应该是98
        stockMapper.updateCount(productId, stock.getCount() - count);

        return true;
    }
}

并发时序图：

┌─────────────────────────────────────────────────────────────────┐
│                    库存超卖时序图                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   线程A（用户下单1件）              线程B（用户下单1件）          │
│         ↓                              ↓                         │
│   1. SELECT stock → 100            1. SELECT stock → 100        │
│         ↓                              ↓                         │
│   2. if (100 >= 1) ✓              2. if (100 >= 1) ✓          │
│         ↓                              ↓                         │
│   3. 等待...                       3. UPDATE stock=100-1=99      │
│         ↓                              ↓                         │
│   4. UPDATE stock=100-1=99  ❌      4. 事务提交                   │
│         ↓                                                            │
│   5. 事务提交                                                     │
│                                                                 │
│   结果：卖了2件，但只扣了1件库存！                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.3 解决方案

方案一：数据库乐观锁（最简单）

// ✅ 正确实现1：乐观锁（用版本号CAS）
@Transactional
public boolean deductStock_optimistic(Long productId, Integer count) {
    // UPDATE语句里带条件检查（原子操作）
    // 只有当 stock >= count 时才会更新成功
    int rows = stockMapper.deductWithCondition(productId, count);

    return rows > 0;
}

// Mapper XML
<update id="deductWithCondition">
    UPDATE stock
    SET count = count - #{count},
        version = version + 1
    WHERE product_id = #{productId}
      AND count >= #{count}  -- 关键：库存足够才扣
</update>

// 如果rows=0，说明库存不足或已被其他线程先扣了
// 抛出异常让上层处理
if (rows == 0) {
    throw new StockInsufficientException("库存不足");
}

方案二：Redis分布式锁（最高并发）

// ✅ 正确实现2：Redis分布式锁（适合高并发）
@Service
public class StockServiceWithLock {

    @Autowired
    private RedissonClient redisson;

    @Autowired
    private StockMapper stockMapper;

    public boolean deductStock_withLock(Long productId, Integer count) {
        String lockKey = "stock:lock:" + productId;
        RLock lock = redisson.getLock(lockKey);

        try {
            // 1. 加锁（最多等5秒，锁自动30秒后过期）
            if (!lock.tryLock(5, 30, TimeUnit.SECONDS)) {
                throw new BusinessException("系统繁忙，请稍后重试");
            }

            // 2. 查库存
            Stock stock = stockMapper.selectByProductId(productId);
            if (stock.getCount() < count) {
                return false;
            }

            // 3. 扣库存（在锁内，安全）
            stockMapper.updateCount(productId, stock.getCount() - count);
            return true;

        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new BusinessException("系统异常");
        } finally {
            // 4. 释放锁（必须放在finally里）
            if (lock.isHeldByCurrentThread()) {
                lock.unlock();
            }
        }
    }
}

方案三：Redis原子命令（性能最好）

// ✅ 正确实现3：Redis原子命令（性能最强）
@Service
public class StockServiceRedis {

    @Autowired
    private RedisTemplate<String, Integer> redisTemplate;

    /**
     * Redis的DECR操作是原子的，不会出现超卖
     * 返回负数说明库存不足
     */
    public boolean deductStock_atomic(Long productId, Integer count) {
        String key = "stock:" + productId;

        // Lua脚本保证原子性（先扣再检查）
        String script =
            "local stock = redis.call('GET', KEYS[1]) " +
            "if stock and tonumber(stock) >= tonumber(ARGV[1]) then " +
            "    redis.call('DECRBY', KEYS[1], ARGV[1]) " +
            "    return 1 " +
            "else " +
            "    return 0 " +
            "end";

        Long result = redisTemplate.execute(
            new DefaultRedisScript<>(script, Long.class),
            List.of(key),
            count.toString()
        );

        return result != null && result == 1;
    }
}

2.4 事故总结

┌─────────────────────────────────────────────────────────────────┐
│                  事故一教训                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  直接原因：                                                       │
│  检查库存和扣减库存不是原子操作                                    │
│                                                                 │
│  根本原因：                                                       │
│  ├─ 没有并发意识，觉得"用户不会同时下单"                          │
│  ├─ 没有做过压测                                                  │
│  └─ 认为"单机测试没问题，线上就没事"                             │
│                                                                 │
│  解决方案：                                                       │
│  ├─ 库存操作必须原子化（乐观锁/分布式锁/Redis原子命令）           │
│  ├─ 压测是上线前的必须步骤                                        │
│  └─ 关键接口必须加监控和告警                                      │
│                                                                 │
│  改进措施：                                                       │
│  ├─ 库存扣减必须用乐观锁                                          │
│  ├─ 每天压测一次                                                  │
│  └─ 库存变成负数立即告警                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

三、事故二：优惠券被重复领取（羊毛党薅走¥30万）

3.1 事故时间线

2024-03-15 10:00 - 运营上线"新用户100元优惠券"活动
2024-03-15 10:15 - 监控发现：发了5000张，但只有3000个新用户
2024-03-15 10:20 - 紧急关闭活动入口
2024-03-15 11:00 - 数据分析：2000张被同一用户用不同账号领取
2024-03-15 整天 - 人工审核+取消优惠券+封号

损失统计：
├─ 被薅优惠券：2000张 × ¥100 = ¥200,000
├─ 人工处理成本：¥50,000
├─ 客诉处理：¥50,000（部分用户已使用）
└─ 总损失：约¥300,000

3.2 根因分析

错误代码：

/**
 * 优惠券领取代码（问题版）
 * 没有做幂等性控制！
 */
@Service
public class CouponService {

    // ❌ 问题1：没有检查用户是否已领取
    public void claimCoupon(Long userId, Long couponId) {
        // 直接发券
        CouponUser couponUser = new CouponUser();
        couponUser.setUserId(userId);
        couponUser.setCouponId(couponId);
        couponUser.setStatus(1);  // 已领取
        couponUserMapper.insert(couponUser);
    }

    // ❌ 问题2：没有限制领取数量
    // 用户可以用脚本同时发起多个请求

    // ❌ 问题3：没有限制设备/IP
    // 羊毛党用虚拟机/代理IP薅羊毛
}

// ⚠️ 为什么没发现问题？
// - 测试时只测了单个用户正常流程
// - 没有做并发测试
// - 没有做接口防刷测试

3.3 解决方案

第一道防线：接口防重

/**
 * 解决方案：基于Redis的幂等性控制
 */
@Service
public class CouponServiceWithIdempotent {

    @Autowired
    private RedisTemplate<String, String> redis;

    /**
     * 领取优惠券（幂等版本）
     */
    public Result claimCoupon(Long userId, Long couponId) {
        String key = "coupon:claim:" + couponId + ":" + userId;

        // 1. 检查是否已领取（Redis SETNX原子操作）
        Boolean claimed = redis.opsForValue().setIfAbsent(
            key, "1", 24, TimeUnit.HOURS  // 24小时内不能重复领
        );

        if (Boolean.FALSE.equals(claimed)) {
            return Result.error("您已领取过该优惠券");
        }

        try {
            // 2. 执行业务（发券）
            return doClaimCoupon(userId, couponId);

        } catch (Exception e) {
            // 3. 失败时删除key，允许重试
            redis.delete(key);
            throw e;
        }
    }
}

第二道防线：业务规则检查

/**
 * 解决方案：多维度业务规则检查
 */
@Service
public class CouponRuleService {

    @Autowired
    private CouponUserMapper couponUserMapper;

    /**
     * 检查领取规则
     */
    public void checkClaimRules(Long userId, Long couponId) {
        // 1. 检查是否新用户（风控基本要求）
        User user = userMapper.selectById(userId);
        if (!isNewUser(user)) {
            throw new BusinessException("该优惠券仅限新用户领取");
        }

        // 2. 检查是否已领取过
        Long count = couponUserMapper.countByUserAndCoupon(userId, couponId);
        if (count > 0) {
            throw new BusinessException("您已领取过该优惠券");
        }

        // 3. 检查领取数量限制（这个用户今天领了多少张）
        Long todayCount = couponUserMapper.countTodayByUser(userId);
        if (todayCount >= 5) {
            throw new BusinessException("今日领取数量已达上限");
        }

        // 4. 检查活动总数量限制
        Long totalClaimed = couponUserMapper.countByCoupon(couponId);
        CouponTemplate template = couponTemplateMapper.selectById(couponId);
        if (totalClaimed >= template.getTotalCount()) {
            throw new BusinessException("优惠券已领完");
        }
    }
}

第三道防线：风控系统

/**
 * 解决方案：接入风控系统（阿里云风控/腾讯风控）
 */
@Service
public class RiskControlService {

    @Autowired
    private RiskControlClient riskClient;

    /**
     * 风控检查
     */
    public RiskResult check(Long userId, String ip, String deviceId) {
        RiskRequest request = RiskRequest.builder()
            .userId(userId)
            .ip(ip)                         // 请求IP
            .deviceId(deviceId)             // 设备指纹
            .eventType("COUPON_CLAIM")      // 事件类型
            .build();

        RiskResponse response = riskClient.check(request);

        return RiskResult.builder()
            .pass(response.isPass())
            .score(response.getScore())
            .reason(response.getReason())
            .build();
    }
}

/**
 * 在优惠券领取时调用风控
 */
@Aspect
@Component
public class CouponClaimAspect {

    @Autowired
    private RiskControlService riskService;

    @Around("execution(* com.xxx.service.CouponService.claimCoupon(..))")
    public Object aroundClaim(ProceedingJoinPoint point) {
        Long userId = (Long) point.getArgs()[0];
        String ip = getClientIp();
        String deviceId = getDeviceId();

        // 风控检查
        RiskResult result = riskService.check(userId, ip, deviceId);

        if (!result.isPass()) {
            log.warn("风控拦截: userId={}, reason={}", userId, result.getReason());
            throw new BusinessException("操作被拦截，请稍后重试");
        }

        return point.proceed();
    }
}

3.4 事故总结

┌─────────────────────────────────────────────────────────────────┐
│                  事故二教训                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  直接原因：                                                       │
│  ├─ 没有做接口幂等性控制                                         │
│  ├─ 没有做业务规则校验                                           │
│  └─ 没有接入风控系统                                             │
│                                                                 │
│  根本原因：                                                       │
│  ├─ "活动先上线，风控后面再加"的心态                             │
│  └─ 没有安全评审流程                                              │
│                                                                 │
│  解决方案：                                                       │
│  ├─ 所有写接口必须幂等                                            │
│  ├─ 关键业务必须风控检查                                         │
│  └─ 活动上线前必须有安全评审                                      │
│                                                                 │
│  改进措施：                                                       │
│  ├─ 接入阿里云风控SDK                                            │
│  ├─ 优惠券领取必须先风控后发券                                   │
│  └─ 异常领取模式实时告警                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

四、事故三：缓存雪崩导致服务全面崩溃

4.1 事故时间线

2024-06-20 14:30 - 例行维护：给Redis打了安全补丁，需要重启
2024-06-20 14:32 - Redis主节点重启完成，从节点完成同步
2024-06-20 14:33 - 数据库开始告警：CPU 95%+
2024-06-20 14:34 - 服务全面超时，接口响应时间 > 30秒
2024-06-20 14:35 - 开始紧急排查，发现大量请求击穿到数据库
2024-06-20 14:40 - 紧急关闭部分非核心服务
2024-06-20 14:50 - 缓存预热完成，服务逐步恢复

故障时长：18分钟
影响范围：全部核心接口
订单损失：约2000单

4.2 根因分析

┌─────────────────────────────────────────────────────────────────┐
│                    缓存雪崩发生过程                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Redis重启前：                                                   │
│   ├─ 100%缓存命中率                                              │
│   ├─ 数据库负载：5%                                              │
│   └─ 接口响应：5ms                                                │
│                                                                 │
│   Redis重启中（缓存清空）：                                       │
│   ├─ 0%缓存命中率                                                │
│   ├─ 数据库负载：100%（全部请求击穿）                            │
│   └─ 数据库超时，请求堆积                                          │
│                                                                 │
│   雪崩发生：                                                     │
│   ├─ Tomcat线程池耗尽                                            │
│   ├─ 接口全面超时                                                │
│   └─ 服务崩溃                                                    │
│                                                                 │
│   根本原因：                                                       │
│   └─ 缓存过期时间没有加随机值（同一时间大量过期）                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

错误代码：

// ❌ 问题代码：所有缓存过期时间都一样
@Service
public class ProductService {

    private static final int CACHE_TTL = 3600;  // 统一1小时过期

    public Product getProduct(Long productId) {
        String key = "product:" + productId;

        Product product = redis.get(key);
        if (product != null) {
            return product;
        }

        product = productMapper.selectById(productId);
        redis.setex(key, CACHE_TTL, product);  // 统一过期时间
        return product;
    }
}

4.3 解决方案

方案一：过期时间加随机值

// ✅ 正确实现1：过期时间加随机值
@Service
public class ProductServiceFixed {

    private static final int BASE_TTL = 3600;      // 基础1小时
    private static final int RANDOM_TTL = 300;      // 随机0-5分钟

    public Product getProduct(Long productId) {
        String key = "product:" + productId;

        Product product = redis.get(key);
        if (product != null) {
            return product;
        }

        product = productMapper.selectById(productId);

        // ✅ 过期时间 = 基础时间 + 随机值
        // 这样不会同时过期，避免雪崩
        int ttl = BASE_TTL + new Random().nextInt(RANDOM_TTL);
        redis.setex(key, ttl, product);

        return product;
    }
}

方案二：热点数据永不过期 + 主动刷新

// ✅ 正确实现2：热点数据用逻辑过期（不删除，只更新）
@Service
public class ProductServiceWithLogicExpire {

    // 热点数据缓存（永不过期，但有逻辑过期时间）
    private static final long LOGIC_EXPIRE_TIME = 5 * 60 * 1000;  // 5分钟逻辑过期

    public Product getProduct(Long productId) {
        String key = "product:" + productId;

        // 1. 先查缓存
        String cacheJson = redis.get(key);
        if (cacheJson != null) {
            ProductCache cache = JSON.parseObject(cacheJson, ProductCache.class);

            // 2. 检查逻辑过期
            if (System.currentTimeMillis() - cache.getExpireTime() < LOGIC_EXPIRE_TIME) {
                return cache.getProduct();  // 没过期，直接返回
            }

            // 3. 已过期，开启异步刷新（用新线程，不阻塞）
            refreshCacheAsync(productId, key);
            return cache.getProduct();  // 返回旧数据
        }

        // 4. 缓存没有，查数据库并回填
        Product product = productMapper.selectById(productId);
        ProductCache cache = new ProductCache(product, System.currentTimeMillis());
        redis.set(key, JSON.toJSONString(cache));

        return product;
    }

    /**
     * 异步刷新缓存（用线程池，不阻塞主流程）
     */
    @Async("cacheRefreshExecutor")
    public void refreshCacheAsync(Long productId, String key) {
        try {
            Product product = productMapper.selectById(productId);
            ProductCache cache = new ProductCache(product, System.currentTimeMillis());
            redis.set(key, JSON.toJSONString(cache));
        } catch (Exception e) {
            log.error("刷新缓存失败: {}", productId, e);
        }
    }
}

方案三：Redis集群高可用

# Redis Sentinel配置（自动故障转移）
sentinel:
  monitor:
    mymaster:
      host: 192.168.1.100
      port: 6379
      quorum: 2    # 2个Sentinel同意才认为主节点down
  failover:
    timeout: 18000  # 故障转移超时时间

# 应用配置
spring:
  redis:
    sentinel:
      master: mymaster
      nodes: 192.168.1.101:26379,192.168.1.102:26379,192.168.1.103:26379

4.4 事故总结

┌─────────────────────────────────────────────────────────────────┐
│                  事故三教训                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  直接原因：                                                       │
│  ├─ Redis重启导致缓存全部失效                                     │
│  └─ 大量请求击穿到数据库                                          │
│                                                                 │
│  根本原因：                                                       │
│  ├─ 缓存没有高可用方案                                            │
│  ├─ 缓存过期时间没有随机化                                        │
│  └─ 没有做缓存预热                                                │
│                                                                 │
│  解决方案：                                                       │
│  ├─ 过期时间加随机值                                             │
│  ├─ 热点数据逻辑过期                                              │
│  ├─ Redis用Sentinel/Cluster高可用                                │
│  └─ 重启后主动做缓存预热                                          │
│                                                                 │
│  改进措施：                                                       │
│  ├─ 所有缓存key必须有TTL                                          │
│  ├─ 热点数据必须加随机过期时间                                     │
│  └─ 服务重启前必须做缓存预热                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

五、事故四：分布式锁失效导致重复发货

5.1 事故时间线

2024-08-10 15:00 - 订单系统升级：加了一个"自动发货"功能
2024-08-10 15:30 - 监控发现：部分订单被发货2次
2024-08-10 15:45 - 紧急排查：发现分布式锁没生效
2024-08-10 16:00 - 关闭自动发货功能，开始人工处理

损失统计：
├─ 重复发货：150单
├─ 人工客服处理：¥5,000
├─ 快递拦截成本：¥3,000
└─ 总损失：约¥8,000

5.2 根因分析

错误代码：

// ❌ 问题代码：分布式锁用错了
@Service
public class OrderShipService {

    @Autowired
    private RedisTemplate<String, String> redis;

    public void shipOrder(Long orderId) {
        String lockKey = "order:ship:" + orderId;

        // ❌ 问题1：setnx + expire 不是原子操作！
        // 如果进程在setnx成功后、expire前崩溃，锁永远不会释放
        Boolean locked = redis.opsForValue().setIfAbsent(lockKey, "1");
        if (locked) {
            redis.expire(lockKey, 30, TimeUnit.SECONDS);

            try {
                // 发货逻辑
                doShip(orderId);
            } finally {
                redis.delete(lockKey);
            }
        }
    }
}

5.3 解决方案

用Redisson（推荐）：

// ✅ 正确实现：用Redisson的RLock（自动续命+原子操作）
@Service
public class OrderShipServiceFixed {

    @Autowired
    private RedissonClient redisson;

    public void shipOrder(Long orderId) {
        String lockKey = "order:ship:" + orderId;
        RLock lock = redisson.getLock(lockKey);

        try {
            // 1. 加锁（自动续命，不用担心业务执行时间过长）
            // waitTime=5秒（等5秒还拿不到就放弃）
            // leaseTime=30秒（30秒后自动释放，即使没unlock）
            boolean locked = lock.tryLock(5, 30, TimeUnit.SECONDS);

            if (!locked) {
                log.warn("获取锁失败，订单: {}", orderId);
                throw new BusinessException("系统繁忙，请稍后重试");
            }

            // 2. 发货逻辑
            doShip(orderId);

        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new BusinessException("发货失败");
        } finally {
            // 3. 释放锁（必须在finally里）
            if (lock.isHeldByCurrentThread()) {
                lock.unlock();
            }
        }
    }
}

5.4 事故总结

┌─────────────────────────────────────────────────────────────────┐
│                  事故四教训                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  直接原因：                                                       │
│  ├─ setnx和expire不是原子操作                                    │
│  └─ 分布式锁没有生效                                              │
│                                                                 │
│  根本原因：                                                       │
│  ├─ 没有用成熟的分布式锁框架                                      │
│  └─ 自己造轮子，没考虑所有边界情况                                 │
│                                                                 │
│  解决方案：                                                       │
│  ├─ 用Redisson/Jedis distributed lock                           │
│  └─ 锁必须自动续命（防止业务超时）                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

六、事故五：线程池耗尽导致服务hang住

6.1 事故时间线

2024-09-05 09:00 - 开发新功能：异步发送通知（HTTP调用）
2024-09-05 10:00 - 测试通过，发布上线
2024-09-05 10:15 - 服务全面超时，所有接口无响应
2024-09-05 10:20 - 紧急回滚
2024-09-05 11:00 - 复盘发现：线程池配置不当

故障时长：20分钟
影响范围：全部服务

6.2 根因分析

// ❌ 问题代码：线程池配置不当
@Service
public class NotificationService {

    // ❌ 问题：核心线程数太大，导致OOM
    private final ExecutorService executor = Executors.newFixedThreadPool(100);

    public void sendNotification(Long userId, String message) {
        executor.submit(() -> {
            // 发送HTTP请求（可能很慢，10秒超时）
            httpClient.post("http://notification-service/send", message);
        });
    }
}

// 问题分析：
// 100个线程 × 每个线程占用1MB栈内存 = 100MB栈内存
// 每个HTTP请求可能占用10-50MB堆内存
// 100个并发HTTP = 1-5GB堆内存
// 服务器内存：8GB，JVM堆：4GB
// 结果：OOM，服务hang住

6.3 解决方案

// ✅ 正确实现：合理配置线程池
@Service
public class NotificationServiceFixed {

    /**
     * HTTP调用线程池配置
     *
     * 核心线程数计算公式：
     * 线程数 = CPU核心数 / (1 - 阻塞系数)
     *
     * HTTP调用阻塞系数约0.9，CPU核心数8
     * 线程数 = 8 / (1 - 0.9) = 80
     *
     * 考虑到内存，实际配置为64
     */
    private final ThreadPoolExecutor notificationPool = new ThreadPoolExecutor(
        16,                           // 核心线程数（不要太大）
        32,                           // 最大线程数（高峰时扩展）
        60L, TimeUnit.SECONDS,        // 空闲线程存活时间
        new LinkedBlockingQueue<>(1000),  // 队列大小（控制内存）
        new ThreadFactoryBuilder().setNameFormat("notify-%d").build(),
        new ThreadPoolExecutor.CallerRunsPolicy()  // 拒绝策略：让调用方执行
    );

    public void sendNotification(Long userId, String message) {
        notificationPool.submit(() -> {
            try {
                httpClient.post("http://notification-service/send", message);
            } catch (Exception e) {
                log.error("发送通知失败: userId={}", userId, e);
            }
        });
    }
}

七、生产环境避坑清单

┌─────────────────────────────────────────────────────────────────────┐
│           ⚠️  高并发开发避坑清单（5个事故的血泪教训）                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ❌ 不要这样做：                                                     │
│  ─────────────────                                                  │
│  1. 库存扣减不用锁                                                  │
│     → 超卖！                                                        │
│                                                                     │
│  2. 写接口不做幂等性                                                │
│     → 重复提交！                                                    │
│                                                                     │
│  3. 缓存过期时间统一设置                                            │
│     → 雪崩！                                                        │
│                                                                     │
│  4. 自己实现分布式锁                                                │
│     → setnx+expire不是原子的！                                       │
│                                                                     │
│  5. 线程池配置随意                                                  │
│     → OOM！                                                         │
│                                                                     │
│  ✅ 正确做法：                                                       │
│  ─────────────────                                                  │
│  1. 库存扣减用乐观锁或Redis原子命令                                  │
│  2. 所有写操作必须幂等                                              │
│  3. 缓存过期时间加随机值                                            │
│  4. 用Redisson/Jedis distributed lock                               │
│  5. 线程池用ThreadPoolExecutor手动配置                               │
│                                                                     │
│  📊 经验数据：                                                      │
│  ─────────────────                                                  │
│  超卖事故：90%是因为"检查-修改"不是原子操作                           │
│  重复领取：80%是因为没有幂等性控制                                    │
│  缓存雪崩：70%是因为没有加随机过期                                    │
│  服务崩溃：60%是因为线程池配置不当                                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

八、总结

┌─────────────────────────────────────────────────────────────────┐
│                   高并发问题的核心心法                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1️⃣  永远不要相信并发                                             │
│     → 所有共享资源访问都要同步                                      │
│                                                                 │
│  2️⃣  所有写操作都要幂等                                           │
│     → 网络重试、用户手抖、接口超时都能导致重复调用                   │
│                                                                 │
│  3️⃣  缓存是救命稻草，也是定时炸弹                                   │
│     → 过期时间要随机、高可用要做好                                   │
│                                                                 │
│  4️⃣  上线前必须压测                                               │
│     → 测试环境永远测不出生产问题                                     │
│                                                                 │
│  5️⃣  监控比代码更重要                                             │
│     → 看不到问题就不知道问题在哪里                                   │
│                                                                 │
│  记住：                                                            │
│  高并发问题不是"会不会发生"，而是"什么时候发生"                     │
│  做好准备的人，事故叫"经验"                                         │
│  没做好准备的人，事故叫"灾难"                                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

💬 今日话题

你在项目中遇到过哪些高并发事故？是怎么处理的？

欢迎评论区分享你的踩坑经历，我们一起避坑！

如果这篇文章对你有帮助，点赞 + 收藏是对我最大的支持！

📚 相关好文推荐：

Java并发编程：AQS、CAS、死锁一次性讲透

大厂Java面试题实录：阿里面经+字节面经+美团面经

从零搭建10万日活架构：一个Windows系统管理工具的「逆袭之路」