Redisson BlockingQueue导致生产环境OOM最近连续收到线上报警：线上容器接连出现 OOM，容器自动重

1. 背景

最近连续收到线上报警：线上容器接连出现 OOM，容器自动重启。怀疑出现了大对象或者内存泄漏。

因为服务最近一次上线为半月前，之前一直稳定运行，所以怀疑上次服务变更导致的本次 OOM。上次上线的功能中引入了 redisson 延迟队列。但是，服务上线几天后，由于一些业务上的原因，我们通过开关暂时关闭了这个功能，不会再向 redisson 延迟队列写入任何数据，理论上不会出现 OOM。

于是在本地分析了线上 heap dump 数据，发现罪魁祸首是 Redisson blockingQueue.take()导致的。

tips：通过-XX:+HeapDumpOnOutOfMemoryError 配置，JVM 会在 OOM 时自动生成堆快照，便于排查问题

问题现象如下：

使用blockingQueue.take()阻塞消费，期间队列中一直没数据写入
十天左右就会 OOM（取决于你的 JVM 内存大小）

2. 问题复盘

2.1. 问题复现 demo 代码

本地环境如下：

JDK 8
Redisson 3.11.1（这个版本很关键，3.12.5 开始修复了这个问题）

本地使用如下代码即可复现，复现要点为：

1. 用blockingQueue.take()阻塞消费
1. 队列中一直没有数据

@Slf4j
@Service
public class DelayService implements ApplicationListener<ContextRefreshedEvent> {
    
    private static final String QUEUE_NAME = "demo-delay";
    
    @Resource
    RedissonClient redissonClient;

    @Override
    public void onApplicationEvent(ContextRefreshedEvent event) {
        Thread thread = new Thread(new QueueMsgListener(redissonClient));
        thread.setDaemon(true);
        thread.start();
    }

    private static class QueueMsgListener implements Runnable{
        private final RedissonClient redissonClient;
        public QueueMsgListener(RedissonClient redissonClient) {
            this.redissonClient = redissonClient;
        }
        
        @Override
        public void run() {
            RBlockingQueue<Object> blockingQueue = redissonClient.getBlockingQueue(QUEUE_NAME);
            while (true) {
                try {
                    // 阻塞获取数据
                    Object obj = blockingQueue.take();
                    // do sth.
                } catch (InterruptedException e) {
                    log.error("[QueueMsgListener] err", e);
                    break;
                }
            }
        }
    }
}

本地运行代码，使用 visual vm 查看堆快照，内存占用变化曲线如下（复现OOM需要的时间比较久，我们分析趋势就行了）：

可以发现，期间虽然有 GC 的发生，但是内存占用一直是上升的趋势。

堆快照如下，占用内存最多的是 Netty 中的一个无界队列，这个队列是 Redisson CommandQueue 的成员变量：

2.2. 问题原因 & 源码分析

我们知道， Redis 的 BLPOP 命令是一个阻塞命令，在没有数据返回的情况下会阻塞当前连接。RedissonRBlockingQueue#take() 底层调用的 Redis BLPOP 命令，所以此方法会阻塞当前 redis 连接，直到有数据返回。

在阻塞期间，Redisson 会按照指定的时间间隔 (pingConnectionInterval) 异步发送 PING 命令，这些命令都会存到一个队列中，按顺序依次发送。

来看一张 Redisson 底层访问 Redis 的示意图：

由于 BLPOP 是阻塞命令，所以上图中的步骤 4 不会返回， PING 命令一直在 CommandsQueue 中积压，队列中的数据无限增长，随着时间的推移，最终 OOM。

发送心跳报文源码如下：

@Sharable
public class PingConnectionHandler extends ChannelInboundHandlerAdapter {
    // ...
    @Override
    public void channelActive(final ChannelHandlerContext ctx) throws Exception {
        RedisConnection connection = RedisConnection.getFrom(ctx.channel());
        connection.getConnectionPromise().onComplete((res, e) -> {
            if (e == null) {
                // 连接成功之后，发送心跳
                sendPing(ctx); 
            }
        });
        ctx.fireChannelActive();
    }

    protected void sendPing(final ChannelHandlerContext ctx) {
        final RedisConnection connection = RedisConnection.getFrom(ctx.channel());
        // 1. 异步发送PING命令
        final RFuture<String> future = connection.async(StringCodec.INSTANCE, RedisCommands.PING);
        
        // 2. pingConnectionInterval 后再次发送心跳
        config.getTimer().newTimeout(new TimerTask() {
            @Override
            public void run(Timeout timeout) throws Exception {
                CommandData<?, ?> commandData = connection.getCurrentCommand();
                if ((commandData == null || !commandData.isBlockingCommand()) 
                        && (future.cancel(false) || !future.isSuccess())) {
                    ctx.channel().close();
                    log.debug("channel: {} closed due to PING response timeout set in {} ms", ctx.channel(), config.getPingConnectionInterval());
                } else {
                    // 3. 递归调用
                    sendPing(ctx);
                }
            }
        }, config.getPingConnectionInterval(), TimeUnit.MILLISECONDS);
    }


}

通过 NettychannelActive回调，与 redis 连接建立成功之后，就会按照pingConnectionInterval 间隔定时发送RedisCommands.PING报文。

CommandsQueue 命令入队、出队相关代码如下：

public class CommandsQueue extends ChannelDuplexHandler {
    // ...
    private final Queue<QueueCommandHolder> queue = PlatformDependent.newMpscQueue();
    // ...

    /**
     *  发送下一个命令，CommandDecoder 解码完毕会调用这个方法
     */
    public void sendNextCommand(Channel channel) {
        QueueCommand command = channel.attr(CommandsQueue.CURRENT_COMMAND).getAndSet(null);
        if (command != null) {
            queue.poll();
        } else {
            QueueCommandHolder c = queue.peek();
            if (c != null) {
                QueueCommand data = c.getCommand();
                List<CommandData<Object, Object>> pubSubOps = data.getPubSubOperations();
                if (!pubSubOps.isEmpty()) {
                    queue.poll();
                }
            }
        }
        sendData(channel);
    }
    
    /**
     *  异步发送逻辑，CommandData全都放入队列中，按序依次发送
     */
    @Override
    public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception {
        // CommandData实现了QueueCommand接口
        if (msg instanceof QueueCommand) {
            QueueCommand data = (QueueCommand) msg;
            QueueCommandHolder holder = queue.peek();
            if (holder != null && holder.getCommand() == data) {
                // case1: 队列头部命令就当前命令，直接发送
                super.write(ctx, msg, promise);
            } else {
                // case2: 否则，命令入队
                queue.add(new QueueCommandHolder(data, promise));
                // 发送队列头部的命令
                sendData(ctx.channel());
            }
        } else {
            super.write(ctx, msg, promise);
        }
    }
}

3. 如何解决

3.1. 方法 1 升级至 Redisson 3.12.5+

Redisson 官方已经修复了这个问题，3.12.5+版本不存在这个问题。

github.com/redisson/re…

官方修复之后，PingConnectionHandler相关代码如下：

@Sharable
public class PingConnectionHandler extends ChannelInboundHandlerAdapter {
    
    // ...
    
    private void sendPing(ChannelHandlerContext ctx) {
        RedisConnection connection = RedisConnection.getFrom(ctx.channel());
        CommandData<?, ?> commandData = connection.getCurrentCommand();
        RFuture<String> future;
        // 如果正在执行的 commandData 是一个阻塞命令，就不再发送 PING
        if (commandData == null || !commandData.isBlockingCommand()) {
            future = connection.async(StringCodec.INSTANCE, RedisCommands.PING);
        } else {
            future = null;
        }

        config.getTimer().newTimeout(new TimerTask() {
            @Override
            public void run(Timeout timeout) throws Exception {
                if (future != null
                        && (future.cancel(false) || !future.isSuccess())) {
                    ctx.channel().close();
                    if (future.cause() != null) {
                        log.error("Unable to send PING command over channel: " + ctx.channel(), future.cause());
                    }
                    log.debug("channel: {} closed due to PING response timeout set in {} ms", ctx.channel(), config.getPingConnectionInterval());
                } else {
                    sendPing(ctx);
                }
            }
        }, config.getPingConnectionInterval(), TimeUnit.MILLISECONDS);
    }

}

修复的方式也比较简单，判断当前执行的命令是否为阻塞命令，如果是阻塞命令不再发送 PING 报文了。

3.2. 方法 2 慎重使用`BlockingQueue#take()`

BlockingQueue#take()是阻塞方法，应当谨慎使用。可以使用带超时参数的 poll()方法规避上述问题。

Redisson BlockingQueue导致生产环境OOM