RocketMQ broker busy 问题排查

774 阅读4分钟

问题排查

首先是报错代码:

private void cleanExpiredRequest() {
        while (this.brokerController.getMessageStore().isOSPageCacheBusy()) {
            try {
                if (!this.brokerController.getSendThreadPoolQueue().isEmpty()) {
                    final Runnable runnable = this.brokerController.getSendThreadPoolQueue().poll(0, TimeUnit.SECONDS);
                    if (null == runnable) {
                        break;
                    }

                    final RequestTask rt = castRunnable(runnable);
                    rt.returnResponse(RemotingSysResponseCode.SYSTEM_BUSY, String.format("[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: %sms, size of queue: %d", System.currentTimeMillis() - rt.getCreateTimestamp(), this.brokerController.getSendThreadPoolQueue().size()));
                } else {
                    break;
                }
            } catch (Throwable ignored) {
            }
        }

        cleanExpiredRequestInQueue(this.brokerController.getSendThreadPoolQueue(),
            this.brokerController.getBrokerConfig().getWaitTimeMillsInSendQueue());

        cleanExpiredRequestInQueue(this.brokerController.getPullThreadPoolQueue(),
            this.brokerController.getBrokerConfig().getWaitTimeMillsInPullQueue());

        cleanExpiredRequestInQueue(this.brokerController.getHeartbeatThreadPoolQueue(),
            this.brokerController.getBrokerConfig().getWaitTimeMillsInHeartbeatQueue());

        cleanExpiredRequestInQueue(this.brokerController.getEndTransactionThreadPoolQueue(), this
            .brokerController.getBrokerConfig().getWaitTimeMillsInTransactionQueue());
    }

其中'[PCBUSY_CLEAN_QUEUE]broker busy'就是客户端输出的报错日志。

查看调用栈发现是在broker启动时会调用成员变量brokerFastFailure的start()方法,开启一个定时任务线程池,每隔10秒钟执行上面的cleanExpiredRequest()方法。从成员变量的名字大概能猜到这个是broker的fast-failure机制,用于在broker消息存储繁忙的时候不用让生产者傻等,直接返回一个错误。那broker是基于什么判断当前写繁忙的呢?看下this.brokerController.getMessageStore().isOSPageCacheBusy()方法。

public boolean isOSPageCacheBusy() {
        long begin = this.getCommitLog().getBeginTimeInLock();
        long diff = this.systemClock.now() - begin;

        return diff < 10000000
                && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
    }

该方法位于DefaultMessageStore.java中,即broker默认的消息存储类,该类的分析见源码分析RocketMQ之CommitLog消息存储机制,此处不再赘述,我们重点关注上面方法的逻辑。

begin是commitLog执行putMessage()方法设置的:

// 默认操作系统页缓存繁忙超时时长
private long osPageCacheBusyTimeOutMills = 1000;
...
// commitLog存储一条消息时的开始时间戳
private volatile long beginTimeInLock = 0;
...
// putMessage时设置beginTimeInLock的代码
long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
this.beginTimeInLock = beginLockTimestamp;

也即是上一条开始存储时的时间戳,注意这是个volatile字段。而messageStoreConfig中默认的操作系统页缓存繁忙超时时间是1s,也就是说当broker的BrokerFastFailure类中的定时线程每隔10秒钟对比一下当前时间和上一条消息存储时的开始时间,如果大于1s(默认值,可配置)并且小于10000s就执行fast-failure机制。当然commitLog存储完一条消息后会将begin重置为0,这样如果broker空闲时diff就会大于10000s了。

RocketMQ消息处理机制

注册消息处理器

回到cleanExpiredRequest()方法,我们看到有个!this.brokerController.getSendThreadPoolQueue().isEmpty()的判断,猜测是broker会将每个接收到的消息封装成Runnable任务塞到sendThreadPoolQueue队列里进行处理。看下具体实现:

brokerController构造方法中会初始化该队列,默认容量为10000

this.sendThreadPoolQueue = new LinkedBlockingQueue<>(this.brokerConfig.getSendThreadPoolQueueCapacity());

brokerController.initialize()中会初始化sendMessageExecutor线程池:

this.sendMessageExecutor = new BrokerFixedThreadPoolExecutor(
                this.brokerConfig.getSendMessageThreadPoolNums(),
                this.brokerConfig.getSendMessageThreadPoolNums(),
                1000 * 60,
                TimeUnit.MILLISECONDS,
                this.sendThreadPoolQueue,
                new ThreadFactoryImpl("SendMessageThread_"));

而sendMessageExecutor又在哪里用到呢?

public void registerProcessor() {
        /**
         * SendMessageProcessor
         */
        SendMessageProcessor sendProcessor = new SendMessageProcessor(this);
        sendProcessor.registerSendMessageHook(sendMessageHookList);
        sendProcessor.registerConsumeMessageHook(consumeMessageHookList);

        this.remotingServer.registerProcessor(RequestCode.SEND_MESSAGE, sendProcessor, this.sendMessageExecutor);
        this.remotingServer.registerProcessor(RequestCode.SEND_MESSAGE_V2, sendProcessor, this.sendMessageExecutor);
        this.remotingServer.registerProcessor(RequestCode.SEND_BATCH_MESSAGE, sendProcessor, this.sendMessageExecutor);
        this.remotingServer.registerProcessor(RequestCode.CONSUMER_SEND_MSG_BACK, sendProcessor, this.sendMessageExecutor);
        this.fastRemotingServer.registerProcessor(RequestCode.SEND_MESSAGE, sendProcessor, this.sendMessageExecutor);
        this.fastRemotingServer.registerProcessor(RequestCode.SEND_MESSAGE_V2, sendProcessor, this.sendMessageExecutor);
        this.fastRemotingServer.registerProcessor(RequestCode.SEND_BATCH_MESSAGE, sendProcessor, this.sendMessageExecutor);
        this.fastRemotingServer.registerProcessor(RequestCode.CONSUMER_SEND_MSG_BACK, sendProcessor, this.sendMessageExecutor);
        ...省略其他代码
    }

可以看到在brokerController.registerProcessor()中创建了一个SendMessageProcessor消息发送处理对象,并调用remotingServer.registerProcessor()根据不同消息类型进行注册。进入NettyRemotingServer:

public void registerProcessor(int requestCode, NettyRequestProcessor processor, ExecutorService executor) {
        ExecutorService executorThis = executor;
        if (null == executor) {
            executorThis = this.publicExecutor;
        }

        Pair<NettyRequestProcessor, ExecutorService> pair = new Pair<NettyRequestProcessor, ExecutorService>(processor, executorThis);
        this.processorTable.put(requestCode, pair);
    }

逻辑很简单,根据processer和executor构造Pair对象并以requestCode(即消息类型)作为key放入processorTable(HashMap)中。

消息处理

RocketMQ底层基于netty,那消息处理的入口肯定是NettyRemotingServer啦,乱七八糟的先别看,直奔主题,看下start()方法:

public void start() {
        ...
        // 初始化handlers
        prepareSharableHandlers();

        ServerBootstrap childHandler =
            this.serverBootstrap.group(this.eventLoopGroupBoss, this.eventLoopGroupSelector)
                .channel(useEpoll() ? EpollServerSocketChannel.class : NioServerSocketChannel.class)
                .option(ChannelOption.SO_BACKLOG, 1024)
                .option(ChannelOption.SO_REUSEADDR, true)
                .option(ChannelOption.SO_KEEPALIVE, false)
                .childOption(ChannelOption.TCP_NODELAY, true)
                .childOption(ChannelOption.SO_SNDBUF, nettyServerConfig.getServerSocketSndBufSize())
                .childOption(ChannelOption.SO_RCVBUF, nettyServerConfig.getServerSocketRcvBufSize())
                .localAddress(new InetSocketAddress(this.nettyServerConfig.getListenPort()))
                .childHandler(new ChannelInitializer<SocketChannel>() {
                    @Override
                    public void initChannel(SocketChannel ch) throws Exception {
                        ch.pipeline()
                            .addLast(defaultEventExecutorGroup, HANDSHAKE_HANDLER_NAME, handshakeHandler)
                            .addLast(defaultEventExecutorGroup,
                                encoder,
                                new NettyDecoder(),
                                new IdleStateHandler(0, 0, nettyServerConfig.getServerChannelMaxIdleTimeSeconds()),
                                connectionManageHandler,
                                // 添加serverHandler
                                serverHandler
                            );
                    }
                });
                ...
    }

在prepareSharableHandlers()中初始化了serverHandler对象,见名知义,这个就是用来处理接收到的消息的:

@ChannelHandler.Sharable
    class NettyServerHandler extends SimpleChannelInboundHandler<RemotingCommand> {

        @Override
        protected void channelRead0(ChannelHandlerContext ctx, RemotingCommand msg) throws Exception {
            // 处理接收到的消息
            processMessageReceived(ctx, msg);
        }
    }
    
public void processMessageReceived(ChannelHandlerContext ctx, RemotingCommand msg) throws Exception {
        final RemotingCommand cmd = msg;
        if (cmd != null) {
            switch (cmd.getType()) {
                case REQUEST_COMMAND:
                    processRequestCommand(ctx, cmd);
                    break;
                case RESPONSE_COMMAND:
                    processResponseCommand(ctx, cmd);
                    break;
                default:
                    break;
            }
        }
    }

我们重点关注processRequestCommand(ctx, cmd)方法,简化后如下:

public void processRequestCommand(final ChannelHandlerContext ctx, final RemotingCommand cmd) {
        // 从消息注册表里根据消息类型获取Pair
        final Pair<NettyRequestProcessor, ExecutorService> matched = this.processorTable.get(cmd.getCode());
        final Pair<NettyRequestProcessor, ExecutorService> pair = null == matched ? this.defaultRequestProcessor : matched;
        final int opaque = cmd.getOpaque();

        if (pair != null) {
            // 消息处理入口
            Runnable run = new Runnable() {
                @Override
                public void run() {
                    try {
                        doBeforeRpcHooks(RemotingHelper.parseChannelRemoteAddr(ctx.channel()), cmd);
                        // 这里是处理消息的入口
                        final RemotingCommand response = pair.getObject1().processRequest(ctx, cmd);
                        ...
                    } catch (Throwable e) {
                        ...
                    }
                }
            };

            try {
                // 封装成RequestTask
                final RequestTask requestTask = new RequestTask(run, ctx.channel(), cmd);
                // 提交到executor,即前面的sendMessageExecutor
                pair.getObject2().submit(requestTask);
            } catch (RejectedExecutionException e) {
                ...
            }
        } else {
            ...
        }
    }

至此,我们对broker busy情况的处理逻辑算是有了一个比较清晰的认知。总结一下:broker启动时会新建一个brokerFastFailure定时任务线程池,每10s查看当前时间与上一条消息开始的时间间隔,大于配置值(默认1s)则启动fast-failure机制,从发送消息处理线程池的队列里拿到还未处理的任务,直接报错返回给客户端。