如果胖友对心跳检测的逻辑不是很了解，建议先看《简易RPC框架-心跳与重连机制》。

1. sofa-bolt

Server ：基于 Netty IdleStateHandler(0, 0, 90000 ms)，即 90 秒无读或者写操作，进行关闭( ServerIdleHandler )。
Client ：基于 Netty IdleStateHandler(15000 ms, 150000 ms, 0) ，即 15 秒无来自 Server 无读或者写操作，发起心跳( HeartbeatHandler + RpcHeartbeatTrigger ) 。在 RpcHeartbeatTrigger 中，如果连续三次发起心跳失败，则关闭连接。并且，是否发起成功的标准为 Server 是否回复一个成功的响应。

比较神奇的是，bolt 的 Server 设置 IdleStateHandler 的第三个参数，Client 设置了 IdleStateHandler 的第一、二个参数。按照 IdleStateHandler 的第三个参数的字面解释，“所有类型的超时时间”，那么 Client 使用 IdleStateHandler(0, 0, 150000 ms) 也可以呀，这样还能少一个 Task 呢！😈 带着酱紫的疑问，请教了下闪电侠大大。经过对 IdleStateHandler 的源码解读，发现 AllIdleTimeoutTask 每次读或写触发空闲后，重新计时。代码如下( 重点见 <1> 处的英文说明 )：

private final class AllIdleTimeoutTask extends AbstractIdleTask {

    AllIdleTimeoutTask(ChannelHandlerContext ctx) {
        super(ctx);
    }

    @Override
    protected void run(ChannelHandlerContext ctx) {

        long nextDelay = allIdleTimeNanos;
        if (!reading) {
            nextDelay -= ticksInNanos() - Math.max(lastReadTime, lastWriteTime);
        }
        if (nextDelay <= 0) {
            // Both reader and writer are idle - set a new timeout and
            // notify the callback.
            allIdleTimeout = schedule(ctx, this, allIdleTimeNanos, TimeUnit.NANOSECONDS);

            boolean first = firstAllIdleEvent;
            firstAllIdleEvent = false;

            try {
                if (hasOutputChanged(ctx, first)) {
                    return;
                }

                IdleStateEvent event = newIdleStateEvent(IdleState.ALL_IDLE, first);
                channelIdle(ctx, event);
            } catch (Throwable t) {
                ctx.fireExceptionCaught(t);
            }
        } else {
            // <1> 重点！！！
            // Either read or write occurred before the timeout - set a new
            // timeout with shorter delay.
            allIdleTimeout = schedule(ctx, this, nextDelay, TimeUnit.NANOSECONDS);
        }
    }
}

2. Jupiter

整体和 sofa-bolt 类似。

Jupiter 比较特殊的是，自己实现了 org.jupiter.transport.netty.handler.IdleStateChecker ，原因如下：

/**
 * 基于 {@link HashedWheelTimer} 的空闲链路监测.
 *
 *  1. Netty4.x默认的链路检测使用的是eventLoop的delayQueue, delayQueue是一个优先级队列, 复杂度为O(log n),
 *      每个worker处理自己的链路监测, 可能有助于减少上下文切换, 但是网络IO操作与idle会相互影响.
 *  2. 这个实现使用{@link HashedWheelTimer}的复杂度为O(1), 而且网络IO操作与idle不会相互影响, 但是有上下文切换.
 *  3. 如果连接数小, 比如几万以内, 可以直接用Netty4.x默认的链路检测 {@link io.netty.handler.timeout.IdleStateHandler},
 *      如果连接数较大, 建议使用这个实现.
 *
 * jupiter
 * org.jupiter.transport.netty.handler
 *
 * @author jiachun.fjc
 */

Server ：基于 Jupiter IdleStateChecker(60000 ms, 0, 0) ，即 60 秒没有从 Client 读取到数据，进行关闭( AcceptorIdleStateTrigger + AcceptorHandler)。
Client ：基于 Jupiter IdleStateChecker(0, 30000 ms, 0) ，即 30 秒没有向 Server 写入数据，发起心跳( ConnectorIdleStateTrigger )。

Jupiter 校验的更轻量，sofa-bolt 校验的更严谨。极端情况下，Jupiter Client 无法检测到连接不上 Jupiter Server 的情况，例如断网的情况。

TODO 芋艿，断网的情况下，Client Channel 会变成 Inactive 么？

3. Dubbo

因为 Dubbo 抽象了通信模块，所以不直接使用 Netty 自带的 IdleStateChecker ，而是自己使用 HeartbeatHandler + HeartBeatTask 实现，从而方便的接入 Mina、Netty3、Netty4 。

当任一一端（无论是 Server 还是 Client），检测 60 秒内无读或者写操作，向对方发起心跳请求。和 sofa-bolt 相同，接收到心跳请求的一端，会返回一个心跳响应给发送端。

当 Server 端 180 秒( 60 秒 * 3 )未收到写操作时，关闭和 Client 的连接。
当 Client 端 180 秒( 60 秒 * 3 )未收到写操作时，重新发起和 Server 的连接。

相比 sofa-bolt 来说，Server 新增了向 Client 主动发起心跳检测的逻辑，这样可能带来一定的资源消耗( 恰好 Server 和 Client 都向对方发起了心跳请求；一般情况下，只会单边发起心跳请求，因为一边的写，就是另外一边的读 )，那么带来什么样的好处呢？TODO

更多 Dubbo 源码解析，请见《Dubbo 实现原理与源码解析系列》。

4. Motan

Motan 的心跳设计方案和意图，和上述的三种都不一样。

发送消息时检测通道是否可用

Motan 在发送请求失败时，会记录连续调用失败次数。代码如下(重点见 <1> 和 <2> 处)：

// NettyChannel.java
@Override
public Response request(Request request) throws TransportException {
    int timeout = nettyClient.getUrl().getMethodParameter(request.getMethodName(), request.getParamtersDesc(), URLParamType.requestTimeout.getName(), URLParamType.requestTimeout.getIntValue());
    if (timeout <= 0) {
        throw new MotanFrameworkException("NettyClient init Error: timeout(" + timeout + ") <= 0 is forbid.", MotanErrorMsgConstant.FRAMEWORK_INIT_ERROR);
    }
    ResponseFuture response = new DefaultResponseFuture(request, timeout, this.nettyClient.getUrl());
    this.nettyClient.registerCallback(request.getRequestId(), response);
    byte[] msg = CodecUtil.encodeObjectToBytes(this, codec, request); // TODO 芋艿，为啥在这里处理，而不是在 Encoder 中
    ChannelFuture writeFuture = this.channel.writeAndFlush(msg);
    boolean result = writeFuture.awaitUninterruptibly(timeout, TimeUnit.MILLISECONDS);

    if (result && writeFuture.isSuccess()) {
        response.addListener(new FutureListener() {
            @Override
            public void operationComplete(Future future) throws Exception {
                if (future.isSuccess() || (future.isDone() && ExceptionUtil.isBizException(future.getException()))) {
                    // 成功的调用
                    nettyClient.resetErrorCount();
                } else {
                    // <1>
                    // 失败的调用
                    nettyClient.incrErrorCount();
                }
            }
        });
        return response;
    }

    writeFuture.cancel(true);
    response = this.nettyClient.removeCallback(request.getRequestId());
    if (response != null) {
        response.cancel();
    }
    // <2>
    // 失败的调用
    nettyClient.incrErrorCount();

    if (writeFuture.cause() != null) {
        throw new MotanServiceException("NettyChannel send request to server Error: url="
                + nettyClient.getUrl().getUri() + " local=" + localAddress + " "
                + MotanFrameworkUtil.toString(request), writeFuture.cause());
    } else {
        throw new MotanServiceException("NettyChannel send request to server Timeout: url="
                + nettyClient.getUrl().getUri() + " local=" + localAddress + " "
                + MotanFrameworkUtil.toString(request));
    }
}

而 NettyClient#resetErrorCount() 方法，代码如下：

// AbstractSharedPoolClient.java // NettyClient 继承 AbstractSharedPoolClient 类。 
void incrErrorCount() {
    long count = errorCount.incrementAndGet();

    // 如果节点是可用状态，同时当前连续失败的次数超过连接数，那么把该节点标示为不可用
    if (count >= connections && state.isAliveState()) {
        synchronized (this) {
            count = errorCount.longValue();

            if (count >= connections && state.isAliveState()) {
                LoggerUtil.error("NettyClient unavailable Error: url=" + url.getIdentity() + " "
                        + url.getServerPortStr());
                state = ChannelState.UNALIVE;
            }
        }
    }
}

当超过 connections 连接数时，标记 Client 不可用。

另外，Motan Client 比较特殊( 不同于 Dubbo )，若 connections 配置了 > 1 ，则一个 Client 会包含多个连接 Server 的 Channel 通道。而 Client 发起请求时，从 Channel 数组中选择一个可用的 Channel 。代码如下：

// AbstractSharedPoolClient.java
protected Channel getChannel() throws MotanServiceException {
    int index = MathUtil.getNonNegative(idx.getAndIncrement());
    Channel channel;

    for (int i = index; i < connections + index; i++) {
        channel = channels.get(i % connections);
        // <1>
        if (channel.isAvailable()) {
            return channel;
        // <2>
        } else {
            factory.rebuildObject(channel);
        }
    }

    String errorMsg = this.getClass().getSimpleName() + " getChannel Error: url=" + url.getUri();
    LoggerUtil.error(errorMsg);
    throw new MotanServiceException(errorMsg);
}

<1> 处，选择一个可用的 Channel。
<2> 处，当顺序遍历到的 Channel 不可用，调用 SharedObjectFactory#rebuildObject(NettyChannel nettyChannel) 方法，重新建立和 Server 的连接 Channel 通道。建立的过程是异步的，从而避免连接导致的阻塞过程，影响 #getChannel() 的遍历。

发送消息时检测客户端是否可用

Motan 的后台线程 HeartbeatClientEndpointManager ，负责每 500 ms ，将不可用的 Channel 通道，重新建立和 Server 的连接 Channel 通道。代码如下：

@Override
public void init() {
    executorService = Executors.newScheduledThreadPool(1);
    executorService.scheduleWithFixedDelay(new Runnable() {
        @Override
        public void run() {

            // <1>
            for (Map.Entry<Client, HeartbeatFactory> entry : endpoints.entrySet()) {
                Client endpoint = entry.getKey();

                try {
                    // 如果节点是存活状态，那么没必要走心跳
                    // <2>
                    if (endpoint.isAvailable()) {
                        continue;
                    }

                    // <3>
                    HeartbeatFactory factory = entry.getValue();
                    endpoint.heartbeat(factory.createRequest());
                } catch (Exception e) {
                    LoggerUtil.error("HeartbeatEndpointManager send heartbeat Error: url=" + endpoint.getUrl().getUri() + ", " + e.getMessage());
                }
            }

        }
    }, MotanConstants.HEARTBEAT_PERIOD, MotanConstants.HEARTBEAT_PERIOD, TimeUnit.MILLISECONDS);
    ShutDownHook.registerShutdownHook(new Closable() {
        @Override
        public void close() {
            if (!executorService.isShutdown()) {
                executorService.shutdown();
            }
        }
    });
}

注意，<1> 处的遍历，是以 Client 为维度，也就是说，Client 只有所有的 Channel 都不可用时，才能不满足 <2> 处的条件，不被 continue。
<3> 处，在调用 NettyClient#heartbeat(Request request) 方法，从而触发上述的 #request(Request request) 的重连逻辑。
SharedObjectFactory#rebuildObject(NettyChannel nettyChannel) 方法，重新建立和 Server 的连接 Channel 通道的逻辑，如果重连失败，不会发起重试，所以需要 HeartbeatClientEndpointManager 的定时检测，相当于发起重试。

总结

通过这样的设计，定时 + 调用时两种检测方式，从而实现 Client 和 Server 的连接发生异常时，快速的检测到并进行重连。当然有优点也会有缺点，如果网络发生抖动时，Client 和 Server 的连接断开，Client 在没有向 Server 发起请求时，是无法发现连接其实已经断开。当然，在微博的实际场景下，Client 向 Server 一分钟内不发起请求的肯能性极小，所以这个缺点也一定算。

为什么微博会是这样的设计呢？笔者猜想和微博的混合云方案有关，跨机房难免会存在一些网络抖动的问题。

补充

Motan 的方案，目前没有 Server 到 Client 的空闲检测，依赖 KeepAlive 机制实现。实际场景下，可以考虑添加下。

5. 最后的总结

如果让笔者实现 RPC 的心跳机制时，我倾向于使用 sofa-bolt 或 Dubbo 的方案，因为能够保证 Client 和 Server 双向都增加了严格的检测。虽然相比 Jupiter 的开销会略大一丢丢，但是基本可以忽略。

另外，如果有跨机房的需求，Motan 的方案可以作为补充，实际上，前三者( sofa-bolt / Jupiter / Dubbo ) 和 Motan 是不冲突的，可以作为互为补充。

空闲检测的思考

1. sofa-bolt

2. Jupiter

3. Dubbo

4. Motan

5. 最后的总结