前言
好久没更新博客了。中间这近两个月在郑州经历了洪水、疫情,又赶项目进度,属实没有心情学习源码写笔记了。好近最近趋近平常了,收拾一下心情,继续开始学习之路。
上一节,了解了一下 Nacos 的服务注册机制。这一小节,我们接着往下学习 Nacos 的服务心跳机制。
客户端健康检查过程
初始化心跳线程的操作,是在 NamingGrpcClientProxy
中进行的。NamingGrpcClientProxy
我们应该还有印象,是 Nacos Server 与 Client 的 gRpc 通信实现。
此处就是在调用构造器时,调用启动线程的方法。
public NamingGrpcClientProxy(String namespaceId, SecurityProxy securityProxy, ServerListFactory serverListFactory,
Properties properties, ServiceInfoHolder serviceInfoHolder) throws NacosException {
super(securityProxy, properties);
this.namespaceId = namespaceId;
this.uuid = UUID.randomUUID().toString();
this.requestTimeout = Long.parseLong(properties.getProperty(CommonParams.NAMING_REQUEST_TIMEOUT, "-1"));
Map<String, String> labels = new HashMap<String, String>();
labels.put(RemoteConstants.LABEL_SOURCE, RemoteConstants.LABEL_SOURCE_SDK);
labels.put(RemoteConstants.LABEL_MODULE, RemoteConstants.LABEL_MODULE_NAMING);
// 创建客户端
this.rpcClient = RpcClientFactory.createClient(uuid, ConnectionType.GRPC, labels);
this.namingGrpcConnectionEventListener = new NamingGrpcConnectionEventListener(this);
// 此处启动线程
start(serverListFactory, serviceInfoHolder);
}
private void start(ServerListFactory serverListFactory, ServiceInfoHolder serviceInfoHolder) throws NacosException {
rpcClient.serverListFactory(serverListFactory);
// 此处启动线程
rpcClient.start();
rpcClient.registerServerRequestHandler(new NamingPushRequestHandler(serviceInfoHolder));
rpcClient.registerConnectionListener(namingGrpcConnectionEventListener);
}
我们继续查看,rpcClient.start();
public final void start() throws NacosException {
// ......
// 线程池
clientEventExecutor = new ScheduledThreadPoolExecutor(2, new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setName("com.alibaba.nacos.client.remote.worker");
t.setDaemon(true);
return t;
}
});
// connection event consumer.
// 连接事件消费(连接与断开连接)
clientEventExecutor.submit(new Runnable() {
// ......
});
// 心跳线程
clientEventExecutor.submit(new Runnable() {
@Override
public void run() {
// 持续执行
while (true) {
try {
ReconnectContext reconnectContext = reconnectionSignal
.poll(keepAliveTime, TimeUnit.MILLISECONDS);
if (reconnectContext == null) {
// check alive time.
// 系统时间-上一次检查时间 >= 5s,才进行健康检查
if (System.currentTimeMillis() - lastActiveTimeStamp >= keepAliveTime) {
// 执行健康检查方法
boolean isHealthy = healthCheck();
if (!isHealthy) {
if (currentConnection == null) {
continue;
}
// 如果不健康,设置不健康状态
rpcClientStatus.set(RpcClientStatus.UNHEALTHY);
reconnectContext = new ReconnectContext(null, false);
} else {
// 如果健康,设置时间,继续
lastActiveTimeStamp = System.currentTimeMillis();
continue;
}
} else {
continue;
}
}
// ......
reconnect(reconnectContext.serverInfo, reconnectContext.onRequestFail);
} catch (Throwable throwable) {
//Do nothing
}s
}
}
});
// .......
// .......
}
OK,接下来我们看healthCheck()
:
private boolean healthCheck() {
HealthCheckRequest healthCheckRequest = new HealthCheckRequest();
if (this.currentConnection == null) {
return false;
}
try {
// 执行请求
Response response = this.currentConnection.request(healthCheckRequest, 3000L);
// not only check server is ok ,also check connection is register.
// 判断响应
return response == null ? false : response.isSuccess();
} catch (NacosException e) {
// ignore
}
return false;
}
这里其实就是发送了一个健康检查请求,请求体是空的。返回值也是空的。
服务端心跳处理过程
通过前几章的学习,我们知道服务端处理请求的入口在GrpcRequestAcceptor # request()
中:
// GrpcRequestAcceptor # request()
public void request(Payload grpcRequest, StreamObserver<Payload> responseObserver) {
// ...... 省略了很多逻辑,我们只关注处理请求逻辑这部分
Request request = (Request) parseObj;
try {
// 获取连接
Connection connection = connectionManager.getConnection(CONTEXT_KEY_CONN_ID.get());
// 构造请求元数据
RequestMeta requestMeta = new RequestMeta();
requestMeta.setClientIp(connection.getMetaInfo().getClientIp());
requestMeta.setConnectionId(CONTEXT_KEY_CONN_ID.get());
requestMeta.setClientVersion(connection.getMetaInfo().getVersion());
requestMeta.setLabels(connection.getMetaInfo().getLabels());
// 刷新连接通信时间
connectionManager.refreshActiveTime(requestMeta.getConnectionId());
// 处理业务
Response response = requestHandler.handleRequest(request, requestMeta);
// 获得返回值
Payload payloadResponse = GrpcUtils.convert(response);
// 写入返回值
traceIfNecessary(payloadResponse, false);
responseObserver.onNext(payloadResponse);
responseObserver.onCompleted();
} catch (Throwable e) {
// ...
}
}
这里的HealthCheckRequestHandler
只 new 了一个 HealthCheckResponse
返回,另外HealthCheckResponse
也是一个空实现, 没有新增任何字段与方法:
@Component
public class HealthCheckRequestHandler extends RequestHandler<HealthCheckRequest, HealthCheckResponse> {
@Override
@TpsControl(pointName = "HealthCheck")
public HealthCheckResponse handle(HealthCheckRequest request, RequestMeta meta) {
return new HealthCheckResponse();
}
}
public class HealthCheckResponse extends Response {
}
那么我们就好奇了,这个健康检查,服务端业务处理器嘛事儿也没干,那我们客户端掉线时,Nacos Server 是如何剔除不健康的节点呢?
在 ConnectionManager
中有一个线程任务,专门用来剔除不健康的链接。这里我们主要知道有这么回事,具体的源码逻辑,不需要太深入追究。
public class ConnectionManager extends Subscriber<ConnectionLimitRuleChangeEvent> {
/**
* Start Task:Expel the connection which active Time expire.
* 驱逐活动时间到期的连接。
*/
@PostConstruct
public void start() {
// Start UnHealthy Connection Expel Task.
// 启动不健康的连接驱逐任务。
RpcScheduledExecutor.COMMON_SERVER_EXECUTOR.scheduleWithFixedDelay(new Runnable() {
@Override
public void run() {
try {
// ..................
// 保存过期链接的集合
Set<String> outDatedConnections = new HashSet<>();
long now = System.currentTimeMillis();
for (Map.Entry<String, Connection> entry : entries) {
Connection client = entry.getValue();
String clientIp = client.getMetaInfo().getClientIp();
AtomicInteger integer = expelForIp.get(clientIp);
if (integer != null && integer.intValue() > 0) {
integer.decrementAndGet();
expelClient.add(client.getMetaInfo().getConnectionId());
expelCount--;
// 这里是个关键点,计算最后一次激活时间。
// 上面服务端处理请求时,都调用refreshActiveTime(requestMeta.getConnectionId());
// 此方法就是修改 lastActiveTime 的值
} else if (now - client.getMetaInfo().getLastActiveTime() >= KEEP_ALIVE_TIME) {
outDatedConnections.add(client.getMetaInfo().getConnectionId());
}
}
// ...........
// 再确认一遍连接已关闭
String serverIp = null;
String serverPort = null;
if (StringUtils.isNotBlank(redirectAddress) && redirectAddress.contains(Constants.COLON)) {
String[] split = redirectAddress.split(Constants.COLON);
serverIp = split[0];
serverPort = split[1];
}
for (String expelledClientId : expelClient) {
try {
Connection connection = getConnection(expelledClientId);
if (connection != null) {
ConnectResetRequest connectResetRequest = new ConnectResetRequest();
connectResetRequest.setServerIp(serverIp);
connectResetRequest.setServerPort(serverPort);
connection.asyncRequest(connectResetRequest, null);
Loggers.REMOTE_DIGEST
.info("Send connection reset request , connection id = {},recommendServerIp={}, recommendServerPort={}",
expelledClientId, connectResetRequest.getServerIp(),
connectResetRequest.getServerPort());
}
} catch (ConnectionAlreadyClosedException e) {
// 连接确实关闭了,就移除掉。
unregister(expelledClientId);
} catch (Exception e) {
Loggers.REMOTE_DIGEST.error("Error occurs when expel connection, expelledClientId:{}", expelledClientId, e);
}
}
//4.client active detection.
// 客户端主动检测。
// 这里发起一个来自服务器的客户端主动检测请求。如果该请求有成功响应的话,说明该客户端又活过来了。
if (CollectionUtils.isNotEmpty(outDatedConnections)) {
Set<String> successConnections = new HashSet<>();
final CountDownLatch latch = new CountDownLatch(outDatedConnections.size());
for (String outDateConnectionId : outDatedConnections) {
try {
Connection connection = getConnection(outDateConnectionId);
if (connection != null) {
ClientDetectionRequest clientDetectionRequest = new ClientDetectionRequest();
connection.asyncRequest(clientDetectionRequest, new RequestCallBack() {
@Override
public void onResponse(Response response) {
latch.countDown();
// 救活了,刷新最后一次连接时间。放到救活列表里。
if (response != null && response.isSuccess()) {
connection.freshActiveTime();
successConnections.add(outDateConnectionId);
}
}
@Override
public void onException(Throwable e) {
latch.countDown();
}
});
} else {
latch.countDown();
}
} catch (ConnectionAlreadyClosedException e) {
latch.countDown();
} catch (Exception e) {
latch.countDown();
}
}
latch.await(3000L, TimeUnit.MILLISECONDS);
for (String outDateConnectionId : outDatedConnections) {
// 没在救活列表中的,移除掉。
if (!successConnections.contains(outDateConnectionId)) {
unregister(outDateConnectionId);
}
}
}
//reset loader client
if (isLoaderClient) {
loadClient = -1;
redirectAddress = null;
}
} catch (Throwable e) {
Loggers.REMOTE.error("Error occurs during connection check... ", e);
}
}
}, 1000L, 3000L, TimeUnit.MILLISECONDS);
}
此外,在 ClientManager
中也有一个延时任务,用于去除过期的客户端信息:
这里的用法有点奇怪,把启动任务线程放到了构造器中,然后 Spring Boot 生成 Bean 时,会调用构造器,从而启动任务线程。
public ConnectionBasedClientManager() {
GlobalExecutor
.scheduleExpiredClientCleaner(new ExpiredClientCleaner(this), 0, Constants.DEFAULT_HEART_BEAT_INTERVAL,
TimeUnit.MILLISECONDS);
}
private static class ExpiredClientCleaner implements Runnable {
private final ConnectionBasedClientManager clientManager;
public ExpiredClientCleaner(ConnectionBasedClientManager clientManager) {
this.clientManager = clientManager;
}
@Override
public void run() {
long currentTime = System.currentTimeMillis();
for (String each : clientManager.allClientId()) {
ConnectionBasedClient client = (ConnectionBasedClient) clientManager.getClient(each);
if (null != client && client.isExpire(currentTime)) {
clientManager.clientDisconnected(each);
}
}
}
}
// ConnectionBasedClient # isExpire
@Override
public boolean isExpire(long currentTime) {
return !isNative() && currentTime - getLastRenewTime() > Constants.DEFAULT_IP_DELETE_TIMEOUT;
}
总结
Connection
有一个属性 lastActiveTime
,客户端与服务端进行通信时,服务端每次都会刷新这个值。
ConnectionManager
有一个延时线程任务,每隔 3 s 执行一次。其中便会检查 Connection#lastActiveTime
属性。
把其中刷新超时的跳出来,尝试请求一下,如果请求报错,就说明这个连接无用了,于是给剔除掉。
ClientManager
也是类似,有个最后一次通信的时候,如果 30s 内无通信,就认为此 Client 过期了,需要剔除。
番外
在查看源码时,发现了两个 Manager 分别是:ConnectionManager
、ClientManager
。
开始不明白这两者的区别,于是去了 Nacos 开源交流群里询问,最终得到了 Nacos 主要开发者 杨翊大佬的回答。
- Connection 是更底层的概念,包括了 naming 、config 等所有模块的连接。
- Client 只针对 naming(服务注册发现) 模块的业务。