Dubbo升级导致ZK宕机

4,270 阅读3分钟

升级过程及现象

1、Dubbo版本从2.7.4 升级到 2.7.8 2、在开发、测试、预发环境运行的时间超过3周 3、发布线上时,应用启动失败,并伴有大量的ZK客户端重连日志, 如下:

2020-09-24 15:03:40.664, [TID:N/A], INFO, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1299, onConnected Session establishment complete on server xx/x:2181, sessionid = 0xe72dc060e3f1d06, negotiated timeout = 40000
2020-09-24 15:03:40.666, [TID:N/A], INFO, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 66, info  [DUBBO] Curator zookeeper client instance initiated successfully, session id is e72dc060e3f1d06, dubbo version: 2.7.8, current host: 10.3.107.40
2020-09-24 15:03:53.734, [TID:N/A], WARN, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len18594568 is out of range!
	at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-09-24 15:03:55.540, [TID:N/A], WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn  [DUBBO] Curator zookeeper connection of session e72dc060e3f1d06 timed out. connection timeout value is 3000, session expire timeout value is 60000, dubbo version: 2.7.8, current host: 10.3.107.40
2020-09-24 15:03:56.165, [TID:N/A], INFO, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 879, primeConnection Socket connection established to xx/xx:2181, initiating session
2020-09-24 15:03:56.346, [TID:N/A], INFO, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1299, onConnected Session establishment complete on server xx/xx:2181, sessionid = 0xe72dc060e3f1d06, negotiated timeout = 40000
2020-09-24 15:04:08.135, [TID:N/A], WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn  [DUBBO] Curator zookeeper connection recovered from connection lose, reuse the old session e72dc060e3f1d06, dubbo version: 2.7.8, current host: 10.3.107.40
2020-09-24 15:04:53.046, [TID:N/A], WARN, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len18594616 is out of range!
	at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-09-24 15:04:54.626, [TID:N/A], WARN, org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient, Log4jLogger.java, 81, warn  [DUBBO] Curator zookeeper connection of session e72dc060e3f1d06 timed out. connection timeout value is 3000, session expire timeout value is 60000, dubbo version: 2.7.8, current host: 10.3.107.40

4、然后还有上百万条的如下日志

2020-09-24 15:37:50.309, [TID:N/A], ERROR, org.apache.curator.framework.imps.CuratorFrameworkImpl, CuratorFrameworkImpl.java, 694, logError Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:647)
	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
	at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:604)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)

5、运维收到告警,ZK挂了

6、回滚应用

问题原因

主要有以下疑惑:

  1. 为什么其它环境没有问题,而线上环境有问题。线上的ZK有问题吗,不存在的
  2. org.apache.dubbo.metadata.report.MetadataReport是干嘛用的?为啥 2.7.4没有问题,2.7.8有问题?

排查过程:

  1. Packet lenxx is out of range!
2020-09-24 15:03:53.734, [TID:N/A], WARN, org.apache.zookeeper.ClientCnxn, ClientCnxn.java, 1164, run Session 0xe72dc060e3f1d06 for server xx/xx:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len18594568 is out of range!
	at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

之前在操作ZK的时候遇到过这个问题,当时是一个节点里面写着很长很长一串乱七八糟的数字,在获取这个节点的时候,就报了个这样的错。在网上查了一些资料,

public static final int packetLen = Integer.getInteger("jute.maxbuffer",4096 * 1024);
protected void readLength() throws IOException {
    int len = incomingBuffer.getInt();
    if (len < 0 || len >= ClientCnxn.packetLen) {
        throw new IOException("Packet len" + len + " is out of range!");
    }
    incomingBuffer = ByteBuffer.allocate(len);
}

大概就是超过了某个限制,可以粗暴的通过设置jute.maxbuffer大小解决这个问题。但是为什么会报这个错呢?因为某个节点太大吗?可是在 2.7.4 版本中并没有这个错,会不会是在新版中对zk做了什么监听导致的?

2、在日志中心看到一条关键日志,发现里面涉及到MetadataReport

Injection of @DubboReference dependencies is failed; nested exception is java.lang.IllegalStateException: Failed to receive INITIALIZED event from zookeeper, pls. check if url zookeeper://xx:2181/org.apache.dubbo.metadata.report.MetadataReport?client=&dubbo.config-center.root-path=/ is correct

在目前的dubbo版本中,元数据中心只是一个用于存放元数据的地方,并没有涉及到其它复杂的逻辑。并且我们已经通过配置中心,为所有应用开启了元数据中心,并将元数据中心改成了Redis,这里为什么会生成这么个东西呢?

3、在MetadataReportInstance的init方法中,涉及到获取MetadataReport的逻辑,最终会导致ZookeeperDynamicConfiguration初始化

4、为什么在其它环境没出现,而仅在日常出现? 因为其它环境都通过dubbo全局配置中心为所有的客户端配置了元数据中心为redis,而线上没有配置。所以导致这里使用了默认元数据中心,及ZK。客户端在没有配置元数据中心的时候,Dubbo会默认使用注册中心作为元数据中心