引言

Zookeeper在Hadoop中扮演着重要角色。它是一个分布式的应用程序协调服务，主要用来解决分布式集群中应用系统的一致性问题。Zookeeper以Fast Paxos算法为基础，在Hadoop中起到了分布式协调、配置管理、高可用性和命名服务等重要作用。它为Hadoop集群的稳定性、可靠性和可扩展性提供了基础支持。

作为Hadoop最基础和最重要的组件之一，Zookeeper的服务异常直接影响了整个集群的可用性，在日常集群的维护中，也遇到多一些zk服务的异常问题，记录解决异常的过程并复盘原因显得特别重要，关于zk的服务异常分析和解决方法也会不定时更新。

问题1：Zookeeper服务异常：Checksum failed

1.1 问题描述

更改完kerberos票据之后zookeeper启动失败，查看服务日志有如下报错： 1695570397(1).png

关键信息：

javax.security.auth.login.LoginException: Checksum failed

1.2 原因分析

1.查看服务状态，zookeeper只有一个节点处于待选举状态，另外两个节点均启动失败；

2.回溯配置kerberos的步骤

①管理->安全→kerberos配置页面配置新的kerberos环境

②kerberos配置完成之后重新生成票据

③停掉所有节点的服务，在所有节点上重新生成keytab票据

详见Kerberos-重新生成节点票据

3.考虑Checksum failed与keytab的关系，有可能keytab认证失效，分别登录三台zookeeper节点，找到kertab并验证票据；

zookeeper启动实例的路径：

ll /var/run/cloudera-scm-agent/process | grep zookeeper

找到id最新的zookeeper文件夹下的keytab文件，此处为3384-zookeeper-server

#klist -kt zookeeper.keytab

Keytab name: FILE:zookeeper.keytab

KVNO Timestamp Principal

5 04/25/2021 18:19:46 zookeeper/HADOOP.COM@HADOOP.COM

5 04/25/2021 18:19:46 zookeeper/HADOOP.COM@HADOOP.COM

票据已经是最新的票据，但执行kinit -kt zookeeper.keytab zookeeper/HADOOP.COM@HADOOP.COM会报错。

进一步定位到这两台节点上的krb5.conf文件，发现在切换keytab的时候并未更新。

1.3 解决办法

更新所有节点上的krb5.conf文件，重启zk及上层的依赖服务，问题解决。

总结：当出现Checksum failed的问题时，可以去检查keytab票据本身及相应的票据文件是否已经生成。

问题2：Zookeeper服务异常：Unable to run quorum server

2.1 问题描述

开发测试环境，在集群因不明原因宕机之后发现zk无法正常起起来了，查看zk日志发现如下内容：

1695570856(1).png

2.2 原因分析

由于上一次ZK的异常退出，导致/var/lib/zookeeper目录下的文件产生冲突，查看关于这个问题的解决办法：

We had this issue completely take down our 5-node cluster last night. After digging around we found that the culprit. It's a snapshot that started with nodes /A/B/C present. Before it got to B, both B and C got removed. So, the snapshot contains only /A, but the txn log starts with "delete C". When it replays the log, it tries to increment the cversion of the parent, which is /A/B. It isn't present in the snapshot and the recovery crashes.

参考连接：issues.apache.org/jira/browse…

2.3 解决办法

对于出现异常的zk实例，删除实例上的数据，再重新启动zk服务。

Removing data from /var/zookeeper/version-2 then restart seems to "fix" the problem (it gets a snapshot from one of the other nodes in the quorum).

问题3：登录zookeeper命令行删除某个实例的目录文件报错

3.1 问题描述

从zk的客户端登录zk命令行界面，发现删除实例上的数据时出现报错。

# zookeeper-client -server master:2181

Connecting to master:2181

Welcome to ZooKeeper!

JLine support is enabled

[zk: master:2181(CONNECTING) 0]

[zk: localhost:2181(CONNECTED) 1] rmr /hbase

The command 'rmr' has been deprecated. Please use 'deleteall' instead.

Authentication is not valid : /hbase/splitWAL

3.2 原因分析

需要超级管理员权限

3.3 解决办法

【解决办法】

在zookeeper的配置中添加超级管理员的权限

Enable the ZooKeeper superuser by adding the zookeeper.DigestAuthenticationProvider.superDigest property.

Using Cloudera Manager, navigate to ZooKeeper > Configuration and search for Java Configuration Options for ZooKeeper Server

Add the following property

-Dzookeeper.DigestAuthenticationProvider.superDigest=super:cY+9eK20soteVC3fQ83SXDvwlP0= Save the configuration

Restart the ZooKeeper server. To maintain security, Cloudera recommends only restarting and using only one ZooKeeper role.

From a shell on any ZooKeeper host, use the ZooKeeper command line to connect to the ZooKeeper server

zookeeper-client -server <zk-host>

Authenticate as the superuser

> addauth digest super:cloudera

Perform the required actions as superuser

Exit the ZooKeeper command line when finished

> quit

[Zookeeper]Zookeeper服务异常问题解决

引言