引言
Zookeeper在Hadoop中扮演着重要角色。它是一个分布式的应用程序协调服务,主要用来解决分布式集群中应用系统的一致性问题。Zookeeper以Fast Paxos算法为基础,在Hadoop中起到了分布式协调、配置管理、高可用性和命名服务等重要作用。它为Hadoop集群的稳定性、可靠性和可扩展性提供了基础支持。
作为Hadoop最基础和最重要的组件之一,Zookeeper的服务异常直接影响了整个集群的可用性,在日常集群的维护中,也遇到多一些zk服务的异常问题,记录解决异常的过程并复盘原因显得特别重要,关于zk的服务异常分析和解决方法也会不定时更新。
问题1:Zookeeper服务异常:Checksum failed
1.1 问题描述
更改完kerberos票据之后zookeeper启动失败,查看服务日志有如下报错:
关键信息:
javax.security.auth.login.LoginException: Checksum failed
1.2 原因分析
1.查看服务状态,zookeeper只有一个节点处于待选举状态,另外两个节点均启动失败;
2.回溯配置kerberos的步骤
①管理->安全→kerberos配置页面配置新的kerberos环境
②kerberos配置完成之后重新生成票据
③停掉所有节点的服务,在所有节点上重新生成keytab票据
详见Kerberos-重新生成节点票据
3.考虑Checksum failed与keytab的关系,有可能keytab认证失效,分别登录三台zookeeper节点,找到kertab并验证票据;
zookeeper启动实例的路径:
ll /var/run/cloudera-scm-agent/process | grep zookeeper
找到id最新的zookeeper文件夹下的keytab文件,此处为3384-zookeeper-server
#klist -kt zookeeper.keytab
Keytab name: FILE:zookeeper.keytab
KVNO Timestamp Principal
5 04/25/2021 18:19:46 zookeeper/HADOOP.COM@HADOOP.COM
5 04/25/2021 18:19:46 zookeeper/HADOOP.COM@HADOOP.COM
票据已经是最新的票据,但执行kinit -kt zookeeper.keytab zookeeper/HADOOP.COM@HADOOP.COM会报错。
进一步定位到这两台节点上的krb5.conf文件,发现在切换keytab的时候并未更新。
1.3 解决办法
更新所有节点上的krb5.conf文件,重启zk及上层的依赖服务,问题解决。
总结:当出现Checksum failed的问题时,可以去检查keytab票据本身及相应的票据文件是否已经生成。
问题2:Zookeeper服务异常:Unable to run quorum server
2.1 问题描述
开发测试环境,在集群因不明原因宕机之后发现zk无法正常起起来了,查看zk日志发现如下内容:
2.2 原因分析
由于上一次ZK的异常退出,导致/var/lib/zookeeper目录下的文件产生冲突,查看关于这个问题的解决办法:
We had this issue completely take down our 5-node cluster last night. After digging around we found that the culprit. It's a snapshot that started with nodes /A/B/C present. Before it got to B, both B and C got removed. So, the snapshot contains only /A, but the txn log starts with "delete C". When it replays the log, it tries to increment the cversion of the parent, which is /A/B. It isn't present in the snapshot and the recovery crashes.
参考连接:issues.apache.org/jira/browse…
2.3 解决办法
对于出现异常的zk实例,删除实例上的数据,再重新启动zk服务。
Removing data from /var/zookeeper/version-2 then restart seems to "fix" the problem (it gets a snapshot from one of the other nodes in the quorum).
问题3:登录zookeeper命令行删除某个实例的目录文件报错
3.1 问题描述
从zk的客户端登录zk命令行界面,发现删除实例上的数据时出现报错。
# zookeeper-client -server master:2181
Connecting to master:2181
Welcome to ZooKeeper!
JLine support is enabled
[zk: master:2181(CONNECTING) 0]
[zk: localhost:2181(CONNECTED) 1] rmr /hbase
The command 'rmr' has been deprecated. Please use 'deleteall' instead.
Authentication is not valid : /hbase/splitWAL
3.2 原因分析
需要超级管理员权限
3.3 解决办法
【解决办法】
在zookeeper的配置中添加超级管理员的权限
Enable the ZooKeeper superuser by adding the zookeeper.DigestAuthenticationProvider.superDigest property.
Using Cloudera Manager, navigate to ZooKeeper > Configuration and search for Java Configuration Options for ZooKeeper Server
Add the following property
-Dzookeeper.DigestAuthenticationProvider.superDigest=super:cY+9eK20soteVC3fQ83SXDvwlP0= Save the configuration
Restart the ZooKeeper server. To maintain security, Cloudera recommends only restarting and using only one ZooKeeper role.
From a shell on any ZooKeeper host, use the ZooKeeper command line to connect to the ZooKeeper server
zookeeper-client -server <zk-host>
Authenticate as the superuser
> addauth digest super:cloudera
Perform the required actions as superuser
Exit the ZooKeeper command line when finished
> quit