背景
在某次启动hdfs之后,在一个月之后,需要将其重启,但却发现无法停止,无奈只有kill掉进程,再次启动。我并不打算将其问题放弃,想看看到底是什么原因导致这个情况。
查看停止脚本
[hadoop@hadoop001 sbin]$ vim stop-dfs.sh
....
# namenodes
NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -namenodes)
echo "Stopping namenodes on [$NAMENODES]"
"$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \ <--- 调用另一个脚本
--config "$HADOOP_CONF_DIR" \
--hostnames "$NAMENODES" \
--script "$bin/hdfs" stop namenode
...
[hadoop@hadoop001 sbin]$ vim hadoop-daemons.sh
...
# 这里又调用了hadoop-daemon.sh脚本
exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_PREFIX" \; "$bin/hadoop-daemon.sh"
--config $HADOOP_CONF_DIR "$@"
...
[hadoop@hadoop001 sbin]$ vim hadoop-daemon.sh
...
# HADOOP_PID_DIR The pid files are stored. /tmp by default.
# HADOOP_IDENT_STRING A string representing this instance of hadoop. $USER by default
pid=$HADOOP_PID_DIR/hadoop-$HADOOP_IDENT_STRING-$command.pid
...
(stop)
if [ -f $pid ]; then <-- 这里主要找到pid,来进行kill
TARGET_PID=`cat $pid`
if kill -0 $TARGET_PID > /dev/null 2>&1; then
echo stopping $command
kill $TARGET_PID
sleep $HADOOP_STOP_TIMEOUT
if kill -0 $TARGET_PID > /dev/null 2>&1; then
echo "$command did not stop gracefully after $HADOOP_STOP_TIMEOUT seconds: killing with kill -9"
kill -9 $TARGET_PID
fi
else
echo no $command to stop
fi
rm -f $pid
else
echo no $command to stop
fi
;;
(*)
echo $usage
exit 1
;;
esac
根据脚本中的内容可知,hdfs停止靠的就是/tmp下的pid文件,文件中存的就是pid,就像下面的一样:
[hadoop@hadoop001 tmp]$ pwd
/tmp
[hadoop@hadoop001 tmp]$ ll
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 12:54 hadoop-hadoop-datanode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 12:54 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 12:54 hadoop-hadoop-secondarynamenode.pid
[hadoop@hadoop001 hadoop]$ jps
25330 NameNode
25636 SecondaryNameNode
25463 DataNode
25751 Jps
[root@hadoop001 tmp]# cat hadoop-hadoop-datanode.pid
25463
实验
将/tmp路径下的datanode.pid文件删除,再次停止hdfs,看是否会出现无法停止的情况呢?
[hadoop@hadoop001 tmp]$ rm -rf hadoop-hadoop-datanode.pid
[hadoop@hadoop001 tmp]$ ll
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 12:54 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 12:54 hadoop-hadoop-secondarynamenode.pid
[hadoop@hadoop001 hadoop]$ sbin/stop-dfs.sh
19/07/06 14:01:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [hadoop001]
hadoop001: stopping namenode
hadoop001: no datanode to stop
Stopping secondary namenodes [hadoop001]
hadoop001: stopping secondarynamenode
19/07/06 14:02:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ jps
24906 Jps
17391 DataNode
看来情况就是无法停止DataNode,这个进程依旧存在,原因就是hdfs停止脚本找不到DataNode的pid。那么,为什么路径/tmp下就没有了pid呢?
Linux自动删除文件
上网搜索Linux的tmp路径相关资料,发现Linux系统有个自动清理tmp目录的机制,默认每30天删除一次。相关详细资料请点击这里,关于Linux系统清理/tmp/文件夹的原理。
解决方案
# 先停hdfs,再修改文件,重启
[hadoop@hadoop001 hadoop]$ vim hadoop-env.sh
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
# export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_PID_DIR=/home/hadoop/data/tmp <---改成你想要存储pid文件的目录
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
[hadoop@hadoop001 ~]$ ll data/tmp/
total 12
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 14:32 hadoop-hadoop-datanode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 14:32 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Jul 6 14:32 hadoop-hadoop-secondarynamenode.pid