Hadoop之集群崩溃处理办法 --无法恢复数据

43 阅读1分钟

模拟集群崩溃

[muyi@hadoop102 module]$ jps
3010 DataNode
5799 Jps
3545 NodeManager
2879 NameNode
[muyi@hadoop102 module]$ kill -9 3010
[muyi@hadoop102 module]$ jps
3545 NodeManager
5817 Jps
2879 NameNode
[muyi@hadoop102 module]$ 

假设我们hadoop安装文件中的data文件夹一不小心删除了,

[muyi@hadoop102 hadoop-3.1.3]$ ll
总用量 180
drwxr-xr-x. 2 muyi muyi    183 9  12 2019 bin
drwxrwxr-x. 4 muyi muyi     37 11 12 07:59 data
drwxr-xr-x. 3 muyi muyi     20 9  12 2019 etc
drwxr-xr-x. 2 muyi muyi    106 9  12 2019 include
drwxr-xr-x. 3 muyi muyi     20 9  12 2019 lib
drwxr-xr-x. 4 muyi muyi    288 9  12 2019 libexec
-rw-rw-r--. 1 muyi muyi 147145 9   4 2019 LICENSE.txt
drwxrwxr-x. 3 muyi muyi   4096 11 12 07:59 logs
-rw-rw-r--. 1 muyi muyi  21867 9   4 2019 NOTICE.txt
-rw-rw-r--. 1 muyi muyi   1366 9   4 2019 README.txt
drwxr-xr-x. 3 muyi muyi   4096 9  12 2019 sbin
drwxr-xr-x. 4 muyi muyi     31 9  12 2019 share
drwxrwxr-x. 2 muyi muyi     22 11 10 08:43 wcinput
drwxr-xr-x. 2 muyi muyi     88 11 10 08:50 wcoutput
[muyi@hadoop102 hadoop-3.1.3]$ rm -rf data/
[muyi@hadoop102 hadoop-3.1.3]$ ll
总用量 180
drwxr-xr-x. 2 muyi muyi    183 9  12 2019 bin
drwxr-xr-x. 3 muyi muyi     20 9  12 2019 etc
drwxr-xr-x. 2 muyi muyi    106 9  12 2019 include
drwxr-xr-x. 3 muyi muyi     20 9  12 2019 lib
drwxr-xr-x. 4 muyi muyi    288 9  12 2019 libexec
-rw-rw-r--. 1 muyi muyi 147145 9   4 2019 LICENSE.txt
drwxrwxr-x. 3 muyi muyi   4096 11 12 07:59 logs
-rw-rw-r--. 1 muyi muyi  21867 9   4 2019 NOTICE.txt
-rw-rw-r--. 1 muyi muyi   1366 9   4 2019 README.txt
drwxr-xr-x. 3 muyi muyi   4096 9  12 2019 sbin
drwxr-xr-x. 4 muyi muyi     31 9  12 2019 share
drwxrwxr-x. 2 muyi muyi     22 11 10 08:43 wcinput
drwxr-xr-x. 2 muyi muyi     88 11 10 08:50 wcoutput
[muyi@hadoop102 hadoop-3.1.3]$ 

在hadoop103上也不小心把data文件夹删除了

[muyi@hadoop103 hadoop-3.1.3]$ rm -rf data

那么现在我们在集群上的文件是否能够下载下来?

图片.png 通过操作,我们能够下载

那么如果hadoop104上的data文件夹也被删除了呢

[muyi@hadoop104 hadoop-3.1.3]$ rm -rf data

我们点击download后,可以看出无法下载了,这时我们的集群就崩溃了

图片.png

格式化尝试


[muyi@hadoop102 hadoop-3.1.3]$ hdfs namenode -format
namenode is running as process 2879.  Stop it first.
[muyi@hadoop102 hadoop-3.1.3]$

我们先停止集群运行

[muyi@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh
Stopping nodemanagers
hadoop102: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop103: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
hadoop104: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill with kill -9
Stopping resourcemanager
[muyi@hadoop103 hadoop-3.1.3]$ 


[muyi@hadoop102 hadoop-3.1.3]$ sbin/stop-dfs.sh 
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[muyi@hadoop102 hadoop-3.1.3]$

重启集群

[muyi@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
Starting namenodes on [hadoop102]
Starting datanodes
Starting secondary namenodes [hadoop104]
[muyi@hadoop102 hadoop-3.1.3]$ jps
7045 DataNode
7295 Jps

但是,我们可以看到NameNode不存在,这就是因为我们刚刚把data文件夹删除了

现在,格式化尝试

[muyi@hadoop102 hadoop-3.1.3]$ ll data/dfs/
总用量 0
drwx------. 2 muyi muyi 6 1112 09:49 data
[muyi@hadoop102 hadoop-3.1.3]$ hdfs namenode -format


[muyi@hadoop102 hadoop-3.1.3]$ ll data/dfs/
总用量 0
drwx------. 2 muyi muyi  6 1112 09:49 data
drwxrwxr-x. 3 muyi muyi 21 1112 09:53 name

但是,还是不能够正常使用

解决方法

第一步:先杀死进程

[muyi@hadoop102 hadoop-3.1.3]$ sbin/stop-dfs.sh 
Stopping namenodes on [hadoop102]
Stopping datanodes
Stopping secondary namenodes [hadoop104]
[muyi@hadoop102 hadoop-3.1.3]$ jps
7861 Jps

注意:是集群中的所有的需要停掉

第二步:删除每一个集群上的data和log文件夹

[muyi@hadoop102 hadoop-3.1.3]$ rm -rf data/ logs/

[muyi@hadoop103 hadoop-3.1.3]$ rm -rf data/ logs/

[muyi@hadoop104 hadoop-3.1.3]$ rm -rf data/ logs/

格式化

[muyi@hadoop102 hadoop-3.1.3]$ hdfs namenode -format

启动集群

[muyi@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh 
Starting namenodes on [hadoop102]
Starting datanodes
hadoop103: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
hadoop104: WARNING: /opt/module/hadoop-3.1.3/logs does not exist. Creating.
Starting secondary namenodes [hadoop104]
[muyi@hadoop102 hadoop-3.1.3]$ jps
8195 NameNode
8357 DataNode
8597 Jps
[muyi@hadoop102 hadoop-3.1.3]$ 

现在重新通过web进行访问

图片.png

说明

我们初始化时,都会生成一个Datanode版本号,所以我们删除了data文件夹后,但是在logs文件中关于上一个初始化的Datanode的版本号还存在,我们通过新生成的Datanode访问web界面,最终的结果就是无法访问!!!

版本号和datanode是一一对应绑定的