hadoop热删除datanode节点

1,080 阅读2分钟

原因

当机器硬件故障可需要退休机器时,在不影响现集群运行情况 在线删除hadoop datanode节点

操作

因为我们使用的CDH管理hadoop集群,这里以CDH官方给的文档进行操作

docs.cloudera.com/documentati…

Removing a DataNode

Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)

  1. The number of DataNodes in your cluster must be greater than or equal to the replication factor you have configured for HDFS. (This value is typically 3.) In order to satisfy this requirement, add the DataNode roles on other hosts as required and start the role instances

    before removing any DataNodes

    .

  2. Ensure the DataNode that is to be removed is running

  3. Decommission the DataNode role. When asked to select the role instance to decommission, select the DataNode role instance.

  4. The decommissioning process moves the data blocks to the other available DataNodes.Important: There must be at least as many DataNodes running as the replication factor or the decommissioning process will not complete.

  5. Once decommissioning is completed, stop the DataNode role. When asked to select the role instance to stop, select the DataNode role instance.

  6. Verify that the integrity of the HDFS service:

    1. Run the following command to identify any problems in the HDFS file system:

      hdfs fsck /
      
    2. Fix any errors reported by the

      fsck

      command. If required, create a Cloudera support case.

  7. After all errors are resolved:

    1. Remove the DataNode role.
    2. Manually remove the DataNode data directories. You can determine the location of these directories by examining the DataNode Data Directory property in the HDFS configuration. In Cloudera Manager, go to the HDFS service, select the Configuration tab and search for the property.

说明 要求和操作

1.   默认hdfs副本是3,所以你的机器数据要大于此值

2.   需要移除的datanode在移除前必须是正在运行的

3.   进入CDH界面,需要最低的权限解是Operator (also provided by Configurator, Cluster Administrator, Full Administrator)

4.   选择需要移除的主机,点击 解除授权 相当于(Decommissioning)

退伇需要的时间根据 datanode大小而定,1T数据需要1个小时

解除授权 的过程相当于 复制数据到其他节点,停止退伇datanode 

退伇datanode节点数据会复制到其他节点,但退伇节点本身数据还在,这是只复制,不是移动。退伇完成后,可自行处理

5.  解除授权 完成后,在hdfs ui页面可以看见相应的datanode 已经是down状态

6.  检查hdfs状态

hdfs fsck /

运行检查,如果显示 状态为 HEALTHY  正常

7. 在CDH界面 选择 节点 删除既可