Hadoop 3.x 伪分布式部署与操作指南

139 阅读2分钟

Hadoop 3.x 伪分布式部署与操作指南

Hadoop Ecosystem

一、伪分布式环境准备

1.1 架构示意图

graph TD
    NN[NameNode] --> DN[DataNode]
    RM[ResourceManager] --> NM[NodeManager]
    DN -->|心跳| NN
    NM -->|资源汇报| RM
    style NN fill:#4CAF50
    style RM fill:#2196F3

1.2 前置条件检查

# 环境验证脚本
import subprocess

def check_environment():
    checks = {
        "Java Version": "java -version",
        "SSH Localhost": "ssh localhost hostname",
        "Disk Space": "df -h /"
    }
    
    for desc, cmd in checks.items():
        try:
            output = subprocess.check_output(cmd, shell=True, stderr=subprocess.STDOUT)
            print(f"✅ {desc} 验证成功")
        except subprocess.CalledProcessError as e:
            print(f"❌ {desc} 失败: {e.output.decode()}")

check_environment()

二、Hadoop 3.x 安装配置

2.1 安装流程

flowchart TD
    A[下载Hadoop] --> B[解压安装]
    B --> C[配置环境变量]
    C --> D[修改配置文件]
    D --> E[格式化HDFS]
    E --> F[启动服务]

2.2 详细步骤

  1. 下载并解压Hadoop
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz -C /opt
sudo ln -s /opt/hadoop-3.3.6 /opt/hadoop
  1. 配置环境变量
# Python脚本生成环境配置
with open("/etc/profile.d/hadoop.sh", "w") as f:
    f.write("""\
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
""")
subprocess.run("source /etc/profile", shell=True)
  1. 核心配置文件修改

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop/data</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file://${hadoop.tmp.dir}/namenode</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

三、HDFS文件系统操作

3.1 服务启停命令

# 格式化NameNode(首次安装)
hdfs namenode -format

# 启动HDFS
start-dfs.sh

# 启动YARN
start-yarn.sh

# 停止所有服务
stop-all.sh

3.2 文件系统操作示例

import subprocess

class HDFSClient:
    def __init__(self, user="hadoop"):
        self.user = user
        
    def run_cmd(self, cmd):
        full_cmd = f"hdfs dfs -D fs.defaultFS=hdfs://localhost:9000 -{cmd}"
        result = subprocess.run(full_cmd.split(), capture_output=True)
        return result.stdout.decode()
    
    def mkdir(self, path):
        return self.run_cmd(f"mkdir -p /user/{self.user}/{path}")
    
    def put_file(self, local, remote):
        return self.run_cmd(f"put {local} /user/{self.user}/{remote}")

# 使用示例
hdfs = HDFSClient()
print(hdfs.mkdir("input"))
print(hdfs.put_file("localfile.txt", "input/"))

3.3 Web UI验证

HDFS Web UI

四、YARN资源管理

4.1 资源调度模型

可用资源 = \sum_{i=1}^{n} (NodeManager_i.memory \times NodeManager_i.vcores)

4.2 提交MapReduce作业

# 运行WordCount示例
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar \
wordcount /user/hadoop/input /user/hadoop/output

4.3 资源监控命令

# YARN应用监控脚本
def yarn_app_monitor():
    cmd = "yarn application -list -appStates ALL"
    output = subprocess.check_output(cmd.split()).decode()
    apps = [line.split("\t") for line in output.split("\n")[2:-1]]
    return {
        "running": len([a for a in apps if "RUNNING" in a]),
        "completed": len([a for a in apps if "SUCCEEDED" in a])
    }

print(yarn_app_monitor())

五、常见问题排查

5.1 故障诊断表

现象可能原因解决方案
NameNode启动失败端口冲突/目录权限问题检查9000端口,修改hadoop.tmp.dir权限
DataNode未注册集群ID不一致格式化前清理所有数据目录
YARN作业卡住内存分配不足调整yarn.nodemanager.resource.memory-mb

5.2 日志查看指南

flowchart LR
    A[日志类型] --> B[Namenode]
    A --> C[Datanode]
    A --> D[ResourceManager]
    B --> E[$HADOOP_HOME/logs/hadoop-*-namenode-*.log]
    C --> F[$HADOOP_HOME/logs/hadoop-*-datanode-*.log]
    D --> G[$HADOOP_HOME/logs/yarn-*-resourcemanager-*.log]

六、环境验证清单

  1. HDFS基础操作正常(创建目录/上传文件)
  2. Web UI可访问且显示节点状态正常
  3. YARN可成功运行MapReduce作业
  4. 日志文件无ERROR级别错误信息

下一章预告:在掌握单机部署后,我们将进入Spark Standalone模式部署,学习如何运行PySpark应用并进行性能调优。

附录:Hadoop 3.x端口对照表

服务端口协议用途
NameNode9000TCPHDFS文件系统访问
NameNode Web9870HTTP元数据管理界面
DataNode9864HTTP数据块存储状态
ResourceManager8088HTTP集群资源管理界面
NodeManager8042HTTP节点资源状态汇报