阿里云服务器搭建Hadoop3.3.2环境及使用Docker部署Hadoop2.7.0环境

692 阅读6分钟

基于Docker安装Hadoop

search hadoop

查找合适的Hadoop镜像

[root@administrator ~]# docker search hadoop
NAME                             DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
sequenceiq/hadoop-docker         An easy way to try Hadoop                       663                  [OK]
uhopper/hadoop                   Base Hadoop image with dynamic configuration…   103                  [OK]
harisekhon/hadoop                Apache Hadoop (HDFS + Yarn, tags 2.2 - 2.8)     67                   [OK]
bde2020/hadoop-namenode          Hadoop namenode of a hadoop cluster             52                   [OK]
bde2020/hadoop-datanode          Hadoop datanode of a hadoop cluster             41                   [OK]

拉取镜像

[root@administrator ~]# docker pull sequenceiq/hadoop-docker

创建启动容器

docker run -dit --name hadoop --privileged=true -p 50070:50070 -p 8088:8088  -p 9000:9000 sequenceiq/hadoop-docker /etc/bootstrap.sh -bash

进入容器

docker exec -it hadoop /bin/bash

使用Hadoop命令

bash-4.1# hadoop fs -ls
bash: hadoop: command not found

hadoop: command not found,添加环境变量配置信息

PATH=$PATH:/usr/local/hadoop/bin/

再次使用Hadoop命令

bash-4.1# PATH=$PATH:/usr/local/hadoop/bin/
bash-4.1# hadoop fs -ls
Found 1 items
drwxr-xr-x   - root supergroup          0 2015-07-22 11:17 input
bash-4.1# 

bash-4.1# hadoop version
Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /usr/local/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar
bash-4.1# 

访问WebUI

查看集群状态:IP:8088 在这里插入图片描述

浏览HDFS文件:IP:50070 在这里插入图片描述

HDFS的API操作测试

    @Test
    public void listFile() throws Exception {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://IP:9000"), new Configuration());
        //获取所有的文件或者文件夹; 指定遍历的路径,指定是否要递归遍历
        RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
        while (locatedFileStatusRemoteIterator.hasNext()) {
            // 获取得到每一个文件详细信息
            LocatedFileStatus fileStatus = locatedFileStatusRemoteIterator.next();
            // 获取每一个文件存储路径 名称
            System.out.println(fileStatus.getPath());
            System.out.println(fileStatus.getPath().getName());
        }
        fileSystem.close();
    }
}

基于Linux搭建Hadoop伪分布式

官网:https://hadoop.apache.org/

安装Hadoop

下载地址:https://archive.apache.org/dist/hadoop/core/

wget http://archive.apache.org/dist/hadoop/core/hadoop-3.3.2/hadoop-3.3.2.tar.gz

解压且重命名

tar -zxvf hadoop-3.3.2.tar.gz

mv hadoop-3.3.2 hadoop

配置环境变量

vi /etc/profile

配置Hadoop环境信息

# hadoop
export HADOOP_HOME=/usr/local/program/hadoop
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

使配置生效

 source /etc/profile

测试

hadoop version

主机别名配置

修改主机名

vim  /etc/hostname 

hostnamectl set-hostname node01

查看主机名

hostname

hostnamectl

更新/etc/hosts,添加 IP 与主机别名 映射

172.22.4.21     node01

修改配置文件

hadoop的配置文件都在hadoop/etc/hadoop目录下,主要修改:core-site.xmlhadoop-env.shhdfs-site.xmlmapred-site.xmlyarn-site.xml

注意:

参考网上各种配置,踩坑无数,以下每类配置都分2份,第一份最基础配置,第二份可选扩展配置,初学推荐使用每类最基础配置即可。

core-site.xml

vim  hadoop/etc/hadoop/core-site.xml

基础配置

<configuration>
  	<!-- 指定文件系统类型,HDFS的通信地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node01:9000</value>
    </property>
    <!-- 临时文件存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/program/hadoop/datas/tmp</value>
    </property>

可选配置

    <!--  缓冲区大小,实际工作中根据服务器性能动态调整 -->
    <property>
        <name>io.file.buffer.size</name>
        <value>8192</value>
    </property>
    <!--  开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>
</configuration>

hadoop-env.sh

hadoop-env.sh是hadoop环境配置文件

vim  hadoop/etc/hadoop/hadoop-env.sh
# export JAVA_HOME=
 export JAVA_HOME=/usr/local/jdk1.8/

hdfs-site.xml

 vim  hadoop/etc/hadoop/hdfs-site.xml
<!--namenode元数据存放目录-->
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/program/hadoop/datas/namenode/namenodedatas</value>
</property>
<!--datanode数据存放目录-->
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/program/hadoop/datas/datanode/datanodeDatas</value>
</property>
<!--HDFS文件副本数,当前伪分布模式只有一个节点,设置为1-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--文件分块/block大小,默认128MB-->
<property>
    <name>dfs.blocksize</name>
    <value>134217728</value>
</property>
<!--HDFS浏览器访问端口,2.x版本50070端口,3.x版本9870-->
<property>
    <name>dfs.namenode.http-address</name>
    <value>node01:9870</value>
</property>
<!--关闭HDFS访问权限-->
<property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
</property>

mapred-site.xml

vim  hadoop/etc/hadoop/mapred-site.xml 
<!--指定Mapreduce执行框架-->
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
<!--以下必须配置,否则运行MapReduce会提示检查是否配置-->
<property>
	<name>yarn.app.mapreduce.am.env</name>
	<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
	<name>mapreduce.map.env</name>
	<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
	<name>mapreduce.reduce.env</name>
	<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<!--maper container 的内存-->
<property>
    <name>mapreduce.map.memory.mb</name>
    <value>1024</value>
</property>
<!--Maper 端  JVM内存-->
<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx512M</value>
</property>
<!--Reducer container 内存-->
<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>1024</value>
</property>
<!--Reducer JVM 内存-->
<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx512M</value>
</property>
<!--用于map输出排序的内存大小 缓冲区的大小默认为100M-->
<property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>256</value>
</property>
<!--排序文件时一次合并多少个文件的数量-->
<property>
    <name>mapreduce.task.io.sort.factor</name>
    <value>100</value>
</property>
<!--提取map输出的copier线程数-->
<property>
    <name>mapreduce.reduce.shuffle.parallelcopies</name>
    <value>25</value>
</property>
<!--job运行日志信息访问地址-->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>node01:10020</value>
</property>
<!--jobhistory浏览器访问地址-->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>node01:19888</value>
</property>

yarn-site.xml

vim  hadoop/etc/hadoop/yarn-site.xml
<!--NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MapReduce程序-->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<!--是否启用日志聚合,默认false-->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!--RM对客户端暴露的地址,客户端通过该地址向RM提交应用程序等-->
<property>
    <name>yarn.resourcemanager.address</name>
    <value>node01:8032</value>
</property>
<!--RM对AM暴露的地址,AM通过地址想RM申请资源,释放资源等-->
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>node01:8030</value>
</property>
<!--RM对NM暴露地址,NM通过该地址向RM汇报心跳,领取任务等-->
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>node01:8031</value>
</property>
<!--管理员可以通过该地址向RM发送管理命令等-->
<property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>node01:8033</value>
</property>
<!--RM对外暴露的web http地址,用户可通过该地址在浏览器中查看集群信息-->
<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>node01:8088</value>
</property>
<!--RM的hostname-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node01</value>
</property>
<!--可申请的最少内存资源-->
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
</property>
<!--可申请的最大内存资源-->
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
</property>
<!--物理内存与虚拟内存的比率-->
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>
<!-- 设置不检查虚拟内存的值,不然内存不够会报错 -->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
<!--NM总的可用物理内存,以MB为单位。一旦设置,不可动态修改-->
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>1024</value>
</property>
<!-- yarn上面运行一个任务,最少需要1.5G内存,虚拟机没有这么大的内存就调小这个值,不然会报错 -->
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>1024</value>
</property>

格式化HDFS

HDFS首次启动需要进行格式化需要一个格式化的过程来创建存放元数据(image, editlog)的目录;格式化是对分布式文件系统HDFS中的数据节点DataNode进行分块,统计所有分块后的初始元数据,并存储在NameNode中

hdfs namenode -format

启动

HDFS的启动和停止

启动HDFS将启动NameNode、DataNode、SecondaryNameNode三个进程

已配置hadoop环境变量,故只需要输入start-dfs.sh就可以启动,否则进入Hadoop主目录下的sbin执行./start-dfs.sh
start-dfs.sh

stop-dfs.sh

jps查看Java相关进程

[root@administrator program]# jps
31348 DataNode
31995 SecondaryNameNode
31069 NameNode
717 Jps

单独启动命令

hadoop-daemon.sh start namenode		#启动NameNode
hadoop-daemon.sh start datanode 	#启动DataNode
hadoop-daemon.sh start secondarynamenode	#启动SecondaryNameNode

hadoop-daemon.sh start namenode		#启动NameNode
hadoop-daemon.sh start datanode 	#启动DataNode
hadoop-daemon.sh start secondarynamenode	#启动SecondaryNameNode

启动和停止YARN

start-yarn.sh

stop-yarn.sh

单独启动命令

yarn-daemon.sh start resourcemanager	#启动ResourceManager
yarn-daemon.sh start nodemanager	 #启动NodeManager

yarn-daemon.sh stop resourcemanager		#停止ResourceManager
yarn-daemon.sh stop nodemanager		#停止NodeManager

验证

[root@administrator hadoop]# jps
9923 NodeManager
2915 NameNode
15956 Jps
3130 DataNode
9692 ResourceManager
3630 SecondaryNameNode

同时启动或停止HDFS和YARN

start-all.sh

stop-all.sh

启动异常解决

1.启动出现异常:

Starting namenodes on [IP]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [IP]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
在	 vim hadoop/etc/hadoop/hadoop-env.sh 配置文件末尾添加配置
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

2.再次启动,出现异常

[root@administrator program]# start-dfs.sh
Starting namenodes on [IP]
上一次登录:日 3月  6 21:47:15 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting datanodes
上一次登录:日 3月  6 21:58:58 CST 2022pts/4 上
localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting secondary namenodes [IP]
上一次登录:日 3月  6 21:58:58 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

即使本机使用SSH服务也是需要对自己进行公私钥授权,所以在本机通过ssh-keygen创建好公私钥,然后将公钥复制到公私钥的认证文件中

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

3.NameNode节点无法启动,端口被占用

java.net.BindException: Port in use:

查看端口占用,发现无进程占用该端口

lsof -i:port

修改hostname,vim /etc/hostname ,若是集群节点配置名称不能重复

[root@administrator logs]# cat /etc/hostname
node01

若使用别名,则修改/etc/hosts,内网IP绑定别名;否则直接使用内网IP;ifconfig查看内网ip,使用内网IP或127.0.0.1

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.22.4.21  netmask 255.255.192.0  broadcast 172.22.63.255
        ether 00:16:3e:02:73:19  txqueuelen 1000  (Ethernet)
        RX packets 59092928  bytes 13499777151 (12.5 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 38224223  bytes 9940817189 (9.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cat /etc/hosts

172.22.4.21     node01

防火墙设置

systemctl start firewalld.service  #启动firewall

systemctl stop firewalld.service  #停止firewall

systemctl disable firewalld.service  #禁止firewall开机启动

访问WebUI

切记开放相应端口

访问IP:8088 在这里插入图片描述 访问IP:9870,访问之前关闭防火墙 在这里插入图片描述

作业测试

准备数据, vim test.txt

MapReduce is a programming
paradigm that enables
massive scalability across
hundreds or thousands of
servers in a Hadoop cluster.
As the processing component,
MapReduce is the heart of Apache Hadoop.
The term "MapReduce" refers to two separate
and distinct tasks that Hadoop programs perform.

test.txt上传到HDFS

hadoop dfs -put test.txt /input

提交作业

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar wordcount /input  /out

检查输出:hdfs dfs -ls /out/

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-03-08 13:49 /out/_SUCCESS
-rw-r--r--   1 root supergroup        332 2022-03-08 13:49 /out/part-r-00000