基于Docker安装Hadoop
search hadoop
查找合适的Hadoop镜像
[root@administrator ~]# docker search hadoop
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
sequenceiq/hadoop-docker An easy way to try Hadoop 663 [OK]
uhopper/hadoop Base Hadoop image with dynamic configuration… 103 [OK]
harisekhon/hadoop Apache Hadoop (HDFS + Yarn, tags 2.2 - 2.8) 67 [OK]
bde2020/hadoop-namenode Hadoop namenode of a hadoop cluster 52 [OK]
bde2020/hadoop-datanode Hadoop datanode of a hadoop cluster 41 [OK]
拉取镜像
[root@administrator ~]# docker pull sequenceiq/hadoop-docker
创建启动容器
docker run -dit --name hadoop --privileged=true -p 50070:50070 -p 8088:8088 -p 9000:9000 sequenceiq/hadoop-docker /etc/bootstrap.sh -bash
进入容器
docker exec -it hadoop /bin/bash
使用Hadoop命令
bash-4.1# hadoop fs -ls
bash: hadoop: command not found
hadoop: command not found,添加环境变量配置信息
PATH=$PATH:/usr/local/hadoop/bin/
再次使用Hadoop命令
bash-4.1# PATH=$PATH:/usr/local/hadoop/bin/
bash-4.1# hadoop fs -ls
Found 1 items
drwxr-xr-x - root supergroup 0 2015-07-22 11:17 input
bash-4.1#
bash-4.1# hadoop version
Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /usr/local/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar
bash-4.1#
访问WebUI
查看集群状态:IP:8088
浏览HDFS文件:IP:50070
HDFS的API操作测试
@Test
public void listFile() throws Exception {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://IP:9000"), new Configuration());
//获取所有的文件或者文件夹; 指定遍历的路径,指定是否要递归遍历
RemoteIterator<LocatedFileStatus> locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
while (locatedFileStatusRemoteIterator.hasNext()) {
// 获取得到每一个文件详细信息
LocatedFileStatus fileStatus = locatedFileStatusRemoteIterator.next();
// 获取每一个文件存储路径 名称
System.out.println(fileStatus.getPath());
System.out.println(fileStatus.getPath().getName());
}
fileSystem.close();
}
}
基于Linux搭建Hadoop伪分布式
官网:https://hadoop.apache.org/
安装Hadoop
下载地址:https://archive.apache.org/dist/hadoop/core/
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.3.2/hadoop-3.3.2.tar.gz
解压且重命名
tar -zxvf hadoop-3.3.2.tar.gz
mv hadoop-3.3.2 hadoop
配置环境变量
vi /etc/profile
配置Hadoop环境信息
# hadoop
export HADOOP_HOME=/usr/local/program/hadoop
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
使配置生效
source /etc/profile
测试
hadoop version
主机别名配置
修改主机名
vim /etc/hostname
hostnamectl set-hostname node01
查看主机名
hostname
hostnamectl
更新/etc/hosts,添加 IP 与主机别名 映射
172.22.4.21 node01
修改配置文件
hadoop的配置文件都在hadoop/etc/hadoop目录下,主要修改:core-site.xml、hadoop-env.sh、hdfs-site.xml、mapred-site.xml、yarn-site.xml
注意:
参考网上各种配置,踩坑无数,以下每类配置都分2份,第一份最基础配置,第二份可选扩展配置,初学推荐使用每类最基础配置即可。
core-site.xml
vim hadoop/etc/hadoop/core-site.xml
基础配置
<configuration>
<!-- 指定文件系统类型,HDFS的通信地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:9000</value>
</property>
<!-- 临时文件存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/program/hadoop/datas/tmp</value>
</property>
可选配置
<!-- 缓冲区大小,实际工作中根据服务器性能动态调整 -->
<property>
<name>io.file.buffer.size</name>
<value>8192</value>
</property>
<!-- 开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
</configuration>
hadoop-env.sh
hadoop-env.sh是hadoop环境配置文件
vim hadoop/etc/hadoop/hadoop-env.sh
# export JAVA_HOME=
export JAVA_HOME=/usr/local/jdk1.8/
hdfs-site.xml
vim hadoop/etc/hadoop/hdfs-site.xml
<!--namenode元数据存放目录-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/program/hadoop/datas/namenode/namenodedatas</value>
</property>
<!--datanode数据存放目录-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/program/hadoop/datas/datanode/datanodeDatas</value>
</property>
<!--HDFS文件副本数,当前伪分布模式只有一个节点,设置为1-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--文件分块/block大小,默认128MB-->
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<!--HDFS浏览器访问端口,2.x版本50070端口,3.x版本9870-->
<property>
<name>dfs.namenode.http-address</name>
<value>node01:9870</value>
</property>
<!--关闭HDFS访问权限-->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
mapred-site.xml
vim hadoop/etc/hadoop/mapred-site.xml
<!--指定Mapreduce执行框架-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!--以下必须配置,否则运行MapReduce会提示检查是否配置-->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<!--maper container 的内存-->
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<!--Maper 端 JVM内存-->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx512M</value>
</property>
<!--Reducer container 内存-->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
<!--Reducer JVM 内存-->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx512M</value>
</property>
<!--用于map输出排序的内存大小 缓冲区的大小默认为100M-->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>256</value>
</property>
<!--排序文件时一次合并多少个文件的数量-->
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<!--提取map输出的copier线程数-->
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>25</value>
</property>
<!--job运行日志信息访问地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node01:10020</value>
</property>
<!--jobhistory浏览器访问地址-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node01:19888</value>
</property>
yarn-site.xml
vim hadoop/etc/hadoop/yarn-site.xml
<!--NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MapReduce程序-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--是否启用日志聚合,默认false-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--RM对客户端暴露的地址,客户端通过该地址向RM提交应用程序等-->
<property>
<name>yarn.resourcemanager.address</name>
<value>node01:8032</value>
</property>
<!--RM对AM暴露的地址,AM通过地址想RM申请资源,释放资源等-->
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node01:8030</value>
</property>
<!--RM对NM暴露地址,NM通过该地址向RM汇报心跳,领取任务等-->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>node01:8031</value>
</property>
<!--管理员可以通过该地址向RM发送管理命令等-->
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node01:8033</value>
</property>
<!--RM对外暴露的web http地址,用户可通过该地址在浏览器中查看集群信息-->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node01:8088</value>
</property>
<!--RM的hostname-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<!--可申请的最少内存资源-->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!--可申请的最大内存资源-->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!--物理内存与虚拟内存的比率-->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<!-- 设置不检查虚拟内存的值,不然内存不够会报错 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!--NM总的可用物理内存,以MB为单位。一旦设置,不可动态修改-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<!-- yarn上面运行一个任务,最少需要1.5G内存,虚拟机没有这么大的内存就调小这个值,不然会报错 -->
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1024</value>
</property>
格式化HDFS
HDFS首次启动需要进行格式化需要一个格式化的过程来创建存放元数据(image, editlog)的目录;格式化是对分布式文件系统HDFS中的数据节点DataNode进行分块,统计所有分块后的初始元数据,并存储在NameNode中
hdfs namenode -format
启动
HDFS的启动和停止
启动HDFS将启动NameNode、DataNode、SecondaryNameNode三个进程
已配置hadoop环境变量,故只需要输入start-dfs.sh就可以启动,否则进入Hadoop主目录下的sbin执行./start-dfs.sh
start-dfs.sh
stop-dfs.sh
jps查看Java相关进程
[root@administrator program]# jps
31348 DataNode
31995 SecondaryNameNode
31069 NameNode
717 Jps
单独启动命令
hadoop-daemon.sh start namenode #启动NameNode
hadoop-daemon.sh start datanode #启动DataNode
hadoop-daemon.sh start secondarynamenode #启动SecondaryNameNode
hadoop-daemon.sh start namenode #启动NameNode
hadoop-daemon.sh start datanode #启动DataNode
hadoop-daemon.sh start secondarynamenode #启动SecondaryNameNode
启动和停止YARN
start-yarn.sh
stop-yarn.sh
单独启动命令
yarn-daemon.sh start resourcemanager #启动ResourceManager
yarn-daemon.sh start nodemanager #启动NodeManager
yarn-daemon.sh stop resourcemanager #停止ResourceManager
yarn-daemon.sh stop nodemanager #停止NodeManager
验证
[root@administrator hadoop]# jps
9923 NodeManager
2915 NameNode
15956 Jps
3130 DataNode
9692 ResourceManager
3630 SecondaryNameNode
同时启动或停止HDFS和YARN
start-all.sh
stop-all.sh
启动异常解决
1.启动出现异常:
Starting namenodes on [IP]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [IP]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
在 vim hadoop/etc/hadoop/hadoop-env.sh 配置文件末尾添加配置
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
2.再次启动,出现异常
[root@administrator program]# start-dfs.sh
Starting namenodes on [IP]
上一次登录:日 3月 6 21:47:15 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting datanodes
上一次登录:日 3月 6 21:58:58 CST 2022pts/4 上
localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting secondary namenodes [IP]
上一次登录:日 3月 6 21:58:58 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
即使本机使用SSH服务也是需要对自己进行公私钥授权,所以在本机通过ssh-keygen创建好公私钥,然后将公钥复制到公私钥的认证文件中
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.NameNode节点无法启动,端口被占用
java.net.BindException: Port in use:
查看端口占用,发现无进程占用该端口
lsof -i:port
修改hostname,vim /etc/hostname ,若是集群节点配置名称不能重复
[root@administrator logs]# cat /etc/hostname
node01
若使用别名,则修改/etc/hosts,内网IP绑定别名;否则直接使用内网IP;ifconfig查看内网ip,使用内网IP或127.0.0.1
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.4.21 netmask 255.255.192.0 broadcast 172.22.63.255
ether 00:16:3e:02:73:19 txqueuelen 1000 (Ethernet)
RX packets 59092928 bytes 13499777151 (12.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38224223 bytes 9940817189 (9.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cat /etc/hosts
172.22.4.21 node01
防火墙设置
systemctl start firewalld.service #启动firewall
systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动
访问WebUI
切记开放相应端口
访问IP:8088
访问
IP:9870,访问之前关闭防火墙
作业测试
准备数据, vim test.txt
MapReduce is a programming
paradigm that enables
massive scalability across
hundreds or thousands of
servers in a Hadoop cluster.
As the processing component,
MapReduce is the heart of Apache Hadoop.
The term "MapReduce" refers to two separate
and distinct tasks that Hadoop programs perform.
将test.txt上传到HDFS
hadoop dfs -put test.txt /input
提交作业
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar wordcount /input /out
检查输出:hdfs dfs -ls /out/
Found 2 items
-rw-r--r-- 1 root supergroup 0 2022-03-08 13:49 /out/_SUCCESS
-rw-r--r-- 1 root supergroup 332 2022-03-08 13:49 /out/part-r-00000