Hadoop 完全分布式安装配置
架构设计:master slave1 slave2
1.准备工作
1.1java安装
yum install -y java-1.8.0-openjdk-devel.x86_64 -y
1.2 hadoop安装包下载
所有节点执行如下操作
下载安装包
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
解压安装包
tar -zxf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 /usr/local/hadoop
在所有节点上编辑~/.bashrc文件,添加以下内容:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
使配置生效:
source ~/.bashrc
1.3 环境准备
所有节点执行
hosts文件解析添加
10.0.0.1 master
10.0.0.2 slave1
10.0.0.3 slave2
#所有节点执行
用户创建 useradd hadoop
给上一步解压出来的hadoop目录授权
chown -R hadoop:hadoop /usr/local/hadoop
免密登录设置,master生成密钥之后,将密钥分发至slave1和slave2节点,实现免密登录,集群启动需要
su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
将公钥复制到所有节点
将公钥复制到 Master 节点:
ssh-copy-id master
将公钥复制到 Slave1 节点:
ssh-copy-id slave1
将公钥复制到 Slave2 节点:
ssh-copy-id slave2
1.2部署配置
a.编辑core-site.xml
所有节点(Master、Slave1、Slave2)**都需要编辑$HADOOP_HOME/etc/hadoop/core-site.xml文件,添加以下内容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
b.编辑hdfs-site.xml
Master节点需要编辑$HADOOP_HOME/etc/hadoop/hdfs-site.xml文件,添加以下内容
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
</configuration>
Slave1节点和Slave2节点需要编辑$HADOOP_HOME/etc/hadoop/hdfs-site.xml文件,添加以下内容:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>
c.编辑mapred-site.xml
所有节点(Master、Slave1、Slave2) 都需要编辑$HADOOP_HOME/etc/hadoop/mapred-site.xml文件,添加以下内容:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
d.编辑yarn-site.xml
Master节点需要编辑$HADOOP_HOME/etc/hadoop/yarn-site.xml文件,添加以下内容:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Slave1节点和Slave2节点需要编辑$HADOOP_HOME/etc/hadoop/yarn-site.xml文件,添加以下内容:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
e. 编辑workers
仅在Master节点上编辑$HADOOP_HOME/etc/hadoop/workers文件,添加以下内容:
slave1
slave2
f.配置hadoop的java_home
所有节点执行
一定要配置,否则启动集群时会报错找不到java_home
进入hadoop家目录
cd $HADOOP_HOME/etc/hadoop
#hadoop集群启动时会去读取这个env.sh文件
vim hadoop-env.sh
修改java_home路径为具体的安装路径:例如
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64
g.格式化HDFS
仅在Master节点上格式化HDFS:
hdfs namenode -format
1.3启动集群
仅在Master节点上执行
cd /usr/local/hadoop/sbin/
./start-dfs.sh
./start-yarn.sh
所有节点执行jps查看效果
master执行
[hadoop@iZbp19msfaakpeah42elrcZ hadoop]$ jps
23648 NameNode
25588 ResourceManager
31337 Jps
23882 SecondaryNameNode
slave1执行
[hadoop@iZbp17h4v08g38r9aqn0o7Z ~]$ jps
21392 DataNode
25872 Jps
slave2执行
[hadoop@iZbp17h4v08g38r9aqn0o6Z ~]$ jps
22644 Jps
20093 DataNode
Flume实时数据采集
1. 安装和配置Flume
a. 安装Flume
在主节点上安装Flume
wget https://downloads.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz
tar -xzvf apache-flume-1.11.0-bin.tar.gz
mv apache-flume-1.11.0-bin /usr/local/flume
b. 配置环境变量
在主节点上编辑~/.bashrc文件,添加以下内容:
export FLUME_HOME=/usr/local/flume
export PATH=$PATH:$FLUME_HOME/bin
使配置生效
source ~/.bashrc
c. 创建Flume配置文件
创建一个新的Flume配置文件,例如flume-kafka-hdfs.conf,内容如下:
请将master:9092替换为你的Kafka Broker地址
# 定义Agent的名称、Source、Channel和Sink
agent.sources = netcatSrc
agent.channels = memoryChannel
agent.sinks = kafkaSink hdfsSink
# 配置Source
agent.sources.netcatSrc.type = netcat
agent.sources.netcatSrc.bind = localhost
agent.sources.netcatSrc.port = 10050
# 配置Channel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000
agent.channels.memoryChannel.transactionCapacity = 100
# 配置Kafka Sink
agent.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkaSink.brokerList = master:9092
agent.sinks.kafkaSink.topic = 00tianhuiping
agent.sinks.kafkaSink.batchSize = 20
agent.sinks.kafkaSink.requiredAcks = 1
# 配置HDFS Sink
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /user/test/flumebackup
agent.sinks.hdfsSink.hdfs.filePrefix = events-
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.writeFormat = Text
agent.sinks.hdfsSink.hdfs.rollInterval = 3600
agent.sinks.hdfsSink.hdfs.rollSize = 0
agent.sinks.hdfsSink.hdfs.rollCount = 10
# 将Source和Sink绑定到Channel
agent.sources.netcatSrc.channels = memoryChannel
agent.sinks.kafkaSink.channel = memoryChannel
agent.sinks.hdfsSink.channel = memoryChannel
d. 启动Flume Agent
使用以下命令启动Flume Agent:
flume-ng agent --conf /usr/local/flume/conf --conf-file /usr/local/flume/conf/flume-kafka-hdfs.conf --name agent -Dflume.root.logger=INFO,console