大数据技术之计算引擎Spark-4.0.0环境搭建

150 阅读2分钟

Linux环境:JDK21、Hadoop3.4.1、CentOS-Stream10、Spark4.0.0

第一章 Local本地模式

Hadoop102Hadoop103Hadoop104
SparkMaster、Worker
# 下载
[whboy@hadoop102 ]$ wget https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz
# 解压
[whboy@hadoop102 ]$ tar -zxvf spark-4.0.0-preview2-bin-hadoop3.tgz -C /opt/module
[whboy@hadoop102 ]$ mv spark-4.0.0-preview2-bin-hadoop3 spark-4.0.0-local
# 配置环境变量
[whboy@hadoop102 ]$ sudo vim /etc/profile.d/my_env.sh
# SPARK_HOME
export SPARK_HOME=/opt/module/spark-4.0.0-local
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
[whboy@hadoop102 ]$ source /etc/profile.d/my_env.sh
# 提交测试案例
[whboy@hadoop102 spark-4.0.0-local]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local \
./examples/jars/spark-examples_2.13-4.0.0-preview2.jar \
100

第二章 Standalone模式

Hadoop102Hadoop103Hadoop104
SparkWorker、 MasterWorkerWorker

2.1 Spark安装

# 下载
[whboy@hadoop102 ]$ wget https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz
# 解压
[whboy@hadoop102 ]$ tar -zxvf spark-4.0.0-preview2-bin-hadoop3.tgz -C /opt/module
[whboy@hadoop102 ]$ mv spark-4.0.0-preview2-bin-hadoop3 spark-4.0.0-standalone
​
# 配置环境变量
[whboy@hadoop102 ]$ sudo vim /etc/profile.d/my_env.sh
# SPARK_HOME
export SPARK_HOME=/opt/module/spark-4.0.0-standalone
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
[whboy@hadoop102 ]$ source /etc/profile.d/my_env.sh

2.2 Spark配置文件

2.2.1 spark-env.sh

[whboy@hadoop102 spark-4.0.0-standalone]$ mv conf/spark-env.sh.template conf/spark-env.sh
[whboy@hadoop102 spark-4.0.0-standalone]$ vim conf/spark-env.sh
export JAVA_HOME=/opt/module/jdk-21.0.5
export SPARK_MASTER_HOST=hadoop102
export SPARK_MASTER_PORT=7077

2.2.2 workers

[whboy@hadoop102 spark-4.0.0-standalone]$ mv conf/workers.template conf/workers
[whboy@hadoop102 spark-4.0.0-standalone]$ vim conf/workers
hadoop102
hadoop103
hadoop104

2.3 配置历史服务

2.3.1 spark-env.sh

[whboy@hadoop102 spark-4.0.0-standalone]$ vim conf/spark-env.sh
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory 
-Dspark.history.retainedApplications=100"
#参数1含义:WEB UI访问的端口号为18080
#参数2含义:指定历史服务器日志存储路径
#参数3含义:指定保存Application历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,这个是内存中的应用数,而不是页面上显示的应用数。

2.3.1 spark-defaults.conf

[whboy@hadoop102 spark-4.0.0-standalone]$ vim conf/spark-defaults.conf
spark.eventLog.enabled      true
spark.eventLog.dir          hdfs://hadoop102:8020/directory
spark.history.ui.port=18080
spark.history.fs.logDirectory=hdfs://hadoop102:8020/directory
spark.yarn.historyServer.address=hadoop102:18080# 集群同步
[whboy@hadoop102 module]$ my_rsync_script.sh spark-4.0.0-standalone

2.4 启动集群

# 注意:需要启动hadoop集群,HDFS上的directory目录需要提前存在
[whboy@hadoop102 spark-4.0.0-standalone]$ start-dfs.sh
[whboy@hadoop102 spark-4.0.0-standalone]$ hadoop fs -mkdir /directory
​
# 启动spark集群和历史服务器
[whboy@hadoop102 spark-4.0.0-standalone]$ sbin/start-all.sh
[whboy@hadoop102 spark-4.0.0-standalone]$ sbin/start-history-server.sh
​
# 提交应用
[whboy@hadoop102 spark-4.0.0-standalone]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
./examples/jars/spark-examples_2.12-4.0.0.jar \
100
# 查看历史服务:http://hadoop102:18080

第三章 Yarn模式

4.1 Spark安装

# 下载
[whboy@hadoop102 ]$ wget https://archive.apache.org/dist/spark/spark-4.0.0-preview2/spark-4.0.0-preview2-bin-hadoop3.tgz
# 解压
[whboy@hadoop102 ]$ tar -zxvf spark-4.0.0-preview2-bin-hadoop3.tgz -C /opt/module
[whboy@hadoop102 ]$ mv spark-4.0.0-preview2-bin-hadoop3 spark-4.0.0-yarn
​
# 配置环境变量
[whboy@hadoop102 ]$ sudo vim /etc/profile.d/my_env.sh
# SPARK_HOME
export SPARK_HOME=/opt/module/spark-4.0.0-yarn
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

4.2 Spark配置文件

4.2.1 spark-env.sh

[whboy@hadoop102 spark-4.0.0-yarn]$ vim conf/spark-env.sh
export JAVA_HOME=/opt/module/jdk-21.0.5
export HADOOP_CONF_DIR=/opt/module/hadoop-3.4.1/etc/hadoop
export YARN_CONF_DIR=/opt/module/hadoop-3.4.1/etc/hadoop

4.2.2 spark-defaults.conf

[whboy@hadoop102 spark-4.0.0-yarn]$ vim conf/spark-defaults.conf
spark.eventLog.enabled				true
spark.eventLog.dir						hdfs://hadoop102:8020/directory

4.4 Spark历史服务器

默认端口是8080一般我们避免和其他程序端口冲突我们改端口18080。

4.4.1 spark-env.sh

[whboy@hadoop102 spark-4.0.0-yarn]$ vim conf/spark-env.sh
# 历史服务器
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 
-Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/directory
-Dspark.history.retainedApplications=100"

4.2.2 spark-defaults.conf

[whboy@hadoop102 spark-4.0.0-yarn]$ vim conf/spark-defaults.conf
spark.eventLog.enabled     	true
spark.eventLog.dir        	hdfs://hadoop102:8020/directory

spark.history.ui.port=18080
spark.history.fs.logDirectory=hdfs://hadoop102:8020/directory
spark.yarn.historyServer.address=hadoop102:18080
# 启动spark历史服务
[whboy@hadoop102 spark-4.0.0-yarn]$ sbin/start-history-server.sh 

4.5 提交作业

# 客户端提交
[whboy@hadoop102 spark-4.0.0-yarn]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.13-4.0.0-preview2.jar \
100

# 集群模式提交
[whboy@hadoop102 spark-4.0.0-yarn]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.13-4.0.0-preview2.jar \
100
# Web页面查看日志:http://hadoop102:18080/

4.6 附录Pyspark On Yarn

[pySpark On Yarn]Cannot run program "/opt/module/anaconda3/bin/python3": error=2, No such file or directory

【尝试2】把PATH=$PYTHON_HOME/bin:$PATH写到yarn-env.sh的里面

【解决】结果pyspark真的就开始跑成功了。确实,如果通过yarn启动的任务,那么相应任务需求的环境变量,最好也在yarn-env.sh里面设置一遍。

[whboy@hadoop102 ~]$ vim /etc/profile.d/my_env.sh 
# PYTHON_HOME
export PYTHON_HOME=/opt/module/anaconda3
export PATH=$PATH:$PYTHON_HOME/bin
[whboy@hadoop102 ~]$ vim $HADOOP_HOME/etc/hadoop/yarn-env.sh 
export PATH=$PYTHON_HOME/bin:$PATH
[whboy@hadoop102 ~]$ my_rsync_script.sh $HADOOP_HOME/etc/hadoop/yarn-env.sh