Linux环境Spark+hadoop 初次安装配置及使用
一、前言
记录初次安装saprk+hadoop;使用Docker-Compose完成了spark集群的构建;Docker镜像选择了bitnami/spark的开源镜像:hadoop 选择了bde2020/hadoop-namenode;单机环境
二、spark 集群环境搭建
配置如下:
version: "3.8"
services:
spark-master:
image: bitnami/spark:3.4.0
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
ports:
- "8080:8080" # Spark Master UI
- "7077:7077" # Spark Master通信端口
networks:
- spark-network
spark-worker-1:
image: bitnami/spark:3.4.0
container_name: spark-worker-1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
depends_on:
- spark-master
networks:
- spark-network
spark-worker-2:
image: bitnami/spark:3.4.0
container_name: spark-worker-2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
depends_on:
- spark-master
networks:
- spark-network
spark-history:
image: bitnami/spark:3.4.0
container_name: spark-history
environment:
- SPARK_MODE=history-server
- SPARK_HISTORY_SERVER_SPARK_MASTER=spark://spark-master:7077
- SPARK_HISTORY_SERVER_LOG_DIRECTORY=/tmp/spark-events
command: ./bin/spark-class org.apache.spark.deploy.history.HistoryServer
volumes:
- ./spark-events:/tmp/spark-events # 持久化历史日志
depends_on:
- spark-master
ports:
- "18080:18080" # Spark History UI
networks:
- spark-network
networks:
spark-network:
driver: bridge
对于master节点,暴露出7077端口和8080端口分别用于连接spark以及浏览器查看spark UI
如果想验证官方示例,需要注意class 文件路径,以及jar包路径,参考:spark-submit \ --class org.apache.spark.examples.JavaSparkPi.java \ --master local[*] \ /opt/bitnami/spark/examples/jars/spark-examples-3.4.0-sources.jar \;
-- 项目打包时,可不将依赖jar包一起打包,但pom 文件需引入依赖;后续可通过命令将所需jar包复制到容器中
例:docker cp /home/jars/mysql-connector-java-8.0.28.jar spark-master:/opt/bitnami/spark/jars
注意:编写测试例子时,引入的spark相关依赖需要和spark版本匹配,可
三、结合hdfs使用
如果需要使用文件系统,如存储模型,或是读取数据文件,则需要结合hdfs 使用 将上文的Hadoop的docker-compose.yml与本次的结合,得到新的docker-compose.yml
version: "3"
services:
namenode:
image: bde2020/hadoop-namenode:latest
container_name: namenode
environment:
- CLUSTER_NAME=test
ports:
- "9870:9870" # HDFS Web UI
- "8020:8020" # HDFS RPC端口
volumes:
- hadoop_namenode:/hadoop/dfs/name
- ./hdfs/core-site.xml:/etc/hadoop/core-site.xml
- ./hdfs/hdfs-site.xml:/etc/hadoop/hdfs-site.xml
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9870 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
datanode:
image: bde2020/hadoop-datanode:latest
container_name: datanode
environment:
- CLUSTER_NAME=test
volumes:
- hadoop_datanode:/hadoop/dfs/data
- ./hdfs/core-site.xml:/etc/hadoop/core-site.xml
- ./hdfs/hdfs-site.xml:/etc/hadoop/hdfs-site.xml
depends_on:
- namenode
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://namenode:9870 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
socks5:
image: serjs/go-socks5-proxy
container_name: socks5
ports:
- 10802:1080
restart: always
networks:
- spark-network
spark-master:
image: bitnami/spark:3.4.0
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
- SPARK_EVENTLOG_ENABLED=true
- SPARK_EVENTLOG_DIR=hdfs://namenode:8020/spark-logs
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://namenode:8020/spark-logs
- spark.hadoop.fs.defaultFS=hdfs://namenode:8020
- SPARK_WORKER_INSTANCES=2
- spark.scheduler.mode=FAIR # 启用公平调度器
- spark.scheduler.allocation.file=/opt/bitnami/spark/conf/fairscheduler.xml # 指定公平调度器配置文件路径
- spark.dynamicAllocation.enabled=true # 启用动态资源分配
- spark.shuffle.service.enabled=true # 启用Shuffle Service
ports:
- "8080:8080" # Spark Master UI
- "7077:7077" # Spark Master通信端口
volumes:
- ./spark-cron:/etc/cron.d/spark-cron
- ./fairscheduler.xml:/opt/bitnami/spark/conf/fairscheduler.xml # 挂载公平调度器配置文件
depends_on:
- namenode
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-worker-1:
image: bitnami/spark:3.4.0
container_name: spark-worker-1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
depends_on:
- spark-master
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://spark-master:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-worker-2:
image: bitnami/spark:3.4.0
container_name: spark-worker-2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
depends_on:
- spark-master
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://spark-master:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-history:
image: bitnami/spark:3.4.0
container_name: spark-history
environment:
- SPARK_MODE=history-server
- spark.hadoop.fs.defaultFS=hdfs://namenode:8020
- SPARK_HISTORY_SERVER_SPARK_MASTER=spark://spark-master:7077
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://namenode:8020/spark-logs
- SPARK_EVENTLOG_ENABLED=true
- SPARK_EVENTLOG_DIR=hdfs://namenode:8020/spark-logs
command: ./bin/spark-class org.apache.spark.deploy.history.HistoryServer
volumes:
- ./spark-events:/tmp/spark-events # 持久化历史日志
depends_on:
- spark-master
ports:
- "18080:18080" # Spark History UI
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:18080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
volumes:
hadoop_namenode:
hadoop_datanode:
networks:
spark-network:
driver: bridge
core-site.xml配置
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/data/tmp</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml配置
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///hadoop/dfs/data</value>
</property>
</configuration>
hadoop 镜像也可以先手动拉取,如:docker pull bde2020/hadoop-namenode:latest 启动后:如果提示:security.AccessControlException, 可进入hdfs 容器中给目录授权
docker exec -it namenode /bin/bashhdfs dfs -mkdir -p /user/hadoop/spark-datahdfs dfs -chmod 777 /user/hadoop/spark-datahdfs dfs -chown -R spark:spark /user/hadoop/spark-model
四、结合cron使用
需要定期运行多个任务时可结合cron 使用,构建spark-cron 镜像 dockerfile 构建镜像; 容易构建失败,建议设置好镜像加速
FROM bitnami/spark:3.4.0
# 切换到 root 用户进行安装
USER root
# 设置环境变量
RUN apt-get update && apt-get install -y
# 设置时区(根据需要)
ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
# 复制并设置 cron 文件
COPY spark-cron /etc/cron.d/spark-cron
RUN chmod 644 /etc/cron.d/spark-cron
# 切换回默认用户
USER 1001
# 启动 cron 服务
CMD ["cron", "-f"]
运行命令docker build -t spark-cron .
结合上文的配置,新的docker-compose 配置文件如下:
version: "3"
services:
namenode:
image: bde2020/hadoop-namenode:latest
container_name: namenode
environment:
- CLUSTER_NAME=test
ports:
- "9870:9870" # HDFS Web UI
- "8020:8020" # HDFS RPC端口
volumes:
- hadoop_namenode:/hadoop/dfs/name
- ./hdfs/core-site.xml:/etc/hadoop/core-site.xml
- ./hdfs/hdfs-site.xml:/etc/hadoop/hdfs-site.xml
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9870 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
datanode:
image: bde2020/hadoop-datanode:latest
container_name: datanode
environment:
- CLUSTER_NAME=test
volumes:
- hadoop_datanode:/hadoop/dfs/data
- ./hdfs/core-site.xml:/etc/hadoop/core-site.xml
- ./hdfs/hdfs-site.xml:/etc/hadoop/hdfs-site.xml
depends_on:
- namenode
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://namenode:9870 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
socks5:
image: serjs/go-socks5-proxy
container_name: socks5
ports:
- 10802:1080
restart: always
networks:
- spark-network
spark-master:
image: bitnami/spark:3.4.0
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
- SPARK_EVENTLOG_ENABLED=true
- SPARK_EVENTLOG_DIR=hdfs://namenode:8020/spark-logs
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://namenode:8020/spark-logs
- spark.hadoop.fs.defaultFS=hdfs://namenode:8020
- SPARK_WORKER_INSTANCES=2
- spark.scheduler.mode=FAIR # 启用公平调度器
- spark.scheduler.allocation.file=/opt/bitnami/spark/conf/fairscheduler.xml # 指定公平调度器配置文件路径
- spark.dynamicAllocation.enabled=true # 启用动态资源分配
- spark.shuffle.service.enabled=true # 启用Shuffle Service
ports:
- "8080:8080" # Spark Master UI
- "7077:7077" # Spark Master通信端口
volumes:
- ./spark-cron:/etc/cron.d/spark-cron
- ./fairscheduler.xml:/opt/bitnami/spark/conf/fairscheduler.xml # 挂载公平调度器配置文件
depends_on:
- namenode
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-worker-1:
image: bitnami/spark:3.4.0
container_name: spark-worker-1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
depends_on:
- spark-master
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://spark-master:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-worker-2:
image: bitnami/spark:3.4.0
container_name: spark-worker-2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=1G
depends_on:
- spark-master
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://spark-master:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-history:
image: bitnami/spark:3.4.0
container_name: spark-history
environment:
- SPARK_MODE=history-server
- spark.hadoop.fs.defaultFS=hdfs://namenode:8020
- SPARK_HISTORY_SERVER_SPARK_MASTER=spark://spark-master:7077
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://namenode:8020/spark-logs
- SPARK_EVENTLOG_ENABLED=true
- SPARK_EVENTLOG_DIR=hdfs://namenode:8020/spark-logs
command: ./bin/spark-class org.apache.spark.deploy.history.HistoryServer
volumes:
- ./spark-events:/tmp/spark-events # 持久化历史日志
depends_on:
- spark-master
ports:
- "18080:18080" # Spark History UI
networks:
- spark-network
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:18080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
spark-cron:
image: spark-cron:latest
container_name: spark-cron
user: root
environment:
- SPARK_WORKER_INSTANCES=2
- spark.scheduler.mode=FAIR # 启用公平调度器
- spark.scheduler.allocation.file=/opt/bitnami/spark/conf/fairscheduler.xml # 指定公平调度器配置文件路径
- spark.dynamicAllocation.enabled=true # 启用动态资源分配
- spark.shuffle.service.enabled=true # 启用Shuffle Service
volumes:
- ./spark-cron:/etc/cron.d/spark-cron
- ./fairscheduler.xml:/opt/bitnami/spark/conf/fairscheduler.xml # 挂载公平调度器配置文件
command: ["sh", "-c", "chmod 644 /etc/cron.d/spark-cron && cron -f"]
depends_on:
- spark-master
networks:
- spark-network
volumes:
hadoop_namenode:
hadoop_datanode:
networks:
spark-network:
driver: bridge
如果只是跑单个任务不需要考虑任务并行,可不配置调度器,否则需要配置调度器分配资源;
<?xml version="1.0"?>
fairscheduler.xml 文件配置
<allocations>
<pool name="default">
<minShare>1</minShare>
<weight>1.0</weight>
</pool>
<!-- 自行结合具体的服务器资源修改配置-->
<pool name="fixed-resource-pool">
<minResources>1024 mb,2 cores</minResources>
<maxResources>2048 mb,4 cores</maxResources>
</pool>
<pool name="fair-scheduling-pool">
<minResources>512 mb,1 core</minResources>
<weight>1.0</weight>
<schedulingMode>FAIR</schedulingMode>
</pool>
</allocations>
spark-cron 文件配置,spark-cron 文件末尾需要换行
# 设置环境变量
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
SPARK_HOME=/opt/bitnami/spark
JAVA_HOME=/opt/bitnami/java
JRE_HOME=${JAVA_HOME}/jre
# 每 分钟执行一次 (注意为了避免换行符报错,同一条启动命令中,不要换行)
* * * * * ${SPARK_HOME}/bin/spark-submit --class com.bim.sql.类名 --master spark://服务器ip:7077 --conf spark.scheduler.mode=FAIR --conf spark.scheduler.pool=fair-scheduling-pool --jars /opt/bitnami/spark/jars/mysql-connector-java-8.0.28.jar /opt/bitnami/spark/examples/jars/bim_system-1.0-SNAPSHOT.jar >> /var/log/spark-cron.log 2>&1
不设置环境变量有可能执行会出错;后续jar包文件统一放在spark-cron 容器中运行;
使用公平调度器,启动命令添加spark.scheduler.mode=FAIR --conf spark.scheduler.pool=fair-scheduling-pool --jars
自定义资源 添加 --total-executor-cores 1 --executor-memory 512m --num-executors 1 --conf spark.dynamicAllocation.enabled=false 添加conf spark.dynamicAllocation.enabled=false 是为了禁用动态资源分配,否则自定义的资源配置会不生效
容器启动后,进入spark-cron容器执行命令
crontab /etc/cron.d/spark-cron #加载定时任务
crontab -l # 查看加载成功的
cat /var/log/spark-cron.log