排版有亿点难看,建议参看原文链接。
该文主要介绍的是使用Dockerfile构建镜像,使用docker-compose构建spark集群,并使用PySpark实现鸢尾花KMeans的过程,希望对你有所帮助。
文件结构
/apps文件夹用于存放脚本/data文件夹用于存放数据docker-compose.yml文件用于构建容器Dockerfile文件用于构建镜像start-spark.sh用于启动容器/apps和/data通过volumes配置挂载到容器中
集群安装
-
创建一个自定义的 Docker 网络
docker network create spark_net -
创建
Dockerfile文件# 使用openjdk作为基础镜像 FROM openjdk:11.0.11-jre-slim-buster # 设置环境变量 ENV SPARK_VERSION=3.5.1 ENV HADOOP_VERSION=3 ENV SPARK_HOME=/opt/spark ENV PYTHONHASHSEED=1 # 下载并安装相关依赖 RUN apt-get update && apt-get install -y wget procps python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy \ && apt-get clean RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1 # 下载并安装Apache Spark RUN wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ && tar -xzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /opt \ && mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} $SPARK_HOME \ && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz # 设置SPARK_HOME环境变量 ENV PATH=$PATH:$SPARK_HOME/bin ENV PYSPARK_PYTHON=python3 # 复制启动脚本到容器中 COPY start-spark.sh /start-spark.sh RUN chmod +x /start-spark.sh # 设置工作目录 WORKDIR $SPARK_HOME # 设置SPARK环境变量 ENV SPARK_MASTER_PORT=7077 \ SPARK_MASTER_WEBUI_PORT=8080 \ SPARK_LOG_DIR=/opt/spark/logs \ SPARK_MASTER_LOG=/opt/spark/logs/spark-master.out \ SPARK_WORKER_LOG=/opt/spark/logs/spark-worker.out \ SPARK_WORKER_WEBUI_PORT=8080 \ SPARK_WORKER_PORT=7000 \ SPARK_MASTER="spark://spark-master:7077" \ SPARK_WORKLOAD="master" # 暴露Spark Master和Worker的端口 EXPOSE 8080 7077 7000 # 创建日志目录 RUN mkdir -p $SPARK_LOG_DIR && \ touch $SPARK_MASTER_LOG && \ touch $SPARK_WORKER_LOG && \ ln -sf /dev/stdout $SPARK_MASTER_LOG && \ ln -sf /dev/stdout $SPARK_WORKER_LOG # 设置启动命令 CMD ["/bin/bash", "/start-spark.sh"]-
注意SPARK_VERSION和HADOOP_VERSION的设置,和该网页中内容版本对应
-
注意需要下载对应的python包才可以运行,可能会下载失败,多尝试几次
-
start-spark.sh脚本是启动脚本#!/bin/bash . "/opt/spark/bin/load-spark-env.sh" if [ "$SPARK_WORKLOAD" == "master" ]; then export SPARK_MASTER_HOST=$(hostname) cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.master.Master --webui-port $SPARK_MASTER_WEBUI_PORT >>$SPARK_MASTER_LOG elif [ "$SPARK_WORKLOAD" == "worker" ]; then cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.worker.Worker --webui-port $SPARK_WORKER_WEBUI_PORT $SPARK_MASTER >>$SPARK_WORKER_LOG elif [ "$SPARK_WORKLOAD" == "submit" ]; then echo "SPARK SUBMIT" else echo "Undefined Workload Type $SPARK_WORKLOAD, must specify: master, worker, submit" fi
-
-
基于
Dockerfile文件构建docker-spark-demo镜像docker build -t docker-spark-demo . -
基于
docker-spark-demo镜像构建集群-
创建
docker-compose.yml进行构建version: "3.7" services: spark-master: image: docker-spark-demo ports: - "8000:8080" - "7077:7077" volumes: - ./apps:/opt/apps - ./data:/opt/data environment: - SPARK_LOCAL_IP=spark-master - SPARK_WORKLOAD=master networks: - spark-net spark-worker-1: image: docker-spark-demo volumes: - ./apps:/opt/apps - ./data:/opt/data ports: - "8001:8080" - "7001:7000" depends_on: - spark-master environment: - SPARK_MASTER=spark://spark-master:7077 - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=1G - SPARK_DRIVER_MEMORY=1G - SPARK_EXECUTOR_MEMORY=1G - SPARK_WORKLOAD=worker - SPARK_LOCAL_IP=spark-worker-1 networks: - spark-net spark-worker-2: image: docker-spark-demo volumes: - ./apps:/opt/apps - ./data:/opt/data ports: - "8002:8080" - "7002:7000" depends_on: - spark-master environment: - SPARK_MASTER=spark://spark-master:7077 - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=1G - SPARK_DRIVER_MEMORY=1G - SPARK_EXECUTOR_MEMORY=1G - SPARK_WORKLOAD=worker - SPARK_LOCAL_IP=spark-worker-2 networks: - spark-net networks: spark-net: -
基于
docker-compose.yml构建容器docker-compose up -d -
查看容器启动情况
-
程序实现
-
使用pyspark实现KMeans算法
''' Author : wyx-hhhh Date : 2024-06-11 LastEditTime : 2024-06-11 Description : ''' from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler from pyspark.ml.clustering import KMeans # 初始化Spark会话 spark = SparkSession.builder.appName("KMeansApp").getOrCreate() # 读取外部数据集文件 data = spark.read.csv("file:/opt/data/iris.csv", header=True, inferSchema=True) # 特征向量组装 feature_cols = data.columns feature_cols.remove("label") # 排除标签列 assembler = VectorAssembler(inputCols=feature_cols, outputCol="features") data = assembler.transform(data) # 使用KMeans聚类 kmeans = KMeans().setK(3).setSeed(1) model = kmeans.fit(data) # 打印聚类中心点 centers = model.clusterCenters() for center in centers: print(center) # 停止Spark会话 spark.stop() -
启动程序
-
查看容器对应的id
docker container ls -
进入任意id的容器命令行界面
docker exec -it e85bf2f9b601 /bin/bash -
在spark-master下启动
-
执行spark-submit运行脚本
/opt/spark/bin/spark-submit --master spark://spark-master:7077 /opt/apps/main.py -
查看运行过程
-
运行结果
-
-
在单独spark-worker下启动
-
执行spark-submit运行脚本
/opt/spark/bin/spark-submit --deploy-mode client --master spark://spark-master:7077 --total-executor-cores 1 /opt/apps/main.py -
查看运行结果
-
运行结果
-
-