CDH6.3.2 升级 Spark3.3.0 版本

政采云技术团队.png

进勇1.png

背景

由于 CDH6.3.2 以上,已不开源。常用组件只能自编译升级,比如 Spark 。看网上的资料,有人说 Spark3 的 SQL 运行性能比 Spark2 可提升 20%,本人未验证,但是 Spark3 的 AE 功能的确很香,能自适应解决 Spark SQL 的数据倾斜。

下载软件

软件版本:jdk-1.8、maven-3.8.4、scala-2.12.15 、spark-3.3.0

说明:maven 和 scala 请不要改变小版本,如果要改变,请改动 pom 中相应的版本号,否则编译时会有版本错误

wget  http://distfiles.macports.org/scala2.12/scala-2.12.16.tgz
wget  https://archive.apache.org/dist/maven/maven-3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz
wget  https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0.tgz

把压缩包放在 /opt 目录,全部解压,设置 jdk、scala、maven 的环境变量

vim /etc/profile

export JAVA_HOME=/opt/jdk1.8.0_181-cloudera
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
export MAVEN_HOME=/opt/maven-3.8.4
export SCALA_HOME=/opt/scala-2.12.15
export PATH=$JAVA_HOME/bin:$PATH:$SCALA_HOME/bin:$HADOOP_CONF_DIR:$HADOOP_HOME:$MAVEN_HOME/bin

编译 Spark3

修改 spark3 的 pom 配置 /opt/spark-3.3.0/pom.xml,增加 cloudera maven 仓库。

在 repositories 标签下,新增

        <repository>
            <id>aliyun</id>
            <url>https://maven.aliyun.com/nexus/content/groups/public</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>

image-20220816143438393

修改 pom 文件中的 Hadoop 版本

<hadoop.version>3.0.0-cdh6.3.2</hadoop.version>

image-20220816143627399

修改 make-distribution.sh

vim /opt/spark-3.3.0/dev/make-distribution.sh

### 修改 maven 编译内存
export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
### 指定 mvn 为我们装的 maven 环境
MVN="/opt/maven-3.8.4/bin/mvn"

重置 scala 版本

cd  /opt/spark-3.3.0
./dev/change-scala-version.sh 2.12

开始编译

./dev/make-distribution.sh --name 3.0.0-cdh6.3.2 --tgz  -Pyarn -Phadoop-3.0 -Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -X

用的是 spark 的 make-distribution.sh 脚本进行编译,这个脚本其实也是用 maven 编译的,

  • –tgz 指定以 tgz 结尾
  • –name 后面跟的是 Hadoop 的版本,在后面生成的 tar 包带的版本号
  • -Pyarn 是基于 yarn
  • -Dhadoop.version=3.0.0-cdh6.3.2 指定 Hadoop 的版本。

经过漫长的编译:

image-20220812143325320

编译成功后的目录:

image-20220812143400883

部署 Spark3 客户端

上传到要部署 spark3 的客户端机器

tar -zxvf spark-3.3.0-bin-3.0.0-cdh6.3.2.tgz -C /opt/cloudera/parcels/CDH/lib
cd /opt/cloudera/parcels/CDH/lib
mv spark-3.3.0-bin-3.0.0-cdh6.3.2/ spark3

将 CDH 集群的 spark-env.sh 复制到 /opt/cloudera/parcels/CDH/lib/spark3/conf 下:

cp /etc/spark/conf/spark-env.sh  /opt/cloudera/parcels/CDH/lib/spark3/conf
chmod +x /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh

#修改 spark-env.sh
vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

将 gateway 节点的 hive-site.xml 复制到 spark3/conf 目录下,不需要做变动:

cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/

创建 spark-sql

vim /opt/cloudera/parcels/CDH/bin/spark-sql

#!/bin/bash 
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
SOURCE="${BASH_SOURCE[0]}"  
BIN_DIR="$( dirname "$SOURCE" )"  
while [ -h "$SOURCE" ]  
do  
 SOURCE="$(readlink "$SOURCE")"  
 [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"  
 BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
done  
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"  
LIB_DIR=$BIN_DIR/../lib  
export HADOOP_HOME=$LIB_DIR/hadoop  
  
# Autodetect JAVA_HOME if not defined  
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome  
  
exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

配置 spark-sql 快捷方式

chmod +x /opt/cloudera/parcels/CDH/bin/spark-sql
alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/bin/spark-sql 1

配置 conf

cd /opt/cloudera/parcels/CDH/lib/spark3/conf
## 开启日志
mv log4j2.properties.template log4j2.properties
## spark-defaults.conf 配置
cp /opt/cloudera/parcels/CDH/lib/spark/conf/spark-defaults.conf ./

# 修改 spark-defaults.conf
vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-defaults.conf
删除 spark.extraListeners、spark.sql.queryExecutionListeners、spark.yarn.jars
添加 spark.yarn.jars=hdfs:///spark/3versionJars/*

hadoop fs -mkdir -p /spark/3versionJars
cd /opt/cloudera/parcels/CDH/lib/spark3/jars
hadoop fs -put *.jar /spark/3versionJars

image-20220816150222905

创建 spark3-submit

vim /opt/cloudera/parcels/CDH/bin/spark3-submit

#!/usr/bin/env bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
SOURCE="${BASH_SOURCE[0]}"
BIN_DIR="$( dirname "$SOURCE" )"
while [ -h "$SOURCE" ]
do
 SOURCE="$(readlink "$SOURCE")"
 [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE"
 BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
done
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
LIB_DIR=/opt/cloudera/parcels/CDH/lib
export HADOOP_HOME=$LIB_DIR/hadoop

# Autodetect JAVA_HOME if not defined
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec $LIB_DIR/spark3/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

配置 spark3-submit 快捷方式

chmod +x /opt/cloudera/parcels/CDH/bin/spark3-submit
alternatives --install /usr/bin/spark3-submit spark3-submit /opt/cloudera/parcels/CDH/bin/spark3-submit 1

测试 spark3-submit

spark3-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue root.default /opt/cloudera/parcels/CDH/lib/spark3/examples/jars/spark-examples*.jar 10

image-20220816151817402

注意事项

如果有启用动态资源分配(Spark Dynamic Allocation),此时会有下列报错

FetchFailed(BlockManagerId(2, n6, 7337, None), shuffleId=57, mapId=136, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException
	at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:312)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1169)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:904)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	.......
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown message type: 9
	at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:71)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:150)
	......

需要在 spark-defaults.conf 中添加 useOldFetchProtocol 配置

spark.shuffle.useOldFetchProtocol=true

这样 Hadoop 集群即有了 CDH 版本的 Spark-2.4.0 又有了 apache 版本的 Spark-3.3.0

Done ~~~

推荐阅读

从源码看 Lucene 的文档写入流程

ElasticSearch 文档分值 score 计算&聚合搜索案例分析

Label Studio+Yolov5 实现目标检测预标注(二)

Label Studio+Yolov5 实现目标检测预标注(一)

招贤纳士

政采云技术团队(Zero),一个富有激情、创造力和执行力的团队,Base 在风景如画的杭州。团队现有 500 多名研发小伙伴,既有来自阿里、华为、网易的“老”兵,也有来自浙大、中科大、杭电等校的新人。团队在日常业务开发之外,还分别在云原生、区块链、人工智能、低代码平台、中间件、大数据、物料体系、工程平台、性能体验、可视化等领域进行技术探索和实践,推动并落地了一系列的内部技术产品,持续探索技术的新边界。此外,团队还纷纷投身社区建设,目前已经是 google flutter、scikit-learn、Apache Dubbo、Apache Rocketmq、Apache Pulsar、CNCF Dapr、Apache DolphinScheduler、alibaba Seata 等众多优秀开源社区的贡献者。如果你想改变一直被事折腾,希望开始折腾事;如果你想改变一直被告诫需要多些想法,却无从破局;如果你想改变你有能力去做成那个结果,却不需要你;如果你想改变你想做成的事需要一个团队去支撑,但没你带人的位置;如果你想改变本来悟性不错,但总是有那一层窗户纸的模糊……如果你相信相信的力量,相信平凡人能成就非凡事,相信能遇到更好的自己。如果你希望参与到随着业务腾飞的过程,亲手推动一个有着深入的业务理解、完善的技术体系、技术创造价值、影响力外溢的技术团队的成长过程,我觉得我们该聊聊。任何时间,等着你写点什么,发给 zcy-tc@cai-inc.com

微信公众号

文章同步发布,政采云技术团队公众号,欢迎关注

政采云技术团队.png