一、背景
spark版本:2.3.1
scala版本:2.11.8
二、conf配置说明
| 选项 | 值 | 说明 | 组合 |
|---|---|---|---|
| spark.sql.crossJoin.enabled | true | 值为true时,sql进行迪卡尔积join运算 | 1 |
| spark.dynamicAllocation.enabled | ture | 值为true时,spark就会启动ExecutorAllocationManager,动态管理执行器; | 2 |
| spark.shuffle.service.enabled | ture | 值为true时,spark动态管理shuffle服务,与 ExecutorAllocationManager配合使用 | 2 |
| spark.dynamicAllocation.initialExecutors | 数值 | 初始化执行器数量 | 2 |
| spark.dynamicAllocation.maxExecutors | 数值 | 最多执行器数量 | 2 |
| spark.dynamicAllocation.minExecutors | 数值 | 最少执行器数量 | 2 |
| spark.default.parallelism | 数值 | task的并行度,num-executors * executor-cores的2~3倍较为合适;该参数比较重要 | 3 |
| spark.sql.adaptive.enabled | true | 默认为false,自适应执行框架的开关 | 4 |
| spark.sql.adaptive.skewedJoin.enabled | true | 默认为 false ,倾斜处理开关 | 4 |
| spark.driver.extraJavaOptions | -Dlog4j.configuration=file:log4j.properties / -Xss30M | driver 的jvm参数 | 5 |
| spark.hadoop.ipc.client.fallback-to-simple-auth-allowed | true | hdfs跨集群数据迁移 | 6 |
| spark.shuffle.memoryFraction | 0.3 | 该参数代表了Executor内存中,分配给shuffle read task进行聚合操作的内存比例,默认是20% | 7 |
| spark.storage.memoryFraction | 0.5 | 用于设置RDD持久化数据在Executor内存中能占的比例,默认是0.6,,默认Executor 60%的内存,可以用来保存持久化的RDD数据 | 8 |
| hive.metastore.client.factory.class | com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory | aws Glue 数据单元管理 | 9 |
| hive.exec.dynamici.partition | true | hive写操作,动态分区 | 10 |
| hive.exec.dynamic.partition.mode | nonstrict | hive写操作,动态分区 | 10 |
| spark.sql.sources.partitionOverwriteMode | dynamic | hive覆盖分区:动态分区 | 10 |
三、conf设置方式
3.1、代码配置
scala 两种设置如下
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder()
.config(
"hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
) // aws Glue 数据单元管理
.enableHiveSupport()
.config("hive.exec.dynamici.partition", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
3.2、提交形式
spark-submit \
--name conf_example \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-cores 1 \
--executor-memory 1G \
--driver-memory 1G \
--class xxx.xxxx.xxxxx.xxx.xxxx \
--files conf.properties,log4j.properties,log4j2.xml \
--conf spark.hadoop.ipc.client.fallback-to-simple-auth-allowed=true \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--jars sss.jar,wwqq.jar \
main.jar
3.3、submit
./bin/spark-submit \
//主类入口
--class <main-class> \
// 指定appname
--name <appname>
\
//pom依赖所需要的resource目录下的资源文件
--files \
//需要的jar包
--jar \
//运行内存
--executor-memory 1G \
//运行内核数
--num-executors 1 \
//运行模式指定
--master <master-url> \
//指定client模式或者cluster模式,默认是client
--deploy-mode <deploy-mode> \
//设置参数
--conf <key>=<value> \
//jar包路径
<application-jar> \
//main方法的参数
[application-arguments]
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000