一、背景

spark版本：2.3.1

scala版本：2.11.8

二、conf配置说明

选项	值	说明	组合
spark.sql.crossJoin.enabled	true	值为true时，sql进行迪卡尔积join运算	1
spark.dynamicAllocation.enabled	ture	值为true时，spark就会启动ExecutorAllocationManager，动态管理执行器；	2
spark.shuffle.service.enabled	ture	值为true时，spark动态管理shuffle服务，与 ExecutorAllocationManager配合使用	2
spark.dynamicAllocation.initialExecutors	数值	初始化执行器数量	2
spark.dynamicAllocation.maxExecutors	数值	最多执行器数量	2
spark.dynamicAllocation.minExecutors	数值	最少执行器数量	2
spark.default.parallelism	数值	task的并行度，num-executors * executor-cores的2~3倍较为合适；该参数比较重要	3
spark.sql.adaptive.enabled	true	默认为false，自适应执行框架的开关	4
spark.sql.adaptive.skewedJoin.enabled	true	默认为 false ，倾斜处理开关	4
spark.driver.extraJavaOptions	-Dlog4j.configuration=file:log4j.properties / -Xss30M	driver 的jvm参数	5
spark.hadoop.ipc.client.fallback-to-simple-auth-allowed	true	hdfs跨集群数据迁移	6
spark.shuffle.memoryFraction	0.3	该参数代表了Executor内存中，分配给shuffle read task进行聚合操作的内存比例，默认是20%	7
spark.storage.memoryFraction	0.5	用于设置RDD持久化数据在Executor内存中能占的比例，默认是0.6,，默认Executor 60%的内存，可以用来保存持久化的RDD数据	8
hive.metastore.client.factory.class	com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory	aws Glue 数据单元管理	9
hive.exec.dynamici.partition	true	hive写操作，动态分区	10
hive.exec.dynamic.partition.mode	nonstrict	hive写操作，动态分区	10
spark.sql.sources.partitionOverwriteMode	dynamic	hive覆盖分区：动态分区	10

三、conf设置方式

3.1、代码配置

scala 两种设置如下

import org.apache.spark.sql.SparkSession


val spark: SparkSession = SparkSession.builder()
  .config(
    "hive.metastore.client.factory.class",
    "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
  ) // aws Glue 数据单元管理
  .enableHiveSupport()
  .config("hive.exec.dynamici.partition", true)
  .config("hive.exec.dynamic.partition.mode", "nonstrict")
  .getOrCreate()
  
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

3.2、提交形式

spark-submit \
--name conf_example \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-cores 1 \
--executor-memory 1G \
--driver-memory 1G \
--class xxx.xxxx.xxxxx.xxx.xxxx \
--files conf.properties,log4j.properties,log4j2.xml \
--conf spark.hadoop.ipc.client.fallback-to-simple-auth-allowed=true \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--jars sss.jar,wwqq.jar \
main.jar

3.3、submit

./bin/spark-submit \

  //主类入口
  --class <main-class> \ 

  // 指定appname
  --name  <appname>    
     \
  //pom依赖所需要的resource目录下的资源文件 
  --files     \

  //需要的jar包
  --jar            \

  //运行内存
  --executor-memory 1G \

  //运行内核数
  --num-executors 1 \

 //运行模式指定
  --master <master-url> \

  //指定client模式或者cluster模式，默认是client
  --deploy-mode <deploy-mode> \

  //设置参数
  --conf <key>=<value> \

  //jar包路径
  <application-jar> \

  //main方法的参数
  [application-arguments]

  

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

spark conf 配置优化

一、背景

二、conf配置说明

三、conf设置方式

3.1、代码配置

3.2、提交形式

3.3、submit