spark-sql explain 执行计划详解

7,248 阅读4分钟
  1. SqlBase.g4 中的定义
    EXPLAIN (LOGICAL | FORMATTED | EXTENDED | CODEGEN | COST)? statement

    CODEGEN 作用:
    当spark.sql.codegen.wholeStage = true 时(开启全阶段代码生成), explain 会展示生成的代码。

  2. shell 中的使用:
    expalin query;
    explain codegen query;

  3. 计划含义解析

    定义表及数据:

Seq(
    ("uid_1", "f", 1, 21, Seq("label_1","label_2","label_3","label_4"), "ext_1,ext_2,ext_3")
    ,("uid_2", "m", 2, 23, Seq("label_1","label_2","label_3","label_4"), "ext_11,ext_2,ext_31")
    ,("uid_3", "m", 1, 24, Seq("label_1","label_2","label_3","label_4"), "ext_1,ext_2,ext_32")
    ,("uid_4", "f", 2, 36, Seq("label_1","label_2","label_3","label_4"), "ext_12,ext_2,ext_33")
    ,("uid_5", "m", 1, 21, Seq("label_1","label_2","label_3","label_4"), "ext_12,ext_21,ext_32")
  ).toDF("uid","sex","num1", "num2", "labels","ext").createOrReplaceTempView("userTable")
 
  Seq(
    ("uid_1", "f", 1, 21)
    ,("uid_2", "m", 2, 23)
    ,("uid_3", "m", 1, 24)
    ,("uid_4", "f", 2, 36)
    ,("uid_5", "m", 1, 21)
  ).toDF("uid","sex","num1", "num2").createOrReplaceTempView("userTable2")
  

val testsql =
    """
      |with t as (
      |select concat(uid,'_dddd')uid
      |from userTable
      |
      |)
      |select t1.uid
      |from
      |   (
      |   select * from t where uid >'label_1'
      |   )as t1
      |   left join
      |   (
      |   select * from t where uid='label_1'
      |   )as t2 on t1.uid = t2.uid
      |
    """.stripMargin
    
  val df = sql(s"$testsql")
  df.explain(true)

输出: 最终的物理执行计划

== Physical Plan ==
*(5) Project [uid#50]
+- SortMergeJoin [uid#50], [uid#54], LeftOuter
   :- *(2) Sort [uid#50 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(uid#50, 5)
   :     +- *(1) Project [concat(_1#14, _dddd) AS uid#50]
   :        +- *(1) Filter (concat(_1#14, _dddd) > label_1)
   :           +- LocalTableScan [_1#14, _2#15, _3#16, _4#17, _5#18, _6#19]
   +- *(4) Sort [uid#54 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(uid#54, 5)
         +- *(3) Project [concat(_1#14, _dddd) AS uid#54]
            +- *(3) Filter (concat(_1#14, _dddd) = label_1)
               +- LocalTableScan [_1#14, _2#15, _3#16, _4#17, _5#18, _6#19]

说明:

1. `*`代表当前节点可以动态生成代码
2. 括号中的数字,代表的是codegen stage id,即代码生成的阶段id。
    2.1 codegen stage的划分:
        相邻节点(父子节点)且节点支持codegen ,会被划分到同一个 codegen stage,直到不支持codegen的节点。
        eg:Filter 、Project 节点支持,Exchange 节点不支持,那么Filter 、Project就会被划分到同一个stage中,
    2.2 如何判断节点是否支持 codegen: 
        查看节点源代码是否 with CodegenSupport 且 def supportCodegen: Boolean = true 
    
3. 相同的codegen stage id 代码会被折叠在一起,减少函数的调用。
4. #数字,代表字段id, id相同的代表同一个字段。

常用节点及其含义

  1. Project
    含义:投影
    源代码类:ProjectExec
    产生原因: select
    CodegenSupport : true
    explain 输出解释:Project [投影字段#投影字段的id,..]

  2. Filter
    含义:过滤条件
    源代码类:FilterExec
    产生原因: where
    CodegenSupport : true
    explain 输出解释:Filter(条件表达式)

  3. Exchange
    含义:shuffle节点
    源代码类:ShuffleExchangeExec
    产生原因: group by 、order by 、join 等都有可能产生。在EnsureRequirements中根据父节点的requiredChildDistribution(要求子节点输出分布)和子节点的outputPartitioning(输出分布)决定是否插入Exchange节点
    CodegenSupport : false
    explain 输出解释:Exchange 分区方式(分区key, 分区数)

  4. Sort
    含义:排序
    源代码类:SortExec
    产生原因:order by, group by(sort aggregate)
    ,join(sortMergeJoin)。在EnsureRequirements中根据父节点的requiredChildOrdering(要求子节点排序情况)和子节点的outputOrdering(排序情况)决定是否插入Sort节点
    CodegenSupport : true
    explain 输出解释:Sort [排序字段#id 排序方向(ASC|DESC) null值在前还是在后(NULLS FIRST:null 排在前边, NULLS LAST:null 排在后边) ]

  5. SortMergeJoin
    含义:排序方式进行join
    源代码类:SortMergeJoinExec
    产生原因:join (条件不满足使用 broadcastJoin 、shuffledHashJoin 时使用 sortmergejoin)
    CodegenSupport : false
    explain 输出解释:SortMergeJoin join的条件 join类型

  6. BroadcastHashJoin
    含义:广播方式进行join
    源代码类:BroadcastHashJoinExec
    产生原因:join(使用hint 、一侧表小于广播阈值)
    CodegenSupport :true
    explain 输出解释:BroadcastHashJoin join的条件 join类型 build侧(BuildRight:右表broadcast, BuildLeft :左表broadcast)

  7. ShuffledHashJoin
    含义:先shuffle再hashjoin
    源代码类:ShuffledHashJoinExec
    产生原因:join(条件不满足BroadcastHashJoin,且表总大小/partition个数 小于 brocast阈值,且spark.sql.join.preferSortMergeJoin=false (默认等于true))
    CodegenSupport :false
    explain 输出解释:ShuffledHashJoin join的条件 join类型

  8. HashAggregate
    含义:hash 方式进行聚合
    源代码类:HashAggregateExec
    产生原因:group by , 聚合函数(count,sum), distinct
    CodegenSupport :true
    explain 输出解释: HashAggregate(聚合的key, 聚合函数, 输出)

  9. SortAggregate
    含义:sort 方式进行聚合
    源代码类:SortAggregateExec
    产生原因:group by , 聚合函数(count,sum), distinct
    CodegenSupport :false
    explain 输出解释: SortAggregate(聚合的key, 聚合函数, 输出)

  10. ObjectHashAggregate
    含义:hash 方式进行聚合, 与HashAggregate有所区别
    源代码类:ObjectHashAggregateExec
    产生原因:group by , 聚合函数(count,sum), distinct
    CodegenSupport :false
    explain 输出解释: ObjectHashAggregate(聚合的key, 聚合函数, 输出)

  11. Window
    含义:窗口操作
    源代码类:WindowExec
    产生原因:使用了over 从句
    CodegenSupport :false
    explain 输出解释: Window [窗口函数(eg row_number ) 窗口定义(partition order ), 窗框范围]

接下来的文章将会对每个节点实现展开说明。