-
SqlBase.g4 中的定义
EXPLAIN (LOGICAL | FORMATTED | EXTENDED | CODEGEN | COST)? statementCODEGEN 作用:
当spark.sql.codegen.wholeStage = true 时(开启全阶段代码生成), explain 会展示生成的代码。 -
shell 中的使用:
expalin query;
explain codegen query; -
计划含义解析
定义表及数据:
Seq(
("uid_1", "f", 1, 21, Seq("label_1","label_2","label_3","label_4"), "ext_1,ext_2,ext_3")
,("uid_2", "m", 2, 23, Seq("label_1","label_2","label_3","label_4"), "ext_11,ext_2,ext_31")
,("uid_3", "m", 1, 24, Seq("label_1","label_2","label_3","label_4"), "ext_1,ext_2,ext_32")
,("uid_4", "f", 2, 36, Seq("label_1","label_2","label_3","label_4"), "ext_12,ext_2,ext_33")
,("uid_5", "m", 1, 21, Seq("label_1","label_2","label_3","label_4"), "ext_12,ext_21,ext_32")
).toDF("uid","sex","num1", "num2", "labels","ext").createOrReplaceTempView("userTable")
Seq(
("uid_1", "f", 1, 21)
,("uid_2", "m", 2, 23)
,("uid_3", "m", 1, 24)
,("uid_4", "f", 2, 36)
,("uid_5", "m", 1, 21)
).toDF("uid","sex","num1", "num2").createOrReplaceTempView("userTable2")
val testsql =
"""
|with t as (
|select concat(uid,'_dddd')uid
|from userTable
|
|)
|select t1.uid
|from
| (
| select * from t where uid >'label_1'
| )as t1
| left join
| (
| select * from t where uid='label_1'
| )as t2 on t1.uid = t2.uid
|
""".stripMargin
val df = sql(s"$testsql")
df.explain(true)
输出: 最终的物理执行计划
== Physical Plan ==
*(5) Project [uid#50]
+- SortMergeJoin [uid#50], [uid#54], LeftOuter
:- *(2) Sort [uid#50 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(uid#50, 5)
: +- *(1) Project [concat(_1#14, _dddd) AS uid#50]
: +- *(1) Filter (concat(_1#14, _dddd) > label_1)
: +- LocalTableScan [_1#14, _2#15, _3#16, _4#17, _5#18, _6#19]
+- *(4) Sort [uid#54 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(uid#54, 5)
+- *(3) Project [concat(_1#14, _dddd) AS uid#54]
+- *(3) Filter (concat(_1#14, _dddd) = label_1)
+- LocalTableScan [_1#14, _2#15, _3#16, _4#17, _5#18, _6#19]
说明:
1. `*`代表当前节点可以动态生成代码
2. 括号中的数字,代表的是codegen stage id,即代码生成的阶段id。
2.1 codegen stage的划分:
相邻节点(父子节点)且节点支持codegen ,会被划分到同一个 codegen stage,直到不支持codegen的节点。
eg:Filter 、Project 节点支持,Exchange 节点不支持,那么Filter 、Project就会被划分到同一个stage中,
2.2 如何判断节点是否支持 codegen:
查看节点源代码是否 with CodegenSupport 且 def supportCodegen: Boolean = true
3. 相同的codegen stage id 代码会被折叠在一起,减少函数的调用。
4. #数字,代表字段id, id相同的代表同一个字段。
常用节点及其含义
-
Project
含义:投影
源代码类:ProjectExec
产生原因: select
CodegenSupport : true
explain 输出解释:Project [投影字段#投影字段的id,..] -
Filter
含义:过滤条件
源代码类:FilterExec
产生原因: where
CodegenSupport : true
explain 输出解释:Filter(条件表达式) -
Exchange
含义:shuffle节点
源代码类:ShuffleExchangeExec
产生原因: group by 、order by 、join 等都有可能产生。在EnsureRequirements中根据父节点的requiredChildDistribution(要求子节点输出分布)和子节点的outputPartitioning(输出分布)决定是否插入Exchange节点
CodegenSupport : false
explain 输出解释:Exchange 分区方式(分区key, 分区数) -
Sort
含义:排序
源代码类:SortExec
产生原因:order by, group by(sort aggregate)
,join(sortMergeJoin)。在EnsureRequirements中根据父节点的requiredChildOrdering(要求子节点排序情况)和子节点的outputOrdering(排序情况)决定是否插入Sort节点
CodegenSupport : true
explain 输出解释:Sort [排序字段#id 排序方向(ASC|DESC) null值在前还是在后(NULLS FIRST:null 排在前边, NULLS LAST:null 排在后边) ] -
SortMergeJoin
含义:排序方式进行join
源代码类:SortMergeJoinExec
产生原因:join (条件不满足使用 broadcastJoin 、shuffledHashJoin 时使用 sortmergejoin)
CodegenSupport : false
explain 输出解释:SortMergeJoin join的条件 join类型 -
BroadcastHashJoin
含义:广播方式进行join
源代码类:BroadcastHashJoinExec
产生原因:join(使用hint 、一侧表小于广播阈值)
CodegenSupport :true
explain 输出解释:BroadcastHashJoin join的条件 join类型 build侧(BuildRight:右表broadcast, BuildLeft :左表broadcast) -
ShuffledHashJoin
含义:先shuffle再hashjoin
源代码类:ShuffledHashJoinExec
产生原因:join(条件不满足BroadcastHashJoin,且表总大小/partition个数 小于 brocast阈值,且spark.sql.join.preferSortMergeJoin=false (默认等于true))
CodegenSupport :false
explain 输出解释:ShuffledHashJoin join的条件 join类型 -
HashAggregate
含义:hash 方式进行聚合
源代码类:HashAggregateExec
产生原因:group by , 聚合函数(count,sum), distinct
CodegenSupport :true
explain 输出解释: HashAggregate(聚合的key, 聚合函数, 输出) -
SortAggregate
含义:sort 方式进行聚合
源代码类:SortAggregateExec
产生原因:group by , 聚合函数(count,sum), distinct
CodegenSupport :false
explain 输出解释: SortAggregate(聚合的key, 聚合函数, 输出) -
ObjectHashAggregate
含义:hash 方式进行聚合, 与HashAggregate有所区别
源代码类:ObjectHashAggregateExec
产生原因:group by , 聚合函数(count,sum), distinct
CodegenSupport :false
explain 输出解释: ObjectHashAggregate(聚合的key, 聚合函数, 输出) -
Window
含义:窗口操作
源代码类:WindowExec
产生原因:使用了over 从句
CodegenSupport :false
explain 输出解释: Window [窗口函数(eg row_number ) 窗口定义(partition order ), 窗框范围]
接下来的文章将会对每个节点实现展开说明。