大数据框架学习-spark (4)

327 阅读2分钟
  • 运行该 SQL,如 q38,并截图该 SQL 的 SQL 执行图
  • 该 SQL 用到了哪些优化规则(optimizer rules)
  • 请各用不少于 200 字描述其中的两条优化规则
  • SQL 从中任意选择一条:
SELECT count(*) |                                                                |
| --------------- | -------------------------------------------------------------- |
|                 | FROM (                                                         |
|                 | SELECT DISTINCT                                                |
|                 | c_last_name,                                                   |
|                 | c_first_name,                                                  |
|                 | d_date                                                         |
|                 | FROM store_sales, date_dim, customer                           |
|                 | WHERE store_sales.ss_sold_date_sk = date_dim.d_date_sk         |
|                 | AND store_sales.ss_customer_sk = customer.c_customer_sk        |
|                 | AND d_month_seq BETWEEN 1200 AND 1200 + 11                     |
|                 | INTERSECT                                                      |
|                 | SELECT DISTINCT                                                |
|                 | c_last_name,                                                   |
|                 | c_first_name,                                                  |
|                 | d_date                                                         |
|                 | FROM catalog_sales, date_dim, customer                         |
|                 | WHERE catalog_sales.cs_sold_date_sk = date_dim.d_date_sk       |
|                 | AND catalog_sales.cs_bill_customer_sk = customer.c_customer_sk |
|                 | AND d_month_seq BETWEEN 1200 AND 1200 + 11                     |
|                 | INTERSECT                                                      |
|                 | SELECT DISTINCT                                                |
|                 | c_last_name,                                                   |
|                 | c_first_name,                                                  |
|                 | d_date                                                         |
|                 | FROM web_sales, date_dim, customer                             |
|                 | WHERE web_sales.ws_sold_date_sk = date_dim.d_date_sk           |
|                 | AND web_sales.ws_bill_customer_sk = customer.c_customer_sk     |
|                 | AND d_month_seq BETWEEN 1200 AND 1200 + 11                     |
|                 | ) hot_cust                                                     |
|                 | LIMIT 100

执行后的结果:

image.png

优化规则:

org.apache.spark.sql.catalyst.optimizer.CollapseProject
org.apache.spark.sql.catalyst.optimizer.ColumnPruning
org.apache.spark.sql.catalyst.optimizer.ConstantFolding
org.apache.spark.sql.catalyst.optimizer.EliminateLimits
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints
org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin
org.apache.spark.sql.catalyst.optimizer.PushDownPredicates
org.apache.spark.sql.catalyst.optimizer.RemoveNoopOperators
org.apache.spark.sql.catalyst.optimizer.ReorderJoin
org.apache.spark.sql.catalyst.optimizer.ReplaceDistinctWithAggregate
org.apache.spark.sql.catalyst.optimizer.ReplaceIntersectWithSemiJoin
org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery

CollapseProject: 合并两个确定且独立的列

/**
 * 将两个[[project]]运算符合并为一个,并进行别名替换。
 * 在以下情况下,将表达式合并为一个单一的表达式。
 * 1. 当两个[[project]]运算符相邻时。
 * 2. 当两个[[project]]运算符之间有LocalLimit/Sample/Repartition运算符时
 * 并且上层项目由相同数量的列组成,这些列是相等的或别离的。
 * `GlobalLimit(LocalLimit)`模式也被考虑。
 */
object CollapseProject extends Rule[LogicalPlan] with AliasHelper {

  def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithPruning(
    _.containsPattern(PROJECT), ruleId) {

两个select操作会被合并成一个,并且和数据来源的逻辑计划进行合并,两个前后依赖的select列的运算也被合并在一起。

PushDownPredicates:合并两个相邻的Filters,谓词下推的一种形式,目的就是把filters过滤条件下推到底层的部分,从表现形式上来看就是在越在join的前进行过滤,就越可以提高整个计划的执行效率。

/**
 * 下推项目的计算元到union操作的两边
 * Operations that are safe to pushdown are listed as follows.
 * Union:
 * Right now, Union means UNION ALL, which does not de-duplicate rows. So, it is
 * safe to pushdown Filters and Projections through it. Filter pushdown is handled by another
 * rule PushDownPredicates. Once we add UNION DISTINCT, we will not be able to pushdown Projections.
 */
object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper {

  /**
   * 将左边的属性映射到右边的相应属性。
   */
  private def buildRewrites(left: LogicalPlan, right: LogicalPlan): AttributeMap[Attribute] = {
    assert(left.output.size == right.output.size)
    AttributeMap(left.output.zip(right.output))
  }

  /**
   * 重写一个表达式,以便它可以被push到一个
union或除法运算符的右边。
*这个方法依赖于这样一个事实:联合/相交/排除的输出属性
 的输出属性总是等于左边子句的输出
   */
  private def pushToRight[A <: Expression](e: A, rewrites: AttributeMap[Attribute]) = {
    val result = e transform {
      case a: Attribute => rewrites(a)
    } match {
      // Make sure exprId is unique in each child of Union.
      case Alias(child, alias) => Alias(child, alias)()
      case other => other
    }

    result.asInstanceOf[A]
  }

  def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
    _.containsAllPatterns(UNION, PROJECT)) {

    // 下推  确定性的投影 来取代全部都union 
    case p @ Project(projectList, u: Union) =>
      assert(u.children.nonEmpty)
      if (projectList.forall(_.deterministic)) {
        val newFirstChild = Project(projectList, u.children.head)
        val newOtherChildren = u.children.tail.map { child =>
          val rewrites = buildRewrites(u.children.head, child)
          Project(projectList.map(pushToRight(_, rewrites)), child)
        }
        u.copy(children = newFirstChild +: newOtherChildren)
      } else {
        p
      }
  }
}

从代码上来看PushDownPredicates 就是通过判断 条件是否可以下推,并且尽量的下推在所有的union操作之前,这样来加快整个计算的方式。