Spark-Radiant:Apache Spark性能和成本优化器

在这篇文章中，学习如何使用Spark-Radiant提高性能，降低成本和增加Spark应用程序的可观察性。

Spark-Radiant是Apache Spark性能和成本优化器。Spark-Radiant将帮助优化性能和成本，考虑催化剂优化器的规则，加强Spark的自动扩展，收集与Spark作业相关的重要指标，Spark的Bloom filter索引等。

Spark-Radiant现在已经可以使用，并准备好了。Spark-Radiant 1.0.4的依赖性在Maven中心可用。在这篇博客中，我将讨论Spark-Radiant 1.0.4的可用性，以及提高性能、降低成本和增加Spark应用的可观察性等功能。

如何在Spark作业中使用Spark-Radiant-1.0.4

对于Maven项目，在pom.xml中使用以下依赖。

spark-radiant-sql:

<dependency>
<groupId>io.github.saurabhchawla100</groupId>
<artifactId>spark-radiant-sql</artifactId>
<version>1.0.4</version>
</dependency>

spark-radiant-core:

<dependency>
<groupId>io.github.saurabhchawla100</groupId>
<artifactId>spark-radiant-core</artifactId>
<version>1.0.4</version>
</dependency>

先决条件

Spark-Radiant支持spark-3.0.x和较新版本的Spark。
支持的Scala版本为2.12.x。
Spark-Radiant-1.0.4支持Scala, Pyspark, Java, spark-sql。

用Spark-Radiant运行Spark作业

在运行spark作业时，使用Maven中心发布的spark-radiant-sql-1.0.4.jar 和 spark-radiant-core-1.0.4.jar。

./bin/spark-shell --packages "io.github.saurabhchawla100:spark-radiant-sql:1.0.4,io.github.saurabhchawla100:spark-radiant-core:1.0.4"

如何使用Spark-Radiant的性能特点

以下是Spark-Radiant的一些功能和改进，它们有助于提高性能，降低成本，并增加Spark应用的可观察性。

在Spark中使用动态过滤功能

Spark-Radiant的动态过滤功能可以很好地用于连接，这是一种星型模式，其中一个表与其他表相比由大量的记录组成。动态过滤在运行时通过使用小表的谓词，过滤掉连接列，在大表上使用这些谓词的结果，并过滤掉大表。这就减少了连接中大表的记录数，从而降低了连接的成本，同时也提高了Spark SQL查询的性能。这适用于内联、右外联、左半联、左外联和左反联。

性能改进因素

改善网络利用率。动 态过滤器减少了连接操作中涉及的记录数量，这有助于减少产生的洗牌数据，并将网络I/O降到最低。
提高资源利用率。由 于在Spark中使用了动态过滤，连接中涉及的记录数量减少。这降低了系统资源需求，因为为连接操作产生的任务数量减少了。这导致以较低的资源数量完成作业。
改进的磁盘I/0： 将动态过滤推到FileSourceScan/Datasource，只读取过滤记录。这将减少对磁盘I/O的压力。

 ./bin/spark-shell --packages "io.github.saurabhchawla100:spark-radiant-sql:1.0.4,io.github.saurabhchawla100:spark-radiant-core:1.0.4" --conf spark.sql.extensions=com.spark.radiant.sql.api.SparkRadiantSqlExtension
   
 or
   
./bin/spark-submit --packages "io.github.saurabhchawla100:spark-radiant-sql:1.0.4,io.github.saurabhchawla100:spark-radiant-core:1.0.4" --class com.test.spark.examples.SparkTestDF /spark/examples/target/scala-2.12/jars/spark-test_2.12-3.1.1.jar
--conf spark.sql.extensions=com.spark.radiant.sql.api.SparkRadiantSqlExtension
 
Example 

val df = spark.sql("select * from table, table1, table2 where table._1=table1._1 and table._1=table2._1
and table1._3 <= 'value019' and table2._3 = 'value015'")
df.show()

Spark SQL中的动态过滤功能

我们发现，在Spark join上使用动态过滤器，与使用常规Spark join运行的查询相比，性能提高了8倍。

欲了解更多信息，请参考文档。

在Spark中使用基于大小的连接重新排序

Spark-Radiant基于大小的连接重排在连接中效果很好。Spark默认执行join从左到右（无论是BHJ在SMJ之前，还是反过来）。这个优化器规则允许小表先于大表进行连接（BHJ先于SMJ）。

在Scala、PySpark、Spark SQL、Java和R中可以使用基于大小的连接重排支持，使用conf 。

--conf spark.sql.extensions=com.spark.radiant.sql.api.SparkRadiantSqlExtension --conf spark.sql.support.sizebased.join.reorder=true

性能改进因素

改善网络利用率。基于大小的连接重排在SMJ之前执行BHJ，因此，减少了连接操作中涉及的记录数量。这有助于减少产生的洗牌数据，并使网络I/O最小化。
提高资源利用率。由于在Spark中使用基于大小的连接重排，连接中涉及的记录数量减少。这降低了系统资源需求，因为为连接操作而产生的任务数量减少了。这导致了用较少的资源完成作业。

UnionReuseExchangeOptimizeRule

这个规则适用于有相同分组列的聚合的联合的情况。联合是在同一个表/数据源之间。在这种情况下，不是对表/数据源扫描两次，而是对表/数据源进行一次扫描，联盟的另一个子节点将重复使用这次扫描。这个功能是通过以下方式启用的。

conf spark.sql.optimize.union.reuse.exchange.rule=true

val df = spark.sql("select test11, count(*) as count from testDf1" +
  " group by test11 union select test11, sum(test11) as count" +
  " from testDf1 group by test11")

度量采集器

这个指标收集器是作为spark-radiant-core模块的一部分新加入的。它有助于获得有关Spark应用程序在各个阶段和任务中的表现的总体信息。这反过来又有助于在Spark应用出现任何性能下降和故障的情况下计算出SLA/RCA。

SparkJobMetricsCollector用于收集Spark作业指标、阶段指标和任务指标（任务失败信息、任务偏度信息）的重要指标。这是通过使用配置启用的。

--conf spark.extraListeners=com.spark.radiant.core.SparkJobMetricsCollector 并在classpath中使用以下方法提供jars。

运行步骤

./bin/spark-shell --conf spark.extraListeners=com.spark.radiant.core.SparkJobMetricsCollector --packages "io.github.saurabhchawla100:spark-radiant-sql:1.0.4,io.github.saurabhchawla100:spark-radiant-core:1.0.4"

度量采集器的响应

Spark-Radiant Metrics Collector
     
Total Time taken by Application:: 895 sec
      
*****Driver Metrics*****
Time spend in the Driver: 307 sec
Percentage of time spend in the Driver: 34. Try adding more parallelism to the Spark job for Optimal Performance
      
*****Stage Info Metrics*****
***** Stage Info Metrics Stage Id:0 *****
{
 "Stage Id": 0,
 "Final Stage Status": succeeded,
 "Number of Task": 10,
 "Total Executors ran to complete all Task": 2,
 "Stage Completion Time": 858 ms,
 "Average Task Completion Time": 139 ms
 "Number of Task Failed in this Stage": 0
 "Few Skew task info in Stage": Skew task in not present in this stage
 "Few Failed task info in Stage": Failed task in not present in this stage
}
***** Stage Info Metrics Stage Id:1 *****
{
 "Stage Id": 1,
 "Final Stage Status": succeeded,
 "Number of Task": 10,
 "Total Executors ran to complete all Task": 2,
 "Stage Completion Time": 53 ms,
 "Average Task Completion Time": 9 ms
 "Number of Task Failed in this Stage": 0
 "Few Skew task info in Stage": Skew task in not present in this stage
 "Few Failed task info in Stage": Failed task in not present in this stage
}

在度量衡收集器中的倾斜任务信息

在这种情况下，歪斜的任务出现在这个阶段。度量采集器将显示偏斜的任务信息。

***** Stage Info Metrics Stage Id:2 *****
{
 "Stage Id": 2,
 "Final Stage Status": succeeded,
 "Number of Task": 100,
 "Total Executors ran to complete all Task": 4,
 "Stage Completion Time": 11206 ms,
 "Average Task Completion Time": 221 ms
 "Number of Task Failed in this Stage": 0
 "Few Skew task info in Stage": List({
      "Task Id": 0,
      "Executor Id": 3,
      "Number of records read": 11887,
      "Number of shuffle read Record": 11887,
      "Number of records write": 0,
      "Number of shuffle write Record": 0,
      "Task Completion Time": 10656 ms
      "Final Status of task": SUCCESS
      "Failure Reason for task": NA
      }, {
      "Task Id": 4,
      "Executor Id": 1,
      "Number of records read": 11847,
      "Number of shuffle read Record": 11847,
      "Number of records write": 0,
      "Number of shuffle write Record": 0,
      "Task Completion Time": 10013 ms
      "Final Status of task": SUCCESS
      "Failure Reason for task": NA
      })
 "Few Failed task info in Stage": Failed task in not present in this stage
}

指标收集器中失败的任务信息

如果失败的任务出现在这个阶段，度量衡收集器将显示失败的任务信息。

***** Stage Info Metrics Stage Id:3 *****
{
 "Stage Id": 3,
 "Final Stage Status": failed,
 "Number of Task": 10,
 "Total Executors ran to complete all Task": 2,
 "Stage Completion Time": 53 ms,
 "Average Task Completion Time": 9 ms
 "Number of Task Failed in this Stage": 1
 "Few Skew task info in Stage": Skew task in not present in this    stage
 "Few Failed task info in Stage": List({
      "Task Id": 12,
      "Executor Id": 1,
      "Number of records read in task": 0,
      "Number of shuffle read Record in task": 0,
      "Number of records write in task": 0,
      "Number of shuffle write Record in task": 0,
      "Final Status of task": FAILED,
      "Task Completion Time": 7 ms,
      "Failure Reason for task": java.lang.Exception: Retry Task
            at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$res0$1(<console>:33)
            at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23)
            at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
            at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
      })
}

支持DropDuplicate中的结构类型列。

从现在开始，在DropDuplicate中使用结构型列，我们将得到以下异常。

case class StructDropDup(c1: Int, c2: Int)
val df = Seq(("d1", StructDropDup(1, 2)),
         ("d1", StructDropDup(1, 2))).toDF("a", "b")
df.dropDuplicates("a", "b.c1")
         
org.apache.spark.sql.AnalysisException: Cannot resolve column name "b.c1" among (a, b) at org.apache.spark.sql.Dataset.$anonfun$dropDuplicates$1(Dataset.scala:2576) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)

添加支持在DropDuplicate中使用结构col。

import com.spark.radiant.sql.api.SparkRadiantSqlApi
case class StructDropDup(c1: Int, c2: Int)
val df = Seq(("d1", StructDropDup(1, 2)),
            ("d1", StructDropDup(1, 2))).toDF("a", "b")
val sparkRadiantSqlApi = new SparkRadiantSqlApi()
val updatedDF = sparkRadiantSqlApi.dropDuplicateOfSpark(df, spark, Seq("a", "b.c1"))
               
updatedDF.show
 +---+------+
 |  a|     b|
 +---+------+
 | d1|{1, 2}|
 +---+------+

在Apache Spark[SPARK-37596][Spark-SQL]的这个PR中也添加了同样的支持。在DropDuplicate中增加对结构类型列的支持。

这对于dropDuplicate 中的地图类型列来说效果很好：

val df = spark.createDataFrame(Seq(("d1", Map(1 -> 2)), ("d1", Map(1 -> 2))))

总结

在这篇博客中，我讨论了如何使用Spark-Radiant。添加到Spark-Radiant的新功能，如Metrics Collector, Drop duplicate for struct Type, Dynamic filter SizeBasedJoinReOrdering, and UnionReUseExchange将提供与性能和成本优化相关的好处。

Apache Spark性能和成本优化器：Spark-Radiant介绍及应用

Spark-Radiant:Apache Spark性能和成本优化器

在这篇文章中，学习如何使用Spark-Radiant提高性能，降低成本和增加Spark应用程序的可观察性。

如何在Spark作业中使用Spark-Radiant-1.0.4

先决条件

用Spark-Radiant运行Spark作业

如何使用Spark-Radiant的性能特点

在Spark中使用动态过滤功能

性能改进因素

在Spark中使用基于大小的连接重新排序

性能改进因素

UnionReuseExchangeOptimizeRule

度量采集器

运行步骤

度量采集器的响应

在度量衡收集器中的倾斜任务信息

指标收集器中失败的任务信息

支持DropDuplicate中的结构类型列。

总结