Spark计算引擎源码分析-前置知识

264 阅读1分钟

shuffle入口

image.png 上图为某线上任务的DAG图,如图有3个stage,按照FIFOtaskset调度顺序执行。
stage 56241 和stage 56242 为ShuffleMapTask,stage 56243 为ResultTask
shuffleMapTask的runTask方法:

dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)

之后调用ShuffleWriteProcessor.write(),首先确定shuffle类型,之后执行shuffle write

ShuffleWriteProcessor.write()

def write(
    rdd: RDD[_],
    dep: ShuffleDependency[_, _, _],
    mapId: Long,
    context: TaskContext,
    partition: Partition): MapStatus = {
  var writer: ShuffleWriter[Any, Any] = null
  val manager = SparkEnv.get.shuffleManager
  writer = manager.getWriter[Any, Any](
      dep.shuffleHandle,
      mapId,
      context,
      createMetricsReporter(context))
  writer.write(
      rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
  writer.stop(success = true).get
}

shuffle类型

在 shufflewrite 阶段有不同的 shuffle 类型:

  • BypassMergeSortShuffle:没有mapSideCombine,而且分区数小于SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD(默认200)时候,是BypassMergeSortShuffle,比如groupByKey算子在分区数小于200时。这种shuffle会为每个reduce task创建一个临时文件,最后将临时文件合并为一个文件并创建单独的索引文件。这种方法会创建较多的磁盘文件,但是不会进行排序,减少了这部分的消耗。
  • tungsten-sort shuffle:使用的序列化器支持序列化对象的重定位(如KryoSerializer),没有mapSideCombine,分区数不大于常量MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE的值(最大分区ID号+1,即2^24=16777216)。关于钨丝计划待完善。
  • SortShuffle:其他情况走SortShuffle

SortShuffleManager.registerShuffle()

if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
  // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
  // need map-side aggregation, then write numPartitions files directly and just concatenate
  // them at the end. This avoids doing serialization and deserialization twice to merge
  // together the spilled files, which would happen with the normal code path. The downside is
  // having multiple files open at a time and thus more memory allocated to buffers.
  new BypassMergeSortShuffleHandle[K, V](
    shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
  // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
  new SerializedShuffleHandle[K, V](
    shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
  // Otherwise, buffer map outputs in a deserialized form:
  new BaseShuffleHandle(shuffleId, dependency)
}

shuffle write

写入RDD,调用 RDD 的iterator方法进行迭代计算,一直迭代到第一个RDD的iterator,stage56241的 KafkaRDD的 iterator,stage56242的ShuffledRDD的iterator。iterator方法最终会调用compute方法,这时候在ShuffleRDD中会进行shuffle read

ShuffleWriteProcessor.write()

writer.write(
  rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

ShuffledRDD.compute()

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
  val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
  val metrics = context.taskMetrics().createTempShuffleReadMetrics()
  SparkEnv.get.shuffleManager.getReader(
    dep.shuffleHandle, split.index, split.index + 1, context, metrics)
    .read()
    .asInstanceOf[Iterator[(K, C)]]
}

返回mapstatus

mapstatus的传输详情见MapOutputTracker分析

ShuffleWriteProcessor.write()

writer.stop(success = true).get

ShuffleDependency

ShuffleDependency的几个重要属性:keyOrderingaggregatormapSideCombine,在shuffle中决定了shuffle类型,排序器,shuffle文件合并等。详情看代码PairRDDFunctions以及OrderedRDDFunctions

算子的属性

算子mapSideCombineaggregatorkeyOrdering
reduceByKeytrue×
groupByKeyfalse×
sortByKeyfalse×

在shuffle机制中算子的计算需求 image.png