shuffle入口
上图为某线上任务的
DAG图,如图有3个stage,按照FIFO的taskset调度顺序执行。
stage 56241 和stage 56242 为ShuffleMapTask,stage 56243 为ResultTask。
shuffleMapTask的runTask方法:
dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)
之后调用ShuffleWriteProcessor.write(),首先确定shuffle类型,之后执行shuffle write
ShuffleWriteProcessor.write()
def write(
rdd: RDD[_],
dep: ShuffleDependency[_, _, _],
mapId: Long,
context: TaskContext,
partition: Partition): MapStatus = {
var writer: ShuffleWriter[Any, Any] = null
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](
dep.shuffleHandle,
mapId,
context,
createMetricsReporter(context))
writer.write(
rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
}
shuffle类型
在 shufflewrite 阶段有不同的 shuffle 类型:
BypassMergeSortShuffle:没有mapSideCombine,而且分区数小于SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD(默认200)时候,是BypassMergeSortShuffle,比如groupByKey算子在分区数小于200时。这种shuffle会为每个reduce task创建一个临时文件,最后将临时文件合并为一个文件并创建单独的索引文件。这种方法会创建较多的磁盘文件,但是不会进行排序,减少了这部分的消耗。tungsten-sort shuffle:使用的序列化器支持序列化对象的重定位(如KryoSerializer),没有mapSideCombine,分区数不大于常量MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE的值(最大分区ID号+1,即2^24=16777216)。关于钨丝计划待完善。SortShuffle:其他情况走SortShuffle
SortShuffleManager.registerShuffle()
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
// need map-side aggregation, then write numPartitions files directly and just concatenate
// them at the end. This avoids doing serialization and deserialization twice to merge
// together the spilled files, which would happen with the normal code path. The downside is
// having multiple files open at a time and thus more memory allocated to buffers.
new BypassMergeSortShuffleHandle[K, V](
shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// Otherwise, buffer map outputs in a deserialized form:
new BaseShuffleHandle(shuffleId, dependency)
}
shuffle write
写入RDD,调用 RDD 的iterator方法进行迭代计算,一直迭代到第一个RDD的iterator,stage56241的 KafkaRDD的 iterator,stage56242的ShuffledRDD的iterator。iterator方法最终会调用compute方法,这时候在ShuffleRDD中会进行shuffle read。
ShuffleWriteProcessor.write()
writer.write(
rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
ShuffledRDD.compute()
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
val metrics = context.taskMetrics().createTempShuffleReadMetrics()
SparkEnv.get.shuffleManager.getReader(
dep.shuffleHandle, split.index, split.index + 1, context, metrics)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
返回mapstatus
mapstatus的传输详情见MapOutputTracker分析
ShuffleWriteProcessor.write()
writer.stop(success = true).get
ShuffleDependency
ShuffleDependency的几个重要属性:keyOrdering,aggregator,mapSideCombine,在shuffle中决定了shuffle类型,排序器,shuffle文件合并等。详情看代码PairRDDFunctions以及OrderedRDDFunctions。
算子的属性
| 算子 | mapSideCombine | aggregator | keyOrdering |
|---|---|---|---|
| reduceByKey | true | √ | × |
| groupByKey | false | √ | × |
| sortByKey | false | × | √ |
在shuffle机制中算子的计算需求