Spark计算引擎

189 阅读7分钟

spark计算入口

DAG图 image.png 上图为线上任务的DAG图,如图有3个stage,按照FIFO的taskset调度顺序执行。
stage 56241 和stage 56242 为 ShuffleMapTask,stage 56243 为ResultTask。
stage 56241执行 shuffleMapTask 的 runTask 方法:

dep.shuffleWriterProcessor.write(rdd, dep, mapId, context, partition)

调用ShuffleWriteProcessor##write方法

  1. shuffle类型:
writer = manager.getWriter[Any, Any](
 dep.shuffleHandle,
 mapId,
 context,
 createMetricsReporter(context))
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
  // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
  // need map-side aggregation, then write numPartitions files directly and just concatenate
  // them at the end. This avoids doing serialization and deserialization twice to merge
  // together the spilled files, which would happen with the normal code path. The downside is
  // having multiple files open at a time and thus more memory allocated to buffers.
  new BypassMergeSortShuffleHandle[K, V](
    shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
  // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
  new SerializedShuffleHandle[K, V](
    shuffleId, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
  // Otherwise, buffer map outputs in a deserialized form:
  new BaseShuffleHandle(shuffleId, dependency)
}
  • BypassMergeSortShuffle:没有mapSideCombine,而且分区数小于SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD(默认200)时候,是BypassMergeSortShuffle,比如groupByKey算子在分区数小于200时。这种shuffle会为每个reduce task创建一个临时文件,最后将临时文件合并为一个文件并创建单独的索引文件。这种方法会创建较多的磁盘文件,但是不会进行排序,减少了这部分的消耗。
  • tungsten-sort shuffle:使用的序列化器支持序列化对象的重定位(如KryoSerializer),没有mapSideCombine,分区数不大于常量MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE的值(最大分区ID号+1,即2^24=16777216)。关于钨丝计划待完善。
  • SortShuffle:其他情况走SortShuffle
  1. shuffle write
writer.write(
  rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

RDD的iterator方法进行迭代计算,一直迭代到第一个iterator,比如在第一个stage中执行KafkaRDD的compute方法,在最后一个stage中执行ShuffleRDD的compute方法也就是shuffleRead。

  1. 返回mapstatus,详情见MapOutputTracker分析
writer.stop(success = true).get

执行stage 56242 shuffleMapTask 的 runTask 方法:

第一个 RDD 为 ShuffleRDD,执行 compute 方法,开始进行shuffle read

override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
  val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
  val metrics = context.taskMetrics().createTempShuffleReadMetrics()
  SparkEnv.get.shuffleManager.getReader(
    dep.shuffleHandle, split.index, split.index + 1, context, metrics)
    .read()
    .asInstanceOf[Iterator[(K, C)]]
}

shuffle write

SortShuffleWriter##write

  1. 创建ExternalSorter,如果不需要mapSideCombine,把聚合函数和ordering设置为none
  2. ExternalSorter插入数据
  3. 对map计算结果持久化,生成一个磁盘文件,并创建索引文件
  4. 创建mapstatus
override def write(records: Iterator[Product2[K, V]]): Unit = {
  sorter = if (dep.mapSideCombine) {
    new ExternalSorter[K, V, C](
      context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
  } else {
    // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
    // care whether the keys get sorted in each partition; that will be done on the reduce side
    // if the operation being run is sortByKey.
    new ExternalSorter[K, V, V](
      context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
  }
  sorter.insertAll(records)

  // Don't bother including the time to open the merged output file in the shuffle write time,
  // because it just opens a single file, so is typically too fast to measure accurately
  // (see SPARK-3570).
  val mapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(
    dep.shuffleId, mapId, dep.partitioner.numPartitions)
  sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
  val partitionLengths = mapOutputWriter.commitAllPartitions()//创建索引文件
  mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)
}

ExternalSorter##insertAll

  1. 判断是否存在aggregator(根据ExternalSorter的初始化过程,其实就是判断是否需要在map端做聚合),需要的话使用PartitionedAppendOnlyMap,否则使用PartitionedPairBuffer
  2. PartitionedAppendOnlyMap:一边写入一边聚合,每次写入判断是否需要溢写磁盘
  3. PartitionedPairBuffer:直接写入buffer不做聚合,每次写入判断是否需要溢写磁盘
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
  // TODO: stop combining if we find that the reduction factor isn't high
  val shouldCombine = aggregator.isDefined

  if (shouldCombine) { //1. mapSideCombine为true
    // Combine values in-memory first using our AppendOnlyMap
    // 使用AppendOnlyMap在内存中聚合
    // 聚合函数
    val mergeValue = aggregator.get.mergeValue
    //创建聚合函数的初始值
    val createCombiner = aggregator.get.createCombiner
    var kv: Product2[K, V] = null
    //2. 偏函数,如果有值,更新,没有值,创建初始值
    val update = (hadValue: Boolean, oldValue: C) => {
      if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
    }
    while (records.hasNext) {
      // 3. 写入map
      addElementsRead()
      kv = records.next()
      // AppendOnlyMap的changeValue方法 并进行采样
      map.changeValue((getPartition(kv._1), kv._1), update)
      // 4. 进行可能的磁盘溢出
      maybeSpillCollection(usingMap = true)
    }
  } else {
    // Stick values into our buffer
    while (records.hasNext) {
      addElementsRead()
      val kv = records.next()
      buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
      maybeSpillCollection(usingMap = false)
    }
  }
}

写时聚合
PartitionedAppendOnlyMap调用父类SizeTrackingAppendOnlyMap的changeValue方法,聚合计算的同时对AppendOnlyMap大小进行采样

override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
//1. 调用父类AppendOnlyMap的changeValue函数,应用缓存聚合算法。
  val newValue = super.changeValue(key, updateFunc)
//2. 调用继承特质SizeTracker的afterUpdate函数,增加对AppendOnlyMap大小的采样。
  super.afterUpdate()
  newValue
}
  1. 聚合算法
    • 存储结构:把key value存在数组中,key0, value0, key1, value1, key2, value2,对于计算出的pos,2pos存key,2pos+1存value。
    • 解决哈希冲突的方法:使用平方探测法(或者二次探测法)解决哈希冲突,随着寻址次数的增加而增加偏移量,为了减少寻址次数。实现上和标准的平方探测法有所不同,考虑了标准情况下map扩容太快的问题,这个实现在别的项目中也可以借鉴。
  /**
   * Set the value for key to updateFunc(hadValue, oldValue), where oldValue will be the old value
   * for key, if any, or null otherwise. Returns the newly updated value.
   */
  def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
    assert(!destroyed, destructionMessage)
    val k = key.asInstanceOf[AnyRef]
    if (k.eq(null)) {
      if (!haveNullValue) {
        incrementSize()
      }
      nullValue = updateFunc(haveNullValue, nullValue)
      haveNullValue = true
      return nullValue
    }
    var pos = rehash(k.hashCode) & mask
    var i = 1
    while (true) {
      val curKey = data(2 * pos)
      if (curKey.eq(null)) {
        val newValue = updateFunc(false, null.asInstanceOf[V])
        data(2 * pos) = k
        data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
        incrementSize()
        return newValue
      } else if (k.eq(curKey) || k.equals(curKey)) {
        val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
        data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
        return newValue
      } else {
        val delta = i
        pos = (pos + delta) & mask
        i += 1
      }
    }
    null.asInstanceOf[V] // Never reached but needed to keep compiler happy
  }
  1. AppendOnlyMap大小采样
    AppendOnlyMap大小不可能无限增长,需要对大小进行限制,但是我们不可能每次更新完之后计算它的大小,会严重影响Spark的性能,Spark采用采样并对AppendOnlyMap未来大小进行估算的方式。
    • 当达到采样间隔 nextSampleNum == numUpdates 时,进行采样。
    • 采样步骤:
      1. 估算AppendOnlyMap所占的内存并且与当前编号(numUpdates)一起作为样本数据写入到samples=new mutable.Queue[Sample]中。
      2. 如果当前采样数量大于2,则使samples执行一次出队操作,保证样本总数等于2。
      3. 计算每次更新增加的大小,公式如下: image.png
        如果样本数小于2,那么bytesPerUpdate=0。
      4. 计算下次采样的间隔nextSampleNum。
protected def afterUpdate(): Unit = {
  numUpdates += 1
  if (nextSampleNum == numUpdates) {
    takeSample()
  }
}
private def takeSample(): Unit = {
  samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
  // Only use the last two samples to extrapolate
  if (samples.size > 2) {
    samples.dequeue()
  }
  val bytesDelta = samples.toList.reverse match {
    case latest :: previous :: tail =>
      (latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
    // If fewer than 2 samples, assume no change
    case _ => 0
  }
  bytesPerUpdate = math.max(0, bytesDelta)
  nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
}

SizeEstimator.estimate估算一个class大小,首先添加类的全部 shellSize,即内部变量大小,随后对于所有带有引用的对象,也会压入队列进行递归的计算,直到队列清空。

blog.csdn.net/BIT_666/art…

溢写磁盘

  private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    if (usingMap) {//如果使用aggregator 对PartitionedAppendOnlyMap的大小进行估算
      estimatedSize = map.estimateSize()
      //溢出到磁盘
      if (maybeSpill(map, estimatedSize)) {
        //新建map
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      estimatedSize = buffer.estimateSize()
      if (maybeSpill(buffer, estimatedSize)) {
        buffer = new PartitionedPairBuffer[K, C]
      }
    }
//更新ExternalSorter已经使用的内存大小的峰值
    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
  }
  1. 判断是否溢出
    首先对map的大小进行估算,根据之前采样得到的每次更新大小估算map大小。当需要估算的内存大小大于等于之前申请到的内存大小,尝试获取内存,大小为2 * currentMemory - myMemoryThreshold。如果申请到的内存小于估算出来的内存溢出
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
  var shouldSpill = false
  // 如果当前集合已经读取的元素数量是32的倍数,且集合当前的内存大小大于等于myMemoryThreshold
  if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
    // Claim up to double our current memory from the shuffle memory pool
    val amountToRequest = 2 * currentMemory - myMemoryThreshold
    val granted = acquireMemory(amountToRequest)
    myMemoryThreshold += granted
    // If we were granted too little memory to grow further (either tryToAcquire returned 0,
    // or we already had more memory than myMemoryThreshold), spill the current collection内存不够溢出
    shouldSpill = currentMemory >= myMemoryThreshold
  }
  shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
  // Actually spill
  if (shouldSpill) {
    _spillCount += 1
    logSpillage(currentMemory)
    spill(collection)
    _elementsRead = 0
    _memoryBytesSpilled += currentMemory
    releaseMemory()
  }
  shouldSpill
}
  1. 溢出
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
  val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
  val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
  spills += spillFile
}
  • 溢出磁盘的过程中会进行排序,比较器为:
    1. 如果定义了 ordering 或 aggregator(且有 mapSideCombine)
      • 有 ordering,先根据分区ID排序,再按照 ordering 排序
      • 没有 ordering,先按照分区ID排序,再按照 key 的 hashcode 排序
    2. 无定义(且没有 mapSideCombine)
      • 按照分区ID排序
def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
  : Iterator[((Int, K), V)] = {
  val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
  destructiveSortedIterator(comparator)
}
  • 之后生成新的迭代器:
    1. 将 data 数组向左整理排列。
    2. 利用 Sorter、KVArraySortDataFormat 以及指定的比较器进行排序。这其中用到了 TimSort,也就是优化版的归并排序。
    3. 生成新的迭代器。
def destructiveSortedIterator(keyComparator: Comparator[K]): Iterator[(K, V)] = {
  destroyed = true
  // Pack KV pairs into the front of the underlying array
  var keyIndex, newIndex = 0
  while (keyIndex < capacity) {
    if (data(2 * keyIndex) != null) {
      data(2 * newIndex) = data(2 * keyIndex)
      data(2 * newIndex + 1) = data(2 * keyIndex + 1)
      newIndex += 1
    }
    keyIndex += 1
  }
  assert(curSize == newIndex + (if (haveNullValue) 1 else 0))

  new Sorter(new KVArraySortDataFormat[K, AnyRef]).sort(data, 0, newIndex, keyComparator)

  new Iterator[(K, V)] {
    var i = 0
    var nullValueReady = haveNullValue
    def hasNext: Boolean = (i < newIndex || nullValueReady)
    def next(): (K, V) = {
      if (nullValueReady) {
        nullValueReady = false
        (null.asInstanceOf[K], nullValue)
      } else {
        val item = (data(2 * i).asInstanceOf[K], data(2 * i + 1).asInstanceOf[V])
        i += 1
        item
      }
    }
  }
}
  • 溢写磁盘 创建临时文件,默认情况每写10000条进行一次刷盘操作。

持久化计算结果

将临时文件和内存中的数据写入最终的输出文件中

  • 未溢写磁盘 先在内存中排序,之后为每个partition创建一个Block文件,为每个Block文件生成一个partitionWriter,写入这些临时Block文件中。
  • 有溢写磁盘,获取分区迭代器,进行mergesort,之后进行写入操作,和上面相同
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
    : Iterator[(Int, Iterator[Product2[K, C]])] = {
  val readers = spills.map(new SpillReader(_))
  val inMemBuffered = inMemory.buffered
  (0 until numPartitions).iterator.map { p =>
    val inMemIterator = new IteratorForPartition(p, inMemBuffered)
    val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
    if (aggregator.isDefined) {
      // Perform partial aggregation across partitions
      (p, mergeWithAggregation(
        iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
    } else if (ordering.isDefined) {
      // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
      // sort the elements without trying to merge them
      (p, mergeSort(iterators, ordering.get))
    } else {
      (p, iterators.iterator.flatten)
    }
  }
}
  1. 为每个 spilled file 创建 SpillReader
  2. 为 inMemory 创建缓冲迭代器
  3. 遍历每个分区,将两个迭代器合并拿到分区迭代器之后进行mergeSort
    • 为 inMermory 创建分区迭代器
    • 如果需要聚合,在mergeSort之后进行聚合
    • 不需要聚合,按照order进行mergeSort
    • 都不需要,直接取出K,V生成iterator
private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
    : Iterator[Product2[K, C]] = {
  val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
  type Iter = BufferedIterator[Product2[K, C]]
  // Use the reverse order (compare(y,x)) because PriorityQueue dequeues the max
  val heap = new mutable.PriorityQueue[Iter]()(
    (x: Iter, y: Iter) => comparator.compare(y.head._1, x.head._1))
  heap.enqueue(bufferedIters: _*)  // Will contain only the iterators with hasNext = true
  new Iterator[Product2[K, C]] {
    override def hasNext: Boolean = heap.nonEmpty

    override def next(): Product2[K, C] = {
      if (!hasNext) {
        throw new NoSuchElementException
      }
      val firstBuf = heap.dequeue()
      val firstPair = firstBuf.next()
      if (firstBuf.hasNext) {
        heap.enqueue(firstBuf)
      }
      firstPair
    }
  }
}

mergeSort的实现:
首先iterators中的每一个iterator中的数据都是有序的。
创建一个优先队列(堆),每个iterator入队,对iterator第一个值,按照我们的comparator进行排序。在next方法中,出队的就是最小值,如果这个iterator还有值,加入堆,继续按照第一个值排序。这样每次获取的都是多个iterator中的最小值。
该方法可对多个有序的iterator进行排序,在别的项目中也可借鉴。
mergeWithAggregation:

  1. 没有定义排序器 首先进行mergeSort拿到按照comparator排序好的迭代器,但是由于没有定义order,在单个分区内是按照key的hashcode进行排序的,所以会有hashcode相同的key(partial ordering),把key放入keys数组中,当有相同key时进行聚合。
val it = new Iterator[Iterator[Product2[K, C]]] {
  val sorted = mergeSort(iterators, comparator).buffered

  // Buffers reused across elements to decrease memory allocation
  val keys = new ArrayBuffer[K]
  val combiners = new ArrayBuffer[C]

  override def hasNext: Boolean = sorted.hasNext

  override def next(): Iterator[Product2[K, C]] = {
    if (!hasNext) {
      throw new NoSuchElementException
    }
    keys.clear()
    combiners.clear()
    val firstPair = sorted.next()
    keys += firstPair._1
    combiners += firstPair._2
    val key = firstPair._1
    while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
      val pair = sorted.next()
      var i = 0
      var foundKey = false
      while (i < keys.size && !foundKey) {
        if (keys(i) == pair._1) {
          combiners(i) = mergeCombiners(combiners(i), pair._2)
          foundKey = true
        }
        i += 1
      }
      if (!foundKey) {
        keys += pair._1
        combiners += pair._2
      }
    }

    // Note that we return an iterator of elements since we could've had many keys marked
    // equal by the partial order; we flatten this below to get a flat iterator of (K, C).
    keys.iterator.zip(combiners.iterator)
  }
}
it.flatten
  1. 定义了排序
    全排序,key是有序的,直接把相同的key进行聚合
new Iterator[Product2[K, C]] {
  val sorted = mergeSort(iterators, comparator).buffered

  override def hasNext: Boolean = sorted.hasNext

  override def next(): Product2[K, C] = {
    if (!hasNext) {
      throw new NoSuchElementException
    }
    val elem = sorted.next()
    val k = elem._1
    var c = elem._2
    while (sorted.hasNext && sorted.head._1 == k) {
      val pair = sorted.next()
      c = mergeCombiners(c, pair._2)
    }
    (k, c)
  }
}

创建索引文件

blockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, resolvedTmp);

在写入过程中把每个 block 的大小保存到 partitionLengths 中,每个 partition 的数据写完调用 close 方法:PartitionWriterStream##close

public void close() {
  isClosed = true;
  partitionLengths[partitionId] = count;
  bytesWrittenToMergedFile += count;
}

之后把每个 block 大小转成 offset 保存到 index flie

var offset = 0L
out.writeLong(offset)
for (length <- lengths) {
  offset += length
  out.writeLong(offset)
}

索引文件如图所示: image.png

map任务状态传递

mapstatus.png

shuffle read

SortShuffleManage##getReader

override def getReader[K, C](
    handle: ShuffleHandle,
    startPartition: Int,
    endPartition: Int,
    context: TaskContext,
    metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C] = {
  val blocksByAddress = SparkEnv.get.mapOutputTracker.getMapSizesByExecutorId(
    handle.shuffleId, startPartition, endPartition)
  new BlockStoreShuffleReader(
    handle.asInstanceOf[BaseShuffleHandle[K, _, C]], blocksByAddress, context, metrics,
    shouldBatchFetch = canUseBatchFetch(startPartition, endPartition, context))
}

获取map任务状态

MapOutputTrackerWorker##getMapSizesByExecutorId

override def getMapSizesByExecutorId(
    shuffleId: Int,
    startPartition: Int,
    endPartition: Int)
  : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])] = {
  logDebug(s"Fetching outputs for shuffle $shuffleId, partitions $startPartition-$endPartition")
  val statuses = getStatuses(shuffleId, conf)
  try {
    MapOutputTracker.convertMapStatuses(
      shuffleId, startPartition, endPartition, statuses)
  } catch {
    case e: MetadataFetchFailedException =>
      // We experienced a fetch failure so our mapStatuses cache is outdated; clear it:
      mapStatuses.clear()
      throw e
  }
}
  1. 获取指定 shuffleID 的 mapstatus,如果本地没有,从远程MapOutputTrackerMaster获取
logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching them")
logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint)  
logInfo("Got the output locations")  
logInfo("Asked to send map output locations for shuffle " + shuffleId + " to " + hostPort)  
private def getStatuses(shuffleId: Int, conf: SparkConf): Array[MapStatus] = {
  val statuses = mapStatuses.get(shuffleId).orNull
  if (statuses == null) {
    logInfo("Don't have map outputs for shuffle " + shuffleId + ", fetching them")
    val startTimeNs = System.nanoTime()
    fetchingLock.withLock(shuffleId) {
      var fetchedStatuses = mapStatuses.get(shuffleId).orNull
      if (fetchedStatuses == null) {
        logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint)
        val fetchedBytes = askTracker[Array[Byte]](GetMapOutputStatuses(shuffleId))
        fetchedStatuses = MapOutputTracker.deserializeMapStatuses(fetchedBytes, conf)
        logInfo("Got the output locations")
        mapStatuses.put(shuffleId, fetchedStatuses)
      }
      logDebug(s"Fetching map output statuses for shuffle $shuffleId took " +
        s"${TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTimeNs)} ms")
      fetchedStatuses
    }
  } else {
    statuses
  }
}
  • trackerEndpoint发送消息GetMapOutputStatuses(shuffleId)
protected def askTracker[T: ClassTag](message: Any): T = {
    trackerEndpoint.askSync[T](message)
}
  • MapOutputTrackerMasterEndpoint.receiveAndReply
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
  case GetMapOutputStatuses(shuffleId: Int) =>
    val hostPort = context.senderAddress.hostPort
    logInfo("Asked to send map output locations for shuffle " + shuffleId + " to " + hostPort)
    tracker.post(new GetMapOutputMessage(shuffleId, context))
}

调用tracker.post

  def post(message: GetMapOutputMessage): Unit = {
    mapOutputRequests.offer(message)
  }

mapOutputRequests加入GetMapOutputMessage(shuffleId, context)消息。这里的mapOutputRequests是链式阻塞队列。

  private val mapOutputRequests = new LinkedBlockingQueue[GetMapOutputMessage]

MapOutputTrackerMaster.MessageLoop.run

MessageLoop启一个线程不断的参数从mapOutputRequests读取数据:

private class MessageLoop extends Runnable {
  override def run(): Unit = {
    try {
      while (true) {
        try {
          val data = mapOutputRequests.take()
           if (data == PoisonPill) {
            // Put PoisonPill back so that other MessageLoops can see it.
            mapOutputRequests.offer(PoisonPill)
            return
          }
          val context = data.context
          val shuffleId = data.shuffleId
          val hostPort = context.senderAddress.hostPort
          logDebug("Handling request to send map output locations for shuffle " + shuffleId +
            " to " + hostPort)
          val shuffleStatus = shuffleStatuses.get(shuffleId).head
          context.reply(
            shuffleStatus.serializedMapStatus(broadcastManager, isLocal, minSizeForBroadcast,
              conf))
        } catch {
          case NonFatal(e) => logError(e.getMessage, e)
        }
      }
    } catch {
      case ie: InterruptedException => // exit
    }
  }
}
  1. map地址转换 对于mapstatus和给定的partition,转换为Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])],表示对于每个BlockManagerID,对应partition的blockId和size
def convertMapStatuses(
    shuffleId: Int,
    startPartition: Int,
    endPartition: Int,
    statuses: Array[MapStatus],
    mapIndex : Option[Int] = None): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])] = {
  assert (statuses != null)
  val splitsByAddress = new HashMap[BlockManagerId, ListBuffer[(BlockId, Long, Int)]]
  val iter = statuses.iterator.zipWithIndex
  for ((status, mapIndex) <- mapIndex.map(index => iter.filter(_._2 == index)).getOrElse(iter)) {
    if (status == null) {
      val errorMessage = s"Missing an output location for shuffle $shuffleId"
      logError(errorMessage)
      throw new MetadataFetchFailedException(shuffleId, startPartition, errorMessage)
    } else {
      for (part <- startPartition until endPartition) {
        val size = status.getSizeForBlock(part)
        if (size != 0) {
          splitsByAddress.getOrElseUpdate(status.location, ListBuffer()) +=
            ((ShuffleBlockId(shuffleId, status.mapId, part), size, mapIndex))
        }
      }
    }
  }

  splitsByAddress.iterator
}

拉取map端计算结果

参数:
spark.reducer.maxReqsInFlight这个参数真正限制了fetch请求大小和次数

spark.reducer.maxSizeInFlight :默认值:48m。
shuffle read缓冲区大小,决定了一次拉取多大的数据,一次请求最大大小为maxBytesInFlight / 5
如果可用内存比较多,可以增加参数大小,从而减少拉取次数。
spark.reducer.maxReqsInFlight : 默认值:Int.MaxValue。最大并发请求数量。
spark.reducer.maxBlocksInFlightPerAddress:默认值:Int.MaxValue。最大能拉取的block数量
spark.maxRemoteBlockSizeFetchToMem:默认值:200m。block大于这个大小会直接写入磁盘。
config.SHUFFLE_DETECT_CORRUPT
SHUFFLE_DETECT_CORRUPT_MEMORY
  • 初始化 ShuffleBlockFetcherIterator ,会执行 initialize() 方法
    1. 划分本地和远程block,返回remoteRequests = new ArrayBuffer[FetchRequest]数组,远程请求大小最大尺寸为math.max(maxBytesInFlight / 5, 1L),为了能够提供5个并发拉取的能力
    2. 将FetchRequest随机排序后存入val fetchRequests = new Queue[FetchRequest]
    3. 发送 fetch 请求直到达到 maxBytesInFlight,如果请求大小大于maxRemoteBlockSizeFetchToMem直接写入磁盘
    4. 获取本地block
打印日志
logInfo(s"Started $numFetches remote fetches in ${Utils.getUsedTimeNs(startTimeNs)}")
private[this] def initialize(): Unit = {
  // Add a task completion callback (called in both success case and failure case) to cleanup.
  context.addTaskCompletionListener(onCompleteCallback)

  // Split local and remote blocks.
  val remoteRequests = splitLocalRemoteBlocks()
  // Add the remote requests into our queue in a random order
  fetchRequests ++= Utils.randomize(remoteRequests)
  assert ((0 == reqsInFlight) == (0 == bytesInFlight),
    "expected reqsInFlight = 0 but found reqsInFlight = " + reqsInFlight +
    ", expected bytesInFlight = 0 but found bytesInFlight = " + bytesInFlight)

  // Send out initial requests for blocks, up to our maxBytesInFlight
  fetchUpToMaxBytes()

  val numFetches = remoteRequests.size - fetchRequests.size
  logInfo(s"Started $numFetches remote fetches in ${Utils.getUsedTimeNs(startTimeNs)}")

  // Get Local Blocks
  fetchLocalBlocks()
  logDebug(s"Got local blocks in ${Utils.getUsedTimeNs(startTimeNs)}")
}
  • 边拉取边聚合
    ShuffleBlockFetcherIterator.next()
    1. while 拉取结果队列results = new LinkedBlockingQueue[FetchResult]为null,一直fetch
    2. 发送fetch请求直到达到MaxBytes
    3. 返回(blockId,inputStream)

fetchUpToMaxBytes 方法在ShuffleBlockFetcherIterator初始化时以及每次迭代时调用,每次拉取最多spark.reducer.maxSizeInFlight大小的数据。由于之前远程获取Block时,一小部分请求可能就达到了maxBytesInFlight的限制,所以很有可能会剩余很多请求没有发送。所以每次迭代ShuffleBlockFetcher-Iterator的时候还有个附加动作用于发送剩余请求。如果一个请求比较大,会在已经没有fetch请求的时候调用,next中的while循环在没有拉取结果时会一直循环等待。如果请求大于maxRemoteBlockSizeFetchToMem会直接写入磁盘。

def isRemoteBlockFetchable(fetchReqQueue: Queue[FetchRequest]): Boolean = {
  fetchReqQueue.nonEmpty &&
    (bytesInFlight == 0 ||
      (reqsInFlight + 1 <= maxReqsInFlight &&
        bytesInFlight + fetchReqQueue.front.size <= maxBytesInFlight))
}

聚合计算

  • 如果定义了聚合函数,且定义了map端聚合,那么ExternalAppendOnlyMap使用mergeCombiners作为聚合函数
  • 如果定义了聚合函数,且没有定义map端聚合,那么ExternalAppendOnlyMap使用mergeValue作为聚合函数
  • 如果没有定义聚合函数,不需要聚合直接返回迭代器
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
  if (dep.mapSideCombine) {
    // We are reading values that are already combined
    val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
    dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
  } else {
    // We don't know the value type, but also don't care -- the dependency *should*
    // have made sure its compatible w/ this aggregator, which will convert the value
    // type to the combined type C
    val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
    dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
  }
} else {
  interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
}

ExternalAppendOnlyMap.insertAll() 和map端不同的是,这个操作一定要做聚合,写时聚合和溢写磁盘的操作和ExternalSorter一致,这个操作主要是为了进行map端的聚合计算

def insertAll(entries: Iterator[Product2[K, V]]): Unit = {
  if (currentMap == null) {
    throw new IllegalStateException(
      "Cannot insert new elements into a map after calling iterator")
  }
  // An update function for the map that we reuse across entries to avoid allocating
  // a new closure each time
  var curEntry: Product2[K, V] = null
  val update: (Boolean, C) => C = (hadVal, oldVal) => {
    if (hadVal) mergeValue(oldVal, curEntry._2) else createCombiner(curEntry._2)
  }

  while (entries.hasNext) {
    curEntry = entries.next()
    val estimatedSize = currentMap.estimateSize()
    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
    if (maybeSpill(currentMap, estimatedSize)) {
      currentMap = new SizeTrackingAppendOnlyMap[K, C]
    }
    currentMap.changeValue(curEntry._1, update)
    addElementsRead()
  }
}

之后使用ExternalSorter进行排序

val sorter =
  new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
sorter.insertAll(aggregatedIter)