SortShuffleWriter
shuffle write流程:
- 创建
ExternalSorter,如果不需要mapSideCombine,把聚合函数和ordering设置为none - ExternalSorter插入数据
- 对map计算结果持久化,生成一个磁盘文件,并创建索引文件
- 创建mapstatus
SortShuffleWriter.write()
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
// In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
// care whether the keys get sorted in each partition; that will be done on the reduce side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records)
// Don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val mapOutputWriter = shuffleExecutorComponents.createMapOutputWriter(
dep.shuffleId, mapId, dep.partitioner.numPartitions)
sorter.writePartitionedMapOutput(dep.shuffleId, mapId, mapOutputWriter)
val partitionLengths = mapOutputWriter.commitAllPartitions()//创建索引文件
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths, mapId)
}
ExternalSorter写入
- 判断是否存在
aggregator(根据ExternalSorter的初始化过程,其实就是判断是否需要在map端做聚合),需要的话使用PartitionedAppendOnlyMap,否则使用PartitionedPairBuffer PartitionedAppendOnlyMap:一边写入一边聚合,每次写入判断是否需要溢写磁盘PartitionedPairBuffer:直接写入buffer不做聚合,每次写入判断是否需要溢写磁盘
ExternalSorter.insertAll()
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
// TODO: stop combining if we find that the reduction factor isn't high
val shouldCombine = aggregator.isDefined
if (shouldCombine) { //1. mapSideCombine为true
// Combine values in-memory first using our AppendOnlyMap
// 使用AppendOnlyMap在内存中聚合
// 聚合函数
val mergeValue = aggregator.get.mergeValue
//创建聚合函数的初始值
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
//2. 偏函数,如果有值,更新,没有值,创建初始值
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
while (records.hasNext) {
// 3. 写入map
addElementsRead()
kv = records.next()
// AppendOnlyMap的changeValue方法 并进行采样
map.changeValue((getPartition(kv._1), kv._1), update)
// 4. 进行可能的磁盘溢出
maybeSpillCollection(usingMap = true)
}
} else {
// Stick values into our buffer
while (records.hasNext) {
addElementsRead()
val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}
写时聚合
PartitionedAppendOnlyMap调用父类SizeTrackingAppendOnlyMap的changeValue方法,聚合计算的同时对AppendOnlyMap大小进行采样
override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
//1. 调用父类AppendOnlyMap的changeValue函数,应用缓存聚合算法。
val newValue = super.changeValue(key, updateFunc)
//2. 调用继承特质SizeTracker的afterUpdate函数,增加对AppendOnlyMap大小的采样。
super.afterUpdate()
newValue
}
聚合算法
- 存储结构:把key value存在数组中,key0, value0, key1, value1, key2, value2,对于计算出的pos,2pos存key,2pos+1存value。
- 如果有值,使用
aggeragtor的mergeValue: (C, V) => C进行更新,没有值,创建初始值createCombiner: V => C - 解决哈希冲突的方法:使用
平方探测法(或者二次探测法)解决哈希冲突,随着寻址次数的增加而增加偏移量,为了减少寻址次数。实现上和标准的平方探测法有所不同,考虑了标准情况下map扩容太快的问题,这个实现在别的项目中也可以借鉴。- 普通二次探测法(en.wikipedia.org/wiki/Quadra…
- AppendOnlyMap使用的二次探测法:pos+1 pos+3 pos+6 pos+10…
- 普通二次探测法(en.wikipedia.org/wiki/Quadra…
//1. 聚合函数
val mergeValue = aggregator.get.mergeValue
//2. 创建聚合函数的初始值
val createCombiner = aggregator.get.createCombiner
var kv: Product2[K, V] = null
//3. 偏函数,如果有值,更新,没有值,创建初始值
val update = (hadValue: Boolean, oldValue: C) => {
if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
}
AppendOnlyMap.changeValue()
/**
* Set the value for key to updateFunc(hadValue, oldValue), where oldValue will be the old value
* for key, if any, or null otherwise. Returns the newly updated value.
*/
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage)
val k = key.asInstanceOf[AnyRef]
if (k.eq(null)) {
if (!haveNullValue) {
incrementSize()
}
nullValue = updateFunc(haveNullValue, nullValue)
haveNullValue = true
return nullValue
}
var pos = rehash(k.hashCode) & mask
var i = 1
while (true) {
val curKey = data(2 * pos)
if (curKey.eq(null)) {
val newValue = updateFunc(false, null.asInstanceOf[V])
data(2 * pos) = k
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
incrementSize()
return newValue
} else if (k.eq(curKey) || k.equals(curKey)) {
val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
return newValue
} else {
val delta = i
pos = (pos + delta) & mask
i += 1
}
}
null.asInstanceOf[V] // Never reached but needed to keep compiler happy
}
估算map大小
AppendOnlyMap大小不可能无限增长,需要对大小进行限制,但是我们不可能每次更新完之后计算它的大小,会严重影响Spark的性能,Spark采用采样并对AppendOnlyMap未来大小进行估算的方式。
- map大小采样
- 当达到采样间隔
nextSampleNum == numUpdates时,进行采样。 - 采样步骤:
- 估算AppendOnlyMap所占的内存并且与当前编号(numUpdates)一起作为样本数据写入到samples=new mutable.Queue[Sample]中。
- 如果当前采样数量大于2,则使samples执行一次出队操作,保证样本总数等于2。
- 计算每次更新增加的大小,公式如下:
如果样本数小于2,那么bytesPerUpdate=0。 - 计算下次采样的间隔nextSampleNum。
- 当达到采样间隔
SizeTracker.afterUpdate()
protected def afterUpdate(): Unit = {
numUpdates += 1
if (nextSampleNum == numUpdates) {
takeSample()
}
}
SizeTracker.takeSample()
private def takeSample(): Unit = {
samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
// Only use the last two samples to extrapolate
if (samples.size > 2) {
samples.dequeue()
}
val bytesDelta = samples.toList.reverse match {
case latest :: previous :: tail =>
(latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
// If fewer than 2 samples, assume no change
case _ => 0
}
bytesPerUpdate = math.max(0, bytesDelta)
nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
}
- map大小估算
SizeEstimator.estimate估算一个class大小,首先添加类的全部 shellSize,即内部变量大小,随后对于所有带有引用的对象,也会压入队列进行递归的计算,直到队列清空。
这种采样再估算大小的方式,可以有效利用内存,不必设定缓存大小而是根据内存动态调整,这种做法值得借鉴。
溢写磁盘
private def maybeSpillCollection(usingMap: Boolean): Unit = {
var estimatedSize = 0L
if (usingMap) {//如果使用aggregator 对PartitionedAppendOnlyMap的大小进行估算
estimatedSize = map.estimateSize()
//溢出到磁盘
if (maybeSpill(map, estimatedSize)) {
//新建map
map = new PartitionedAppendOnlyMap[K, C]
}
} else {
estimatedSize = buffer.estimateSize()
if (maybeSpill(buffer, estimatedSize)) {
buffer = new PartitionedPairBuffer[K, C]
}
}
//更新ExternalSorter已经使用的内存大小的峰值
if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}
- 判断是否溢出
首先对 map 的大小进行估算,根据之前采样得到的每次更新大小估算 map 大小。当需要估算的内存大小大于等于之前申请到的内存大小,尝试获取内存,大小为2 * currentMemory - myMemoryThreshold。如果申请到的内存小于估算出来的内存溢出
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
var shouldSpill = false
// 如果当前集合已经读取的元素数量是32的倍数,且集合当前的内存大小大于等于myMemoryThreshold
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
// Claim up to double our current memory from the shuffle memory pool
val amountToRequest = 2 * currentMemory - myMemoryThreshold
val granted = acquireMemory(amountToRequest)
myMemoryThreshold += granted
// If we were granted too little memory to grow further (either tryToAcquire returned 0,
// or we already had more memory than myMemoryThreshold), spill the current collection内存不够溢出
shouldSpill = currentMemory >= myMemoryThreshold
}
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
// Actually spill
if (shouldSpill) {
_spillCount += 1
logSpillage(currentMemory)
spill(collection)
_elementsRead = 0
_memoryBytesSpilled += currentMemory
releaseMemory()
}
shouldSpill
}
- 溢出
override protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = {
val inMemoryIterator = collection.destructiveSortedWritablePartitionedIterator(comparator)
val spillFile = spillMemoryIteratorToDisk(inMemoryIterator)
spills += spillFile
}
- 溢出磁盘的过程中会进行排序,比较器为:
- 如果定义了
ordering或aggregator(且有mapSideCombine)- 有 ordering,先根据分区ID排序,再按照 ordering 排序
- 没有 ordering,先按照分区ID排序,再按照 key 的 hashcode 排序
- 无定义(且没有
mapSideCombine)- 按照分区ID排序
- 如果定义了
def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
: Iterator[((Int, K), V)] = {
val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
destructiveSortedIterator(comparator)
}
- 之后生成新的迭代器:
- 将 data 数组向左
整理排列。 - 利用 Sorter、KVArraySortDataFormat 以及指定的比较器进行
排序。这其中用到了TimSort,也就是优化版的归并排序。 - 生成新的迭代器。
- 将 data 数组向左
def destructiveSortedIterator(keyComparator: Comparator[K]): Iterator[(K, V)] = {
destroyed = true
// Pack KV pairs into the front of the underlying array
var keyIndex, newIndex = 0
while (keyIndex < capacity) {
if (data(2 * keyIndex) != null) {
data(2 * newIndex) = data(2 * keyIndex)
data(2 * newIndex + 1) = data(2 * keyIndex + 1)
newIndex += 1
}
keyIndex += 1
}
assert(curSize == newIndex + (if (haveNullValue) 1 else 0))
new Sorter(new KVArraySortDataFormat[K, AnyRef]).sort(data, 0, newIndex, keyComparator)
new Iterator[(K, V)] {
var i = 0
var nullValueReady = haveNullValue
def hasNext: Boolean = (i < newIndex || nullValueReady)
def next(): (K, V) = {
if (nullValueReady) {
nullValueReady = false
(null.asInstanceOf[K], nullValue)
} else {
val item = (data(2 * i).asInstanceOf[K], data(2 * i + 1).asInstanceOf[V])
i += 1
item
}
}
}
}
- 溢写磁盘 创建临时文件,默认情况每写10000条进行一次刷盘操作。
持久化计算结果
将临时文件和内存中的数据写入最终的输出文件中
- 未溢写磁盘 先在内存中排序,之后为每个
partition创建一个Block文件,为每个Block文件生成一个partitionWriter,写入这些临时Block文件中。 - 有溢写磁盘,获取分区迭代器,进行
merge,之后进行写入操作,和上面相同。
merge
- 为每个 spilled file 创建
SpillReader - 为 inMemory 数据创建缓冲迭代器
- 遍历每个分区,将两个迭代器合并拿到分区迭代器之后进行
mergeSort- 为 inMermory 创建分区迭代器,将两个迭代器合并为
Seq[Iterator[Product2[K, C]]] - 如果需要聚合,在
mergeSort之后进行聚合 - 不需要聚合,按照
order进行mergeSort - 都不需要,直接取出K,V生成iterator
- 为 inMermory 创建分区迭代器,将两个迭代器合并为
ExternalSorter.merge()
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] = {
val readers = spills.map(new SpillReader(_))
val inMemBuffered = inMemory.buffered
(0 until numPartitions).iterator.map { p =>
val inMemIterator = new IteratorForPartition(p, inMemBuffered)
val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator)
if (aggregator.isDefined) {
// Perform partial aggregation across partitions
(p, mergeWithAggregation(
iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
} else if (ordering.isDefined) {
// No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
// sort the elements without trying to merge them
(p, mergeSort(iterators, ordering.get))
} else {
(p, iterators.iterator.flatten)
}
}
}
mergeSort
首先明确每一个iterator中的数据都是有序的。
对多个有序的 iterator 进行排序:
- 创建一个优先队列(堆),每个iterator入队后,对iterator的头数据,按照我们的comparator进行排序。
- 返回一个迭代器,在next方法中,出队的就是第一个iterator的下一个元素,也是所有iterator中的最小值。如果这个iterator还有值,继续加入堆,这时仍然按照第一个值排序。这样每次获取的都是多个iterator中的最小值。
该方法可对多个有序的iterator进行排序,在别的项目中也可借鉴。
ExternalSorter.mergeSort()
private def mergeSort(iterators: Seq[Iterator[Product2[K, C]]], comparator: Comparator[K])
: Iterator[Product2[K, C]] = {
val bufferedIters = iterators.filter(_.hasNext).map(_.buffered)
type Iter = BufferedIterator[Product2[K, C]]
// Use the reverse order (compare(y,x)) because PriorityQueue dequeues the max
val heap = new mutable.PriorityQueue[Iter]()(
(x: Iter, y: Iter) => comparator.compare(y.head._1, x.head._1))
heap.enqueue(bufferedIters: _*) // Will contain only the iterators with hasNext = true
new Iterator[Product2[K, C]] {
override def hasNext: Boolean = heap.nonEmpty
override def next(): Product2[K, C] = {
if (!hasNext) {
throw new NoSuchElementException
}
val firstBuf = heap.dequeue()
val firstPair = firstBuf.next()
if (firstBuf.hasNext) {
heap.enqueue(firstBuf)
}
firstPair
}
}
}
聚合计算
- 没有定义排序器:
首先进行
mergeSort拿到按照comparator排序好的迭代器,但是由于没有定义order,在单个分区内是按照key的hashcode进行排序的,所以会有 hashcode 相同的 key(partial ordering),把 key 放入 keys 数组中,使用while循环遍历所有hash值相同的key,当有相同 key 时进行聚合,把聚合值放在 combiner 数组中。
External.mergeWithAggregation()
val it = new Iterator[Iterator[Product2[K, C]]] {
val sorted = mergeSort(iterators, comparator).buffered
// Buffers reused across elements to decrease memory allocation
val keys = new ArrayBuffer[K]
val combiners = new ArrayBuffer[C]
override def hasNext: Boolean = sorted.hasNext
override def next(): Iterator[Product2[K, C]] = {
if (!hasNext) {
throw new NoSuchElementException
}
keys.clear()
combiners.clear()
val firstPair = sorted.next()
keys += firstPair._1
combiners += firstPair._2
val key = firstPair._1
while (sorted.hasNext && comparator.compare(sorted.head._1, key) == 0) {
val pair = sorted.next()
var i = 0
var foundKey = false
while (i < keys.size && !foundKey) {
if (keys(i) == pair._1) {
combiners(i) = mergeCombiners(combiners(i), pair._2)
foundKey = true
}
i += 1
}
if (!foundKey) {
keys += pair._1
combiners += pair._2
}
}
// Note that we return an iterator of elements since we could've had many keys marked
// equal by the partial order; we flatten this below to get a flat iterator of (K, C).
keys.iterator.zip(combiners.iterator)
}
}
it.flatten
- 定义了排序:全排序,key 是有序的,直接把相同的 key 进行聚合
External.mergeWithAggregation()
new Iterator[Product2[K, C]] {
val sorted = mergeSort(iterators, comparator).buffered
override def hasNext: Boolean = sorted.hasNext
override def next(): Product2[K, C] = {
if (!hasNext) {
throw new NoSuchElementException
}
val elem = sorted.next()
val k = elem._1
var c = elem._2
while (sorted.hasNext && sorted.head._1 == k) {
val pair = sorted.next()
c = mergeCombiners(c, pair._2)
}
(k, c)
}
}
创建索引文件
在每个 partition 数据写完调用close方法时,会把每个 block 的大小保存到long[] partitionLengths中
PartitionWriterStream.close()
public void close() {
isClosed = true;
partitionLengths[partitionId] = count;
bytesWrittenToMergedFile += count;
}
之后创建索引文件,把 block 大小转成 offset 保存到索引文件中。 LocalDiskShuffleMapOutputWriter.commitAllPartitions()
blockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, resolvedTmp);
var offset = 0L
out.writeLong(offset)
for (length <- lengths) {
offset += length
out.writeLong(offset)
}
过程如图所示:
map任务状态传递
ByPassMergeSortShuffleWriter
ByPassMergeSortShuffleWriter首先获取reduce个数个DiskWriter,每个reduce对应一个文件,之后合并为一个文件并生成索引文件。在进行文件合并时,默认使用NIO进行字节流的copy。相比SortShuffleWriter,不需要进行排序和聚合操作,并且在文件合并时少了序列化反序列化的开销。
public void write(Iterator<Product2<K, V>> records) throws IOException {
assert (partitionWriters == null);
ShuffleMapOutputWriter mapOutputWriter = shuffleExecutorComponents
.createMapOutputWriter(shuffleId, mapId, numPartitions);
try {
if (!records.hasNext()) {
partitionLengths = mapOutputWriter.commitAllPartitions();
mapStatus = MapStatus$.MODULE$.apply(
blockManager.shuffleServerId(), partitionLengths, mapId);
return;
}
final SerializerInstance serInstance = serializer.newInstance();
final long openStartTime = System.nanoTime();
//1. 获取reduce个数个diskWriter
partitionWriters = new DiskBlockObjectWriter[numPartitions];
partitionWriterSegments = new FileSegment[numPartitions];
for (int i = 0; i < numPartitions; i++) {
final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock();
final File file = tempShuffleBlockIdPlusFile._2();
final BlockId blockId = tempShuffleBlockIdPlusFile._1();
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
}
// Creating the file to write to and creating a disk writer both involve interacting with
// the disk, and can take a long time in aggregate when we open many files, so should be
// included in the shuffle write time.
writeMetrics.incWriteTime(System.nanoTime() - openStartTime);
//2. 数据迭代写入
while (records.hasNext()) {
final Product2<K, V> record = records.next();
final K key = record._1();
//通过partitioner判断在哪个分区
partitionWriters[partitioner.getPartition(key)].write(key, record._2());
}
for (int i = 0; i < numPartitions; i++) {
try (DiskBlockObjectWriter writer = partitionWriters[i]) {
partitionWriterSegments[i] = writer.commitAndGet();
}
}
//3. 把每个分区的文件为一个文件,并生成索引文件
partitionLengths = writePartitionedData(mapOutputWriter);
mapStatus = MapStatus$.MODULE$.apply(
blockManager.shuffleServerId(), partitionLengths, mapId);
} catch (Exception e) {
try {
mapOutputWriter.abort(e);
} catch (Exception e2) {
logger.error("Failed to abort the writer after failing to write map output.", e2);
e.addSuppressed(e2);
}
throw e;
}
}