大数据 Shuffle 原理与实践 | 青训营笔记

2022-08-01 83 阅读2分钟

大数据 Shuffle 原理与实践 | 青训营笔记

这是我参与「第四届青训营」笔记创作活动的的第8天

shuffle概述

经典shuffle过程

map阶段：在单价上进行的针对一小块数据的计算过程。
shuffle阶段：在map的基础上，进行数据移动，为后续的reduce阶段做准备。
reduce阶段：对移动后的数据进行处理，依然是在单价上处理一小份数据。
为什么shuffle对性能非常重要 ?
- M *R次网络连接
- 大量的数据移动
- 数据丢失风险
- 可能存在大量的排序操作
- 大量的数据序列化，反序列化操作
- 数据压缩
为什么shuffle如此重要？
- 数据shuffle表示了不同分区数据交换的过程，不同的shuffle策略性能差异较大。目前在各个引擎中shuffle都是优化的重点，在spark框架中，shuffle是支撑spark进行大规模复杂数据处理的基石。

shuffle算子

常见的触发shuffle的算子
- repartition
  - coalesce、repartition
- ByKey
  - groupByKey、reduceByKey、aggregateByKey、combineByKey、sortByKeysortBy
- Join
  - cogroup、join
- distinct
  - distinct
算子使用例子

val text = sc.textFile("mytextfile.txt")
val counts = text
  .flatMap(line => line.split(" "))
  .map(word => (word,1))
  .reduceByKey(_+_)
counts.collect

Spark中对shuffle的抽象-宽依赖、窄依赖
- 窄依赖：父RDD的每个分片至多被子RDD中的一个分片所依赖
- 宽依赖：父RDD的每个分片可能被子RDD中的多个分片所依赖

算子内部的依赖关系

Shuffle Dependency

构造函数

A single key-value pair RDD, i.e. RDD[Product2[K, V]],
Partitioner (available as partitioner property),
Serializer,
Optional key ordering (of Scala’s scala.math.Ordering type),
Optional Aggregator,
mapSideCombine flag which is disabled (i.e. false) by default. 1. Partitioner
两个接口：numberPartitions、getPartition

abstract class Partitioner extends Serializable{
    def numPartitions: Int
    def getPartition(key: Any): Int
}

经典实现：HashPartitioner

class HashPartitioner(partitions: Int) extends Partitioner {
    require(partitions >= 0,s"......")
    
    def numPartitions: Int = partitions
    
    def getPartition(key: Any): Int = key match {
        case null => 0
        case _ => Utils.nonNegativeMod(key.hashCode,numPartitions)
    }
}

Aggregator

createCombiner：只有一个value的时候初始化的方法
mergeValue：合并一个value到Aggregator中
mergeCombiners：合并两个Aggregator