Spark分区策略Spark数据是可以分区分散在集群各个机器中的，并且Spark会根据key使用Partitioner类

Spark数据是可以分区分散在集群各个机器中的，并且Spark会根据key使用Partitioner类进行数据分区。

Spark提供两个分区类：HashPartitioner/RangePartitioner，他们都继承Partitioner类。如果想要自定义分区逻辑，可以创建自己的类继承Partitioner.

指定分区策略

rdd.partitionBy(new CustomPartitioner(10)).saveAsTextFile("...")
rdd1 = rdd1.partitionBy(new CustomPartitioner(10))
rdd2.partitionBy(new CustomPartitioner(10)).join(rdd1)

默认分区策略

Spark中如果不指定分区算法，默认使用HashPartitioner。

Partitioner中定义了方法defaultPartitioner来获取shuffle产生的RDD分区数以及使用哪种分区算法。

获取分区数

val defaultNumPartitions = 
if (rdd.context.conf.contains("spark.default.parallelism")) {  
    rdd.context.defaultParallelism
} else {  
    rdds.map(_.partitions.length).max
}

如果程序中指定了shuffle的parallelism数，就以这个作为分区数。如果没有，那么就以上游所有的RDD中最大分区数作为结果RDD分区数。

获取分区策略

val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))

val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
    Some(hasPartitioner.maxBy(_.partitions.length))
} else {
    None
}

if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
    defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
    hasMaxPartitioner.get.partitioner.get
} else {
    new HashPartitioner(defaultNumPartitions)
}

在上游的所有的RDD中过滤出包含partitioner的RDD，并且从中选出分区数最大的RDD，如果它的分区数与其他所有RDD最大分区数在一个数量级以内，就把它的partitioner作为分区策略。否则就默认使用HashPartitioner作为分区策略。

分区策略传递

在一次shuffle中使用的partitioner策略不会传递给下一次shuffle，就是说一个shuffle使用某种partitioner并产生RDD，这个RDD在下一次shuffle时，它的分区策略是None，如果需要就要重新制定分区策略。

HashPartitioner

HashPartitioner对key计算hash值，对分区数取模，如果小于0则加上分区数。并且支持key是null，null时返回0分区。

HashPartitioner有些情况会导致某些分区数据量特别大，导致被数据倾斜。

def getPartition(key: Any): Int = key match {  
    case null => 0  
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}

def nonNegativeMod(x: Int, mod: Int): Int = {  
    val rawMod = x % mod  
    rawMod + (if (rawMod < 0) mod else 0)
}

RangePartitioner

RangePartitioner可以尽量保证数据量的均匀，它会将一定范围内的数据均匀的分配到各个分区。在RangePartitioner中，会对RDD数据进行抽样分析，根据抽样数据的分布，调整key的range范围，然后计算出每个分区的最大key，生成rangeBounds。

RangePartitioner主要用于数据排序相关的API中，比如sortByKey。

//samplePointsPerPartitionHint=20
val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)

抽取样本20*分区数的样本数，如果大于1e6，则取1e6的样本数。每一个分区抽取样本数为sampleSizePerPartition。

sketched.foreach { case (idx, n, sample) =>
  if (fraction * n > sampleSizePerPartition) {
    imbalancedPartitions += idx
  } else {
    // The weight is 1 over the sampling probability.
    val weight = (n.toDouble / sample.length).toFloat
    for (key <- sample) {
      candidates += ((key, weight))
    }
  }
}

计算抽取的样本数量，如果大于sampleSizePerPartition，那么需要重新抽取数据使得数据分布均匀。

最终根据rangeBounds确定数据的分区。

def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]
    var partition = 0
    if (rangeBounds.length <= 128) {
      // If we have less than 128 partitions naive search
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // Determine which binary search method to use only once.
      partition = binarySearch(rangeBounds, k)
      // binarySearch either returns the match location or -[insertion point]-1
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {
        partition = rangeBounds.length
      }
    }
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

当分区数小于128时使用遍历的方式获取partition，如果大于128会使用二分法查找partition。

自定义partitioner

继承partitioner，并实现方法：

numPartitions：指定返回分区；

getPartition：对key进行计算，根据特定逻辑对数据分区，返回的范围在0 - numPartitions-1；

比如下面试自己实现的HashPartitioner。

class CustomPartitioner(numParts: Int) extends Partitioner{
  override def numPartitions: Int = numParts
  override def getPartition(key: Any): Int = {
    var partition = key.hashCode() % numPartitions
    if (partition < 0 ){
      partition + numPartitions
    }else{
       partition
    }
  }
}