1. 概述

spark支持Hash分区(当前默认是Hash)和Range分区，以及用户自定义分区。

分区器：分区器决定了RDD中分区的个数，RDD中每条数据经过shuffle后进入那个分区和Reduce的个数。

注意：

（1）只有Key-Value类型的RDD才有分区器，非Key-Value类型的RDD分区的值是None
（2）每个RDD的分区ID范围：0~numPartitions-1，决定这个值是属于那个分区的。

2. Hash分区

class HashPartitioner(partitions: Int) extends Partitioner {
    require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
    
    def numPartitions: Int = partitions
    
    def getPartition(key: Any): Int = key match {
        case null => 0
        case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
    }
    
    override def equals(other: Any): Boolean = other match {
        case h: HashPartitioner =>
            h.numPartitions == numPartitions
        case _ =>
            false
    }
    
    override def hashCode: Int = numPartitions
}

3. Ranger分区

自定义分区

object KeyValue01_partitionBy {
    def main(args: Array[String]): Unit = {
        //1.创建SparkConf并设置App名称
        val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")

        //2.创建SparkContext，该对象是提交Spark App的入口
        val sc: SparkContext = new SparkContext(conf)

        //3具体业务逻辑
        //3.1 创建第一个RDD
        val rdd: RDD[(Int, String)] = sc.makeRDD(Array((1, "aaa"), (2, "bbb"), (3, "ccc")), 3)
        //3.2 自定义分区
        val rdd3: RDD[(Int, String)] = rdd.partitionBy(new MyPartitioner(2))

        //4 打印查看对应分区数据
        val indexRdd: RDD[(Int, String)] = rdd3.mapPartitionsWithIndex(
            (index, datas) => {
                // 打印每个分区数据，并带分区号
                datas.foreach(data => {
                    println(index + "=>" + data)
                })
                // 返回分区的数据
                datas
            }
        )

        indexRdd.collect()

        //5.关闭连接
        sc.stop()
    }
}

// 自定义分区
class MyPartitioner(num: Int) extends Partitioner {
    // 设置的分区数
    override def numPartitions: Int = num

    // 具体分区逻辑
    override def getPartition(key: Any): Int = {

        if (key.isInstanceOf[Int]) {

            val keyInt: Int = key.asInstanceOf[Int]
            if (keyInt % 2 == 0)
                0
            else
                1
        }else{
            0
        }
    }
}

spark_键值对RDD数据分区

1. 概述

2. Hash分区

3. Ranger分区

自定义分区