以下代码:
package com.baixw.study
import org.apache.spark.rdd.RDD
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
object JoinDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val random = scala.util.Random
val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD"))
val rdd1: RDD[(Int, String)] = sc.makeRDD(col1)
val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2)
println(rdd3.dependencies)
val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
println(rdd4.dependencies)
sc.stop()
}
}
问题1:两个打印语句的结果是什么,对应的依赖是宽依赖还是窄依赖,为什么会是这个结果;
// println(rdd3.dependencies)
List(org.apache.spark.OneToOneDependency@4c9e38)
rdd3对应:宽依赖
// println(rdd4.dependencies)
List(org.apache.spark.OneToOneDependency@6dd93a21)
rdd4对应:窄依赖
问题2:join 操作何时是宽依赖,何时是窄依赖;
查看Join源码
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
join(other, defaultPartitioner(self, other))
}
join使用了分区器defaultPartitioner(self, other)
所以,第一个Join应该会返回电脑core数目的分区数,第二个Join会返回我们设置的三个分区数
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
Some(hasPartitioner.maxBy(_.partitions.length))
} else {
None
}
val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
rdd.context.defaultParallelism
} else {
rdds.map(_.partitions.length).max
}
// If the existing max partitioner is an eligible one, or its partitions number is larger
// than the default number of partitions, use the existing partitioner.
if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
hasMaxPartitioner.get.partitioner.get
} else {
new HashPartitioner(defaultNumPartitions)
}
}
进入Join方法内的join方法(join(other, defaultPartitioner(self, other))
):
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
进入Join实现方法中的cogroup方法:
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
cogroup方法中,核心是CoGroupedRDD,参数为(Seq(rdd1,rdd2), 分区器
)。因为第一个Join的时候,两个rdd都没有分区器,所以在这一步,两个rdd需要先根据传入的分区器进行一次shuffle,因此第一个Join是宽依赖。而第二个Join此时已经分好区了,不需要再再进行Shuffle了。所以第二个是窄依赖
所以:如果需要join的两个表,本身已经有分区器,且分区的数目相同,此时,相同的key在同一个分区内,就是窄依赖。反之,如果两个需要join的表中没有分区器或者分区数量不同,在join的时候需要shuffle,那么就是宽依赖