这是我参与11月更文挑战的第16天,活动详情查看:2021最后一次更文挑战
一、概述
RDD 整体上分为 Value 类型 和 Key-Value 类型。
之前介绍的是 Value的 类型的 RDD 的操作, 实际使用更多的是 key-value 类型的 RDD, 也称为 PairRDD。
Value类型RDD的操作基本集中在RDD.scala中;key-value类型 的RDD操作集中在PairRDDFunctions.scala中;
如源码 RDD.scala 中,隐式转换:
// 伴生对象
object RDD {
private[spark] val CHECKPOINT_ALL_MARKED_ANCESTORS =
"spark.checkpoint.checkpointAllMarkedAncestors"
// The following implicit functions were in SparkContext before 1.3 and users had to
// `import SparkContext._` to enable them. Now we move them here to make the compiler find
// them automatically. However, we still keep the old functions in SparkContext for backward
// compatibility and forward to the following functions directly.
implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}
... ...
}
之前介绍的大多数算子对 PairRDD 都是有效的。PairRDD 还有属于自己的 Transformation、Action 算子;
创建 PairRDD
scala> val arr = (1 to 10).toArray
arr: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val arr1 = arr.map(x => (x, x*10, x*100))
arr1: Array[(Int, Int, Int)] = Array((1,10,100), (2,20,200), (3,30,300), (4,40,400), (5,50,500), (6,60,600), (7,70,700), (8,80,800), (9,90,900), (10,100,1000))
// rdd1 不是 Pair RDD
scala> val rdd1 = sc.makeRDD(arr1)
rdd1: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[102] at makeRDD at <console>:26
// rdd2 是 Pair RDD
scala> val arr2 = arr.map(x => (x, (x*10, x*100)))
arr2: Array[(Int, (Int, Int))] = Array((1,(10,100)), (2,(20,200)), (3,(30,300)), (4,(40,400)), (5,(50,500)), (6,(60,600)), (7,(70,700)), (8,(80,800)), (9,(90,900)), (10,(100,1000)))
scala> val rdd2 = sc.makeRDD(arr2)
rdd2: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = ParallelCollectionRDD[103] at makeRDD at <console>:26
二、transformation 操作
(1)类似 map 操作
mapValues / flatMapValues / keys / values, 这些操作都可以使用 map 操作实现, 是简化操作。
scala> val a = sc.parallelize(List((1,2),(3,4),(5,6)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[104] at parallelize at <console>:24
// 使用 mapValues 更简洁
scala> val b = a.mapValues(x=>1 to x)
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[105] at mapValues at <console>:25
scala> b.collect
res60: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
// 可使用 map 实现同样的操作
scala> val b = a.map(x => (x._1, 1 to x._2))
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[106] at map at <console>:25
scala> b.collect
res61: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
scala> val b = a.map{case (k, v) => (k, 1 to v)}
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[107] at map at <console>:25
scala> b.collect
res62: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))
// flatMapValues 将 value 的值压平
scala> val c = a.flatMapValues(x=>1 to x)
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[108] at flatMapValues at <console>:25
scala> c.collect
res63: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
scala> val c = a.mapValues(x=>1 to x).flatMap{case (k, v) => v.map(x => (k, x))}
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[110] at flatMap at <console>:25
scala> c.collect
res64: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))
scala> c.keys
res65: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[111] at keys at <console>:26
scala> c.values
res66: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[112] at values at <console>:26
scala> c.map{case (k, v) => k}.collect
res67: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)
scala> c.map{case (k, _) => k}.collect
res68: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)
scala> c.map{case (_, v) => v}.collect
res69: Array[Int] = Array(1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6)
(2)聚合操作【重要】
PariRDD(k, v)使用范围广, 聚合
算子:groupByKey / reduceByKey / foldByKey / aggregateByKey
底层实现:
combineByKey(OLD)/combineByKeyWithClassTag (NEW)
算子:subtractByKey : 类似于 subtract, 删掉 RDD 中键与 other RDD 中的键相同的元素
小案例:
给定一组数据: ("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15), ("scala", 26), ("spark", 25), ("spark", 23), ("hadoop", 16), ("scala", 24), ("spark", 16), 键值对的 key 表示图书名称, value 表示某天图书销量。计算每个键对应的平均值,也就是计算每种图书的每天平均销量。
scala> val rdd = sc.makeRDD(Array(("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15),("scala", 26), ("spark", 25), ("spark", 23), ("hadoop", 16), ("scala", 24), ("spark", 16)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[116] at makeRDD at <console>:24
// groupByKey
scala> rdd.groupByKey().map(x=>(x._1, x._2.sum.toDouble/x._2.size)).collect
res70: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
scala> rdd.groupByKey().map{case (k, v) => (k, v.sum.toDouble/v.size)}.collect
res71: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
scala> rdd.groupByKey.mapValues(v => v.sum.toDouble/v.size).collect
res72: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// reduceByKey
scala> rdd.mapValues((_, 1)).
| reduceByKey((x, y)=> (x._1+y._1, x._2+y._2)).
| mapValues(x => (x._1.toDouble / x._2)).
| collect()
res73: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// foldByKey
scala> rdd.mapValues((_, 1)).foldByKey((0, 0))((x, y) => {
| (x._1+y._1, x._2+y._2)
| }).mapValues(x=>x._1.toDouble/x._2).collect
res74: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// aggregateByKey
// aggregateByKey => 定义初值 + 分区内的聚合函数 + 分区间的聚合函数
scala> rdd.mapValues((_, 1)).
| aggregateByKey((0,0))(
| (x, y) => (x._1 + y._1, x._2 + y._2),
| (a, b) => (a._1 + b._1, a._2 + b._2)
| ).mapValues(x=>x._1.toDouble / x._2).
| collect
res75: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// 初值(元祖)与RDD元素类型(Int)可以不一致
scala> rdd.aggregateByKey((0, 0))(
| (x, y) => {println(s"x=$x, y=$y"); (x._1 + y, x._2 + 1)},
| (a, b) => {println(s"a=$a, b=$b"); (a._1 + b._1, a._2 + b._2)}
| ).mapValues(x=>x._1.toDouble/x._2).collect
x=(0,0), y=12
x=(0,0), y=26
x=(26,1), y=23
x=(12,1), y=15
x=(0,0), y=26
x=(27,2), y=25
x=(52,3), y=23
x=(49,2), y=16
x=(26,1), y=24
x=(75,4), y=16
res76: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// 分区内的合并与分区间的合并, 可以采用不同的方式; 这种方式是低效的!
scala> rdd.aggregateByKey(scala.collection.mutable.ArrayBuffer[Int]())(
| (x, y) => {x.append(y); x},
| (a, b) => {a++b}
| ).mapValues(v => v.sum.toDouble/v.size).collect
res77: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// combineByKey(理解就行)
scala> rdd.combineByKey(
| (x: Int) => {println(s"x=$x"); (x,1)},
| (x: (Int, Int), y: Int) => {println(s"x=$x, y=$y");(x._1+y, x._2+1)},
| (a: (Int, Int), b: (Int, Int)) => {println(s"a=$a, b=$b");(a._1+b._1, a._2+b._2)}
| ).mapValues(x=>x._1.toDouble/x._2).collect
x=12
x=26
x=(26,1), y=23
x=(12,1), y=15
x=26
x=(27,2), y=25
x=(52,3), y=23
x=(49,2), y=16
x=(26,1), y=24
x=(75,4), y=16
res78: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))
// subtractByKey
scala> val rdd1 = sc.makeRDD(Array(("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[138] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array(("spark", 100), ("hadoop", 300)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[139] at makeRDD at <console>:24
scala> rdd1.subtractByKey(rdd2).collect()
res79: Array[(String, Int)] = Array()
// subtractByKey
scala> val rdd = sc.makeRDD(Array(("a",1), ("b",2), ("c",3), ("a",5), ("d",5)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[141] at makeRDD at <console>:24
scala> val other = sc.makeRDD(Array(("a",10), ("b",20), ("c",30)))
other: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[142] at makeRDD at <console>:24
scala> rdd.subtractByKey(other).collect()
res80: Array[(String, Int)] = Array((d,5))
结论: 效率相等用最熟悉的方法; groupByKey 在一般情况下效率低, 尽量少用。
初学: 最重要的是实现; 如果使用了 groupByKey, 寻找替换的算子实现;
如图:
(3)排序
sortByKey : sortByKey 函数作用于 PairRDD, 对 Key 进行排序。
在源码中实现:
object RDD {
implicit def rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](rdd: RDD[(K, V)])
: OrderedRDDFunctions[K, V, (K, V)] = {
new OrderedRDDFunctions[K, V, (K, V)](rdd)
}
...
}
在 org.apache.spark.rdd.OrderedRDDFunctions 中实现:
class OrderedRDDFunctions[K : Ordering : ClassTag,
V: ClassTag,
P <: Product2[K, V] : ClassTag] @DeveloperApi() (
self: RDD[P])
extends Logging with Serializable {
private val ordering = implicitly[Ordering[K]]
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
... ...
}
实操如下:
scala> val a = sc.parallelize(List("wyp", "iteblog", "com", "397090770", "test"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[144] at parallelize at <console>:24
scala> val b = sc.parallelize (1 to a.count.toInt)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[145] at parallelize at <console>:26
scala> val c = a.zip(b)
c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[146] at zip at <console>:27
scala> c.sortByKey().collect
res81: Array[(String, Int)] = Array((397090770,4), (com,3), (iteblog,2), (test,5), (wyp,1))
scala> c.sortByKey(false).collect
res82: Array[(String, Int)] = Array((wyp,1), (test,5), (iteblog,2), (com,3), (397090770,4))
(4)join 操作
cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin
源码 org.apache.spark.rdd.PairRDDFunctions 如下:
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}
scala> val rdd1 = sc.makeRDD(Array((1,"Spark"), (2,"Hadoop"), (3,"Kylin"), (4,"Flink")))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[149] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array((3,"李四"), (4,"王五"), (5,"赵六"), (6,"冯七")))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[150] at makeRDD at <console>:24
scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[152] at cogroup at <console>:27
scala> rdd3.collect.foreach(println)
(4,(CompactBuffer(Flink),CompactBuffer(王五)))
(1,(CompactBuffer(Spark),CompactBuffer()))
(6,(CompactBuffer(),CompactBuffer(冯七)))
(3,(CompactBuffer(Kylin),CompactBuffer(李四)))
(5,(CompactBuffer(),CompactBuffer(赵六)))
(2,(CompactBuffer(Hadoop),CompactBuffer()))
scala> rdd3.filter{case (_, (v1, v2)) => v1.nonEmpty & v2.nonEmpty}.collect
res84: Array[(Int, (Iterable[String], Iterable[String]))] = Array((4,(CompactBuffer(Flink),CompactBuffer(王五))), (3,(CompactBuffer(Kylin),CompactBuffer(李四))))
// 仿照源码实现join操作
scala> rdd3.flatMapValues( pair =>
| for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
| )
res85: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[154] at flatMapValues at <console>:26
scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[155] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[156] at makeRDD at <console>:24
scala> rdd1.join(rdd2).collect
res86: Array[(String, (String, String))] = Array((4,(Java,18K)), (3,(Scala,20K)))
scala> rdd1.leftOuterJoin(rdd2).collect
res87: Array[(String, (String, Option[String]))] = Array((4,(Java,Some(18K))), (2,(Hadoop,None)), (3,(Scala,Some(20K))), (1,(Spark,None)))
scala> rdd1.rightOuterJoin(rdd2).collect
res88: Array[(String, (Option[String], String))] = Array((4,(Some(Java),18K)), (5,(None,25K)), (6,(None,10K)), (3,(Some(Scala),20K)))
scala> rdd1.fullOuterJoin(rdd2).collect
res89: Array[(String, (Option[String], Option[String]))] = Array((4,(Some(Java),Some(18K))), (5,(None,Some(25K))), (6,(None,Some(10K))), (2,(Some(Hadoop),None)), (3,(Some(Scala),Some(20K))), (1,(Some(Spark),None)))
三、action 操作
collectAsMap / countByKey / lookup(key)
countByKey 源码:
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* @note This method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
lookup(key): 高效的查找方法, 只查找对应分区的数据(如果 RDD 有分区器的话)
/**
* Return the list of values in the RDD for key `key`. This operation is done efficiently if the
* RDD has a known partitioner by only searching the partition that the key maps to.
*/
def lookup(key: K): Seq[V] = self.withScope {
self.partitioner match {
case Some(p) =>
val index = p.getPartition(key)
val process = (it: Iterator[(K, V)]) => {
val buf = new ArrayBuffer[V]
for (pair <- it if pair._1 == key) {
buf += pair._2
}
buf
} : Seq[V]
val res = self.context.runJob(self, process, Array(index))
res(0)
case None =>
self.filter(_._1 == key).map(_._2).collect()
}
}
实操如下:
scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("1","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[169] at makeRDD at <console>:24
scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[170] at makeRDD at <console>:24
scala> rdd1.lookup("1")
res90: Seq[String] = WrappedArray(Spark, Java)
scala> rdd2.lookup("3")
res91: Seq[String] = WrappedArray(20K)