【Spark】RDD KV操作

294 阅读8分钟

这是我参与11月更文挑战的第16天,活动详情查看:2021最后一次更文挑战

一、概述

RDD 整体上分为 Value 类型 和 Key-Value 类型。

之前介绍的是 Value的 类型的 RDD 的操作, 实际使用更多的是 key-value 类型的 RDD, 也称为 PairRDD

  • Value 类型 RDD 的操作基本集中在 RDD.scala 中;
  • key-value 类型 的 RDD 操作集中在 PairRDDFunctions.scala 中;

如源码 RDD.scala 中,隐式转换:

// 伴生对象
object RDD {

  private[spark] val CHECKPOINT_ALL_MARKED_ANCESTORS =
    "spark.checkpoint.checkpointAllMarkedAncestors"

  // The following implicit functions were in SparkContext before 1.3 and users had to
  // `import SparkContext._` to enable them. Now we move them here to make the compiler find
  // them automatically. However, we still keep the old functions in SparkContext for backward
  // compatibility and forward to the following functions directly.

  implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }
 
  ... ...
}

之前介绍的大多数算子对 PairRDD 都是有效的。PairRDD 还有属于自己的 TransformationAction 算子;

创建 PairRDD

scala> val arr = (1 to 10).toArray
arr: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val arr1 = arr.map(x => (x, x*10, x*100))
arr1: Array[(Int, Int, Int)] = Array((1,10,100), (2,20,200), (3,30,300), (4,40,400), (5,50,500), (6,60,600), (7,70,700), (8,80,800), (9,90,900), (10,100,1000))

// rdd1 不是 Pair RDD
scala> val rdd1 = sc.makeRDD(arr1)
rdd1: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[102] at makeRDD at <console>:26


// rdd2 是 Pair RDD
scala> val arr2 = arr.map(x => (x, (x*10, x*100)))
arr2: Array[(Int, (Int, Int))] = Array((1,(10,100)), (2,(20,200)), (3,(30,300)), (4,(40,400)), (5,(50,500)), (6,(60,600)), (7,(70,700)), (8,(80,800)), (9,(90,900)), (10,(100,1000)))

scala> val rdd2 = sc.makeRDD(arr2)
rdd2: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = ParallelCollectionRDD[103] at makeRDD at <console>:26

二、transformation 操作

(1)类似 map 操作

mapValues / flatMapValues / keys / values, 这些操作都可以使用 map 操作实现, 是简化操作。

scala> val a = sc.parallelize(List((1,2),(3,4),(5,6)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[104] at parallelize at <console>:24

// 使用 mapValues 更简洁
scala> val b = a.mapValues(x=>1 to x)
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[105] at mapValues at <console>:25

scala> b.collect
res60: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))



// 可使用 map 实现同样的操作
scala> val b = a.map(x => (x._1, 1 to x._2))
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[106] at map at <console>:25

scala> b.collect
res61: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))

scala> val b = a.map{case (k, v) => (k, 1 to v)}
b: org.apache.spark.rdd.RDD[(Int, scala.collection.immutable.Range.Inclusive)] = MapPartitionsRDD[107] at map at <console>:25

scala> b.collect
res62: Array[(Int, scala.collection.immutable.Range.Inclusive)] = Array((1,Range 1 to 2), (3,Range 1 to 4), (5,Range 1 to 6))




// flatMapValues 将 value 的值压平
scala> val c = a.flatMapValues(x=>1 to x)
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[108] at flatMapValues at <console>:25

scala> c.collect
res63: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))

scala> val c = a.mapValues(x=>1 to x).flatMap{case (k, v) => v.map(x => (k, x))}
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[110] at flatMap at <console>:25

scala> c.collect
res64: Array[(Int, Int)] = Array((1,1), (1,2), (3,1), (3,2), (3,3), (3,4), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6))

scala> c.keys
res65: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[111] at keys at <console>:26

scala> c.values
res66: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[112] at values at <console>:26

scala> c.map{case (k, v) => k}.collect
res67: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)

scala> c.map{case (k, _) => k}.collect
res68: Array[Int] = Array(1, 1, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5)

scala> c.map{case (_, v) => v}.collect
res69: Array[Int] = Array(1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6)

(2)聚合操作【重要】

PariRDD(k, v) 使用范围广, 聚合

算子:groupByKey / reduceByKey / foldByKey / aggregateByKey

底层实现:combineByKey(OLD) / combineByKeyWithClassTag (NEW)

算子:subtractByKey : 类似于 subtract, 删掉 RDD 中键与 other RDD 中的键相同的元素

小案例: 给定一组数据: ("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15), ("scala", 26), ("spark", 25), ("spark", 23), ("hadoop", 16), ("scala", 24), ("spark", 16), 键值对的 key 表示图书名称, value 表示某天图书销量。计算每个键对应的平均值,也就是计算每种图书的每天平均销量。

scala> val rdd = sc.makeRDD(Array(("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15),("scala", 26), ("spark", 25), ("spark", 23), ("hadoop", 16), ("scala", 24), ("spark", 16)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[116] at makeRDD at <console>:24


// groupByKey
scala> rdd.groupByKey().map(x=>(x._1, x._2.sum.toDouble/x._2.size)).collect
res70: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))

scala> rdd.groupByKey().map{case (k, v) => (k, v.sum.toDouble/v.size)}.collect
res71: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))

scala> rdd.groupByKey.mapValues(v => v.sum.toDouble/v.size).collect
res72: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))



// reduceByKey
scala> rdd.mapValues((_, 1)).
     | reduceByKey((x, y)=> (x._1+y._1, x._2+y._2)).
     | mapValues(x => (x._1.toDouble / x._2)).
     | collect()
res73: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))



// foldByKey
scala> rdd.mapValues((_, 1)).foldByKey((0, 0))((x, y) => {
     | (x._1+y._1, x._2+y._2)
     | }).mapValues(x=>x._1.toDouble/x._2).collect
res74: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))



// aggregateByKey
// aggregateByKey => 定义初值 + 分区内的聚合函数 + 分区间的聚合函数
scala> rdd.mapValues((_, 1)).
     | aggregateByKey((0,0))(
     | (x, y) => (x._1 + y._1, x._2 + y._2),
     | (a, b) => (a._1 + b._1, a._2 + b._2)
     | ).mapValues(x=>x._1.toDouble / x._2).
     | collect
res75: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))


// 初值(元祖)与RDD元素类型(Int)可以不一致
scala> rdd.aggregateByKey((0, 0))(
     | (x, y) => {println(s"x=$x, y=$y"); (x._1 + y, x._2 + 1)},
     | (a, b) => {println(s"a=$a, b=$b"); (a._1 + b._1, a._2 + b._2)}
     | ).mapValues(x=>x._1.toDouble/x._2).collect
x=(0,0), y=12
x=(0,0), y=26
x=(26,1), y=23
x=(12,1), y=15
x=(0,0), y=26
x=(27,2), y=25
x=(52,3), y=23
x=(49,2), y=16
x=(26,1), y=24
x=(75,4), y=16
res76: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))


// 分区内的合并与分区间的合并, 可以采用不同的方式; 这种方式是低效的!
scala> rdd.aggregateByKey(scala.collection.mutable.ArrayBuffer[Int]())(
     | (x, y) => {x.append(y); x},
     | (a, b) => {a++b}
     | ).mapValues(v => v.sum.toDouble/v.size).collect
res77: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))


// combineByKey(理解就行)
scala> rdd.combineByKey(
     | (x: Int) => {println(s"x=$x"); (x,1)},
     | (x: (Int, Int), y: Int) => {println(s"x=$x, y=$y");(x._1+y, x._2+1)},
     | (a: (Int, Int), b: (Int, Int)) => {println(s"a=$a, b=$b");(a._1+b._1, a._2+b._2)}
     | ).mapValues(x=>x._1.toDouble/x._2).collect
x=12
x=26
x=(26,1), y=23
x=(12,1), y=15
x=26
x=(27,2), y=25
x=(52,3), y=23
x=(49,2), y=16
x=(26,1), y=24
x=(75,4), y=16
res78: Array[(String, Double)] = Array((scala,25.0), (spark,18.2), (hadoop,21.666666666666668))


// subtractByKey
scala> val rdd1 = sc.makeRDD(Array(("spark", 12), ("hadoop", 26), ("hadoop", 23), ("spark", 15)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[138] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array(("spark", 100), ("hadoop", 300)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[139] at makeRDD at <console>:24

scala> rdd1.subtractByKey(rdd2).collect()
res79: Array[(String, Int)] = Array()


// subtractByKey
scala> val rdd = sc.makeRDD(Array(("a",1), ("b",2), ("c",3), ("a",5), ("d",5)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[141] at makeRDD at <console>:24

scala> val other = sc.makeRDD(Array(("a",10), ("b",20), ("c",30)))
other: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[142] at makeRDD at <console>:24

scala> rdd.subtractByKey(other).collect()
res80: Array[(String, Int)] = Array((d,5))

结论: 效率相等用最熟悉的方法; groupByKey 在一般情况下效率低, 尽量少用。

初学: 最重要的是实现; 如果使用了 groupByKey, 寻找替换的算子实现;

如图: 2021-02-1915-20-55.png

(3)排序

sortByKey : sortByKey 函数作用于 PairRDD, 对 Key 进行排序。

在源码中实现:

object RDD {

  implicit def rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](rdd: RDD[(K, V)])
    : OrderedRDDFunctions[K, V, (K, V)] = {
    new OrderedRDDFunctions[K, V, (K, V)](rdd)
  }
  
  ...
}

org.apache.spark.rdd.OrderedRDDFunctions 中实现:

class OrderedRDDFunctions[K : Ordering : ClassTag,
                          V: ClassTag,
                          P <: Product2[K, V] : ClassTag] @DeveloperApi() (
    self: RDD[P])
  extends Logging with Serializable {
  private val ordering = implicitly[Ordering[K]]

  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }
  
  ... ...
}

实操如下:

scala> val a = sc.parallelize(List("wyp", "iteblog", "com", "397090770", "test"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[144] at parallelize at <console>:24

scala> val b = sc.parallelize (1 to a.count.toInt)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[145] at parallelize at <console>:26

scala> val c = a.zip(b)
c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[146] at zip at <console>:27

scala> c.sortByKey().collect
res81: Array[(String, Int)] = Array((397090770,4), (com,3), (iteblog,2), (test,5), (wyp,1))

scala> c.sortByKey(false).collect
res82: Array[(String, Int)] = Array((wyp,1), (test,5), (iteblog,2), (com,3), (397090770,4))

(4)join 操作

cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin

源码 org.apache.spark.rdd.PairRDDFunctions 如下:

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
scala> val rdd1 = sc.makeRDD(Array((1,"Spark"), (2,"Hadoop"), (3,"Kylin"), (4,"Flink")))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[149] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array((3,"李四"), (4,"王五"), (5,"赵六"), (6,"冯七")))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[150] at makeRDD at <console>:24

scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[152] at cogroup at <console>:27

scala> rdd3.collect.foreach(println)
(4,(CompactBuffer(Flink),CompactBuffer(王五)))
(1,(CompactBuffer(Spark),CompactBuffer()))
(6,(CompactBuffer(),CompactBuffer(冯七)))
(3,(CompactBuffer(Kylin),CompactBuffer(李四)))
(5,(CompactBuffer(),CompactBuffer(赵六)))
(2,(CompactBuffer(Hadoop),CompactBuffer()))

scala> rdd3.filter{case (_, (v1, v2)) => v1.nonEmpty & v2.nonEmpty}.collect
res84: Array[(Int, (Iterable[String], Iterable[String]))] = Array((4,(CompactBuffer(Flink),CompactBuffer(王五))), (3,(CompactBuffer(Kylin),CompactBuffer(李四))))



// 仿照源码实现join操作
scala> rdd3.flatMapValues( pair =>
     | for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
     | )
res85: org.apache.spark.rdd.RDD[(Int, (String, String))] = MapPartitionsRDD[154] at flatMapValues at <console>:26

scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("4","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[155] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[156] at makeRDD at <console>:24

scala> rdd1.join(rdd2).collect
res86: Array[(String, (String, String))] = Array((4,(Java,18K)), (3,(Scala,20K)))

scala> rdd1.leftOuterJoin(rdd2).collect
res87: Array[(String, (String, Option[String]))] = Array((4,(Java,Some(18K))), (2,(Hadoop,None)), (3,(Scala,Some(20K))), (1,(Spark,None)))

scala> rdd1.rightOuterJoin(rdd2).collect
res88: Array[(String, (Option[String], String))] = Array((4,(Some(Java),18K)), (5,(None,25K)), (6,(None,10K)), (3,(Some(Scala),20K)))

scala> rdd1.fullOuterJoin(rdd2).collect
res89: Array[(String, (Option[String], Option[String]))] = Array((4,(Some(Java),Some(18K))), (5,(None,Some(25K))), (6,(None,Some(10K))), (2,(Some(Hadoop),None)), (3,(Some(Scala),Some(20K))), (1,(Some(Spark),None)))

三、action 操作

collectAsMap / countByKey / lookup(key)

countByKey 源码:

  /**
   * Count the number of elements for each key, collecting the results to a local Map.
   *
   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */
  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

lookup(key): 高效的查找方法, 只查找对应分区的数据(如果 RDD 有分区器的话)

  /**
   * Return the list of values in the RDD for key `key`. This operation is done efficiently if the
   * RDD has a known partitioner by only searching the partition that the key maps to.
   */
  def lookup(key: K): Seq[V] = self.withScope {
    self.partitioner match {
      case Some(p) =>
        val index = p.getPartition(key)
        val process = (it: Iterator[(K, V)]) => {
          val buf = new ArrayBuffer[V]
          for (pair <- it if pair._1 == key) {
            buf += pair._2
          }
          buf
        } : Seq[V]
        val res = self.context.runJob(self, process, Array(index))
        res(0)
      case None =>
        self.filter(_._1 == key).map(_._2).collect()
    }
  }

实操如下:

scala> val rdd1 = sc.makeRDD(Array(("1","Spark"),("2","Hadoop"),("3","Scala"),("1","Java")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[169] at makeRDD at <console>:24

scala> val rdd2 = sc.makeRDD(Array(("3","20K"),("4","18K"),("5","25K"),("6","10K")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[170] at makeRDD at <console>:24

scala> rdd1.lookup("1")
res90: Seq[String] = WrappedArray(Spark, Java)

scala> rdd2.lookup("3")
res91: Seq[String] = WrappedArray(20K)