Spark中reduceByKey和groupByKey的区别

338 阅读1分钟

在Spark中,我们经常会对数据进行分组操作,Spark给我们提供了两个进行分组的算子,下面探讨一下两个算子的区别以及各自的性能

1、reduceByKey

直接看底层源码

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

源码中讲到在我们进行Shuffle,将数据发往Reducer之前,会现在Map端进行一次合并操作,这一特质类似于MapReduce中的Combiner操作

2、groupByKey

  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

groupByKey则是将key-value形式的数据直接发送到Reducer端,在Reducer端进行计算操作

3、区别

从性能上来讲:reduceByKey要优于groupByKey。 原因如下:

  • 1、reduceByKey可以减少reduce端的数据量,因为在map端做了一次合并,减少shuffle数据的传输量
  • 2、提高分布式计算程序的性能,对于网络密集型的来说,reduceByKey的作用尤其明显