在Spark中,我们经常会对数据进行分组操作,Spark给我们提供了两个进行分组的算子,下面探讨一下两个算子的区别以及各自的性能
1、reduceByKey
直接看底层源码
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
源码中讲到在我们进行Shuffle,将数据发往Reducer之前,会现在Map端进行一次合并操作,这一特质类似于MapReduce中的Combiner操作
2、groupByKey
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
groupByKey则是将key-value形式的数据直接发送到Reducer端,在Reducer端进行计算操作
3、区别
从性能上来讲:reduceByKey要优于groupByKey。 原因如下:
- 1、reduceByKey可以减少reduce端的数据量,因为在map端做了一次合并,减少shuffle数据的传输量
- 2、提高分布式计算程序的性能,对于网络密集型的来说,reduceByKey的作用尤其明显