0、环境准备
-
IDEA集成Scala
-
添加Spark依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
1、普通版本
文字解释:
/**
* 普通版本WordCount —— 通过计算集合的大小
*
* 1. 读取文件,获取一行行数据
* "hello world"
* "hello java"
* "hello scala"
*
* 2. 分词
* ["hello", "hello", "hello", "world", "java", "scala"]
*
* 3. 分组
* ("hello", Iterable(hello, hello, hello))
* ("world", Iterable(world))
* ("java", Iterable(java))
* ("scala", Iterable(scala))
*
* 4. 转换
* ("hello", Iterable(hello, hello, hello)) --> ("hello", 3)
* ("world", Iterable(world)) --> ("world", 1)
* ("java", Iterable(java)) --> ("java", 1)
* ("scala", Iterable(scala)) --> ("scala", 1)
*/
图解:
代码:
def main(args: Array[String]): Unit = {
// TODO 1. 建立与Spark框架的连接
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")
val sc = new SparkContext(sparkConf)
// TODO 2. 业务逻辑
// 1. 读取文件,获取一行一行的数据
val lines: RDD[String] = sc.textFile("datas/")
// 2. 分词: 拆分 + 扁平化 "hello world" => "hello", "world"
val words: RDD[String] = lines.flatMap(word => word.split(" "))
// 3. 分组: < hello, Iterable(hello, hello, hello)>
val wordGroup: RDD[(String, Iterable[String])] = words.groupBy(word => word)
// 4. 对分组后的数据进行转换
// < hello, Iterable(hello, hello, hello)> ==> <hello, 3>
val wordToCount: RDD[(String, Int)] = wordGroup.map(
kv => (kv._1, kv._2.size)
)
// 5. collect触发任务执行
val array: Array[(String, Int)] = wordToCount.collect()
array.foreach(println)
//TODO 3. 关闭资源
sc.stop()
}
Tip:上面第4步,通过模式匹配,书写更加简洁:
val wordToCount: RDD[(String, Int)] = wordGroup.map { // 注意最外层的{ }
case (word, list) => {
(word, list.size)
}
}
2、经典版本
文字解释:
/**
* 推荐版本WordCount —— reduce聚合
*
* 1. 读取文件,获取一行行数据
* "hello world"
* "hello java"
* "hello scala"
*
* 2. 分词
* ["hello", "hello", "hello", "world", "java", "scala"]
*
* 3. 转换
* (hello, 1) (hello, 1) (hello, 1)
* (world, 1) (java, 1) (scala, 1)
*
* 4. 分组
* hello --> Iterable((hello, 1), (hello, 1), (hello, 1))
* world --> Iterable(world, 1)
* java --> Iterable(java, 1)
* scala --> Iterable(scala, 1)
*
* 5. 聚合计算
* (hello, 1) (hello, 1) (hello, 1) ==> (hello, 3)
* (world, 1)
* (java, 1)
* (scala, 1)
*/
图解:
代码:
def main(args: Array[String]): Unit = {
// TODO 1. 建立与Spark框架的连接
val sparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
val sc = new SparkContext(sparkConf)
// TODO 2. 业务逻辑
// 1. 读取文件,获取一行一行的数据
val lines: RDD[String] = sc.textFile("datas")
// 2. 分词: 拆分 + 扁平化
val words: RDD[String] = lines.flatMap(_.split(" "))
// 3. 转换,便于统计: hello => (hello, 1)
val wordToOne: RDD[(String, Int)] = words.map(
word => (word, 1)
)
/**
4. 将转换后的数据进行分组聚合
reduceByKey(): 按key分组,对同一个key下的value进行聚合操作
传入的参数为 func: (Int, Int) => Int 即value的聚合逻辑
*/
val wordToCount: RDD[(String, Int)] = wordToOne.reduceByKey((x, y) => x + y)
val array: Array[(String, Int)] = wordToCount.collect()
array.foreach(println)
//TODO 3. 关闭资源
sc.stop()
}
最终输出结果为:
(hello,4)
(world,2)
(spark,2)
Spark提供了功能丰富的RDD算子,这里我们以wordCount为例,总结一下还有哪些算子可以解决此问题:
- 分组
// groupBy: "hello" -> Iterable("hello", "hello",...)
def wordcount1(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val group: RDD[(String, Iterable[String])] = words.groupBy(word=>word)
val wordCount: RDD[(String, Int)] = group.mapValues(iter=>iter.size)
}
// groupByKey: "hello" -> Iterable(1, 1,...)
def wordcount2(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val wordOne = words.map((_,1)) // ("hello", 1)
val group: RDD[(String, Iterable[Int])] = wordOne.groupByKey()
val wordCount: RDD[(String, Int)] = group.mapValues(iter=>iter.size)
}
- 归约
def wordcount3(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val wordOne = words.map((_,1)) // ("hello", 1)
// reduceByKey
val wordCount: RDD[(String, Int)] = wordOne.reduceByKey(_+_)
// aggregateByKey
val wordCount: RDD[(String, Int)] = wordOne.aggregateByKey(0)(_+_, _+_)
// foldByKey
val wordCount: RDD[(String, Int)] = wordOne.foldByKey(0)(_+_)
// combineByKey
val wordCount: RDD[(String, Int)] = wordOne.combineByKey(
v=>v,
(x:Int, y) => x + y,
(x:Int, y:Int) => x + y
)
}
- 计数
// countByKey
def wordcount7(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val wordOne = words.map((_,1)) // ("hello", 1)
val wordCount: collection.Map[String, Long] = wordOne.countByKey()
}
// countByValue
def wordcount8(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val wordCount: collection.Map[String, Long] = words.countByValue()
}
- 合并map
def wordcount9(sc : SparkContext): Unit = {
val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
val words = rdd.flatMap(_.split(" "))
val mapRdd: RDD[mutable.Map[String, Long]] = words.map(
word => mutable.Map(word -> 1) //Map("hello" -> 1)
)
// 两个Map的合并
val value: mutable.Map[String, Long] = mapRdd.reduce(
(map1, map2) => {
map2.foreach {
case (word, count) => {
val newCount = map1.getOrElse(word, 0L) + count
map1.update(word, newCount) //新增或更新KV
}
}
map1
}
)
println(value) // Map(Hello -> 2, Scala -> 1, Spark -> 1)
}
4、小结
| 版本 | 思路 |
|---|---|
| 普通版本 | 分组,计算集合大小 |
| 经典版本 | 转换为(word, 1),使用reduce归约 |
经典版本的思路,才是WordCount的核心,比如("hello", 10)和("hello", 20)进行归约后,可以得到("hello", 30),而普通版本具有局限性。