Spark经典案例之WordCount

1,191 阅读3分钟

0、环境准备

  1. IDEA集成Scala

  2. 添加Spark依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.0.0</version>
</dependency>

1、普通版本

文字解释:

/**
 *  普通版本WordCount —— 通过计算集合的大小
 *
 * 1. 读取文件,获取一行行数据
 * "hello world"
 * "hello java"
 * "hello scala"
 *
 * 2. 分词
 *  ["hello", "hello", "hello", "world", "java", "scala"]
 *
 * 3. 分组
 * ("hello",  Iterable(hello, hello, hello))
 * ("world",  Iterable(world))
 * ("java",   Iterable(java))
 * ("scala",  Iterable(scala))
 *
 * 4. 转换
 *  ("hello",  Iterable(hello, hello, hello))     --> ("hello", 3)
 *  ("world",  Iterable(world)) 		  --> ("world", 1)
 *  ("java",   Iterable(java)) 		          --> ("java",  1)
 *  ("scala",  Iterable(scala))                   --> ("scala", 1)
 */

图解:

image-20210414194829588

代码:

def main(args: Array[String]): Unit = {
    // TODO 1. 建立与Spark框架的连接
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")
    val sc = new SparkContext(sparkConf)

        
    // TODO 2. 业务逻辑
    // 1. 读取文件,获取一行一行的数据
    val lines: RDD[String] = sc.textFile("datas/")
    // 2. 分词: 拆分 + 扁平化  "hello  world" => "hello", "world"
    val words: RDD[String] = lines.flatMap(word => word.split(" "))
    // 3. 分组: < hello, Iterable(hello, hello, hello)>
    val wordGroup: RDD[(String, Iterable[String])] = words.groupBy(word => word)
        
    // 4. 对分组后的数据进行转换
    // < hello, Iterable(hello, hello, hello)> ==> <hello, 3>
    val wordToCount: RDD[(String, Int)] = wordGroup.map(
      kv => (kv._1, kv._2.size)
    )
        
    // 5. collect触发任务执行
    val array: Array[(String, Int)] = wordToCount.collect()
    array.foreach(println)

        
    //TODO 3. 关闭资源
    sc.stop()
  }

Tip:上面第4步,通过模式匹配,书写更加简洁:

val wordToCount: RDD[(String, Int)] = wordGroup.map {	// 注意最外层的{ }
  case (word, list) => {
    (word, list.size)
  }
}

2、经典版本

文字解释:

/**
 *  推荐版本WordCount —— reduce聚合
 *
 * 1. 读取文件,获取一行行数据
 * "hello world"
 * "hello java"
 * "hello scala"
 *
 * 2. 分词
 *  ["hello", "hello", "hello", "world", "java", "scala"]
 *
 * 3. 转换
 *  (hello, 1)  (hello, 1)  (hello, 1)
 *  (world, 1)  (java, 1)  (scala, 1)
 *
 * 4. 分组
 *  hello  --> Iterable((hello, 1), (hello, 1), (hello, 1))
 *  world  --> Iterable(world, 1)
 *  java   --> Iterable(java, 1)
 *  scala  --> Iterable(scala, 1)
 *
 * 5. 聚合计算
 *  (hello, 1) (hello, 1) (hello, 1) ==> (hello, 3)
 *  (world, 1)
 *  (java,  1)
 *  (scala, 1)
 */

图解:

image-20210414200748542

代码:

def main(args: Array[String]): Unit = {
    // TODO 1. 建立与Spark框架的连接
    val sparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
    val sc = new SparkContext(sparkConf)

    // TODO 2. 业务逻辑
    // 1. 读取文件,获取一行一行的数据
    val lines: RDD[String] = sc.textFile("datas")
    // 2. 分词: 拆分 + 扁平化
    val words: RDD[String] = lines.flatMap(_.split(" "))
    // 3. 转换,便于统计: hello => (hello, 1)
    val wordToOne: RDD[(String, Int)] = words.map(
      word => (word, 1)
    )
    /**
    4. 将转换后的数据进行分组聚合
      reduceByKey(): 按key分组,对同一个key下的value进行聚合操作
      传入的参数为 func: (Int, Int) => Int 即value的聚合逻辑
    */
    val wordToCount: RDD[(String, Int)] = wordToOne.reduceByKey((x, y) => x + y)

    val array: Array[(String, Int)] = wordToCount.collect()
    array.foreach(println)

    //TODO 3. 关闭资源
    sc.stop()
  }

最终输出结果为:

(hello,4)
(world,2)
(spark,2)

Spark提供了功能丰富的RDD算子,这里我们以wordCount为例,总结一下还有哪些算子可以解决此问题:

  • 分组
// groupBy: "hello" -> Iterable("hello", "hello",...)
def wordcount1(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))
    val group: RDD[(String, Iterable[String])] = words.groupBy(word=>word)
    val wordCount: RDD[(String, Int)] = group.mapValues(iter=>iter.size)
}

// groupByKey: "hello" -> Iterable(1, 1,...)
def wordcount2(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))
    val wordOne = words.map((_,1))		// ("hello", 1)
    val group: RDD[(String, Iterable[Int])] = wordOne.groupByKey()
    val wordCount: RDD[(String, Int)] = group.mapValues(iter=>iter.size)
}
  • 归约
def wordcount3(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))
    val wordOne = words.map((_,1))		// ("hello", 1)
     // reduceByKey
    val wordCount: RDD[(String, Int)] = wordOne.reduceByKey(_+_)
     // aggregateByKey
    val wordCount: RDD[(String, Int)] = wordOne.aggregateByKey(0)(_+_, _+_)
    // foldByKey
    val wordCount: RDD[(String, Int)] = wordOne.foldByKey(0)(_+_)
     // combineByKey
    val wordCount: RDD[(String, Int)] = wordOne.combineByKey(
            v=>v,
            (x:Int, y) => x + y,
            (x:Int, y:Int) => x + y
        )
}
  • 计数
// countByKey
def wordcount7(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))
    val wordOne = words.map((_,1))		// ("hello", 1)
    val wordCount: collection.Map[String, Long] = wordOne.countByKey()
}

// countByValue
def wordcount8(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))
    val wordCount: collection.Map[String, Long] = words.countByValue()
}
  • 合并map
def wordcount9(sc : SparkContext): Unit = {
    val rdd = sc.makeRDD(List("Hello Scala", "Hello Spark"))
    val words = rdd.flatMap(_.split(" "))

    val mapRdd: RDD[mutable.Map[String, Long]] = words.map(
        word => mutable.Map(word -> 1)		//Map("hello" -> 1)
    )

    // 两个Map的合并
    val value: mutable.Map[String, Long] = mapRdd.reduce(
        (map1, map2) => {
            map2.foreach {
                case (word, count) => {
                    val newCount = map1.getOrElse(word, 0L) + count
                    map1.update(word, newCount)		//新增或更新KV
                }
            }
            map1
        }
    )
    println(value)  // Map(Hello -> 2, Scala -> 1, Spark -> 1)
}

4、小结

版本思路
普通版本分组,计算集合大小
经典版本转换为(word, 1),使用reduce归约

经典版本的思路,才是WordCount的核心,比如("hello", 10)("hello", 20)进行归约后,可以得到("hello", 30),而普通版本具有局限性。