大数据(8i)Spark练习之TopN_第一行输入为topn和文章数m,2024年最新阿里大数据开发研发岗二面

37 阅读4分钟

(row._2 + "-" + row.4, 1) }) // 按城市和广告计数 val r2 = r1.reduceByKey( + _) // 拆分城市和广告 val r3 = r2.map(kv => (kv._1.split('-')(0), (kv.1.split('-')(1), kv.2))) // 按城市分组 val r4 = r3.groupByKey // 排序取TopN val r6 = r4.mapValues(.toSeq.sortBy(-._2).take(2)) r6.foreach(println)




> 

> 打印结果:  

>  ![](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/276301acf9694c319b66be72e365404a~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771262522&x-signature=OX8%2FUN0S6FXipBefjslJrrZ5hJ0%3D)

> 

> 

> 




### SparkSQL实现


![](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/bcdc014c5d3b45c99407b8cd8a59abc4~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771262522&x-signature=MNrZIEnGm8c3f24kIXMaoO9t5X0%3D)



//创建SparkSession对象 import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession val c1: SparkConf = new SparkConf().setAppName("a1").setMaster("local[*]") val spark: SparkSession = SparkSession.builder().config(c1).getOrCreate() //隐式转换支持 import spark.implicits._ // 创建临时视图0 List( ("2020", "guangzhou", "Farseer", "A"), ("2020", "foshan", "Blade Master", "B"), ("2020", "foshan", "Warden", "B"), ("2020", "shenzhen", "Archmage", "D"), ("2020", "guangzhou", "Lich", "C"), ("2020", "foshan", "Mountain King", "B"), ("2021", "guangzhou", "Demon Hunter", "A"), ("2021", "foshan", "Blade Master", "C"), ("2021", "foshan", "Warden", "C"), ("2021", "shenzhen", "Death Knight", "D"), ("2021", "guangzhou", "Paladin", "D"), ("2021", "foshan", "Blade Master", "D"), ("2021", "foshan", "Wind Runner", "C"), ("2021", "guangzhou", "Crypt Lord", "D"), ).toDF("time", "city", "user", "advertisement").createTempView("t0") // 按城市和广告分组 spark.sql( """ |SELECT city,advertisement,count(0) clicks FROM t0 |GROUP BY city,advertisement |""".stripMargin).createTempView("t1") // 使用窗口函数,按城市分区,分区内按点击数排名 spark.sql( """ |SELECT | city, | advertisement, | clicks, | RANK() OVER(PARTITION BY city ORDER BY clicks DESC)AS r |FROM t1 |""".stripMargin).createTempView("t2") // 取排名前2 spark.sql("SELECT city,advertisement,clicks FROM t2 WHERE r<3").show()




> 

> 打印结果  

>  ![](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/0ff5fe49283f4c60ba2c9a746813d1bb~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771262522&x-signature=YUvfNk1wdkSwxh0vJ8NdPIOJtsA%3D)

> 

> 

> 




## 需求:省份点击数Top2



### 数据



// 创建SparkConf对象,并设定配置 import org.apache.spark.{SparkConf, SparkContext} val conf = new SparkConf().setAppName("A").setMaster("local[8]") // 创建SparkContext对象,Spark通过该对象访问集群 val sc = new SparkContext(conf) // 创建数据 val r0 = sc.makeRDD(Seq( 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4401, 4406, 4406, 4406, 4406, 4406, 4406, 4406, 4406, 4602, 4602, 4601, 4301, 4301, ))


### 方法1:reduceBy省份



// 省份汇总统计 val r1 = r0.map(a => (a.toString.slice(0, 2), 1)).reduceByKey(_ + ) // 查看各分区元素 r1.mapPartitionsWithIndex((pId, iter) => { println("分区" + pId + "元素:" + iter.toList) iter }).collect // 省份TopN r1.sortBy(-._2).take(2).foreach(println)


### 方法2:先reduceBy城市,再reduceBy省份


reduceBy城市可以使并行更充分,缓解数据倾斜



// reduceBy城市 val r1 = r0.map((, 1)).reduceByKey( + _) // 查看各分区元素 r1.mapPartitionsWithIndex((pId, iter) => { println("分区" + pId + "元素:" + iter.toList) iter }).collect // reduceBy省份 val r2 = r1.map(t => (t._1.toString.slice(0, 2), t.2)).reduceByKey( + ) // 查看各分区元素 r2.mapPartitionsWithIndex((pId, iter) => { println("分区" + pId + "元素:" + iter.toList) iter }).collect // 省份TopN r2.sortBy(-._2).take(2).foreach(println)


### 打印


reduceBy城市各分区元素



分区4元素:List() 分区3元素:List() 分区7元素:List() 分区0元素:List() 分区2元素:List((4602,2)) 分区6元素:List((4406,8)) 分区5元素:List((4301,2)) 分区1元素:List((4401,13), (4601,1))


reduceBy省份各分区元素



分区5元素:List() 分区3元素:List() 分区1元素:List() 分区4元素:List() 分区6元素:List() 分区7元素:List((43,2)) 分区0元素:List((44,21)) 分区2元素:List((46,3))


结果



(44,21) (46,3)


## 自定义分区器 求TopN


自定义分区器可以缓解数据倾斜,后面需要二次聚合



import org.apache.spark.{HashPartitioner, Partitioner, SparkConf, SparkContext}

import scala.util.Random

class MyPartitioner extends Partitioner { val random: Random = new Random // 总的分区数 override def numPartitions: Int = 8 // 按key分区,此处假设44数据倾斜 override def getPartition(key: Any): Int = key match { case "44" => random.nextInt(7) case _ => 7 } }

object Hello { def main(args: Array[String]): Unit = { // 创建SparkConf对象,并设定配置 val conf = new SparkConf().setAppName("A").setMaster("local[8]") // 创建SparkContext对象,Spark通过该对象访问集群 val sc = new SparkContext(conf) // 创建数据 val r0 = sc.makeRDD(Seq( 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 46, 46, 46, 43, 43, )) // 省份汇总统计 val r1 = r0.map(a => (a.toString.slice(0, 2), 1)) // 自定义分区 val r2 = r1.reduceByKey(partitioner = new MyPartitioner, func = _ + _) // 查看各分区元素 r2.mapPartitionsWithIndex((pId, iter) => { println("分区" + pId + "元素:" + iter.toList) iter

img img img

既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!

由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新

需要这份系统化资料的朋友,可以戳这里获取