【Spark】Spark Dataframe 对项目中的数据实现列转行操作_spark的dataset列转行

29 阅读2分钟

文章目录

一、数据源

转换之前先看下数据结构

多行存在空值需要过滤,不同的状态(yes、maybe、invited、no)存在多个值,需要转换成(events userid status)的状态

在这里插入图片描述

val df = spark.read.format("csv").option("header","true").load("file:///opt/data/event\_attendees.csv")

scala> df.printSchema
root
 |-- event: string (nullable = true)
 |-- yes: string (nullable = true)
 |-- maybe: string (nullable = true)
 |-- invited: string (nullable = true)
 |-- no: string (nullable = true

在这里插入图片描述

二、首先考虑单独两行映射

df.filter(col("yes").isNotNull).select(col("event"),col("yes")).withColumn("userid",explode(split(col("yes")," "))).drop($"yes").withColumn("status",lit("yes")).show(3)
+----------+----------+------+
|     event|    userid|status|
+----------+----------+------+
|1159822043|1975964455|   yes|
|1159822043| 252302513|   yes|
|1159822043|4226086795|   yes|
+----------+----------+------+
only showing top 3 rows

三、同理将其余隔行依次映射

scala> val no = df.filter(col("no").isNotNull).select(col("event"),col("no")).withColumn("userid",explode(split(col("no")," "))).drop($"no").withColumn("status",lit("no"))
+----------+----------+------+
|     event|    userid|status|
+----------+----------+------+
|1159822043|3575574655|    no|
|1159822043|1077296663|    no|
|1186208412|1728988561|    no|
+----------+----------+------+
only showing top 3 rows

no: Unit = ()

scala> val invited = df.filter(col("invited").isNotNull).select(col("event"),col("invited")).withColumn("userid",explode(split(col("invited")," "))).drop($"invited").withColumn("status",lit("invited")).show(3)
+----------+----------+-------+
|     event|    userid| status|
+----------+----------+-------+
|1159822043|1723091036|invited|
|1159822043|3795873583|invited|
|1159822043|4109144917|invited|
+----------+----------+-------+
only showing top 3 rows

invited: Unit = ()

scala> val maybe = df.filter(col("maybe").isNotNull).select(col("event"),col("maybe")).withColumn("userid",explode(split(col("maybe")," "))).drop($"maybe").withColumn("status",lit("maybe")).show(3)
+----------+----------+------+
|     event|    userid|status|
+----------+----------+------+
|1159822043|2733420590| maybe|
|1159822043| 517546982| maybe|
|1159822043|1350834692| maybe|
+----------+----------+------+
only showing top 3 rows

maybe: Unit = ()



![img](https://p3-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/9eef429ddc30443f9403f89629eba46b~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771252320&x-signature=n2y48wbJPvEjGmDpclU%2FxSk2Kyk%3D)
![img](https://p3-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/cac41a6920204393b7a376d54383f69a~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771252320&x-signature=toLVbXigVX2ZlyCOyaus0h2O1c8%3D)
![img](https://p3-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/5f128823f82a477cbc8bb1bcd63f4f88~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5py65Zmo5a2m5Lmg5LmL5b-DQUk=:q75.awebp?rk3s=f64ab15b&x-expires=1771252320&x-signature=qYnQ2%2FAWrlWpUdGhGYux6LEwbA8%3D)

**既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!**


**由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新**

**[需要这份系统化资料的朋友,可以戳这里获取](https://gitee.com/vip204888)**