DataSet
DataSet是类型安全的API,我们可以提前定义好DataSet的数据类型,这样加载数据的时候,就是严格按照定义的类型来处理的。
同样的,DataSet也是分为Transform & Action的,Transform定义了一组操作,然后Spark解析器会解析优化操作,最后在Action动作的地方触发整个流程,这个和前面文章提到的DataFrame一样。
给出代码示例:
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc
object SparkExample {
// 定义schame, scala中都使用case class的方式
case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkExample")
.master("local").getOrCreate // 创建SparkSession
spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志
import spark.implicits._ // 必须声明隐式参数
// 加载数据集, 此时返回的是DataFrame结构
val flightData = spark
.read
.parquet("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/flight-data/parquet/2010-summary.parquet/")
// DataFrame转换成DataSet
val flights = flightData.as[Flight]
val data = flights
.filter(_.ORIGIN_COUNTRY_NAME != "Canada")
.map(flight_now => flight_now)
.take(5)
println(data.mkString(","))
}
}
输出结果:
Flight(United States,Romania,1),Flight(United States,Ireland,264),Flight(United States,India,69),Flight(Egypt,United States,24),Flight(Equatorial Guinea,United States,1)
总结一下DataSet的使用步骤:
- 创建Spark环境,即
SparkSession-- 所有的Spark程序都要这么做 - 定义
case class,用于指定schema DataFrame的方式加载数据DataFrame转化成DataSet- 出发
Action,实际执行整个流程
结构化流处理
Spark执行结构化流处理的时候,本质上还是一种micro batch的思想,因此我们需要指定时间窗口完成聚合,window函数就是用于窗口聚合的。先看下window在批处理示例:
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{desc, window, column, col}
object SparkExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkExample")
.master("local").getOrCreate // 创建SparkSession
spark.conf.set("spark.sql.shuffle.partitions", "5")
spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志
val staticDataFrame = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")
// 创建DataFrame数据表,支持SparkSQL查询
staticDataFrame.createOrReplaceTempView("retail_data")
println(staticDataFrame.schema)
staticDataFrame
.selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")
.groupBy(col("CustomerId"), window(col("InvoiceDate"), "1 day"))
.sum("total_cost")
.show(5)
}
}
结果输出:
StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity,IntegerType,true), StructField(InvoiceDate,StringType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))
+----------+--------------------+-----------------+
|CustomerId| window| sum(total_cost)|
+----------+--------------------+-----------------+
| 16057.0|{2011-12-05 08:00...| -37.6|
| 14126.0|{2011-11-29 08:00...|643.6300000000001|
| 13500.0|{2011-11-16 08:00...|497.9700000000001|
| 17160.0|{2011-11-08 08:00...|516.8499999999999|
| 15608.0|{2011-11-11 08:00...| 122.4|
+----------+--------------------+-----------------+
only showing top 5 rows
可以看出,window聚合的方式是一天为单位。
在上述代码的基础上,我们可以更改为流式处理的方式,但是下面的方法不建议生成环境使用:
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{desc, window, column, col}
object SparkExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("SparkExample")
.master("local").getOrCreate // 创建SparkSession
spark.conf.set("spark.sql.shuffle.partitions", "5")
spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志
val staticDataFrame = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")
// 创建DataFrame数据表,支持SparkSQL查询
staticDataFrame.createOrReplaceTempView("retail_data")
val staticSchema = staticDataFrame.schema
val streamingDataFrame = spark.readStream
.schema(staticSchema) // 复用之前的schema,这里显式指定类型
.option("maxFilesPerTrigger", "1") // 设置触发器,每读取一个文件出发流的更新
.format("csv")
.option("header", "true")
.load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")
println(streamingDataFrame.isStreaming) // 输出是否是streaming数据
// 此时无法直接计数,对流操作没意义
val purchaseByCustomerPerHour = streamingDataFrame
.selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")
.groupBy(col("CustomerId"), window(col("InvoiceDate"), "1 day"))
.sum("total_cost")
purchaseByCustomerPerHour.writeStream
.format("memory") // 写入内存中
.queryName("customer_purchases") // 写入内存时,存储的表名
.outputMode("complete") // 保存表中的所有记录
.start() // 启动流
spark.sql(
"""
|SELECT *
| FROM customer_purchases
|ORDER BY `sum(total_cost)` DESC
|""".stripMargin)
.show(5)
}
}
总结一下Structed Stream的使用方式:
- 定义数据源,并指定读取的
schema - 定义分析步骤,并指定出发条件
- 执行缓存位置,
Streaming数据必须得缓存 Streaming数据写入到缓存中- 执行分析语句