02-Spark工具集简介

275 阅读3分钟

DataSet

DataSet是类型安全的API,我们可以提前定义好DataSet的数据类型,这样加载数据的时候,就是严格按照定义的类型来处理的。

同样的,DataSet也是分为Transform & Action的,Transform定义了一组操作,然后Spark解析器会解析优化操作,最后在Action动作的地方触发整个流程,这个和前面文章提到的DataFrame一样。

给出代码示例:

package example

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc

object SparkExample {
  // 定义schame, scala中都使用case class的方式
  case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt)
  
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("SparkExample")
      .master("local").getOrCreate // 创建SparkSession

    spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志

    import spark.implicits._  // 必须声明隐式参数
    
    // 加载数据集, 此时返回的是DataFrame结构
    val flightData = spark
      .read
      .parquet("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/flight-data/parquet/2010-summary.parquet/")
      
    // DataFrame转换成DataSet
    val flights = flightData.as[Flight]
    
    val data = flights
      .filter(_.ORIGIN_COUNTRY_NAME != "Canada")
      .map(flight_now => flight_now)
      .take(5)
    println(data.mkString(","))
  }
}

输出结果:

Flight(United States,Romania,1),Flight(United States,Ireland,264),Flight(United States,India,69),Flight(Egypt,United States,24),Flight(Equatorial Guinea,United States,1)

总结一下DataSet的使用步骤:

  1. 创建Spark环境,即SparkSession -- 所有的Spark程序都要这么做
  2. 定义case class,用于指定schema
  3. DataFrame的方式加载数据
  4. DataFrame转化成DataSet
  5. 出发Action,实际执行整个流程

结构化流处理

Spark执行结构化流处理的时候,本质上还是一种micro batch的思想,因此我们需要指定时间窗口完成聚合,window函数就是用于窗口聚合的。先看下window在批处理示例:

package example

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{desc, window, column, col}

object SparkExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("SparkExample")
      .master("local").getOrCreate // 创建SparkSession

    spark.conf.set("spark.sql.shuffle.partitions", "5")
    spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志

    val staticDataFrame = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")

    // 创建DataFrame数据表,支持SparkSQL查询
    staticDataFrame.createOrReplaceTempView("retail_data")
    println(staticDataFrame.schema)
    staticDataFrame
      .selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")
      .groupBy(col("CustomerId"), window(col("InvoiceDate"), "1 day"))
      .sum("total_cost")
      .show(5)
  }
}

结果输出:

StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity,IntegerType,true), StructField(InvoiceDate,StringType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))

+----------+--------------------+-----------------+
|CustomerId|              window|  sum(total_cost)|
+----------+--------------------+-----------------+
|   16057.0|{2011-12-05 08:00...|            -37.6|
|   14126.0|{2011-11-29 08:00...|643.6300000000001|
|   13500.0|{2011-11-16 08:00...|497.9700000000001|
|   17160.0|{2011-11-08 08:00...|516.8499999999999|
|   15608.0|{2011-11-11 08:00...|            122.4|
+----------+--------------------+-----------------+
only showing top 5 rows

可以看出,window聚合的方式是一天为单位。

在上述代码的基础上,我们可以更改为流式处理的方式,但是下面的方法不建议生成环境使用:

package example

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{desc, window, column, col}

object SparkExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("SparkExample")
      .master("local").getOrCreate // 创建SparkSession

    spark.conf.set("spark.sql.shuffle.partitions", "5")
    spark.sparkContext.setLogLevel("ERROR") // 只输出ERROR日志

    val staticDataFrame = spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")

    // 创建DataFrame数据表,支持SparkSQL查询
    staticDataFrame.createOrReplaceTempView("retail_data")

    val staticSchema = staticDataFrame.schema
    val streamingDataFrame = spark.readStream
      .schema(staticSchema)  // 复用之前的schema,这里显式指定类型
      .option("maxFilesPerTrigger", "1")  // 设置触发器,每读取一个文件出发流的更新
      .format("csv")
      .option("header", "true")
      .load("/Users/ericklv/Documents/code/Spark-The-Definitive-Guide-master/data/retail-data/by-day/*.csv")
    println(streamingDataFrame.isStreaming)  // 输出是否是streaming数据

    // 此时无法直接计数,对流操作没意义
    val purchaseByCustomerPerHour = streamingDataFrame
      .selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")
      .groupBy(col("CustomerId"), window(col("InvoiceDate"), "1 day"))
      .sum("total_cost")

    purchaseByCustomerPerHour.writeStream
      .format("memory")  // 写入内存中
      .queryName("customer_purchases")  // 写入内存时,存储的表名
      .outputMode("complete")  // 保存表中的所有记录
      .start()  // 启动流

    spark.sql(
      """
        |SELECT *
        | FROM customer_purchases
        |ORDER BY `sum(total_cost)` DESC
        |""".stripMargin)
      .show(5)
  }
}

总结一下Structed Stream的使用方式:

  1. 定义数据源,并指定读取的schema
  2. 定义分析步骤,并指定出发条件
  3. 执行缓存位置,Streaming数据必须得缓存
  4. Streaming数据写入到缓存中
  5. 执行分析语句