1. 窗口概述

一般真实的流都是无界的，处理这种无界的数据，我们一般都是将这些无限数据流进行切分，得到有限数据集进行业务处理。
窗口（window）就是将无限流切割为有限流的一种方式，它会将流数据分发到有限大小的桶（bucket）中进行分析。

1. window 类型

时间窗口（Time window）

滚动时间窗口

滑动时间窗口

会话窗口

计数窗口（count window）

滚动计数窗口

滑动计数窗口

2. 滚动窗口

滚动窗口示意图

将数据依据固定的窗口长度，对数据进行切分。

时间对齐，窗口长度固定，数据没有重叠。

适用场景：适合做BI统计等（做每个时间段的聚合计算）

3. 滑动窗口

滑动窗口示意图

滑动窗口是固定窗口的更广义的一种形式，滑动窗口由固定的窗口长度和滑动间隔组成

窗口长度固定，数据可以有重叠

适用场景：对最近一个时间段内的统计（求某接口最近5min的失败率来决定是否要报警）。

4. 会话窗口

会话窗口示意图

由一系列事件组合一个指定时间长度的 timeout 间隙组成，也就是一段时间没有接收到新数据就会生成新的窗口

特点：时间无对齐

5. window API

滚动窗口（tumbling time windows） .timeWindow(Time size)

//滚动窗口的源码定义
public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
    if (environment.getStreamTimeCharacteristic() == 
            TimeCharacteristic.ProcessingTime) {
        // ProcessingTime 事件时间
        //带有时间语义的处理逻辑
        return window(TumblingProcessingTimeWindows.of(size));
    } else {
        //未带时间语义
        return window(TumblingEventTimeWindows.of(size));
    }
}

滑动窗口(sliding time window) .timeWindow(Time size, Time slide)

public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide) {
    if (environment.getStreamTimeCharacteristic() == 
                TimeCharacteristic.ProcessingTime) {
        return window(SlidingProcessingTimeWindows.of(size, slide));
    } else {
        return window(SlidingEventTimeWindows.of(size, slide));
    }
}

会话窗口（session window） .window(EventTimeSessionWindows.withGap(Time.seconds(10)))
滚动计数 .countWindow(size: Long)

def countWindow(size: Long): WindowedStream[T, K, GlobalWindow]=
{
    new WindowedStream(javaStream.countWin
    dow(size))
}

滑动计数 .countWindow(size: Long, slide: Long)

def countWindow(size: Long, slide: Long): WindowedStream[T, K, GlobalWindow] = {
    new WindowedStream(javaStream.countWindow(size, slide))
}

6. 窗口函数

window function 定义了要对窗口中收集的数据做的计算操作，主要可以分为两类：

1. 增量聚合函数

每条数据到来，都进行计算，保持一个简单状态（省内存空间）常用的增量聚合函数

ReduceFunction

AggregateFunction

//AggregateFunction的定义
// * @param <IN>  The type of the values that are aggregated (input values)
// * @param <ACC> The type of the accumulator (intermediate aggregate state).
// * @param <OUT> The type of the aggregated result
AggregateFunction<IN, ACC, OUT>

实例

import com.atguigu.day2.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

object AvgTempByAggregateFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    stream
      .keyBy(_.id)
      .timeWindow(Time.seconds(5))
      .aggregate(new AvgTempAgg)
      .print()

    env.execute()
  }

  //(一个key，一个窗口)对应一个累加器
  // 第一个泛型：流中元素的类型
  // 第二个泛型：累加器的类型，元组(传感器ID，来了多少条温度读数，来的温度读数的总和是多少)
  // 第三个泛型：增量聚合函数的输出类型，元组(传感器ID，窗口温度平均值)
  class AvgTempAgg extends AggregateFunction[SensorReading, (String, Long, Double), (String, Double)] {
    // 创建空累加器
    override def createAccumulator(): (String, Long, Double) = ("", 0L, 0.0)

    // 聚合逻辑是什么
    override def add(value: SensorReading, accumulator: (String, Long, Double)): (String, Long, Double) = {
      (value.id, accumulator._2 + 1, accumulator._3 + value.temperature)
    }

    // 窗口闭合时，输出的结果是什么？
    override def getResult(accumulator: (String, Long, Double)): (String, Double) = {
      (accumulator._1, accumulator._3 / accumulator._2)
    }

    // 两个累加器合并的逻辑是什么？
    override def merge(a: (String, Long, Double), b: (String, Long, Double)): (String, Long, Double) = {
      (a._1, a._2 + b._2, a._3 + b._3)
    }
  }
}

2. 全窗口函数

ProcessWindowFunction 先把窗口所有数据收集起来，等到计算的时候会遍历所有数据源码定义

  /* @tparam IN The type of the input value.
  * @tparam OUT The type of the output value.
  * @tparam KEY The type of the key.
  * @tparam W The type of the window.
  */
 ProcessWindowFunction[IN, OUT, KEY, W <: Window]

实例

import com.atguigu.day2.{SensorReading, SensorSource}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object AvgTempByProcessWindowFunction {

  case class AvgInfo(id: String, avgTemp: Double, windowStart: Long, windowEnd: Long)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    stream
      .keyBy(_.id)
      .timeWindow(Time.seconds(5))
      .process(new AvgTempFunc)
      .print()

    env.execute()
  }

  // 相比于增量聚合函数，缺点是要保存窗口中的所有元素
  // 增量聚合函数只需要保存一个累加器就行了
  // 优点是：全窗口聚合函数可以访问窗口信息
  class AvgTempFunc extends ProcessWindowFunction[SensorReading, AvgInfo, String, TimeWindow] {
    // 在窗口闭合时调用
    override def process(key: String, context: Context, elements: Iterable[SensorReading], out: Collector[AvgInfo]): Unit = {
      val count = elements.size // 窗口闭合时，温度一共有多少条
      var sum = 0.0 // 总的温度值
      for (r <- elements) {
        sum += r.temperature
      }
      // 单位是ms
      val windowStart = context.window.getStart
      val windowEnd = context.window.getEnd
      out.collect(AvgInfo(key, sum / count, windowStart, windowEnd))
    }
  }
}

3. 增量聚合与全窗口合用

直接看实例

import com.atguigu.day2.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object AvgTempByAggAndProcWindow {

  case class AvgInfo(id: String, avgTemp: Double, windowStart: Long, windowEnd: Long)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    stream
      .keyBy(_.id)
      .timeWindow(Time.seconds(5))
      .aggregate(new AvgTempAgg, new WindowResult)
      .print()

    env.execute()
  }

  //(一个key，一个窗口)对应一个累加器
  // 第一个泛型：流中元素的类型
  // 第二个泛型：累加器的类型，元组(传感器ID，来了多少条温度读数，来的温度读数的总和是多少)
  // 第三个泛型：增量聚合函数的输出类型，元组(传感器ID，窗口温度平均值)
  class AvgTempAgg extends AggregateFunction[SensorReading, (String, Long, Double), (String, Double)] {
    // 创建空累加器
    override def createAccumulator(): (String, Long, Double) = ("", 0L, 0.0)

    // 聚合逻辑是什么
    override def add(value: SensorReading, accumulator: (String, Long, Double)): (String, Long, Double) = {
      (value.id, accumulator._2 + 1, accumulator._3 + value.temperature)
    }

    // 窗口闭合时，输出的结果是什么？
    override def getResult(accumulator: (String, Long, Double)): (String, Double) = {
      (accumulator._1, accumulator._3 / accumulator._2)
    }

    // 两个累加器合并的逻辑是什么？
    override def merge(a: (String, Long, Double), b: (String, Long, Double)): (String, Long, Double) = {
      (a._1, a._2 + b._2, a._3 + b._3)
    }
  }

  // 注意！输入的泛型是增量聚合函数的输出泛型,携带窗口信息
  class WindowResult extends ProcessWindowFunction[(String, Double), AvgInfo, String, TimeWindow] {
    override def process(key: String, context: Context, elements: Iterable[(String, Double)], out: Collector[AvgInfo]): Unit = {
      // 迭代器中只有一个值，就是增量聚合函数发送过来的聚合结果！
      out.collect(AvgInfo(key, elements.head._2, context.window.getStart, context.window.getEnd))
    }
  }
}

Flink窗口基本概念以及窗口API

1. 窗口概述

1. window 类型

2. 滚动窗口

3. 滑动窗口

4. 会话窗口

5. window API

6. 窗口函数

1. 增量聚合函数

2. 全窗口函数

3. 增量聚合与全窗口合用

7. 其它窗口API