案例可参考(切换到Flink1.12分支查看最新代码):github.com/perkinls/fl…
定义窗口Window assigner,后,我们需要指定要在每个窗口上执行的计算。这是窗口函数的职责,一旦系统确定某个窗口已准备好进行处理,就可以使用该窗口函数来处理每个(按key分组)窗口的元素。
ProcessWindowFunction可以与ReduceFunction或AggregateFunction组合在一起,在元素到达窗口时增量地聚合它们。当窗口关闭时,ProcessWindowFunction将提供聚合的结果。使得ProcessWindowFunction在能获取到窗口元信息的同时增量地计算窗口。
ReduceFunction
ReduceFunction增量聚合
将输入中的两个元素组合在一起以产生相同类型的输出元素。
val input: DataStream[(String, Long)] = ...
// 汇总了窗口中所有元素的元组的第二个字段
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }
AggregateFunction
AggregateFunction是ReduceFunction的通用版本,具有三种类型:输入类型(IN),累加器类型(ACC)和输出类型(OUT)。AggregateFunction具有一种将一个输入元素添加到累加器的方法。该接口还具有创建初始累加器,将两个累加器合并为一个累加器以及从累加器提取输出(OUT类型)的方法。
与ReduceFunction相同,Flink将在窗口的输入元素到达时对其进行增量聚合
。
/**
* 累加器用于保存一个运行求和与一个计数。[getResult]方法计算平均值。
* 计算窗口中元素的第二个字段的平均值
*/
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
override def createAccumulator() = (0L, 0L)
override def add(value: (String, Long), accumulator: (Long, Long)) =
(accumulator._1 + value._2, accumulator._2 + 1L)
override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2
override def merge(a: (Long, Long), b: (Long, Long)) =
(a._1 + b._1, a._2 + b._2)
}
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate)
ProcessWindowFunction
ProcessWindowFunction获取一个Iterable,该Iterable包含窗口的所有元素,以及一个Context对象,该对象可以访问时间和状态信息,从而使其比其他窗口函数更具灵活性。这是以性能和资源消耗为代价的,因为无法增量聚合元素,而是需要在内部对其进行缓冲,直到将窗口视为已准备好进行处理为止。
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window] extends Function {
/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param context The context in which the window is being evaluated.
* @param elements The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def process(
key: KEY,
context: Context,
elements: Iterable[IN],
out: Collector[OUT])
/**
* The context holding window metadata
*/
abstract class Context {
/**
* Returns the window that is being evaluated.
*/
def window: W
/**
* Returns the current processing time.
*/
def currentProcessingTime: Long
/**
* Returns the current event-time watermark.
*/
def currentWatermark: Long
/**
* State accessor for per-key and per-window state.
*/
def windowState: KeyedStateStore
/**
* State accessor for per-key global state.
*/
def globalState: KeyedStateStore
}
}
可以这样定义和使用ProcessWindowFunction:
val input: DataStream[(String, Long)] = ...
input
.keyBy(_._1)
.timeWindow(Time.minutes(5))
.process(new MyProcessWindowFunction())
/* 窗口中的元素进行计数 */
class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {
def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]): () = {
var count = 0L
for (in <- input) {
count = count + 1
}
out.collect(s"Window ${context.window} count: $count")
}
}
ProcessWindowFunction增量聚合
可以将ProcessWindowFunction与ReduceFunction,AggregateFunction组合以在元素到达窗口时对其进行增量聚合。当窗口关闭时,将向ProcessWindowFunction提供聚合结果。这使得它可以递增地计算窗口,同时可以访问ProcessWindowFunction的其他窗口元信息。
还可以使用传统的WindowFunction代替ProcessWindowFunction来增量窗口聚合。
Incremental Window Aggregation with ReduceFunction
以下示例显示了如何将增量ReduceFunction与ProcessWindowFunction结合使用以返回窗口中的最小事件以及该窗口的开始时间。
val input: DataStream[SensorReading] = ...
input
.keyBy(<key selector>)
.timeWindow(<duration>)
.reduce(
(r1: SensorReading, r2: SensorReading) => { if (r1.value > r2.value) r2 else r1 },
( key: String,
context: ProcessWindowFunction[_, _, _, TimeWindow]#Context,
minReadings: Iterable[SensorReading],
out: Collector[(Long, SensorReading)] ) =>
{
val min = minReadings.iterator.next()
out.collect((context.window.getStart, min))
}
)
Incremental Window Aggregation with AggregateFunction
以下示例显示了如何将增量的AggregateFunction与ProcessWindowFunction组合在一起以计算平均值,并与平均值一起发出key和窗口。
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.timeWindow(<duration>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction())
// Function definitions
/**
* The accumulator is used to keep a running sum and a count. The [getResult] method
* computes the average.
*/
class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
override def createAccumulator() = (0L, 0L)
override def add(value: (String, Long), accumulator: (Long, Long)) =
(accumulator._1 + value._2, accumulator._2 + 1L)
override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2
override def merge(a: (Long, Long), b: (Long, Long)) =
(a._1 + b._1, a._2 + b._2)
}
class MyProcessWindowFunction extends ProcessWindowFunction[Double, (String, Double), String, TimeWindow] {
def process(key: String, context: Context, averages: Iterable[Double], out: Collector[(String, Double)]): () = {
val average = averages.iterator.next()
out.collect((key, average))
}
}
ProcessWindowFunction获取每个窗口状态
除了访问key状态(任何Rich函数都可以)之外,ProcessWindowFunction还可以使用范围为该函数当前正在处理的窗口的键控状态。在这种情况下,重要的是要了解每个窗口状态所指的窗口是什么。涉及不同的“窗口”:
- 指定窗口操作时定义的窗口:这可能是1小时的滚动窗口或2小时的滑动窗口滑动1小时。
- 给定key已定义窗口的实际实例:对于用户ID xyz这可能是从12:00到13:00的时间窗口。这是基于窗口定义的,并且根据作业当前正在处理的键的数量以及事件属于哪个时隙,会有很多窗口。
每个窗口的状态与这两个中的后者相关。这意味着,如果我们处理1000个不同键的事件,并且当前所有事件的事件都属于[12:00,13:00)时间窗口,那么将有1000个窗口实例,每个实例具有各自的每个窗口状态。
调用在Context对象上process()有两种方法可以访问两种状态:
- globalState(),它允许访问不在窗口范围内的键状态
- windowState(),它允许访问也作用于窗口的键控状态
如果您预期同一窗口会多次触发,则此功能很有用,例如,对于迟到的数据有较早的触发,或者您有进行推测性较早触发的自定义触发器时,可能会发生这种情况。在这种情况下,您将存储有关先前触发或每个窗口状态中触发次数的信息。
使用窗口状态时,清除窗口时也要使用clear()方法清除该状态,这一点很重要。
WindowFunction旧版本
在某些可以使用ProcessWindowFunction的地方,您也可以使用WindowFunction。这是ProcessWindowFunction的较旧版本,提供的上下文信息较少,并且没有某些高级功能,例如每个窗口的key状态。该接口将在某个时候被弃用。
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param window The window that is being evaluated.
* @param input The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
可以这样使用:
val input: DataStream[(String, Long)] = ...
input
.keyBy(<key selector>)
.window(<window assigner>)
.apply(new MyWindowFunction())
关注公众号 数据工匠记
,专注于大数据领域离线、实时技术干货定期分享!个人网站