前言
我们前面其实一直都有在使用窗口,那么我们现在来针对窗口的各种类型做一个演示。
1.1 窗口简述
聚合事件(比如计数、求和)在流上的工作方式与批处理不同。比如,对流中的所有元素进行计数是不可能的,因为通常流是无限的(无界的)。所以,流上的聚合需要由 window 来划定范围,比如 “计算过去的5分钟” ,或者 “最后100个元素的和” 。window是一种可以把无限数据切割为有限数据块的手段。
1.2 窗口类型
- tumbling window:滚动窗口

- sliding window:滑动窗口

- session window:会话窗口

- global window: 没有窗口

窗口还可以划分为 Keyed Window与Non-Keyed Window,简单来讲,就是是否经过了keyBy算子,Keyed Window就相当于stream流的数据根据key,进行了分组,然后窗口针对每一个key的数据进行相应划分,然后执行窗口的统计。而Non-Keyed Window 则相当于我不对流进行split,那么所有的数据都在一起,那么就只有一个task对于当前的数据流,进行窗口划分与计算。
可以参考Keyed Window与Non-Keyed Window
1.3 窗口案例演示
1.3.1 tumbling windows
dataStream.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
...;
1.3.2 sliding windows
dataStream.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(10)))
...;
1.3.3 session windows
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 一个单词,再5秒之内都没有出现过,那么就输出它一共出现了多少次
*/
public class SessionWindowDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
}).keyBy(0).window(ProcessingTimeSessionWindows.withGap(Time.seconds(5))).sum(1);
sum.print().setParallelism(1);
see.execute("SessionWindowDemo");
}
}
1.3.4 global windows
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.util.Collector;
/**
* 单词每出现2次 统计一次
*/
public class GlobalWindowDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
//GlobalWindows 的使用需要结合trigger能使使用,因为如果你只是设置了窗口,但是没有触发,那么这个窗口没有意义
//就如transformation算子需要一个action来触发 是一样的。
}).keyBy(0).window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1);
sum.print().setParallelism(1);
see.execute("GlobalWindowDemo");
}
}
结果
demo,2
demo,4
demo,6
1.3.5 自定义trigger
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.util.Collector;
/**
* 单词每出现2次 统计一次
*/
public class TriggerWindowDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
//.window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1); 和CountTrigger.of(2)里面的源码逻辑是一样的
}).keyBy(0).window(GlobalWindows.create()).trigger(new TriggerImpl(2l)).sum(1);
sum.print().setParallelism(1);
see.execute("TriggerWindowDemo");
}
/**
* @param <T> The type of elements on which this {@code Trigger} works.
* 输入的类型
* * @param <W> The type of {@link Window Windows} on which this {@code Trigger} can operate.
* 窗口类型
*/
private static class TriggerImpl extends Trigger<Tuple2<String, Integer>, GlobalWindow> {
// 指定出现的次数
private Long maxCount;
// 记录key出现的次数
private ReducingStateDescriptor<Long> descriptor = new ReducingStateDescriptor<Long>("count", new ReduceFunction<Long>() {
@Override
public Long reduce(Long aLong, Long t1) throws Exception {
return aLong + t1;
}
}, Long.class);
public TriggerImpl(Long maxCount) {
this.maxCount = maxCount;
}
/**
* 当一个元素进入到一个 window 中的时候就会调用这个方法
*
* @param element 元素
* @param timestamp 进来的时间
* @param window 元素所属的窗口
* @param ctx 上下文
* 1. TriggerResult.CONTINUE :表示对 window 不做任何处理
* 2. TriggerResult.FIRE :表示触发 window 的计算
* 3. TriggerResult.PURGE :表示清除 window 中的所有数据
* 4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
*/
@Override
public TriggerResult onElement(Tuple2<String, Integer> element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
// 获取state
ReducingState<Long> count = ctx.getPartitionedState(descriptor);
// count 累加 1
count.add(1L);
// 如果当前 key 的 count 值等于 maxCount
if (count.get().equals(maxCount)) {
count.clear();
// 触发 window 计算,删除数据
return TriggerResult.FIRE;
}
// 否则,对 window 不做任何的处理
return TriggerResult.CONTINUE;
}
// 当使用processingTime时的处理逻辑
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
// 当使用processingTime时的处理逻辑
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(descriptor).clear();
}
}
}
结果
demo,2
demo,4
demo,6
1.3.6 evictor
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.evictors.Evictor;
import org.apache.flink.streaming.api.windowing.triggers.Trigger;
import org.apache.flink.streaming.api.windowing.triggers.TriggerResult;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.Window;
import org.apache.flink.streaming.runtime.operators.windowing.TimestampedValue;
import org.apache.flink.util.Collector;
import java.util.Iterator;
/**
* 单词每出现2次,统计最近3的3个单词
*/
public class EvictorWindowDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
//.window(GlobalWindows.create()).trigger(CountTrigger.of(2)).sum(1); 和CountTrigger.of(2)里面的源码逻辑是一样的
}).keyBy(0).window(GlobalWindows.create()).trigger(new TriggerImpl(2l))
.evictor(new EvictorImpl(3)).sum(1);
sum.print().setParallelism(1);
see.execute("TriggerWindowDemo");
}
/**
* @param <T> The type of elements on which this {@code Trigger} works.
* 输入的类型
* * @param <W> The type of {@link Window Windows} on which this {@code Trigger} can operate.
* 窗口类型
*/
private static class TriggerImpl extends Trigger<Tuple2<String, Integer>, GlobalWindow> {
// 指定出现的次数
private Long maxCount;
// 记录key出现的次数
private ReducingStateDescriptor<Long> descriptor = new ReducingStateDescriptor<Long>("count", new ReduceFunction<Long>() {
@Override
public Long reduce(Long aLong, Long t1) throws Exception {
return aLong + t1;
}
}, Long.class);
public TriggerImpl(Long maxCount) {
this.maxCount = maxCount;
}
/**
* 当一个元素进入到一个 window 中的时候就会调用这个方法
*
* @param element 元素
* @param timestamp 进来的时间
* @param window 元素所属的窗口
* @param ctx 上下文
* 1. TriggerResult.CONTINUE :表示对 window 不做任何处理
* 2. TriggerResult.FIRE :表示触发 window 的计算
* 3. TriggerResult.PURGE :表示清除 window 中的所有数据
* 4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
*/
@Override
public TriggerResult onElement(Tuple2<String, Integer> element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
// 获取state
ReducingState<Long> count = ctx.getPartitionedState(descriptor);
// count 累加 1
count.add(1L);
// 如果当前 key 的 count 值等于 maxCount
if (count.get().equals(maxCount)) {
count.clear();
// 触发 window 计算,删除数据
return TriggerResult.FIRE;
}
// 否则,对 window 不做任何的处理
return TriggerResult.CONTINUE;
}
// 当使用processingTime时的处理逻辑
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
// 当使用processingTime时的处理逻辑
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(descriptor).clear();
}
}
private static class EvictorImpl implements Evictor<Tuple2<String, Integer>, GlobalWindow> {
// window 的大小
private long windowCount;
public EvictorImpl(long windowCount) {
this.windowCount = windowCount;
}
/**
* 在 window 计算之前删除特定的数据
* @param elements window 中所有的元素
* @param size window 中所有元素的大小
* @param window window
* @param evictorContext 上下文
*/
@Override
public void evictBefore(Iterable<TimestampedValue<Tuple2<String, Integer>>> elements,
int size, GlobalWindow window, EvictorContext evictorContext) {
if (size <= windowCount) {
return;
} else {
int evictorCount = 0;
Iterator<TimestampedValue<Tuple2<String, Integer>>> iterator = elements.iterator();
while (iterator.hasNext()) {
iterator.next();
evictorCount++;
// 如果删除的数量小于当前的 window 大小减去规定的 window 的大小,就需要删除当前的元素
if (evictorCount > size - windowCount) {
break;
} else {
iterator.remove();
}
}
}
}
/**
* 在 window 计算之后删除特定的数据
* @param elements window 中所有的元素
* @param size window 中所有元素的大小
* @param window window
* @param evictorContext 上下文
*/
@Override
public void evictAfter(Iterable<TimestampedValue<Tuple2<String, Integer>>> elements,
int size, GlobalWindow window, Evictor.EvictorContext evictorContext) {
}
}
}
结果
(a,2)
(a,3)
(a,3)
(a,3)
1.4 窗口的增量聚合与全量聚合
1.4.1 增量聚合
窗口中每进入一条数据,就进行一次计算,等时间到了展示最后的结果
常用的聚合算子
reduce(reduceFunction),aggregate(aggregateFunction),sum(),min(),max()
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class ReduceDemo {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
}).keyBy(0).timeWindow(Time.seconds(3)).reduce(new ReduceFunction<Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> reduceTuple, Tuple2<String, Integer> value) throws Exception {
return Tuple2.of(reduceTuple.f0,reduceTuple.f1 + value.f1);
}
}).print().setParallelism(1);
see.execute("ReduceDemo");
}
}
1.4.2 全量聚合
等属于窗口的数据到齐,才开始进行聚合计算【可以实现对窗口内的数据进行排序等需求】
apply(windowFunction)
process(processWindowFunction)
processWindowFunction比windowFunction提供了更多的上下文信息。类似于map和RichMap的关系
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.util.Iterator;
public class ProcessDemo {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> localhost = see.socketTextStream("localhost", 8888);
localhost.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String str : line.split(",")) {
out.collect(Tuple2.of(str, 1));
}
}
}).keyBy(0).timeWindow(Time.seconds(5)).process(new ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
@Override
public void process(Tuple tuple, Context context, Iterable<Tuple2<String, Integer>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
int sum = 0;
Iterator<Tuple2<String, Integer>> iterator = elements.iterator();
while (iterator.hasNext()){
sum += iterator.next().f1;
}
out.collect(Tuple2.of(tuple.getField(0),sum));
}
}).print().setParallelism(1);
see.execute("ProcessDemo");
}
}
1.5 窗口的join
两个window之间可以进行join,join操作只支持三种类型的window:滚动窗口,滑动窗口,会话窗口
使用方式:
stream.join(otherStream) //两个流进行关联
.where(<KeySelector>) //选择第一个流的key作为关联字段
.equalTo(<KeySelector>)//选择第二个流的key作为关联字段
.window(<WindowAssigner>)//设置窗口的类型
.apply(<JoinFunction>) //对结果做操作
Tumbling Window Join
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

Sliding Window Join
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

Session Window Join
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

Interval Join
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
@Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});

最后
到这里Flink的演示就到这里了,有些地方偷懒就没写了,主要以前的测试工程被我删掉了。有什么问题,留言沟通。