大数据开发Flink高级进阶之Watermark（第四十九篇）一、Watermark 当我们使用EventTime处理流

一、Watermark

当我们使用EventTime处理流数据的时候，会遇到数据乱序的问题。流处理从数据产生，到流经Source，再到具体的算子，中间是有一个过程和时间的，有可能会导致数据乱序

特别是我们使用kafka的时候，多个分区之间的数据时无法保证有序的。所以在进入window计算的时候，我们又不能无限期的等下去，必须要有一个机制来保证，一个特定的时间之后必须触发window去计算了。这个特别的机制就是Watermark

使用Watermark+EventTime处理乱序数据。Watermark可以翻译为水位线

1.1、有序数据流的Watermark

有序数据流的watermark

w(11)，w表示是watermark，表示watermark的值为11，此时表示11之前的数据都到了。

w(20)，表示watermark的值为20，此时表示20之前的数据都到了，可以进行计算了

1.2、无序数据流的Watermark

无序数据流的waternark

w(11)，表示11之前的数据都到了，可以对11之前的数据进行计算，大于11的数据不进行计算

w(20)，表示20之前的数据都到了，可以对20之前的数据进行计算，大于20的数据不进行计算

1.3、多并行度数据流的watermark

在多并行度情况下，watermark会有一个对齐机制，这个对齐机制，它会取所有channel中最小的watermark。

1.4、watermark的生成方式

可以在接收到DataSource的数据后，立刻生成watermark。也可以在DataSource后，使用map或者filter操作后再生成watermark
生成方式1：基于周期性的去触发 with periodic watermarks
- 每隔N秒自动向流里面注入一个watermark，时间间隔由ExecutionConfig.setAutoWatermarkInterval决定，默认是200ms
- 可以定义一个最大允许乱序的时间，这种比较常用
生成方式2：基于某些事件去触发 with punctuated watermarks
- 基于事件向流里注入一个Watermark，每一个元素都有机会判断是否生成一个Watermark

二、如何生成水位线

2.1、水位线的总体原则

完美的Watermark是”绝对正确“的，即一个watermark一旦出现，就表示这个时间之前的数据已经全部到齐，之后再也不会出现这个时间之前的数据。而实际上我们只能尽量去保证水位线的正确性。如果对结果正确性要求很高，想要让窗口收集到所有数据，该怎么做？就是等待。由于网络的不稳定性，为了获取是所有迟到的数据，只能等待更长的时间，到底要等多久，取决于当前业务特性。比如当前业务中事件的迟到时间不会超过5s，就可以将watermark设置为当前已有数据的最大时间戳减去5s，相当于等待了5s。

2.2、window触发条件

watermark时间>=window_end_time
在[window_start_time,window_end_time)区间中有数据存在（注意是左闭右开的区间）

同时满足以上条件，window才会触发

package com.strivelearn.flink.watermarkdemo;

import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

/**
 * @author strivelearn
 * @version WatermarkOp.java, 2023年01月23日
 */
public class WatermarkOp {
    public static String[] eventDataOutOfOrder = new String[] { "001,1674477177627", "001,1674477277637", "001,1674477177147", "001,1674477377657" };

    public static void main(String[] args) throws Exception {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss");
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 设置使用数据产生的时间：EventTime
        executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        // 设置全局并行度为1
        executionEnvironment.setParallelism(1);
        // 设置自动周期性的产生watermark，默认值为200毫秒
        executionEnvironment.getConfig().setAutoWatermarkInterval(200);

        DataStreamSource<String> dataSourceWithSocket = executionEnvironment.socketTextStream("192.168.234.100", 9001);

        SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator = dataSourceWithSocket.flatMap(new FlatMapFunction<String, Tuple2<String, Long>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Long>> out) throws Exception {
                String[] words = value.split(",");
                out.collect(new Tuple2<>(words[0], new Long(words[1])));
            }
        });
        // 从数据流中抽取时间戳作为EventTime
        SingleOutputStreamOperator<Tuple2<String, Long>> tuple2SingleOutputStreamOperator1 = tuple2SingleOutputStreamOperator.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>> forBoundedOutOfOrderness(Duration.ofSeconds(10))
                                                                                                                                                                             .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {

                                                                                                                                                                                 long currentMaxTimestamp = 0L;

                                                                                                                                                                                 @Override
                                                                                                                                                                                 public long extractTimestamp(Tuple2<String, Long> element,
                                                                                                                                                                                                              long recordTimestamp) {
                                                                                                                                                                                     long timestamp = element.f1;
                                                                                                                                                                                     currentMaxTimestamp = Math.max(currentMaxTimestamp, timestamp);
                                                                                                                                                                                     // 当前watermark=currentMaxTimestamp-OutOfOrderness
                                                                                                                                                                                     long currentWaterMark = currentMaxTimestamp
                                                                                                                                                                                                             - 10000L;
                                                                                                                                                                                     String format = simpleDateFormat.format(timestamp);

                                                                                                                                                                                     System.out.println("key："
                                                                                                                                                                                                        + element.f0
                                                                                                                                                                                                        + " eventTime："
                                                                                                                                                                                                        + simpleDateFormat.format(element.f1)
                                                                                                                                                                                                        + " currentMaxTimestamp="
                                                                                                                                                                                                        + simpleDateFormat.format(currentMaxTimestamp)
                                                                                                                                                                                                        + " currentWaterMark="
                                                                                                                                                                                                        + simpleDateFormat.format(currentWaterMark));
                                                                                                                                                                                     return element.f1;
                                                                                                                                                                                 }
                                                                                                                                                                             }));

        tuple2SingleOutputStreamOperator1.keyBy(0)
                                         // 按照消息的EventTime分配窗口，和利用TimeWindow效果一样
                                         .window(TumblingEventTimeWindows.of(Time.seconds(3)))
                                         // 使用全量聚合的方式处理window中的数据
                                         .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                                             @Override
                                             public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                                                 String result="tuple" + tuple.toString() + " window 开始时间：" + simpleDateFormat.format(window.getStart()) + " window 结束时间："
                                                                    + simpleDateFormat.format(window.getEnd());
                                                 out.collect(result);
                                             }
                                         })
                                         .print();

        executionEnvironment.execute("WatermarkOp");
    }
}

key：001 eventTime：2023-01-23 20:32:57 currentMaxTimestamp=2023-01-23 20:32:57 currentWaterMark=2023-01-23 20:32:47 key：001 eventTime：2023-01-23 20:34:37 currentMaxTimestamp=2023-01-23 20:34:37 currentWaterMark=2023-01-23 20:34:27 tuple(001) window 开始时间：2023-01-23 20:32:57 window 结束时间：2023-01-23 20:33:00

window触发机制，是按照自然时间将window划分，如果window大小是3s，那么1min内会把window划分为如下的形式（左闭右开）

[00:00:00,00:00:03)
[00:00:03,00:00:06)
...
[00:00:57,00:01:00)

window的设定无关数据本身，而是系统定义好的，输入的数据，根据自身的EventTime，将数据划分到不同的window中，如果window中有数据，则当Watermark时间>=eventTime，就符合window触发条件了，最终决定window触发，还是由数据本身的eventTime所属的window中的window_end_time决定。

三、延迟数据的处理方式

丢弃（默认）
指定允许数据延迟的时间

allowedLateness(Time.seconds(2))
收集迟到的数据

通过sideOutputLateData函数可以把迟到的数据统一收集，统一存储，方便后期排查问题