Window in Flink | 青训营笔记

252 阅读6分钟

这是我参与「第四届青训营 」笔记创作活动的的第4天

1. Streaming Computing

  • Streaming Computing vs. Batch Computing | Features | Batch Computing | Streaming Computing | | --- | --- | --- | | Data Storage | HDFS、Hive | Kafka、Pulsar | | Data Timeliness | Day Level | Minute Level | | Accuracy | Accurate | Trade-off between accuracy and timeliness | | Typical computing engines | Hive、Spark、Flink | Flink | | Computational Model | Exactly-Once | At Least Once / Exactly Once | | Resource Model | Timed scheduling | Long-term holding | | Main Usage | Off-line day-level data reports | Real-time data warehouse、Real-time marketing、Real-time risk control |
  • Batch process
    • The data warehouse architecture for batch process model is T+1, that is, the data is processed at the day level, the day can only see the previous day's processed results
    • Usually use Hive or Spark. The data is fully ready when processed, and the inputs and outputs are deterministic Screen Shot 2022-07-30 at 20.55.31.png
  • Processing Time Window
    • Real-time computing: processing time window
    • Real-time data flow, real-time computing, end-of-window direct delivery of results, no need for periodic scheduling tasks Screen Shot 2022-07-30 at 20.59.44.png
  • Processing Time vs. Event Time
    • Processing Time: The current time of the machine where the data is actually processed in the streaming system
    • Event Time: Time of data generation, e.g. when data is reported by clients, sensors, back-end code, etc.

    Screen Shot 2022-07-30 at 21.02.48.png

  • Event Time Window
    • Real-time computing: Event Time Window
    • The data enters the real event window in real time for computing, which can effectively deal with data delay and disorder Screen Shot 2022-07-30 at 21.10.10.png
    • But when will the window end?
  • Watermark
    • Insert some watermark into the data to indicate the current real time Screen Shot 2022-07-30 at 21.12.04.png
    • Watermark is important when the data is in disorder Screen Shot 2022-07-30 at 21.13.44.png

2. Watermark

  • What is WaterMark?
    • Watermark indicates the current real event time as perceived by the system
    • Watermark is a mechanism to measure the progress of Event Time, which is a hidden property of the data itself, which carries the corresponding Watermark.
    • The Watermark is essentially a timestamp, indicating that events earlier than this timestamp have all arrived in the window. That is, assuming that no events smaller than this timestamp will arrive, this assumption is the basis for triggering the window calculation. The window will be closed and computed only when the Watermark exceeds the corresponding end time.
    • If the data is processed according to this criterion, then if there is data with a smaller timestamp than this, then it is considered late data, and Flink has a mechanism to handle this late data.
  • Passing of the Watermark
    • Watermark is broadcasted to all downstream subtasks when it is passed downstream, and if there are multiple watermarks passed downstream in multiple parallelism, the smallest watermark is taken. Screen Shot 2022-07-30 at 21.28.30.png
  • Problem 1
    • Per-partition VS per-subtask watermark generation
      • Per-subtask watermark generation
        • If a source subtask consumes multiple partitions, then data reading between multiple partitions may increase the degree of disorder
      • Per-partition watermark generation
        • Based on a separate watermark generation mechanism for each partition, this mechanism can effectively avoid the above problem
  • Problem 2
    • Partial subtask/partition out of the flow
      • If the watermark of an upstream subtask is not updated, then all downstream watermarks are not updated
    • Solution: Idle source
      • When a subtask breaks the flow for more than the configured idle timeout, the current subtask is set to idle and an idle status is sent downstream. When calculating its own watermark, the downstream can ignore those subtasks that are currently idle
  • Problem 3
    • Late data processing
      • Since watermark indicates the real time of the current event, the system will consider the data later than watermark as late data when it arrives.
      • The operator itself decides how to process late data:
        • Window Aggregate, late data will be discarded by default
        • Join, cannot join previous data
        • CEP, late data will be discarded by default

3. Window

  • Window Classification
    • Typical Window:
      • Tumble Window
      • Sliding Window
      • Session Window
    • Other Window:
      • Global Window
      • Count Window
      • Accumulation Window
      • ...
  • Tumble Window
    • Window Division
      1. Each key is divided separately
      2. Each piece of data will only belong to one window Screen Shot 2022-07-31 at 10.14.27.png
    • Window Trigger
      • Triggered once when Window end time is reached
  • Sliding Window
    • Window Division
      1. Each key is divided separately
      2. Each piece of data may belong to more than one window Screen Shot 2022-07-31 at 10.18.12.png
    • Window Trigger
      • Triggered once when Window end time is reached
  • Session Window
    • A new window is generated when no new data is received for a period of time.
    • When it no longer receives elements in a fixed time period, i.e., an inactive interval is generated, that this window closes.
    • Window Division
      1. Each key is divided separately
      2. Each piece of data will be divided into a separate window, and if there is an intersection between windows, the windows will be merged
    • Window Trigger
      • Triggered once when Window end time is reached
  • Late Data Process
    • How to define late data?
      • After a piece of data arrives, a WindowAssigner will divide it into a window, generally the time window is a time interval, such as [10:00, 11:00), if the divided window end is smaller than the current watermark value, it means that this window has triggered the process, and this piece of data will be considered as late data
    • When late data is generated?
      • Late data will be generated only under event time
    • Default processing of late data?
      • Discard
    • Allow lateness
      • This method requires setting a time that allows lateness. After setting it, the window will not clean up the state immediately after the normal processing, but will keep the allowLateness for such a long time more, and if there is still data coming during this time, the previous state will continue to be processed.
      • Applicable: DataStream, SQL
    • SideOutput
      • This approach requires a tag for late data, and then the late data stream is obtained on the DataStream based on this tag, and then the business level chooses to process it itself
      • Applicable: DataStream Screen Shot 2022-07-31 at 12.46.33.png
    • Incremental Computing VS Full Window FunctionEvery time data arrives, it is stored in the window's state. When window triggers the calculation, all the data is taken out and calculated together.
      • Incremental Computing:
        • Every time data comes, the computation is done directly and window only stores the result of the computation. For example, to compute sum, only the result of sum needs to be stored in the state, not each piece of data
        • Typical function: reduce, aggregate
        • Aggregation in SQL only has incremental computing
      • Full Window Function:
        • Every time data arrives, it is stored in the window's state. When window triggers the computation, all the data is taken out and computed together.
        • Typical function; process
    • EMIT Trigger
      • What is EMIT?
        • Generally speaking, windows are only output when they end, for example, a 1h tumble window can only output results uniformly at the end of 1 hour. If the window is large, such as 1 day, or even larger, the delay of the computation result output is higher, and the meaning of real-time computation is lost.
        • EMIT output means that the results of the window computation are output in advance when the window is not finished
      • How to achieve?
        • This can be done inside the DataStream by customizing the Trigger, the result of which can be:
          • CONTINUE
          • FIRE
          • PURGE
          • FIRE_AND_PURGE
        • Can also be used in SQL, by configuration:
          • table.exec.emit.early-fire.enabled=true
          • table.exec.emit.earlt-fire.delay={time}

4. Window - Advanced Optimization

  • Mini-batch Optimization
    • By default, the aggregator performs "read accumulator state → modify state → write back state" operation for each data ingested. If the data flow is large, the overhead of the state operation will increase and affect the efficiency. After Mini-Batch is turned on, the ingested data will be saved in the buffer inside the arithmetic, and the aggregation logic will be done after reaching the specified capacity or time threshold. Screen Shot 2022-07-31 at 13.12.31.png
  • local-global
    • To solve Aggregate data skewing problem. Screen Shot 2022-07-31 at 13.14.00.png
  • Distinct计算状态复用
    • Example: Counting the UV (Unique Visitors) of each sub-channel of the app, this use case has two features:

      1. Channels are enumerable
      2. High overlap of users on different channels
      SELECT
          channel,
          COUNT(DISTINCT device_id) AS uv
      FROM source_table
      GROUP BY channel
      

      Above is the the most primitive query statement

      Before Optimization Screen Shot 2022-07-31 at 16.12.22.png group key is one channel; we use one count distinct to compute each channel's uv. Assume that there are only three enumerations of channels, A, B and other. The group key is the channel ID, the Key of MapState is the device ID, and the value is a 64 bit long type value, each bit indicates whether the device appears in the channel. In a simple scenario, the Value of MapState is 1.

      In the above figure, there are two devices with IDs 1 and 2 under channel A. The device with ID 1 accesses channel B at the same time, and the device with ID 2 accesses channel Other at the same time. We can find that the map of different channels can have a lot of overlap, and if you want to reuse these keys, you can use the method provided by the community to rewrite the SQL manually.

      The rewritten query statement, the state of the device collection and the storage are as follows.

      SELECT TMP.channel, TMP.uv
      FROM(
          SELECT
              COUNT (DISTINCT device_id) FILTER (WHERE channel = 'A') AS uv_1,
              COUNT (DISTINCT device_id) FILTER (WHERE channel = 'B') AS uv_2,
              COUNT (DISTINCT device_id) FILTER (WHERE channel = 'OTHER') AS uv_3
          FROM source_table
      ) T1, LATERAL TABLE (
          toMultipleRow(uv_1, uv_2, uv_3)) AS TMP (channel, uv)
      

      Screen Shot 2022-07-31 at 16.27.38.png group key is empty, MapState Key is device ID, MapState Value is a 64bit long type, each bit indicates whether this device is present in each channel, for example, the value of device with ID 1 is 110, which means this device accesses channels A and B, and the device with ID 3 accesses channels B and OTHER.

    • Advantages: Significant reduction in state

    • Disadvantages:

      • The SQL needs to be rewritten manually, and if a dimension has multiple values or has multiple enumerable dimensions, then the manually rewritten SQL will be very long
      • Column to row conversions need to be performed with custom table functions.
  • Pane Optimization
    • The common solution for many current systems to perform queries on sliding windows is to cache each input element until it does not appear in all windows (as time goes on and the window slides, the point in time corresponding to the element will be out of the window's time range).
      • Two main problems:
        1. It requires the buffer size to be unbounded
        2. Processing each input element multiple times can lead to high computational overhead
    • Pane divides the window into several regular parts, which can be regarded as sub-windows and can be simply understood as slicing the window again.
    • Pane reduces the space requirement for buffers by subaggregating windows, and reduces the number of computations by sharing the results of subaggregation when computing windows. Screen Shot 2022-07-31 at 16.43.12.png