intro

Stateful Computations over Data Streams

关键机制

www.bilibili.com/video/BV1rr…

状态
- 存储中间节点计算结果
- 有利于flink 容错恢复
检查点
- 进行快照，把状态进行存储
时间
- 作为切分点
- 容错恢复也需要时间点
窗口
- 对数据进行切分，也方便对数据进行聚合计算

核心理念 - 状态管理

Flink与其他流计算引擎的最大区别，就是状态管理，
Flink提供了内置的状态管理，可以把工作时状态存储在Flink内部，而不需要把它存储在外部系统。这样做的好处: 。降低了计算引擎对外部系统的依赖，使得部署、运维更加简单; 。对性能带来了极大的提升。

运行时架构

核心概念

datastream

dataset

flink 程序

由Source、Transformation和Sink三部分组成

Source主要负责数据的读取，支持HDFS、kafka和文本等;
Transformation主要负责对数据的转换操作;
Sink负责最终数据的输出，支持HDFS、kafka和文本输出等。在各部分之间流转的数据称为流stream)。

flink 数据源

有界流与无界流

示例

流处理

//每小时对某网站的访问者计数，并按照地区分组
val counts = visits
.keyBy("region")
.timeWindow(Time.hours(1))
.sum("visits")

批处理

val counts = visits
.groupBy("region")
.sum("visits")

处理模型

watermark 用来处理当事件到达flink的时候出现了乱序或者延迟的情况
窗口用于对数据切分
触发器就是触发什么时候进行窗口类的数据的统计计算

architecture

运行流程

用户首先提交Flink程序到JobClient，经过JobClient的处理、解析、优化 , 提交到JobManager （jobManager把逻辑数据流图转换成物理数据流图（真实可执行的，能具体将任务放置在taskManager）），最后由TaskManager运行task （taskManager 把资源划分成 taskSlot （对内存资源进行划分），每个slot 可以执行同一个job 的不同算子）。

graphs

zhuanlan.zhihu.com/p/678970264

StreamGraph：是根据用户通过 Stream API 编写的代码生成的最初的图。用来表示程序的拓扑结构。简单说就是进行算子拼接。

JobGraph：StreamGraph 经过优化后生成了 JobGraph，提交给 JobManager 的数据结构。主要的优化为，将多个符合条件的节点 chain 在一起作为一个节点，这样可以减少数据在节点之间流动所需要的序列化反序列化传输消耗。简单说就是把能优化的算子拼接在一起。

ExecutionGraph：JobManager 根据 JobGraph 生成 ExecutionGraph。ExecutionGraph 是JobGraph 的并行化版本，是调度层最核心的数据结构，用来执行调度。简单说就是 JobGraph 的并行化版本

物理执行图：JobManager 根据 ExecutionGraph 对 Job 进行调度后，在各个 TaskManager 上部署 Task 后形成的图，并不是一个具体的数据结构。简单说就是最终运行状态图。

注意：最后一个物理执行图并非 Flink 的数据结构，而是程序开始执行后，各个 Task 分布在不同的节点上，所形成的物理上的关系表示。

JobManager

ResourceManager

ResourceManager 负责 Flink 集群中的资源提供、回收、分配 - 它管理 task slots，这是 Flink 集群中资源调度的单位。
Dispatcher Dispatcher 提供了一个 REST 接口，用来提交 Flink 应用程序执行，并为每个提交的作业启动一个新的 JobMaster。它还运行 Flink WebUI 用来提供作业执行信息。
JobMaster

JobMaster 负责管理单个[JobGraph]的执行。Flink 集群中可以同时运行多个作业，每个作业都有自己的 JobMaster。

TaskManagers [#]

TaskManager（也称为 worker）执行作业流的 task，并且缓存和交换数据流。

在 TaskManager 中资源调度的最小单位是 task slot。TaskManager 中 task slot 的数量表示并发处理 task 的数量。一个 task slot 中可以执行多个算子

operator

主要有3类

window

Window是无限数据流处理的核心，它将一个无限的stream拆分成有限大小的buckets 桶，我们可以在这些桶上做计算操作。

根据应用类型可以分成两类：

CountWindow:数据驱动，按照指定的数据条数生成一个Window，与时间无关。
TimeWindow:时间驱动，按照时间生成Window。

timeWindow 分类

TimeWindow可以根据窗口实现原理的不同分成三类:

滚动窗口(Tumbling Window)
滑动窗口(Sliding Window)
会话窗口(Session Window)。
- 会话窗口由一系列事件组合一个指定时间长度的timeout间隙组成，类似于web应用的session，也就是一段时间没有接收到新数据就会生成新的窗口。

时间分类

event time，事件发生的时间
ingestion time，事件到达流处理系统的时间
processing time，事件被系统处理的时间

实际情况中事件真正发生的先后顺序与系统处理时间存在一定的差异，这些差异主要由网络延迟、处理时间的长短等造成。

乱序示例

watermark

Watermark就是用来解决乱序问题。所谓的乱序，其实就是有事件延迟了，对于延迟的元素，我们不可能无限期的等下去，必须要有一种机制来保证一个特定的时间后，必须触发indow进行计算。这个特别的机制，就是Watermark,它告诉了算子延迟到达的消息不应该再被接收。

Flink怎么保证基于event-time的窗口在销毁的时候，已经处理完了所有的数据呢?这就是watermark的功能所在。watermark会携带一个单调递增的时间戳t，Watermark(t)表示所有时间戳 <= t的数据都已经到来了，未来 <= t的数据不会再来，因此可以放心地触发和销毁窗口了

延迟数据处理机制

Allowed Lateness机制允许用户设置一个允许的最大延迟时长。Flink会在窗口关闭后一直保存窗口的状态直至超过允许延迟时长，这期间的延迟事件不会被丢弃，而是会触发窗口重新计算。因为保存窗口状态需要额外内存，并且如果窗口计算使用了ProcessWindowFunction API还可能使得每个延迟事件触发一次窗口的全量计算，代价比较大，所以允许延迟时长不宜设得太长，延迟事件也不宜过多。

容错机制

core concepts

Streams

stream is a bounded or unbounded sequence of events.

A Flink application is a data processing pipeline. Your events flow through this pipeline, and they are operated on at each stage by code you write. We call this pipeline the job graph, and the nodes of this graph (or in other words, the stages of the processing pipeline) are called operators.

parallelism and subtask

算子可以并行

During execution, a stream has one or more stream partitions, and each operator has one or more operator subtasks, each operating independently on some subset of the events.

The number of operator subtasks is the parallelism of that particular operator. Different operators of the same program may have different levels of parallelism.

subtask and task slot

nightlies.apache.org/flink/flink…

task slot的设计是为了不同 job 间资源（内存）隔离
每个 worker（TaskManager）都是一个 JVM 进程
TaskManager 中 task slot 的数量表示并发处理 task 的数量
通过调整 task slot 的数量，用户可以定义 subtask 如何互相隔离。每个 TaskManager 有一个 slot，这意味着每个 task 组都在单独的 JVM 中运行（例如，可以在单独的容器中启动）。具有多个 slot 意味着更多 subtask 共享同一 JVM。同一 JVM 中的 task 共享 TCP 连接（通过多路复用）和心跳信息。它们还可以共享数据集和数据结构，从而减少了每个 task 的开销。
默认情况下，Flink 允许 subtask 共享 slot，即便它们是不同的 task 的 subtask，只要是来自于同一作业即可。结果就是一个 slot 可以持有整个作业管道。

Scheduling task to TM

The Flink tasks undergo two stages in deploying from JobGraph to TMs.

The first stage is the allocation for tasks to slots.

The existing strategy is LocalInputPreferredSlotSharingStrategy. This strategy tries to reduce remote data exchanges. Unfortunately, it could cause the result that SlotSharingStrategy assigns many tasks into the same one execution slot sharing group,which causes unbalanced tasks loading in slot-level.

The second stage is that scheduling slots to TM

Slots containing different numbers of tasks are randomly deployed to TMs. As a sequence, it causes the unbalanced tasks loading in TM-level.

transport data between two operators

Streams can transport data between two operators in a one-to-one (or forwarding) pattern, or in a redistributing pattern

One-to-one streams (for example between the Source and the map() operators in the figure above) preserve the partitioning and ordering of the elements.

Redistributing streams (as between map() and keyBy/window above, as well as between keyBy/window and Sink) change the partitioning of streams.

stateful

Flink’s operations can be stateful. This means that how one event is handled can depend on the accumulated effect of all the events that came before it.

The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each parallel instance is responsible for handling events for a specific group of keys, and the state for those keys is kept locally.

fault-tolerant, exactly-once

Flink is able to provide fault-tolerant, exactly-once semantics through a combination of state snapshots and stream replay. These snapshots capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed. As depicted above, these state snapshots are captured asynchronously, without impeding the ongoing processing.

key by

当我们使用 keyBy 算子指定 key 的时候，底层是用你指定的 key ，然后去计算这个key 所对应的 keyGroupId，然后再利用 keyGroupId 来计算这个 key 被分配到哪个并行子任务上面。

    keyGroupId = MathUtils.murmurHash(keyHash) % maxParallelism;
    keyGroupId * parallelism / maxParallelism;

application patterns

flink.apache.org/2020/01/15/…

flink.apache.org/2020/03/24/…

flink.apache.org/2020/07/30/…

flink